2.2 Data Lakes

2.2.1 Data Lake Architecture

Semi-structured and unstructured data do not fit neatly into a fixed schema. Data lakes address this by providing a central repository for storing large volumes of data with no predefined schema or set of transformations. Instead, they use a schema-on-read pattern - structure is applied when data is queried, not when it is written.

PropertyData WarehouseData Lake
SchemaSchema-on-write (defined before loading)Schema-on-read (applied at query time)
Data typesStructured onlyStructured, semi-structured, unstructured
Storage costHigher (optimized storage engines)Lower (object storage like S3)
Query performanceFast (pre-modeled, indexed)Slower (no pre-optimization)
FlexibilityLow (rigid schema changes)High (store anything, decide later)

2.2.2 Data Lake 1.0 and Its Shortcomings

The first generation of data lakes combined storage technologies (Hadoop HDFS, Amazon S3) with processing engines (Apache Pig, Presto, Hive). While functional, they suffered from significant shortcomings:

  • Data swamp - without proper data management, cataloging, or discovery tools, there was no guarantee of data integrity or quality
  • Write-only storage - DML operations like deleting or updating rows required creating entirely new tables, making regulatory compliance painful
  • No schema management or data modeling - data was not optimized for query operations like joins, making it difficult to process

Large companies like Facebook built custom tooling to work around these issues, but most organizations struggled to extract value from Data Lake 1.0.

2.2.3 Next-Generation Data Lakes

Next-generation data lakes introduced several key improvements:

ImprovementDescription
ZonesData organized by processing stage - raw landing, cleaned/transformed, and curated/enriched zones with appropriate governance at each stage
Data partitioningDatasets divided by criteria like time or location, allowing queries to scan only relevant partitions
Data catalogCentralized metadata about each dataset - owner, source, partitions, schema, and schema evolution history
Data lake zone architecture with raw, cleaned, and curated zones Data lake zone architecture with raw, cleaned, and curated zones

The Two-System Problem

Even with these improvements, a fundamental limitation remained: organizations still needed both a data lake (for low-cost, high-volume storage) and a separate data warehouse (for high-performance analytical queries). Moving data between the two was expensive, introduced bugs and failures, and risked data quality, duplication, and consistency issues. This โ€œtwo-system problemโ€ motivated the next evolution - the data lakehouse.