2.3 Data Lakehouse

2.3.1 The Data Lakehouse Architecture

The data lakehouse, first introduced by Databricks, merges the best of both worlds: the flexibility and low-cost storage of a data lake with the superior query performance and robust data management of a data warehouse - eliminating the need to maintain two separate systems.

The key insight is that an open table format layer sits between the query engines and the raw storage, adding transactional capabilities and metadata management on top of cheap object storage.

Data lakehouse architecture with query engines, open table format, and object storage layers Data lakehouse architecture with query engines, open table format, and object storage layers

2.3.2 Open Table Formats

The lakehouse architecture is made possible by open table formats - specialized storage formats that add transactional capabilities to data stored in a lake. They enable record-level updates and deletes while supporting full ACID guarantees.

FormatOriginKey Differentiator
Delta LakeDatabricksTight Spark integration, transaction log-based, most mature ecosystem
Apache IcebergNetflixEngine-agnostic design, hidden partitioning, best multi-engine support
Apache HudiUberOptimized for incremental upserts and streaming ingestion

Shared Capabilities

All three formats provide capabilities that were previously only available in traditional data warehouses:

  • ACID transactions - concurrent reads and writes without data corruption, even at scale
  • Time travel and snapshots - query data as it existed at any point in time, enabling auditing and rollback
  • Schema evolution - add, drop, or rename columns without breaking existing queries or rewriting data
  • Partition evolution - change partitioning strategies on existing tables without a full rewrite
  • Open source - multiple query engines (Spark, Trino, Flink, Athena) can read and write the same data

How Open Table Formats Work

Under the hood, data remains stored as Parquet or ORC files on object storage. The table format adds a metadata layer - a set of manifest files and logs that track which data files belong to each table version. When a write occurs, new data files are created and the metadata is atomically updated to point to the new snapshot, enabling ACID semantics without locks on the underlying storage.

2.3.3 Medallion Architecture

The Medallion Architecture is the most widely adopted organizational pattern for structuring data within a lakehouse. Data flows through three layers - Bronze, Silver, and Gold - each adding progressively more structure and business value.

Medallion Architecture: Bronze, Silver, and Gold layers Medallion Architecture: Bronze, Silver, and Gold layers
LayerPurposeData Characteristics
BronzeRaw ingestion - exact copy of source dataUnprocessed, may contain duplicates, nulls, and schema inconsistencies
SilverCleaned and conformedDeduplicated, type-cast, validated, and joined across sources
GoldBusiness-level aggregationsModeled into star schemas, aggregated metrics, or ML feature tables

The key benefit of this layered approach is reprocessability. If a transformation bug is discovered in the Gold layer, engineers can re-derive it from the Silver layer without re-ingesting from source systems. The Bronze layer serves as an immutable audit trail of everything that entered the pipeline, while the Silver layer provides a clean, reusable foundation that multiple Gold-layer consumers can build on independently.