2.1 Data Warehouses

2.1.1 Storage Abstractions Overview

This week covers three storage abstractions, each building on the last:

AbstractionDescriptionKey Idea
Data WarehouseThe classic analytical store, now available as cloud-managed servicesStructured data, modeled for fast queries
Data LakeA central repository for raw data at any scaleSchema-on-read, supports all data types
Data LakehouseCombines warehouse query performance with lake flexibilityOpen table formats enable ACID on object storage

2.1.2 Data Warehouse Concepts

Bill Inmonโ€™s definition: โ€œA subject-oriented, integrated, nonvolatile, time-variant collection of data in support of managementโ€™s decisions.โ€


Key Structural Properties

PropertyMeaning
Subject-orientedOrganized around business domains (sales, inventory, customers) rather than applications
IntegratedData from multiple source systems is cleaned, standardized, and consolidated into a single consistent format
NonvolatileOnce loaded, data is not modified or deleted - historical records are preserved for auditing and trend analysis
Time-variantEvery record carries a time dimension, enabling analysis across different time periods

How Data Flows Into a Warehouse

Source systems (CRM, ERP, flat files) feed data through an ETL or ELT process. Inside the warehouse, data moves through a staging area (raw landing zone), gets modeled into dimension and fact tables (star or snowflake schema), and is then served to downstream consumers through data marts - subsets of the warehouse tailored to specific business units.

Data warehouse architecture showing ETL flow from sources through staging to modeled tables Data warehouse architecture showing ETL flow from sources through staging to modeled tables

Separation of Storage and Compute

Traditional on-premise warehouses coupled storage and compute on the same hardware. Cloud warehouses decouple them - storage lives on cheap object storage (S3, GCS) while compute scales independently via on-demand clusters. This means you can store petabytes affordably and only pay for compute when queries are running.

2.1.3 Modern Cloud Data Warehouses and Redshift

Modern cloud data warehouses implement MPP (massively parallel processing) and can scale clusters dynamically based on workload.

WarehouseProviderKey Differentiator
Amazon RedshiftAWSDeep AWS integration, Redshift Spectrum for S3 queries
Google BigQueryGCPServerless, automatic scaling, slot-based pricing
SnowflakeMulti-cloudVirtual warehouses, time travel, data sharing

Common characteristics:

  • ELT pattern - raw data is loaded first, then transformed in place using MPP compute
  • Columnar storage - optimized for analytical queries that scan specific columns across many rows
  • Separation of compute and storage - allows independent scaling to optimize cost and performance
  • Highly structured data - modeled to support analytical queries with predefined schemas

Amazon Redshift MPP Architecture

When a client application sends a query, the leader node parses it, creates an execution plan, compiles the code, and distributes it to the compute nodes. Each compute node is divided into slices - independent processing units that each hold a portion of the data. Slices execute query fragments in parallel, then the leader node aggregates the results and returns them to the client.

Redshift MPP architecture with leader node distributing queries to compute node slices Redshift MPP architecture with leader node distributing queries to compute node slices

Key Redshift concepts:

  • Leader node - receives client queries, builds execution plans, coordinates compute nodes, and aggregates final results
  • Compute nodes - execute query fragments against their local data slices in parallel
  • Node slices - each node is partitioned into slices based on CPU count; each slice gets its own memory and disk allocation
  • Distribution styles - control how data is distributed across slices (KEY, EVEN, ALL) to minimize data movement during joins