2.1 Data Warehouses
2.1.1 Storage Abstractions Overview
This week covers three storage abstractions, each building on the last:
| Abstraction | Description | Key Idea |
|---|---|---|
| Data Warehouse | The classic analytical store, now available as cloud-managed services | Structured data, modeled for fast queries |
| Data Lake | A central repository for raw data at any scale | Schema-on-read, supports all data types |
| Data Lakehouse | Combines warehouse query performance with lake flexibility | Open table formats enable ACID on object storage |
2.1.2 Data Warehouse Concepts
Bill Inmonโs definition: โA subject-oriented, integrated, nonvolatile, time-variant collection of data in support of managementโs decisions.โ
Key Structural Properties
| Property | Meaning |
|---|---|
| Subject-oriented | Organized around business domains (sales, inventory, customers) rather than applications |
| Integrated | Data from multiple source systems is cleaned, standardized, and consolidated into a single consistent format |
| Nonvolatile | Once loaded, data is not modified or deleted - historical records are preserved for auditing and trend analysis |
| Time-variant | Every record carries a time dimension, enabling analysis across different time periods |
How Data Flows Into a Warehouse
Source systems (CRM, ERP, flat files) feed data through an ETL or ELT process. Inside the warehouse, data moves through a staging area (raw landing zone), gets modeled into dimension and fact tables (star or snowflake schema), and is then served to downstream consumers through data marts - subsets of the warehouse tailored to specific business units.
Separation of Storage and Compute
Traditional on-premise warehouses coupled storage and compute on the same hardware. Cloud warehouses decouple them - storage lives on cheap object storage (S3, GCS) while compute scales independently via on-demand clusters. This means you can store petabytes affordably and only pay for compute when queries are running.
2.1.3 Modern Cloud Data Warehouses and Redshift
Modern cloud data warehouses implement MPP (massively parallel processing) and can scale clusters dynamically based on workload.
| Warehouse | Provider | Key Differentiator |
|---|---|---|
Amazon Redshift | AWS | Deep AWS integration, Redshift Spectrum for S3 queries |
Google BigQuery | GCP | Serverless, automatic scaling, slot-based pricing |
Snowflake | Multi-cloud | Virtual warehouses, time travel, data sharing |
Common characteristics:
- ELT pattern - raw data is loaded first, then transformed in place using MPP compute
- Columnar storage - optimized for analytical queries that scan specific columns across many rows
- Separation of compute and storage - allows independent scaling to optimize cost and performance
- Highly structured data - modeled to support analytical queries with predefined schemas
Amazon Redshift MPP Architecture
When a client application sends a query, the leader node parses it, creates an execution plan, compiles the code, and distributes it to the compute nodes. Each compute node is divided into slices - independent processing units that each hold a portion of the data. Slices execute query fragments in parallel, then the leader node aggregates the results and returns them to the client.
Key Redshift concepts:
- Leader node - receives client queries, builds execution plans, coordinates compute nodes, and aggregates final results
- Compute nodes - execute query fragments against their local data slices in parallel
- Node slices - each node is partitioned into slices based on CPU count; each slice gets its own memory and disk allocation
- Distribution styles - control how data is distributed across slices (
KEY,EVEN,ALL) to minimize data movement during joins