1.1 Storage Fundamentals
1.1.1 Introduction and Storage Hierarchy
Storage in data engineering spans multiple layers, from physical hardware up through high-level abstractions. Understanding this hierarchy is essential for making informed decisions about cost, performance, and scalability.
| Layer | Components | Role |
|---|---|---|
| Physical Components | Magnetic disks, SSDs, RAM, CPU cache | The raw hardware that stores and retrieves bits |
| Processes | Serialization, compression, networking, CPU | Transform data between formats and move it between systems |
| Storage Systems | OLTP databases, OLAP databases, object stores, graph/vector DBs | Organize and manage data for specific access patterns |
| Storage Abstractions | Data warehouses, data lakes, data lakehouses | High-level paradigms built on top of storage systems |
Course Overview
Week 1 covers serialization and compression, physical component characteristics, row vs. column databases, graph and vector databases, cloud storage paradigms (block, object, file), and storage cost-performance trade-offs.
Week 2 focuses on choosing the right abstraction for storing data.
Week 3 digs into how queries work, how storage choices affect query performance, and techniques for optimization.
1.1.2 Raw Ingredients of Storage Systems
Every storage system is built from a combination of persistent media, volatile memory, and supporting processes.
Persistent Storage
| Medium | How it works | Characteristics |
|---|---|---|
| Magnetic disks (HDD) | Rotating platters with a read/write head that encodes binary data by flipping magnetic field directions | Data addressed by circular tracks and sectors. Latency depends on seek time and rotational delay. Cheapest per GB. |
| SSDs | Store data as electrical charge in flash memory cells | Significantly faster reads and writes than HDDs. Can be partitioned for logical separation. |
Volatile Memory
| RAM | CPU Cache | |
|---|---|---|
| Transfer speed | Up to 100 GB/s (25x faster than SSD) | Up to 1 TB/s |
| Cost | ~$3/GB (30-50x more expensive than SSD) | Built into the CPU chip |
| Use case | Stores code and data being actively processed. Volatile โ power loss means data loss. | Ultrafast data fetch (~0.1 ns). Used for browser cache, database query result cache. |
Supporting Processes
- Networking โ storage systems are typically distributed across many servers, improving read/write performance, durability, and availability.
- Serialization โ any data stored in a file, database, or sent over a network must be serialized into a portable format.
Serialization
Serialization transforms in-memory data structures (optimized for CPU) into a disk format (binary bytes) for persistent storage. De-serialization reverses the process.
Data can be serialized in row-based or column-based order. In row-based serialization, each row is stored as a contiguous sequence of bytes. In column-based serialization, all values of a single column are stored together.
Serialization Formats
| Format | Type | Layout | Key characteristics |
|---|---|---|---|
| CSV | Text | Row-based | Human-readable, no defined schema, error-prone. Adding rows/columns requires manual handling. |
| XML | Text | Row-based | Extensible markup language. Legacy format, slow to serialize and de-serialize. |
| JSON | Text | Row-based | Plain-text object serialization. The standard for data exchange over APIs. |
| Parquet | Binary | Column-based | Optimized for analytical queries and big data processing. Up to 100x faster than CSV. |
| Avro | Binary | Row-based | Schema-defined structure with schema evolution support. Good for streaming and write-heavy workloads. |
Compression
Compression algorithms reduce redundancy to make serialized data smaller. Smaller files mean faster queries and less I/O time when loading data from disk.
| Type | Behavior | Use case |
|---|---|---|
| Lossless | Data can be reconstructed bit-for-bit | Required for data engineering (gzip, bzip2, Snappy, Zstandard, LZ4) |
| Lossy | Some data is permanently discarded | Media files (MP3, JPEG) โ not used in data pipelines |
Columnar Encoding Techniques
Two encoding techniques are particularly effective for columnar data with repeated values:
-
Run-length encoding โ repeated values stored as
(value, start_index, run_length)tuples. For example,[5,5,5,5,4,4,2,2,2,2,2]becomes(5,1,4), (4,5,2), (2,7,5). -
Bit-vector encoding โ each distinct value gets a binary vector with a
1at every index where that value appears.
1.1.3 Cloud Storage Options
Cloud providers offer three main storage paradigms, each optimized for different access patterns.
| File Storage | Block Storage | Object Storage | |
|---|---|---|---|
| Structure | Directory tree with metadata (name, owner, permissions) | Fixed-size blocks on disk, each with a unique key | Flat structure โ immutable objects in a top-level container |
| Access pattern | Path-based: /home/user/file.txt | Key-based block lookup table | Key-based: s3://bucket/file.json |
| Best for | Centralized file sharing across users/hosts | OLTP systems with frequent, small read/write ops | Data lakes, OLAP, ML pipelines, cloud warehouses |
| Scaling | Built on top of block storage | Blocks distributed across multiple disks | Horizontal โ each node holds its own disk, objects sharded and replicated |
| AWS service | EFS (Elastic File System) | EBS (Elastic Block Store) | S3 (Simple Storage Service) |
| Tradeoff | Easy to manage, slower due to file hierarchy overhead | Low latency, persistent VM storage (EC2) | Not suited for small transactional workloads, immutable (update = full rewrite) |
1.1.4 Schema Evolution
As data systems grow, schemas inevitably change - new columns are added, types are widened, fields are renamed or deprecated. Schema evolution is the ability of a storage format or table to accommodate these changes without breaking existing readers or requiring a full data rewrite.
Forward vs. Backward Compatibility
| Direction | Meaning | Why It Matters |
|---|---|---|
| Backward compatible | New readers can read data written with an older schema | Ensures upgraded consumers donโt break on historical data |
| Forward compatible | Old readers can read data written with a newer schema | Allows producers to evolve without forcing all consumers to upgrade simultaneously |
Schema Evolution by Format
Different serialization formats handle evolution with varying degrees of flexibility:
| Format | Evolution Support | Approach |
|---|---|---|
| Apache Avro | Strong - supports adding, removing, and renaming fields | Schema stored alongside data; readers reconcile old and new schemas at read time using field names |
| Apache Parquet | Good - supports adding and removing columns | Column-based storage means missing columns return NULL; extra columns are ignored by old readers |
| CSV / JSON | Weak - no formal schema enforcement | Schema changes are undetected until downstream code fails on missing or unexpected fields |
Schema Evolution in Open Table Formats
Open table formats (Apache Iceberg, Delta Lake, Apache Hudi) add a metadata layer that tracks schema changes as part of the tableโs version history. This enables operations like adding a column that automatically applies to all future queries while leaving historical data untouched. Some formats also support partition evolution - changing how data is partitioned without rewriting existing files.