1.1 Storage Fundamentals

1.1.1 Introduction and Storage Hierarchy

Storage in data engineering spans multiple layers, from physical hardware up through high-level abstractions. Understanding this hierarchy is essential for making informed decisions about cost, performance, and scalability.

Storage hierarchy from physical components to abstractions Storage hierarchy from physical components to abstractions
LayerComponentsRole
Physical ComponentsMagnetic disks, SSDs, RAM, CPU cacheThe raw hardware that stores and retrieves bits
ProcessesSerialization, compression, networking, CPUTransform data between formats and move it between systems
Storage SystemsOLTP databases, OLAP databases, object stores, graph/vector DBsOrganize and manage data for specific access patterns
Storage AbstractionsData warehouses, data lakes, data lakehousesHigh-level paradigms built on top of storage systems

Course Overview

Week 1 covers serialization and compression, physical component characteristics, row vs. column databases, graph and vector databases, cloud storage paradigms (block, object, file), and storage cost-performance trade-offs.

Week 2 focuses on choosing the right abstraction for storing data.

Week 3 digs into how queries work, how storage choices affect query performance, and techniques for optimization.

1.1.2 Raw Ingredients of Storage Systems

Every storage system is built from a combination of persistent media, volatile memory, and supporting processes.

Persistent Storage

MediumHow it worksCharacteristics
Magnetic disks (HDD)Rotating platters with a read/write head that encodes binary data by flipping magnetic field directionsData addressed by circular tracks and sectors. Latency depends on seek time and rotational delay. Cheapest per GB.
SSDsStore data as electrical charge in flash memory cellsSignificantly faster reads and writes than HDDs. Can be partitioned for logical separation.

Volatile Memory

RAMCPU Cache
Transfer speedUp to 100 GB/s (25x faster than SSD)Up to 1 TB/s
Cost~$3/GB (30-50x more expensive than SSD)Built into the CPU chip
Use caseStores code and data being actively processed. Volatile โ€” power loss means data loss.Ultrafast data fetch (~0.1 ns). Used for browser cache, database query result cache.

Supporting Processes

  • Networking โ€” storage systems are typically distributed across many servers, improving read/write performance, durability, and availability.
  • Serialization โ€” any data stored in a file, database, or sent over a network must be serialized into a portable format.

Serialization

Serialization transforms in-memory data structures (optimized for CPU) into a disk format (binary bytes) for persistent storage. De-serialization reverses the process.

Serialization and compression flow from in-memory to storage Serialization and compression flow from in-memory to storage

Data can be serialized in row-based or column-based order. In row-based serialization, each row is stored as a contiguous sequence of bytes. In column-based serialization, all values of a single column are stored together.

Serialization Formats

FormatTypeLayoutKey characteristics
CSVTextRow-basedHuman-readable, no defined schema, error-prone. Adding rows/columns requires manual handling.
XMLTextRow-basedExtensible markup language. Legacy format, slow to serialize and de-serialize.
JSONTextRow-basedPlain-text object serialization. The standard for data exchange over APIs.
ParquetBinaryColumn-basedOptimized for analytical queries and big data processing. Up to 100x faster than CSV.
AvroBinaryRow-basedSchema-defined structure with schema evolution support. Good for streaming and write-heavy workloads.

Compression

Compression algorithms reduce redundancy to make serialized data smaller. Smaller files mean faster queries and less I/O time when loading data from disk.

TypeBehaviorUse case
LosslessData can be reconstructed bit-for-bitRequired for data engineering (gzip, bzip2, Snappy, Zstandard, LZ4)
LossySome data is permanently discardedMedia files (MP3, JPEG) โ€” not used in data pipelines

Columnar Encoding Techniques

Two encoding techniques are particularly effective for columnar data with repeated values:

  • Run-length encoding โ€” repeated values stored as (value, start_index, run_length) tuples. For example, [5,5,5,5,4,4,2,2,2,2,2] becomes (5,1,4), (4,5,2), (2,7,5).

  • Bit-vector encoding โ€” each distinct value gets a binary vector with a 1 at every index where that value appears.

1.1.3 Cloud Storage Options

Cloud providers offer three main storage paradigms, each optimized for different access patterns.

File StorageBlock StorageObject Storage
StructureDirectory tree with metadata (name, owner, permissions)Fixed-size blocks on disk, each with a unique keyFlat structure โ€” immutable objects in a top-level container
Access patternPath-based: /home/user/file.txtKey-based block lookup tableKey-based: s3://bucket/file.json
Best forCentralized file sharing across users/hostsOLTP systems with frequent, small read/write opsData lakes, OLAP, ML pipelines, cloud warehouses
ScalingBuilt on top of block storageBlocks distributed across multiple disksHorizontal โ€” each node holds its own disk, objects sharded and replicated
AWS serviceEFS (Elastic File System)EBS (Elastic Block Store)S3 (Simple Storage Service)
TradeoffEasy to manage, slower due to file hierarchy overheadLow latency, persistent VM storage (EC2)Not suited for small transactional workloads, immutable (update = full rewrite)

1.1.4 Schema Evolution

As data systems grow, schemas inevitably change - new columns are added, types are widened, fields are renamed or deprecated. Schema evolution is the ability of a storage format or table to accommodate these changes without breaking existing readers or requiring a full data rewrite.


Forward vs. Backward Compatibility

DirectionMeaningWhy It Matters
Backward compatibleNew readers can read data written with an older schemaEnsures upgraded consumers donโ€™t break on historical data
Forward compatibleOld readers can read data written with a newer schemaAllows producers to evolve without forcing all consumers to upgrade simultaneously

Schema Evolution by Format

Different serialization formats handle evolution with varying degrees of flexibility:

FormatEvolution SupportApproach
Apache AvroStrong - supports adding, removing, and renaming fieldsSchema stored alongside data; readers reconcile old and new schemas at read time using field names
Apache ParquetGood - supports adding and removing columnsColumn-based storage means missing columns return NULL; extra columns are ignored by old readers
CSV / JSONWeak - no formal schema enforcementSchema changes are undetected until downstream code fails on missing or unexpected fields

Schema Evolution in Open Table Formats

Open table formats (Apache Iceberg, Delta Lake, Apache Hudi) add a metadata layer that tracks schema changes as part of the tableโ€™s version history. This enables operations like adding a column that automatically applies to all future queries while leaving historical data untouched. Some formats also support partition evolution - changing how data is partitioned without rewriting existing files.