1.2 Storage Tiers and Distributed Systems
1.2.1 Storage Tiers
Cloud providers organize storage into tiers that trade off access speed against cost. Frequently accessed data lives in hot tiers with low latency, while rarely accessed or archival data moves to cold tiers at a fraction of the price.
| Hot Storage | Warm Storage | Cold Storage | |
|---|---|---|---|
| Access frequency | Very frequent | Less frequent | Infrequent / archive |
| Example use case | Product recommendation app | Regular reports and analyses | Compliance archives, backups |
| Storage medium | SSD & memory | Magnetic disks or hybrid systems | Low-cost magnetic disks |
| Storage cost | High | Medium | Low |
| Retrieval cost | Low | Medium | High |
AWS S3 Storage Classes
AWS S3 maps the hot/warm/cold model to specific storage classes, each with different pricing and retrieval characteristics:
| Tier | S3 Class | Notes |
|---|---|---|
| Hot | S3 Express One Zone | Single-AZ, lowest latency |
| Hot | S3 Standard | Multi-AZ, general purpose |
| Warm | S3 Standard-IA | Infrequent access, multi-AZ |
| Warm | S3 One Zone-IA | Infrequent access, single-AZ (cheaper) |
| Cold | S3 Glacier Instant Retrieval | Archive with millisecond retrieval |
| Cold | S3 Glacier Flexible Retrieval | Archive with minutes-to-hours retrieval |
| Cold | S3 Glacier Deep Archive | Lowest cost, 12-48 hour retrieval |
1.2.2 Distributed Storage Systems
Distributed storage systems spread data across multiple nodes (magnetic disks or SSDs), replicating it for fault tolerance.
How Distributed Storage Works
Each node has its own processing capabilities to handle data management, replication, and access control. A group of nodes forms a cluster, which provides:
- Fault tolerance — data survives individual node failures
- High availability — the system remains accessible during maintenance or outages
- Parallel I/O — reads and writes spread across nodes for higher throughput
- Horizontal scaling — add more nodes rather than upgrading a single machine
The total storage capacity is the sum of storage across all individual nodes. Many technologies rely on distributed storage, including HDFS, S3, Cassandra, and Kafka.
Data Distribution Strategies
There are two primary strategies for distributing data across nodes:
| Strategy | How it works | Tradeoff |
|---|---|---|
| Replication | Copy the same data to multiple nodes | Higher durability and read throughput, but more storage cost and write overhead |
| Partitioning (sharding) | Split data into disjoint subsets across nodes | Better write scalability, but queries spanning partitions are more complex |
CAP Theorem
The CAP theorem states that a distributed system can guarantee at most two of three properties simultaneously:
| Property | Meaning |
|---|---|
| Consistency | Every read returns the most recent write |
| Availability | Every request receives a response (even if not the latest data) |
| Partition tolerance | The system continues operating despite network partitions between nodes |
In practice, network partitions are unavoidable in distributed systems, so the real choice is between consistency (CP systems like HBase, MongoDB) and availability (AP systems like Cassandra, DynamoDB).