1.2 Storage Tiers and Distributed Systems

1.2.1 Storage Tiers

Cloud providers organize storage into tiers that trade off access speed against cost. Frequently accessed data lives in hot tiers with low latency, while rarely accessed or archival data moves to cold tiers at a fraction of the price.

Hot StorageWarm StorageCold Storage
Access frequencyVery frequentLess frequentInfrequent / archive
Example use caseProduct recommendation appRegular reports and analysesCompliance archives, backups
Storage mediumSSD & memoryMagnetic disks or hybrid systemsLow-cost magnetic disks
Storage costHighMediumLow
Retrieval costLowMediumHigh

AWS S3 Storage Classes

AWS S3 maps the hot/warm/cold model to specific storage classes, each with different pricing and retrieval characteristics:

AWS S3 storage tiers from hot to cold AWS S3 storage tiers from hot to cold
TierS3 ClassNotes
HotS3 Express One ZoneSingle-AZ, lowest latency
HotS3 StandardMulti-AZ, general purpose
WarmS3 Standard-IAInfrequent access, multi-AZ
WarmS3 One Zone-IAInfrequent access, single-AZ (cheaper)
ColdS3 Glacier Instant RetrievalArchive with millisecond retrieval
ColdS3 Glacier Flexible RetrievalArchive with minutes-to-hours retrieval
ColdS3 Glacier Deep ArchiveLowest cost, 12-48 hour retrieval

1.2.2 Distributed Storage Systems

Distributed storage systems spread data across multiple nodes (magnetic disks or SSDs), replicating it for fault tolerance.

How Distributed Storage Works

Each node has its own processing capabilities to handle data management, replication, and access control. A group of nodes forms a cluster, which provides:

  • Fault tolerance — data survives individual node failures
  • High availability — the system remains accessible during maintenance or outages
  • Parallel I/O — reads and writes spread across nodes for higher throughput
  • Horizontal scaling — add more nodes rather than upgrading a single machine

The total storage capacity is the sum of storage across all individual nodes. Many technologies rely on distributed storage, including HDFS, S3, Cassandra, and Kafka.


Data Distribution Strategies

There are two primary strategies for distributing data across nodes:

StrategyHow it worksTradeoff
ReplicationCopy the same data to multiple nodesHigher durability and read throughput, but more storage cost and write overhead
Partitioning (sharding)Split data into disjoint subsets across nodesBetter write scalability, but queries spanning partitions are more complex
Replication vs partitioning data distribution strategies Replication vs partitioning data distribution strategies

CAP Theorem

The CAP theorem states that a distributed system can guarantee at most two of three properties simultaneously:

PropertyMeaning
ConsistencyEvery read returns the most recent write
AvailabilityEvery request receives a response (even if not the latest data)
Partition toleranceThe system continues operating despite network partitions between nodes

In practice, network partitions are unavoidable in distributed systems, so the real choice is between consistency (CP systems like HBase, MongoDB) and availability (AP systems like Cassandra, DynamoDB).

CAP theorem triangle showing consistency, availability, and partition tolerance tradeoffs CAP theorem triangle showing consistency, availability, and partition tolerance tradeoffs