3.2 Hadoop - DE Specialization Notes

3.2 Hadoop

3.2.1 Hadoop Background and HDFS

Apache Hadoop

Hadoop emerged in the mid-2000s, inspired by Google’s GFS (2003) and MapReduce (2004) papers, when traditional monolithic data warehouses could not handle massive data volumes.

Hadoop Distributed File System (HDFS)

HDFS combines compute and storage on the same nodes, unlike object storage which has limited compute support for internal processing. Large files are broken into blocks of a few hundred megabytes each.

Component	Role
NameNode	Manages directories, file metadata, and a catalog describing where file blocks reside in the cluster
DataNodes	Store the actual data blocks - each block is replicated across 3 DataNodes for durability and availability

Key properties:

Replication - each block stored on 3 nodes increases durability and availability
Combined compute and storage - enables in-place data processing via MapReduce

3.2.2 Hadoop MapReduce

MapReduce sends computation code to the nodes that contain the data, favoring data locality rather than moving data to the application. The computation has three phases:

Phase	Action
Map	Read individual data blocks inside DataNodes and produce key-value pairs
Shuffle	Redistribute results across the cluster so each DataNode holds unique keys
Reduce	Aggregate data on each node into final results

For example, counting user events: the Map phase scans log blocks and emits (user_id, 1) pairs, the Shuffle phase groups all pairs by user_id, and the Reduce phase sums the counts per user.

MapReduce flow: Map emits key-value pairs, Shuffle groups by key, Reduce aggregates

Shortcomings of MapReduce

MapReduce writes to disk at every intermediate step - never to memory. This simplifies state management and minimizes memory consumption, but results in high disk bandwidth utilization and slow processing for iterative algorithms.

Spark improved on this with in-memory caching. RAM is faster than SSD/HDD in both transfer speed and seek time, delivering dramatic speedups. Spark treats data as a distributed set that resides in memory, using disk only as a fallback when data overflows available memory.