1.1 How to Think Like a Data Engineer

1.1.1 The Data Engineering Lifecycle

Data Engineering Definition (from Fundamentals of Data Engineering):

Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high quality consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering.

Data Pipeline Definition

The combination of architecture, systems, and processes that move data through the data engineering lifecycle.

The Job of the Data Engineer

At its core, the data engineer’s job is to get raw data from somewhere, transform it into something useful, and make it available to downstream consumers.

1.1.2 The History of Data and Data Engineering

Data engineering didn’t appear overnight - it evolved alongside decades of database and computing innovation.

1960s - IMS (Information Management System) was IBM’s hierarchical database, one of the first to manage data at scale. CODASYL defined the network database model, an early standard for how programs interact with databases.
1970s - Edgar Codd’s relational model changed everything. Oracle and IBM DB2 turned that theory into commercial products, and SQL became the standard query language that still dominates today.
1980s - Teradata pioneered massively parallel data warehouses, enabling analytics on large datasets. Informatica introduced ETL tooling to move and transform data between systems.
1990s - Ralph Kimball and Bill Inmon defined competing approaches to data warehouse design (dimensional vs. enterprise). OLAP (Online Analytical Processing) enabled multidimensional analysis. The dot-com boom drove massive investment in web backends and databases.
2000s - Google’s papers on GFS (Google File System) and MapReduce laid the foundation for distributed computing. Hadoop made those ideas open source, kicking off the Big Data era.
2010s - Spark replaced MapReduce with faster in-memory processing. Kafka enabled real-time event streaming. Redshift brought cloud-native data warehousing to the masses.
Today - dbt brought software engineering practices to SQL transformations. Snowflake decoupled storage from compute in the cloud warehouse. Airflow became the standard for pipeline orchestration. The “Big Data Engineer” role has matured into the modern “Data Engineer” - focused on building with existing tools rather than inventing infrastructure from scratch.

1.1.3 The Data Engineer Among Other Stakeholders

Data engineers sit between the teams that produce data and the teams that consume it.

Communication with upstream stakeholders matters for understanding the data you’re ingesting and catching anything that might disrupt the pipeline. Communication with downstream stakeholders matters for understanding how the data you serve relates to their goals and where it adds business value.

1.1.4 Business Value

The most important question for a data engineer: how does your work add value to the organization? Don’t get hung up on every new technology - focus on what drives business outcomes.