1.1 How to Think Like a Data Engineer
1.1.1 The Data Engineering Lifecycle
Data Engineering Definition (from Fundamentals of Data Engineering):
Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high quality consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering.
Data Pipeline Definition
The combination of architecture, systems, and processes that move data through the data engineering lifecycle.
The Job of the Data Engineer
At its core, the data engineerโs job is to get raw data from somewhere, transform it into something useful, and make it available to downstream consumers.
1.1.2 The History of Data and Data Engineering
Data engineering didnโt appear overnight - it evolved alongside decades of database and computing innovation.
-
1960s -
IMS(Information Management System) was IBMโs hierarchical database, one of the first to manage data at scale.CODASYLdefined the network database model, an early standard for how programs interact with databases. -
1970s - Edgar Coddโs relational model changed everything.
OracleandIBM DB2turned that theory into commercial products, andSQLbecame the standard query language that still dominates today. -
1980s -
Teradatapioneered massively parallel data warehouses, enabling analytics on large datasets.Informaticaintroduced ETL tooling to move and transform data between systems. -
1990s - Ralph Kimball and Bill Inmon defined competing approaches to data warehouse design (dimensional vs. enterprise).
OLAP(Online Analytical Processing) enabled multidimensional analysis. The dot-com boom drove massive investment in web backends and databases. -
2000s - Googleโs papers on
GFS(Google File System) andMapReducelaid the foundation for distributed computing.Hadoopmade those ideas open source, kicking off the Big Data era. -
2010s -
Sparkreplaced MapReduce with faster in-memory processing.Kafkaenabled real-time event streaming.Redshiftbrought cloud-native data warehousing to the masses. -
Today -
dbtbrought software engineering practices to SQL transformations.Snowflakedecoupled storage from compute in the cloud warehouse.Airflowbecame the standard for pipeline orchestration. The โBig Data Engineerโ role has matured into the modern โData Engineerโ - focused on building with existing tools rather than inventing infrastructure from scratch.
1.1.3 The Data Engineer Among Other Stakeholders
Data engineers sit between the teams that produce data and the teams that consume it.
Communication with upstream stakeholders matters for understanding the data youโre ingesting and catching anything that might disrupt the pipeline. Communication with downstream stakeholders matters for understanding how the data you serve relates to their goals and where it adds business value.
1.1.4 Business Value
The most important question for a data engineer: how does your work add value to the organization? Donโt get hung up on every new technology - focus on what drives business outcomes.
1.1.5 Gathering System Requirements
Once you understand the business need, translate it into system requirements. These fall into two categories.
Cost and security constraints should be factored in from the start.
1.1.6 Translating Stakeholder Needs into Specific Requirements
Turning vague stakeholder needs into concrete requirements takes structured discovery.
1.1.7 Thinking Like a Data Engineer
Putting it all together, here is a repeatable framework for approaching any data engineering problem: