2.1 The Data Engineering Lifecycle
2.1.1 Data Generation in Source Systems
The data engineering lifecycle begins at the source. Data can originate from a wide variety of systems, and understanding those systems is the first step toward building reliable pipelines.
- Databases - relational databases, NoSQL stores (key-value, document)
- Files - text, audio, video, and other formats
- APIs - data returned as
JSONor XML from programmatic requests - Data sharing platforms - internal datasets or third-party providers
- IoT devices - fleets of sensors that typically feed into a database, API, or sharing platform
The upstream stakeholders for data generation are usually software engineers or third-party platform owners. Source systems are often unpredictable, so it is important to build relationships with source system owners and understand how the data and its schema might change over time.
2.1.2 Ingestion
Ingestion is the process of moving raw data from source systems into the data pipeline for further processing. The key design decision here is frequency.
Batch ingestion processes data on a predetermined time interval or once a size threshold is reached. Stream ingestion uses an event-streaming platform or message queue to provide continuous, near-real-time data availability shortly after production. Streaming adds cost, complexity, and maintenance burden, so it should only be adopted when there is a clear business use case. In practice, data engineers often decide where the boundary between batch and streaming falls.
2.1.3 Data Storage
Storage sits at every stage of the lifecycle. The raw hardware ingredients trade off against each other in predictable ways. Process-level components such as networking, compression, serialization, and caching also influence storage performance.
Data engineers typically work with database management systems, object storage, Apache Iceberg, cache/memory-based stores, and streaming storage. These sit behind higher-level abstractions - data warehouses, data lakes, and data lakehouses - that let you configure latency, scalability, and cost to match your workload.
2.1.4 Data Transformation
Transformation is where raw data becomes something useful. It breaks down into three parts.
2.1.5 Serving Data
The final stage delivers data to end consumers across three main channels: analytics, machine learning, and reverse ETL.