3.1 Batch Transformation Patterns
3.1.1 Batch Transformation Overview
The transformation stage is where raw data is manipulated and enhanced for downstream stakeholders. Transformations serve three purposes:
- Manipulate and enhance data for downstream consumers
- Leverage massively parallel processing for data modeling (star schemas, data vaults, etc.)
- Define zones for stages of transformed data (raw, cleaned, enriched)
Technical Considerations
| Factor | Batch | Streaming |
|---|---|---|
| Primary constraint | Data size, hardware specs, performance requirements | Latency requirements |
| Approach | Single machine or distributed processing framework | Event-driven or micro-batch |
| Logic | SQL or Python | SQL, Python, or framework-specific APIs |
3.1.2 ETL Patterns and Use Cases
ETL vs. ELT vs. EtLT
| Pattern | How it Works | Best For |
|---|---|---|
| ETL | Transform data before loading into the target | Legacy warehouses, limited target compute |
| ELT | Load raw data first, then transform inside the target | Cloud warehouses with strong compute (Redshift, BigQuery) |
| EtLT | Light transforms before load (cleaning), heavy transforms after (modeling) | Hybrid - clean early, model late |
Transformations for Data Modeling
Convert normalized source data into analytical models - star schemas, data vaults, one-big-table - optimized for downstream queries.
Transformations for Data Cleaning (Wrangling)
| Operation | Example |
|---|---|
| Deduplication | Remove duplicate records |
| Type casting | Convert string dates to date types |
| Null handling | Impute or drop missing values |
| Standardization | Normalize formats (phone numbers, addresses) |
| Filtering | Remove invalid or irrelevant rows |
| Validation | Assert business rules (e.g., price > 0) |
3.1.3 Data Updating and Change Data Capture
A common use case is keeping the data warehouse in sync with source systems.
Truncate and Reload
Delete all records in the target and reload from source. Works for small datasets or infrequent updates, but becomes expensive at scale.
Change Data Capture (CDC)
CDC identifies changes in the source system and applies only those changes to the target. Changes can be detected through a last_updated column or database transaction logs (I for insert, U for update, D for delete).
| Strategy | Behavior |
|---|---|
| Insert-only | Append new records without modifying old ones - new records include metadata to distinguish versions |
| Upsert / Merge | Match by primary key - update on match, insert on no match |
Handling Deletes
| Method | Description |
|---|---|
| Hard delete | Permanently remove the record - for performance or compliance |
| Soft delete | Mark the record as deleted with a flag column |
| Insert-only with delete flag | Append a new record with a deletion marker |
Insert Performance
Single-row inserts work well for row-oriented OLTP databases but are inefficient for column-oriented OLAP systems - they create massive load and degrade read performance. Micro-batch or batch inserts are the preferred approach for OLAP.
3.1.4 Idempotency in Batch Pipelines
An idempotent pipeline produces the same result whether it runs once or multiple times with the same input. Idempotency is one of the most important reliability properties in data engineering - it means you can safely retry a failed run without creating duplicates, corrupting data, or producing inconsistent results.
Why It Matters
Pipelines fail for countless reasons: network timeouts, cloud service outages, schema changes, resource limits. When a pipeline fails partway through, the recovery question is critical. Without idempotency, engineers must manually inspect what succeeded and what didn’t before deciding whether to rerun. With idempotency, the answer is always: just rerun the whole thing.
Common Patterns for Achieving Idempotency
| Pattern | How It Works | Tradeoff |
|---|---|---|
| Truncate and reload | Delete all data in the target partition, then write fresh | Simple and reliable, but expensive for large datasets |
| Upsert (MERGE) | Insert new rows, update existing ones based on a key | Efficient for incremental loads, but requires a reliable unique key |
| Deterministic output paths | Write to a path derived from the input (e.g., /output/date=2025-03-15/) so reruns overwrite the same location | Works well with object storage, relies on consistent naming |
| Deduplication on read | Accept potential duplicates at write time, deduplicate downstream using ROW_NUMBER() or QUALIFY | Shifts complexity to the consumer, but decouples write reliability from correctness |
Anti-patterns to Avoid
- Appending without deduplication - rerunning a pipeline that appends to a table without checking for existing records creates duplicates with every retry
- Using wall-clock timestamps as keys - if a pipeline reruns, the timestamp changes, making it impossible to detect duplicate records
- Non-deterministic file names - writing to randomly named files means reruns create new files instead of overwriting previous attempts