3.1 Batch Transformation Patterns

3.1.1 Batch Transformation Overview

The transformation stage is where raw data is manipulated and enhanced for downstream stakeholders. Transformations serve three purposes:

  1. Manipulate and enhance data for downstream consumers
  2. Leverage massively parallel processing for data modeling (star schemas, data vaults, etc.)
  3. Define zones for stages of transformed data (raw, cleaned, enriched)

Technical Considerations

FactorBatchStreaming
Primary constraintData size, hardware specs, performance requirementsLatency requirements
ApproachSingle machine or distributed processing frameworkEvent-driven or micro-batch
LogicSQL or PythonSQL, Python, or framework-specific APIs

3.1.2 ETL Patterns and Use Cases

ETL vs. ELT vs. EtLT

PatternHow it WorksBest For
ETLTransform data before loading into the targetLegacy warehouses, limited target compute
ELTLoad raw data first, then transform inside the targetCloud warehouses with strong compute (Redshift, BigQuery)
EtLTLight transforms before load (cleaning), heavy transforms after (modeling)Hybrid - clean early, model late

Transformations for Data Modeling

Convert normalized source data into analytical models - star schemas, data vaults, one-big-table - optimized for downstream queries.


Transformations for Data Cleaning (Wrangling)

OperationExample
DeduplicationRemove duplicate records
Type castingConvert string dates to date types
Null handlingImpute or drop missing values
StandardizationNormalize formats (phone numbers, addresses)
FilteringRemove invalid or irrelevant rows
ValidationAssert business rules (e.g., price > 0)

3.1.3 Data Updating and Change Data Capture

A common use case is keeping the data warehouse in sync with source systems.


Truncate and Reload

Delete all records in the target and reload from source. Works for small datasets or infrequent updates, but becomes expensive at scale.


Change Data Capture (CDC)

CDC identifies changes in the source system and applies only those changes to the target. Changes can be detected through a last_updated column or database transaction logs (I for insert, U for update, D for delete).

StrategyBehavior
Insert-onlyAppend new records without modifying old ones - new records include metadata to distinguish versions
Upsert / MergeMatch by primary key - update on match, insert on no match

Handling Deletes

MethodDescription
Hard deletePermanently remove the record - for performance or compliance
Soft deleteMark the record as deleted with a flag column
Insert-only with delete flagAppend a new record with a deletion marker

Insert Performance

Single-row inserts work well for row-oriented OLTP databases but are inefficient for column-oriented OLAP systems - they create massive load and degrade read performance. Micro-batch or batch inserts are the preferred approach for OLAP.

3.1.4 Idempotency in Batch Pipelines

An idempotent pipeline produces the same result whether it runs once or multiple times with the same input. Idempotency is one of the most important reliability properties in data engineering - it means you can safely retry a failed run without creating duplicates, corrupting data, or producing inconsistent results.


Why It Matters

Pipelines fail for countless reasons: network timeouts, cloud service outages, schema changes, resource limits. When a pipeline fails partway through, the recovery question is critical. Without idempotency, engineers must manually inspect what succeeded and what didn’t before deciding whether to rerun. With idempotency, the answer is always: just rerun the whole thing.


Common Patterns for Achieving Idempotency

PatternHow It WorksTradeoff
Truncate and reloadDelete all data in the target partition, then write freshSimple and reliable, but expensive for large datasets
Upsert (MERGE)Insert new rows, update existing ones based on a keyEfficient for incremental loads, but requires a reliable unique key
Deterministic output pathsWrite to a path derived from the input (e.g., /output/date=2025-03-15/) so reruns overwrite the same locationWorks well with object storage, relies on consistent naming
Deduplication on readAccept potential duplicates at write time, deduplicate downstream using ROW_NUMBER() or QUALIFYShifts complexity to the consumer, but decouples write reliability from correctness

Anti-patterns to Avoid

  • Appending without deduplication - rerunning a pipeline that appends to a table without checking for existing records creates duplicates with every retry
  • Using wall-clock timestamps as keys - if a pipeline reruns, the timestamp changes, making it impossible to detect duplicate records
  • Non-deterministic file names - writing to randomly named files means reruns create new files instead of overwriting previous attempts