4.2 AWS Services for Batch Pipelines

4.2.1 AWS Services for Batch Pipelines

A typical batch ETL pipeline starts with a source system (e.g., Amazon RDS) and needs an extraction and transformation layer. Running this on a raw EC2 instance works but means managing software installation, security, and all the complexity of a cloud server yourself.

AWS Lambda offers a serverless alternative - you write a function to extract data from the source system with no infrastructure to manage. However, Lambda has a 15-minute timeout per invocation and limited memory and CPU, so it is best suited for lightweight tasks.

Serverless Tools for Batch Processing:

Amazon EMR - more control, designed as a big data processing tool
AWS Glue ETL - more convenience; Glue Crawler automatically discovers and classifies data, Glue Data Catalog serves as a central metadata repository, and Glue Visual ETL lets you design pipelines graphically. A good starting point for most teams.

Load/Serve:

Where you land the processed data depends on the use case:

Amazon RDS - if normalizing tabular data using a star schema
Amazon Redshift - if running complex analytical queries on massive datasets
Amazon S3 - for ML model artifacts or any workload that benefits from flexible, scalable, cost-effective object storage