2.3 Practical Examples on AWS

2.3.1 The Data Engineering Lifecycle on AWS

Each stage of the data engineering lifecycle maps to concrete AWS services. Here is how they break down.

Data Engineering Lifecycle on AWS Data Engineering Lifecycle on AWS

Source Systems

  • Databases: Amazon RDS - managed relational databases, reducing operational overhead. Amazon DynamoDB - serverless NoSQL with flexible schemas, best for low-latency access to large volumes of data.
  • Streaming Sources: Amazon Kinesis Data Streams, Amazon SQS, Amazon MSK

Ingestion

  • From a database: AWS Database Migration Service (DMS), AWS Glue
  • From a streaming source: Amazon Kinesis Data Streams, Amazon Data Firehose, Amazon SQS, Amazon MSK

Storage

  • Amazon Redshift - traditional cloud data warehouse
  • Amazon S3 - object storage, also the foundation for a lakehouse arrangement that can handle both structured and unstructured data

Transformation

  • AWS Glue, Apache Spark, dbt

Serving

  • Analytics and BI: Querying with Amazon Athena and Amazon Redshift. Dashboarding with Amazon QuickSight, Apache Superset, or Metabase.
  • AI/ML: Serve batch data for model training and work with vector databases

2.3.2 The Undercurrents in AWS

The undercurrents also have direct AWS counterparts.

AWS Undercurrents AWS Undercurrents

Security - AWS uses a shared responsibility model. IAM (Identity and Access Management) enforces permissions based on roles.

Data Management - AWS Glue, Glue Crawler, and Glue Data Catalog discover, create, and manage metadata for data stored across AWS storage systems.

DataOps - Amazon CloudWatch collects metrics and provides monitoring for cloud resources, applications, and on-prem systems. Amazon SNS (Simple Notification Service) handles alerting.

Orchestration - Apache Airflow remains the industry standard, available as a managed service through Amazon MWAA.

Architecture - The AWS Well-Architected Framework provides the guiding principles.

Software Engineering - AWS Cloud9 IDE (hosted on EC2) for development, AWS CodeDeploy for automated deployments, and Git/GitHub for source code management.