Glossary of Terms

Key concepts and technologies covered across all four courses.

🔄 Data Engineering Lifecycle

Batch Processing

Data Engineering Lifecycle

Data Pipeline

Ingestion

Serving

Stream Processing

Transformation

Undercurrents

📡 Source Systems

ACID Compliance

Amazon DynamoDB

IoT Devices

Logs

NoSQL Databases

OLTP

Relational Databases

REST API

Source Systems

Structured Data

Unstructured Data

📥 Data Ingestion

Airbyte

Amazon Data Firehose

Amazon Kinesis Data Streams

Amazon MSK

AWS DMS

Batch Ingestion

Change Data Capture (CDC)

Dead Letter Queue

Debezium

Event Streaming Platform

Fivetran

JDBC/ODBC

Message Queue

Stream Ingestion

💾 Storage Systems

Amazon EBS

Amazon EFS

Amazon RDS

Amazon S3

Block Storage

Cassandra

Column-Oriented Storage

Distributed Storage Systems

File Storage

HDFS

In-Memory Storage

Memcached

Object Storage

Redis

Row-Oriented Storage

Storage Tiers

🏗️ Storage Abstractions

Apache Hudi

Apache Iceberg

AWS Lake Formation

Data Lake

Data Lakehouse

Data Mart

Data Warehouse

Delta Lake

Medallion Architecture

Open Table Formats

Schema Evolution

Schema-on-Read

Schema-on-Write

Separation of Storage and Compute

⚡ Queries & Performance

Aggregate Queries

Amazon Athena

B-Tree Index

Common Table Expressions (CTEs)

Distribution Styles

Exactly-Once Semantics

EXPLAIN

Hash Join

Index

Joins

Massively Parallel Processing (MPP)

Partition Pruning

Query Optimizer

Redshift Spectrum

Sort Key

SQL

Streaming Queries

VACUUM

Watermark (Streaming)

Window Functions

Windowing (Tumbling, Sliding, Session)

📐 Data Modeling

Conformed Dimension

Data Vault

Denormalization

Dimension Table

Entity-Relationship Diagram

Fact Table

Grain

Inmon Approach

Kimball Approach

Normalization

One Big Table (OBT)

Primary Key / Foreign Key

Slowly Changing Dimension (SCD)

Star Schema

Surrogate Key

🔧 Transformations

Apache Hadoop

Apache Spark

Backfill

dbt

ELT

ETL

Feature Engineering

Idempotency

MapReduce

PySpark

Reverse ETL

Spark DataFrames

Spark Structured Streaming

User-Defined Functions (UDFs)

📊 Serving

Amazon QuickSight

Amazon SageMaker

Business Intelligence

Embedded Analytics

Materialized View

Metabase

Operational Analytics

Semantic Layer

View

🎼 Orchestration

Amazon MWAA

Apache Airflow

Dagster

Directed Acyclic Graph (DAG)

Mage

Prefect

Taskflow API

XComs

🔁 DataOps & DevOps

Amazon CloudWatch

CI/CD

Data Contract

Data Governance

Data Lineage

Data Observability

Data Quality

DataOps

Great Expectations

Incident Response

Infrastructure as Code (IaC)

Terraform

☁️ Cloud Platforms & Services

Amazon EC2

Amazon EMR

Amazon Neptune

Amazon Redshift

Amazon SQS

Amazon VPC

Apache Flink

Apache Kafka

AWS CloudFormation

AWS Glue

AWS IAM

AWS Lambda

AWS Well-Architected Framework

Boto3

Databricks

Glue Data Catalog

Google BigQuery

MySQL

Neo4j

Oracle

PostgreSQL

Snowflake

📦 Data Formats & Serialization

Apache Avro

Apache Parquet

Compression

CSV

JSON

Serialization

XML

🏛️ Architecture & Design Patterns

Availability Zones

CAP Theorem

Conway's Law

Data Architecture

Data Mesh

FinOps

GDPR

Kappa Architecture

Lambda Architecture

Loosely Coupled Systems

Monolith vs. Modular

Partitioning (Sharding)

Principle of Least Privilege

Replication

Serverless

Shared Responsibility Model

Total Cost of Ownership (TCO)