3.4 PySpark SQL, EMR, Glue, and Streaming
3.4.1 PySpark SQL
PySpark supports three approaches to data transformation: SQL-based manipulation, Python-based DataFrame operations, and a hybrid of both. All approaches run on the same computation engine and compile to the same low-level code.
Temporary Views
A temporary view is a virtual table that exists only while the Spark session is active. It lets you write SQL queries against a DataFrame without physically storing data.
# Register DataFrame as a temporary SQL view
transactions_df.createOrReplaceTempView("orders")
# Query using SQL syntax - returns a DataFrame
spark.sql("SELECT * FROM orders").show()
SQL Queries
# Aggregate: total amount per order
spark.sql("""
SELECT ID, SUM(Amount) AS total
FROM orders
GROUP BY ID
ORDER BY total DESC
""").show()
UDFs in SQL
SQL queries cannot use raw Python functions directly - register them with Spark first:
def to_lower(word):
return word.lower()
# Register for use in SQL
spark.udf.register("udf_to_lower", to_lower)
spark.sql("SELECT DISTINCT udf_to_lower(Country) FROM orders").show()
Joining Multiple Views
# Register a second DataFrame as a view
product_category_df.createOrReplaceTempView("items")
# Join views using standard SQL
spark.sql("""
SELECT items.Category, AVG(orders.Amount) AS avg_amount
FROM items
LEFT JOIN orders ON items.ItemID = orders.StockCode
GROUP BY items.Category
""").show()
3.4.2 Amazon EMR
Amazon EMR
Amazon EMR is AWS’s managed big data platform for scalable, distributed computing.
| Feature | Description |
|---|---|
| Frameworks | Apache Spark, Hadoop, Hive, Presto, Flink, HBase |
| Integration | Native with S3, DynamoDB, RDS, Redshift |
| Storage | Decouples compute from storage via S3 - multiple clusters can process the same data |
| Scaling | Clusters scale elastically based on workload |
EMR Studio and Serverless
| Variant | Description |
|---|---|
| EMR Studio | Browser-based IDE (built on Jupyter) for interactive analysis and job execution |
| EMR Serverless | Eliminates manual cluster management - automatic scaling, pay only for what you use |
A typical workflow: create an EMR Serverless application with default Spark settings, launch an EMR Studio workspace, attach it to the application, choose the PySpark kernel, and run Spark jobs interactively.
3.4.3 AWS Glue
AWS Glue
AWS Glue is a serverless data integration service for ingesting, transforming, and loading data using Apache Spark under the hood.
Glue Jobs
ETL pipelines in AWS Glue are called Glue jobs. Three ways to create them:
| Method | Skill Level | Description |
|---|---|---|
| Glue DataBrew | No-code | Visual spreadsheet-like interface for transforms, powered by Spark |
| Glue Studio | Low-code | Drag-and-drop UI for sources, transforms, and destinations with optional custom code |
| Jupyter Notebooks | Code | Write Spark code from scratch, run via Glue |
Glue jobs can be orchestrated using Glue triggers, blueprints, and workflows.
Data Catalog and Governance
AWS Glue includes a centralized data catalog for managing metadata and enabling governance. Glue Crawlers scan data sources to populate the catalog with schema, data types, structure, and partitioning information - essential for data lakes and lakehouses.
The Glue Data Catalog integrates with Amazon Athena (SQL queries), Amazon QuickSight (BI dashboards), and Amazon SageMaker (ML models).
3.4.4 AWS Glue Visual ETL
Glue Visual ETL
Glue Visual ETL is a low-code interface in AWS Glue Studio for designing ETL pipelines.
Typical Workflow
- Configure sources - add source nodes (e.g., MySQL tables via JDBC) and specify connections
- Define transforms - add SQL Query nodes for each dimension and fact table
- Set targets - add
S3target nodes with Parquet format - Generate script - preview and customize the auto-generated PySpark script
- Run and verify - configure workers, run the job, monitor until succeeded
Generated Glue Script Structure
The auto-generated glue_job.py follows this pattern:
import sys
from awsglue import DynamicFrame
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
def sparkSqlQuery(glueContext, query, mapping, transformation_ctx):
"""Run a SQL query against DynamicFrames registered as temp views."""
for alias, frame in mapping.items():
frame.toDF().createOrReplaceTempView(alias)
result = spark.sql(query)
return DynamicFrame.fromDF(result, glueContext, transformation_ctx)
# Parse job arguments and initialize context
args = getResolvedOptions(
sys.argv, ["JOB_NAME", "glue_connection", "glue_database", "target_path"]
)
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# --- EXTRACT: read source tables via JDBC ---
customers = glueContext.create_dynamic_frame.from_options(
connection_type="mysql",
connection_options={
"useConnectionProperties": "true",
"dbtable": "classicmodels.customers",
"connectionName": args["glue_connection"],
},
transformation_ctx="customers",
)
# --- TRANSFORM: build star schema with SQL ---
dim_customers = sparkSqlQuery(
glueContext,
query="""
SELECT customerNumber, customerName,
contactLastName, contactFirstName,
phone, addressLine1, creditLimit
FROM customers
""",
mapping={"customers": customers},
transformation_ctx="dim_customers",
)
# --- LOAD: write to S3 as Parquet ---
sink = glueContext.getSink(
path=f"{args['target_path']}/dim_customers/",
connection_type="s3",
updateBehavior="UPDATE_IN_DATABASE",
compression="snappy",
enableUpdateCatalog=True,
transformation_ctx="dim_customers_to_s3",
)
sink.setCatalogInfo(
catalogDatabase=args["glue_database"],
catalogTableName="dim_customers",
)
sink.setFormat("glueparquet")
sink.writeFrame(dim_customers)
job.commit()
3.4.5 DataFrames vs. SQL and Streaming
Choosing Between SQL and Python
| Aspect | Spark DataFrames (Python) | Spark SQL |
|---|---|---|
| Syntax | Programmatic (Python API) | Declarative (SQL queries) |
| Complex transforms | Easier to implement (e.g., transpose, custom logic) | Limited for non-standard operations |
| Reusability | High - modular, testable, supports libraries | Low - limited reuse for complex queries |
| Maintainability | Easier with modular code structure | Harder with large, nested queries |
| Best for | Complex logic, reusable components | Simple, declarative data manipulation |
Both compile to the same execution plan - mix and match freely.
Spark vs. Pandas
| Aspect | Pandas | Spark |
|---|---|---|
| Data size | Fits in memory on a single machine | Exceeds memory - distributed across nodes |
| Execution | Single machine | Distributed cluster |
| Setup | Minimal | Requires cluster (local or cloud) |
| Best for | Prototyping, local exploration, lightweight tasks | Production pipelines, high-volume processing |
Streaming Transformations
For streaming, the key consideration is latency requirements:
- Micro-batch (e.g., Spark Structured Streaming) - processes small batches at regular intervals, trading a few seconds of latency for simpler semantics
- True streaming (e.g.,
Apache Flink) - processes events individually with millisecond latency, more complex but lower latency