3.4 Streaming Queries

3.4.1 Streaming Query Patterns

When data arrives continuously, query patterns must reflect that real-time nature. Frameworks like Apache Flink, Spark Streaming, and Kafka enable SQL queries over streaming data, typically using windowing techniques to bound the otherwise infinite stream.


Windowing Techniques

Window TypeBehaviorUse Case
Tumbling (fixed-time)Non-overlapping windows of equal duration. Each event belongs to exactly one window.Regular aggregations - e.g., count events per minute
SlidingOverlapping windows that advance by a fixed step smaller than the window size. Events can appear in multiple windows.Moving averages - e.g., 5-minute average computed every 1 minute
SessionVariable-size windows grouped by activity. An inactivity gap triggers a new session.User behavior analysis - e.g., web session tracking
Streaming window types: tumbling, sliding, and session Streaming window types: tumbling, sliding, and session

Joining Data Streams

Streaming joins combine records from two or more streams based on matching keys within a time window. For example, joining a clickstream with a purchase stream to correlate user behavior with transactions within a 30-minute window.

AWS provides a fully managed service for running Apache Flink applications, handling infrastructure provisioning, scaling, and fault tolerance so you can focus on writing streaming logic.

FeatureDescription
Managed infrastructureNo servers to provision - AWS handles cluster sizing, patching, and scaling
Automatic scalingAdjusts parallelism based on incoming data throughput
Fault toleranceBuilt-in checkpointing and state recovery - processing resumes from the last checkpoint after failures
IntegrationReads from Kinesis Data Streams, Amazon MSK (Kafka), and S3; writes to S3, Redshift, OpenSearch, and more

A typical deployment: Kinesis Data Streams ingests real-time events, Managed Apache Flink processes and transforms the stream using SQL or Java/Python applications, and results are written to S3 or Redshift for downstream analytics.

AWS streaming pipeline: Kinesis โ†’ Managed Apache Flink โ†’ S3 and Redshift AWS streaming pipeline: Kinesis โ†’ Managed Apache Flink โ†’ S3 and Redshift

3.4.3 Watermarks and Late-Arriving Data

In real-world streaming systems, events rarely arrive in perfect order. Network delays, device buffering, and retries mean an event with event time 10:00:05 might arrive at the processing engine at 10:00:45 - or even minutes later. Watermarks are the mechanism streaming frameworks use to reason about this disorder.

Watermarks and late data handling in stream processing Watermarks and late data handling in stream processing

Event Time vs. Processing Time

Every streaming event has two timestamps: the event time (when it actually occurred at the source) and the processing time (when the engine received it). Correct analytical results require aggregating by event time, but the engine only sees events in processing-time order. This gap is the core challenge watermarks solve.


How Watermarks Work

A watermark is a declaration by the streaming engine that it believes all events with an event time up to a certain point have arrived. When a watermark advances past the end of a window, the engine considers that window complete and emits its result.

Watermarks are typically set as a fixed offset from the latest observed event time. For example, a watermark of โ€œ10 secondsโ€ means the engine waits 10 seconds of event-time progress before closing a window, allowing for that much lateness.


Handling Late Data

Events that arrive after their windowโ€™s watermark has passed are considered late. Frameworks provide three strategies for handling them:

StrategyBehaviorUse Case
DropDiscard the late event entirelyAcceptable when occasional data loss is tolerable (e.g., click analytics)
Allowed latenessAccept late events within a grace period and update the window resultWhen correctness matters but unbounded waiting is impractical
Side outputRoute late events to a separate stream for manual inspection or reprocessingWhen no data should be silently lost (e.g., financial transactions)