4.1 Serving Data and Analytics

4.1.1 Serving Data for Analytics and ML

The final stage of the data engineering lifecycle is serving - delivering processed data to downstream consumers for analytics and machine learning.


Analytical Use Cases

Use CaseDescription
Business IntelligenceDashboards and reports for strategic decision-making
Operational AnalyticsMonitoring data to inform immediate action, served within required latency constraints
Embedded AnalyticsClient-facing data products or dashboards

Machine Learning Use Cases

The primary end users are data scientists and ML engineers. The data engineerโ€™s role is to acquire, transform, and deliver the data necessary for model training.


Ways to Serve Data

MethodDescription
TablePhysical storage of data - queries read directly from disk/memory
ViewA saved SQL query that runs on demand - always returns fresh results but recomputes every time
Materialized ViewA pre-computed snapshot of a query result stored physically - fast reads but requires periodic refresh

How Data Scientists Accept Data

Delivery MethodDetails
As filesText files for language modeling, Parquet/CSV for tabular ML, image formats for computer vision - common for ad hoc requests
From databases / warehousesAccess via SQL queries with schema enforcement, fine-grained permissions, and high query performance
From streaming systemsReal-time data delivery for low-latency analytics across both historical and current data

Semantic Layer

A semantic layer documents definitions, derives business metrics, and creates a common language for data across the organization. It ensures correctness, consistency, and trustworthiness through clear data definitions (column names) and data logic (formulas for deriving metrics).

4.1.2 Views and Materialized Views

PropertyViewMaterialized View
StorageNo physical storage - query runs on demandPre-computed result stored on disk
FreshnessAlways up-to-date (recomputes on every query)Stale until refreshed (manual or scheduled)
PerformanceSlower for complex queries (recomputes each time)Fast reads - serves pre-computed results
Use caseSimple transformations, access control layersExpensive aggregations queried frequently
MaintenanceNone - just a saved SQL definitionRequires refresh strategy (incremental or full)
View recomputes SQL on every query vs materialized view reads cached result View recomputes SQL on every query vs materialized view reads cached result
-- Create a view: recomputes on every query
CREATE VIEW daily_sales AS
SELECT order_date, SUM(amount) AS total
FROM fact_orders
GROUP BY order_date;

-- Create a materialized view: pre-computed, needs refresh
CREATE MATERIALIZED VIEW daily_sales_mv AS
SELECT order_date, SUM(amount) AS total
FROM fact_orders
GROUP BY order_date;

-- Refresh the materialized view when underlying data changes
REFRESH MATERIALIZED VIEW daily_sales_mv;