3.1 Batch Queries
3.1.1 Query Fundamentals
How data is stored and managed directly affects both query performance and the performance of the storage systems themselves. This week explores those relationships.
Query Definition
A query is a statement written in a specific query language to retrieve or act on data. Examples include SQL queries against an RDBMS, SQL-like queries against Amazon S3, and Cypher queries against a Neo4j graph database.
Query Languages
SQL and its relatives are declarative - you describe what data you want, and the DBMS handles the execution steps. That abstraction does not mean you can ignore what happens behind the scenes, though. This week covers:
- SQL execution behind the scenes
- Techniques to improve query performance (e.g., database indexes)
- How aggregating queries behave differently on columnar vs. row storage
- Queries on streaming data
3.1.2 The Life of a Query
A batch query travels through several stages within the DBMS before returning results.
| Stage | Component | What Happens |
|---|---|---|
| 1. Transport | Network layer | Client sends the SQL query over a network connection to the DBMS |
| 2. Parsing | Query parser | Tokenizes the query, checks syntax, validates access permissions, compiles tokens to bytecode |
| 3. Optimization | Query optimizer | Evaluates candidate execution plans - considers operation types, index availability, data scan size - calculates the cost of each (I/O, computation, memory) and selects the least expensive |
| 4. Execution | Execution engine | Executes the chosen plan against the storage layer |
| 5. Storage | Storage engine | Reads/writes data blocks on disk and returns results back up the chain |