|
| 1 | +# Life of a Query |
| 2 | + |
| 3 | +_Inspired by [CRDB: Life of a SQL |
| 4 | +Query](https://github.com/cockroachdb/cockroach/blob/master/docs/tech-notes/life_of_a_query.md)._ |
| 5 | + |
| 6 | +## Introduction |
| 7 | + |
| 8 | +This document aims to provide an overview of the components involved in |
| 9 | +processing a SQL query, following the code paths through the various layers: |
| 10 | +network protocol, session management, the adapter, query parsing, planning, and |
| 11 | +optimization, interaction of the adapter layer and clusters, the storage layer, |
| 12 | +and delivery of results from clusters, back through the adapter layer to the |
| 13 | +client. |
| 14 | + |
| 15 | +We don't discuss design decisions and historical evolution but focus on the |
| 16 | +current code. We also only cover the common and most significant parts and omit |
| 17 | +discussing details and special cases. |
| 18 | + |
| 19 | +## PostgreSQL Client Protocol |
| 20 | + |
| 21 | +Queries arrive at Materialize through the [PostgreSQL wire |
| 22 | +protocol](https://www.postgresql.org/docs/current/protocol.html). |
| 23 | + |
| 24 | +The protocol is implemented in the [pgwire |
| 25 | +crate](https://github.com/MaterializeInc/materialize/tree/bea88fb10b02792bad73afd79a6350b381a7dcc3/src/pgwire). |
| 26 | +Incoming connections are handled in |
| 27 | +[handle_connection](https://github.com/MaterializeInc/materialize/blob/bea88fb10b02792bad73afd79a6350b381a7dcc3/src/pgwire/src/server.rs#L105), |
| 28 | +which passes on to |
| 29 | +[run](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/pgwire/src/protocol.rs#L127), |
| 30 | +and ultimately we create a |
| 31 | +[StateMachine](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/pgwire/src/protocol.rs#L478) |
| 32 | +whose |
| 33 | +[run](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/pgwire/src/protocol.rs#L512) |
| 34 | +handles the connection until completion. |
| 35 | + |
| 36 | +The state machine has an adapter client |
| 37 | +([SessionClient](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/adapter/src/client.rs#L481)) |
| 38 | +that it needs to use to accomplish most "interesting" things. It's a client to |
| 39 | +the adapter layer, with its main component, the Coordinator. The work of |
| 40 | +processing queries and maintaining session state is split between the "front |
| 41 | +end" state machine that runs the pgwire protocol and the "back end" adapter |
| 42 | +layer that handles talking to the other components. |
| 43 | + |
| 44 | +> [!NOTE] |
| 45 | +> We will keep using the terms adapter front end and adapter back end below to |
| 46 | +> mean, respectively, the state machine that is responsible for a single |
| 47 | +> connection and can do work concurrently with other connections, and the |
| 48 | +> adapter/Coordinator which has to sequentialize work and is therefore more |
| 49 | +> limited in the amount of work it can do concurrently. |
| 50 | +> |
| 51 | +> If it is clear from context, we will drop the adapter prefix. |
| 52 | +
|
| 53 | +## Adapter / Coordinator |
| 54 | + |
| 55 | +The |
| 56 | +[adapter](https://github.com/MaterializeInc/materialize/tree/f641a29d4aad9fcfeb2de535ff54706a1f1d38c4/src/adapter) |
| 57 | +is named such because it translates user commands into commands that the other |
| 58 | +internal systems understand. It currently understands SQL but is intended as a |
| 59 | +generic component that isolates the other components from needing to know |
| 60 | +details about the front-end commands (SQL). |
| 61 | + |
| 62 | +A core component is the |
| 63 | +[Coordinator](https://github.com/MaterializeInc/materialize/blob/9682eb11f31e2eb8005cb3eae687f81a1bce21bb/src/adapter/src/coord.rs#L1633). |
| 64 | +It mediates access to durable environment state, kept in a durable catalog and |
| 65 | +represented by the |
| 66 | +[Catalog](https://github.com/MaterializeInc/materialize/blob/9682eb11f31e2eb8005cb3eae687f81a1bce21bb/src/adapter/src/catalog.rs#L137) |
| 67 | +in memory, and to the controllers for the storage and compute components (more |
| 68 | +on that later). |
| 69 | + |
| 70 | +It holds on to mutable state and clients to other components. Access and |
| 71 | +changes to these are mediated by a "single-threaded" event loop that listens to |
| 72 | +internal and external command channels. Commands are worked of sequentially. |
| 73 | +Other parts can put in commands, for example, the front end calling into the |
| 74 | +adapter is implemented as sending a command which then eventually causes a |
| 75 | +response to be sent back, but the Coordinator will also periodically put in |
| 76 | +commands for itself. |
| 77 | + |
| 78 | +> [!NOTE] |
| 79 | +> There is active work in progress on changing this design because it doesn't |
| 80 | +> lend itself well to scaling the amount of work that the adapter can do. We |
| 81 | +> want to move more work "out to the edges", that is towards the front end |
| 82 | +> (which can do work for different connections/sessions concurrently) and the |
| 83 | +> controllers for the other components. See [Design Doc: A Small |
| 84 | +> Coordinator](/doc/developer/design/20250717_a_small_coordinator_more_scalable_isolated_materialize.md) |
| 85 | +> for details. |
| 86 | +
|
| 87 | +## Query Processing |
| 88 | + |
| 89 | +Query processing follows these steps: |
| 90 | + |
| 91 | +parsing & describing → logical planning → timestamp selection → optimization & physical planning → execution |
| 92 | + |
| 93 | +### Parsing & Describing |
| 94 | + |
| 95 | +The Materialize parser is a hand-written recursive-descent parser that we |
| 96 | +forked from [sqlparser-rs](https://github.com/andygrove/sqlparser-rs). The main |
| 97 | +entry point is |
| 98 | +[parser.rs](https://github.com/MaterializeInc/materialize/blob/aba45afb39455b84cd64d63c3af50ffcae46fd83/src/sql-parser/src/parser.rs#L129). |
| 99 | + |
| 100 | +Parsing happens completely in the front end, it produces an AST of |
| 101 | +[Statement](https://github.com/MaterializeInc/materialize/blob/7fab0d0e12a15799334854dcb3990997bc72e037/src/sql-parser/src/ast/defs/statement.rs#L43) |
| 102 | +nodes. The AST only represents the syntax of the query and says nothing about |
| 103 | +how or if it can be executed. |
| 104 | + |
| 105 | +Describing is the process of figuring out the result type of a statement. For |
| 106 | +this, the front end needs access to the Catalog, for which it needs to call into |
| 107 | +the adapter. |
| 108 | + |
| 109 | +Both parsing and describing happen in the front end, with calls into the adapter |
| 110 | +as needed. All further steps are orchestrated by the adapter/Coordinator: the |
| 111 | +front end passes the AST as a command and will become involved again when |
| 112 | +sending results back to the client. |
| 113 | + |
| 114 | +### Logical Planning |
| 115 | + |
| 116 | +Logical planning generates a |
| 117 | +[Plan](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/sql/src/plan.rs#L133) |
| 118 | +from the AST. Glossing over some details, this binds referenced names based on |
| 119 | +the Catalog as of planning time and determines a logical execution plan. The |
| 120 | +entrypoint is |
| 121 | +[plan](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/sql/src/plan/statement.rs#L274). |
| 122 | + |
| 123 | +One of the differences between the user (SQL) commands that are input to the |
| 124 | +adapter and commands for the rest of the system is the use of user-defined and |
| 125 | +reusable names in SQL statements rather than the use of immutable, non-reusable |
| 126 | +IDs. The binding mentioned above turns those user-defined names into IDs. |
| 127 | + |
| 128 | +As mentioned in the previous section, logical planning happens inside the |
| 129 | +Coordinator. |
| 130 | + |
| 131 | +### Timestamp Selection |
| 132 | + |
| 133 | +Another difference between user commands and internal commands is the absence of explicit timestamps in user (SQL) commands, quoting from [formalism.md](/doc/developer/platform/formalism.md#adapter): |
| 134 | + |
| 135 | +> A `SELECT` statement does not indicate *when* it should be run, or against |
| 136 | +> which version of its input data. The Adapter layer introduces timestamps to |
| 137 | +> these commands, in a way that provides the appearance of sequential |
| 138 | +> execution. |
| 139 | +
|
| 140 | +The internal interface that we use for determining timestamps is |
| 141 | +[TimestampOracle](https://github.com/MaterializeInc/materialize/blob/cbaed5e677317feabd97048595e529bf3770547e/src/timestamp-oracle/src/lib.rs#L41). |
| 142 | +The production implementation is an oracle that uses CRDB as the source of |
| 143 | +truth. See [Design Doc: Distributed Timestamp |
| 144 | +Oracle](/doc/developer/design/20230921_distributed_ts_oracle.md) for more |
| 145 | +context. |
| 146 | + |
| 147 | +### Optimization & Physical Planning |
| 148 | + |
| 149 | +There are multiple stages to optimization and different internal |
| 150 | +representations, and different types of queries or created objects will |
| 151 | +instantiate different optimizer pipelines. The optimization pipeline for SELECT |
| 152 | +is [this |
| 153 | +snippet](https://github.com/MaterializeInc/materialize/blob/bdf9573960cece48e7811db7ef777b973d355fe6/src/adapter/src/coord/sequencer/inner/peek.rs#L568). |
| 154 | + |
| 155 | +The final result of these stages will depend on the type of query we're |
| 156 | +optimizing, but for certain types of SELECT and permanent objects it will |
| 157 | +include a |
| 158 | +[DataflowDescription](https://github.com/MaterializeInc/materialize/blob/53acc93eee1bbbf418fde681389aec0419db8954/src/compute-types/src/dataflows.rs#L40). |
| 159 | +Which is a physical execution plan that can be given to the compute layer, to |
| 160 | +execute on a cluster. |
| 161 | + |
| 162 | +TODO: Expand this section on optimization, if/when needed. |
| 163 | + |
| 164 | +### Execution |
| 165 | + |
| 166 | +For SELECT, which is internally called PEEK, there are three different execution scenarios: |
| 167 | + |
| 168 | +- Fast-Path Peek: there is an existing arrangement (think index) in the cluster |
| 169 | + that we're targeting for the query. We can read the result from memory with |
| 170 | + minimal massaging and return it to the client (through the adapter). |
| 171 | +- Slow-Path Peek: there is no existing arrangement that we can query. We have to create a dataflow in the targeted cluster that will ultimately fill an arrangement that we can read the result out of. |
| 172 | +- Persist Fast-Path Peek: there is no existing arrangement but the query has a shape that allows us to read a result right out of a storage collection (in persist) |
| 173 | + |
| 174 | +The result of optimization will indicate which of these scenarios we're in, and |
| 175 | +the adapter will now have to talk to the compute controller to implement |
| 176 | +execution of the query. |
| 177 | + |
| 178 | +Ultimately, for all of these scenarios the adapter will make a call into the |
| 179 | +compute controller to read out a peek result. For slow-path peeks it will first |
| 180 | +create the dataflow, but the functionality for reading our the result is the |
| 181 | +same for fast path and slow path after that. The entrypoint for that is |
| 182 | +[peek](https://github.com/MaterializeInc/materialize/blob/4a84902b5bed3bbd605d7d165fa6e0823c88c102/src/compute-client/src/controller.rs#L862). |
| 183 | + |
| 184 | +The adapter will pass the sending endpoint of a channel to the compute |
| 185 | +controller, for sending back the results, and then setup up an async task that |
| 186 | +reads from that channel, massages results, and sends them out as another |
| 187 | +stream. The receiving end of that second stream is what the adapter returns in |
| 188 | +a |
| 189 | +[ExecutionResponse::SendingRowsStreaming](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/adapter/src/command.rs#L362), |
| 190 | +to the adapter front end, which handles sending results back out to the pgwire |
| 191 | +client. |
| 192 | + |
| 193 | +### Query Processing: A Flow Chart |
| 194 | + |
| 195 | +TODO: Give a graphical representation of the stages that a query goes through, |
| 196 | +likely as a chart that shows what components invoke what other components, in |
| 197 | +the style of a lamport diagram. |
| 198 | + |
| 199 | +### Query Processing: Network hops |
| 200 | + |
| 201 | +TODO: Describe which of the steps involved in processing a query involve |
| 202 | +network hops, talking to "external" systems, talking to the cluster. etc. Then |
| 203 | +maybe also give a description of which steps are CPU heavy and which are not. |
| 204 | + |
| 205 | +## Details |
| 206 | + |
| 207 | +The above description of query processing has mentioned some names and concepts |
| 208 | +that are involved in query processing that we didn't explain further. We now |
| 209 | +explain those. |
| 210 | + |
| 211 | +## Compute & Storage Controllers |
| 212 | + |
| 213 | +Should maybe explain the compute and storage protocol, to really describe how |
| 214 | +the commands flow to the cluster and how the responses come back. |
| 215 | + |
| 216 | +## Arrangements |
| 217 | + |
| 218 | +TODO: Write up something about arrangements, how it's the basis for sharing and ultimately the thing that can be queries from a cluster. |
| 219 | + |
| 220 | +## Storage |
| 221 | + |
| 222 | +TODO: Both storage and persist are mentioned above, so we should at least give |
| 223 | +an overview. |
| 224 | + |
| 225 | +## Persist |
| 226 | + |
| 227 | +TODO: Both storage and persist are mentioned above, so we should at least give |
| 228 | +an overview. |
| 229 | + |
0 commit comments