Skip to content

Commit 19857bd

Browse files
committed
docs/developer: add life-of-a-query.md
1 parent 11e6ce3 commit 19857bd

File tree

1 file changed

+229
-0
lines changed

1 file changed

+229
-0
lines changed

doc/developer/life-of-a-query.md

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# Life of a Query
2+
3+
_Inspired by [CRDB: Life of a SQL
4+
Query](https://github.com/cockroachdb/cockroach/blob/master/docs/tech-notes/life_of_a_query.md)._
5+
6+
## Introduction
7+
8+
This document aims to provide an overview of the components involved in
9+
processing a SQL query, following the code paths through the various layers:
10+
network protocol, session management, the adapter, query parsing, planning, and
11+
optimization, interaction of the adapter layer and clusters, the storage layer,
12+
and delivery of results from clusters, back through the adapter layer to the
13+
client.
14+
15+
We don't discuss design decisions and historical evolution but focus on the
16+
current code. We also only cover the common and most significant parts and omit
17+
discussing details and special cases.
18+
19+
## PostgreSQL Client Protocol
20+
21+
Queries arrive at Materialize through the [PostgreSQL wire
22+
protocol](https://www.postgresql.org/docs/current/protocol.html).
23+
24+
The protocol is implemented in the [pgwire
25+
crate](https://github.com/MaterializeInc/materialize/tree/bea88fb10b02792bad73afd79a6350b381a7dcc3/src/pgwire).
26+
Incoming connections are handled in
27+
[handle_connection](https://github.com/MaterializeInc/materialize/blob/bea88fb10b02792bad73afd79a6350b381a7dcc3/src/pgwire/src/server.rs#L105),
28+
which passes on to
29+
[run](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/pgwire/src/protocol.rs#L127),
30+
and ultimately we create a
31+
[StateMachine](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/pgwire/src/protocol.rs#L478)
32+
whose
33+
[run](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/pgwire/src/protocol.rs#L512)
34+
handles the connection until completion.
35+
36+
The state machine has an adapter client
37+
([SessionClient](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/adapter/src/client.rs#L481))
38+
that it needs to use to accomplish most "interesting" things. It's a client to
39+
the adapter layer, with its main component, the Coordinator. The work of
40+
processing queries and maintaining session state is split between the "front
41+
end" state machine that runs the pgwire protocol and the "back end" adapter
42+
layer that handles talking to the other components.
43+
44+
> [!NOTE]
45+
> We will keep using the terms adapter front end and adapter back end below to
46+
> mean, respectively, the state machine that is responsible for a single
47+
> connection and can do work concurrently with other connections, and the
48+
> adapter/Coordinator which has to sequentialize work and is therefore more
49+
> limited in the amount of work it can do concurrently.
50+
>
51+
> If it is clear from context, we will drop the adapter prefix.
52+
53+
## Adapter / Coordinator
54+
55+
The
56+
[adapter](https://github.com/MaterializeInc/materialize/tree/f641a29d4aad9fcfeb2de535ff54706a1f1d38c4/src/adapter)
57+
is named such because it translates user commands into commands that the other
58+
internal systems understand. It currently understands SQL but is intended as a
59+
generic component that isolates the other components from needing to know
60+
details about the front-end commands (SQL).
61+
62+
A core component is the
63+
[Coordinator](https://github.com/MaterializeInc/materialize/blob/9682eb11f31e2eb8005cb3eae687f81a1bce21bb/src/adapter/src/coord.rs#L1633).
64+
It mediates access to durable environment state, kept in a durable catalog and
65+
represented by the
66+
[Catalog](https://github.com/MaterializeInc/materialize/blob/9682eb11f31e2eb8005cb3eae687f81a1bce21bb/src/adapter/src/catalog.rs#L137)
67+
in memory, and to the controllers for the storage and compute components (more
68+
on that later).
69+
70+
It holds on to mutable state and clients to other components. Access and
71+
changes to these are mediated by a "single-threaded" event loop that listens to
72+
internal and external command channels. Commands are worked of sequentially.
73+
Other parts can put in commands, for example, the front end calling into the
74+
adapter is implemented as sending a command which then eventually causes a
75+
response to be sent back, but the Coordinator will also periodically put in
76+
commands for itself.
77+
78+
> [!NOTE]
79+
> There is active work in progress on changing this design because it doesn't
80+
> lend itself well to scaling the amount of work that the adapter can do. We
81+
> want to move more work "out to the edges", that is towards the front end
82+
> (which can do work for different connections/sessions concurrently) and the
83+
> controllers for the other components. See [Design Doc: A Small
84+
> Coordinator](/doc/developer/design/20250717_a_small_coordinator_more_scalable_isolated_materialize.md)
85+
> for details.
86+
87+
## Query Processing
88+
89+
Query processing follows these steps:
90+
91+
parsing & describing → logical planning → timestamp selection → optimization & physical planning → execution
92+
93+
### Parsing & Describing
94+
95+
The Materialize parser is a hand-written recursive-descent parser that we
96+
forked from [sqlparser-rs](https://github.com/andygrove/sqlparser-rs). The main
97+
entry point is
98+
[parser.rs](https://github.com/MaterializeInc/materialize/blob/aba45afb39455b84cd64d63c3af50ffcae46fd83/src/sql-parser/src/parser.rs#L129).
99+
100+
Parsing happens completely in the front end, it produces an AST of
101+
[Statement](https://github.com/MaterializeInc/materialize/blob/7fab0d0e12a15799334854dcb3990997bc72e037/src/sql-parser/src/ast/defs/statement.rs#L43)
102+
nodes. The AST only represents the syntax of the query and says nothing about
103+
how or if it can be executed.
104+
105+
Describing is the process of figuring out the result type of a statement. For
106+
this, the front end needs access to the Catalog, for which it needs to call into
107+
the adapter.
108+
109+
Both parsing and describing happen in the front end, with calls into the adapter
110+
as needed. All further steps are orchestrated by the adapter/Coordinator: the
111+
front end passes the AST as a command and will become involved again when
112+
sending results back to the client.
113+
114+
### Logical Planning
115+
116+
Logical planning generates a
117+
[Plan](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/sql/src/plan.rs#L133)
118+
from the AST. Glossing over some details, this binds referenced names based on
119+
the Catalog as of planning time and determines a logical execution plan. The
120+
entrypoint is
121+
[plan](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/sql/src/plan/statement.rs#L274).
122+
123+
One of the differences between the user (SQL) commands that are input to the
124+
adapter and commands for the rest of the system is the use of user-defined and
125+
reusable names in SQL statements rather than the use of immutable, non-reusable
126+
IDs. The binding mentioned above turns those user-defined names into IDs.
127+
128+
As mentioned in the previous section, logical planning happens inside the
129+
Coordinator.
130+
131+
### Timestamp Selection
132+
133+
Another difference between user commands and internal commands is the absence of explicit timestamps in user (SQL) commands, quoting from [formalism.md](/doc/developer/platform/formalism.md#adapter):
134+
135+
> A `SELECT` statement does not indicate *when* it should be run, or against
136+
> which version of its input data. The Adapter layer introduces timestamps to
137+
> these commands, in a way that provides the appearance of sequential
138+
> execution.
139+
140+
The internal interface that we use for determining timestamps is
141+
[TimestampOracle](https://github.com/MaterializeInc/materialize/blob/cbaed5e677317feabd97048595e529bf3770547e/src/timestamp-oracle/src/lib.rs#L41).
142+
The production implementation is an oracle that uses CRDB as the source of
143+
truth. See [Design Doc: Distributed Timestamp
144+
Oracle](/doc/developer/design/20230921_distributed_ts_oracle.md) for more
145+
context.
146+
147+
### Optimization & Physical Planning
148+
149+
There are multiple stages to optimization and different internal
150+
representations, and different types of queries or created objects will
151+
instantiate different optimizer pipelines. The optimization pipeline for SELECT
152+
is [this
153+
snippet](https://github.com/MaterializeInc/materialize/blob/bdf9573960cece48e7811db7ef777b973d355fe6/src/adapter/src/coord/sequencer/inner/peek.rs#L568).
154+
155+
The final result of these stages will depend on the type of query we're
156+
optimizing, but for certain types of SELECT and permanent objects it will
157+
include a
158+
[DataflowDescription](https://github.com/MaterializeInc/materialize/blob/53acc93eee1bbbf418fde681389aec0419db8954/src/compute-types/src/dataflows.rs#L40).
159+
Which is a physical execution plan that can be given to the compute layer, to
160+
execute on a cluster.
161+
162+
TODO: Expand this section on optimization, if/when needed.
163+
164+
### Execution
165+
166+
For SELECT, which is internally called PEEK, there are three different execution scenarios:
167+
168+
- Fast-Path Peek: there is an existing arrangement (think index) in the cluster
169+
that we're targeting for the query. We can read the result from memory with
170+
minimal massaging and return it to the client (through the adapter).
171+
- Slow-Path Peek: there is no existing arrangement that we can query. We have to create a dataflow in the targeted cluster that will ultimately fill an arrangement that we can read the result out of.
172+
- Persist Fast-Path Peek: there is no existing arrangement but the query has a shape that allows us to read a result right out of a storage collection (in persist)
173+
174+
The result of optimization will indicate which of these scenarios we're in, and
175+
the adapter will now have to talk to the compute controller to implement
176+
execution of the query.
177+
178+
Ultimately, for all of these scenarios the adapter will make a call into the
179+
compute controller to read out a peek result. For slow-path peeks it will first
180+
create the dataflow, but the functionality for reading our the result is the
181+
same for fast path and slow path after that. The entrypoint for that is
182+
[peek](https://github.com/MaterializeInc/materialize/blob/4a84902b5bed3bbd605d7d165fa6e0823c88c102/src/compute-client/src/controller.rs#L862).
183+
184+
The adapter will pass the sending endpoint of a channel to the compute
185+
controller, for sending back the results, and then setup up an async task that
186+
reads from that channel, massages results, and sends them out as another
187+
stream. The receiving end of that second stream is what the adapter returns in
188+
a
189+
[ExecutionResponse::SendingRowsStreaming](https://github.com/MaterializeInc/materialize/blob/b3475aaaf96f9fae4a0d68cbb5202c224d9ce15b/src/adapter/src/command.rs#L362),
190+
to the adapter front end, which handles sending results back out to the pgwire
191+
client.
192+
193+
### Query Processing: A Flow Chart
194+
195+
TODO: Give a graphical representation of the stages that a query goes through,
196+
likely as a chart that shows what components invoke what other components, in
197+
the style of a lamport diagram.
198+
199+
### Query Processing: Network hops
200+
201+
TODO: Describe which of the steps involved in processing a query involve
202+
network hops, talking to "external" systems, talking to the cluster. etc. Then
203+
maybe also give a description of which steps are CPU heavy and which are not.
204+
205+
## Details
206+
207+
The above description of query processing has mentioned some names and concepts
208+
that are involved in query processing that we didn't explain further. We now
209+
explain those.
210+
211+
## Compute & Storage Controllers
212+
213+
Should maybe explain the compute and storage protocol, to really describe how
214+
the commands flow to the cluster and how the responses come back.
215+
216+
## Arrangements
217+
218+
TODO: Write up something about arrangements, how it's the basis for sharing and ultimately the thing that can be queries from a cluster.
219+
220+
## Storage
221+
222+
TODO: Both storage and persist are mentioned above, so we should at least give
223+
an overview.
224+
225+
## Persist
226+
227+
TODO: Both storage and persist are mentioned above, so we should at least give
228+
an overview.
229+

0 commit comments

Comments
 (0)