Skip to content

Commit c6a3a8c

Browse files
committed
fill out more details
1 parent 19857bd commit c6a3a8c

File tree

1 file changed

+72
-10
lines changed

1 file changed

+72
-10
lines changed

doc/developer/life-of-a-query.md

Lines changed: 72 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -204,26 +204,88 @@ maybe also give a description of which steps are CPU heavy and which are not.
204204

205205
## Details
206206

207-
The above description of query processing has mentioned some names and concepts
208-
that are involved in query processing that we didn't explain further. We now
209-
explain those.
207+
The above description of query processing has mentioned some components and
208+
concepts that are involved in query processing that we didn't explain further.
209+
We now explain those.
210210

211211
## Compute & Storage Controllers
212212

213-
Should maybe explain the compute and storage protocol, to really describe how
214-
the commands flow to the cluster and how the responses come back.
213+
The adapter interacts with clusters and storage collections through two main
214+
controllers: the
215+
[ComputeController](https://github.com/MaterializeInc/materialize/blob/main/src/compute-client/src/controller.rs)
216+
and the
217+
[StorageController](https://github.com/MaterializeInc/materialize/blob/main/src/storage-controller/src/lib.rs).
218+
These controllers act as intermediaries that translate adapter commands into
219+
cluster-specific operations and manage the lifecycle of compute and storage
220+
resources.
221+
222+
The
223+
[ComputeController](https://github.com/MaterializeInc/materialize/blob/main/src/compute-client/src/controller.rs)
224+
manages compute instances (clusters) and the dataflows running on them. It
225+
handles the creation and maintenance of indexes, materialized views, and
226+
dataflows, talking to cluster replicas via the compute protocol.
227+
228+
The
229+
[StorageController](https://github.com/MaterializeInc/materialize/blob/main/src/storage-controller/src/lib.rs)
230+
manages storage collections including sources, tables, and sinks. For ingestion
231+
from external systems, it needs to install computation on a cluster. Similarly
232+
to the compute controller, communication with the storage parts on a cluster
233+
replica happens via the storage protocol.
234+
235+
Both controllers maintain read and write capabilities for their respective
236+
resources, coordinate compaction policies, and ensure that data remains
237+
accessible as long as needed while allowing garbage collection when possible.
238+
239+
TODO: Talk about the protocols to explain how query execution flows through the
240+
commands and responses.
215241

216242
## Arrangements
217243

218-
TODO: Write up something about arrangements, how it's the basis for sharing and ultimately the thing that can be queries from a cluster.
244+
Arrangements are multiversioned indexes that serve as the foundation for data
245+
sharing and efficient querying in Materialize. As described in the
246+
[arrangements documentation](/doc/developer/arrangements.md), an arrangement is
247+
an indexed representation of a stream of update triples `(data, time, diff)`,
248+
organized by key for efficient lookups.
249+
250+
Arrangements are required by many differential dataflow operators. The `join`
251+
operator needs both of its inputs to be arrangements indexed by the join keys,
252+
while the `reduce` operator requires both input and output arrangements. This
253+
means that a single SQL query can create multiple arrangements as it gets
254+
compiled into a dataflow graph.
255+
256+
The key benefit of arrangements is sharing: multiple operators can reuse the
257+
same arrangement if they need data indexed by the same key. This sharing is
258+
especially common with indexes, materialized sources, and materialized views,
259+
which publish their arrangements for reuse across dataflows.
260+
261+
Arrangements only store distinct `(key, value, time)` combinations and undergo
262+
both logical compaction (forgetting historical detail that no reader needs) and
263+
physical compaction (consolidating space). This makes their memory usage
264+
proportional to the current accumulated state rather than the total volume of
265+
updates processed.
219266

220267
## Storage
221268

222-
TODO: Both storage and persist are mentioned above, so we should at least give
223-
an overview.
269+
TODO: Buff out this section.
224270

225271
## Persist
226272

227-
TODO: Both storage and persist are mentioned above, so we should at least give
228-
an overview.
273+
Persist is Materialize's durable storage implementation that provides definite
274+
Time-Varying Collections as described in the [persist design
275+
document](/doc/developer/design/20220330_persist.md). It serves as the
276+
foundation for the storage layer.
277+
278+
The core abstraction is a "shard" - a durable
279+
[TVC](/doc/developer/platform/formalism.md#in-a-nutshell) that can be written
280+
to and read from concurrently. Persist uses a rich client model where readers
281+
and writers interact directly with the underlying blob storage (typically S3)
282+
while coordinating through a consensus system for metadata operations.
283+
284+
Persist is built on two key primitives: `Blob` (a durable key-value store) and
285+
`Consensus` (a linearizable log). The blob storage holds the actual data in
286+
immutable batches, while consensus maintains a state machine that tracks
287+
metadata like shard frontiers, active readers/writers, and batch locations.
288+
289+
Key features include automatic compaction to bound storage costs and horizontal
290+
read scalability.
229291

0 commit comments

Comments
 (0)