Skip to content

Commit af78f2c

Browse files
committed
Fix few errors - youkaichao review
Signed-off-by: Aleksa Gordic <[email protected]>
1 parent 3476bef commit af78f2c

File tree

1 file changed

+8
-9
lines changed

1 file changed

+8
-9
lines changed

_posts/2025-09-05-anatomy-of-vllm.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ Engine core itself is made up of several sub components:
9494
<ol type="a">
9595
<li>policy setting - it can be either <b>FCFS</b> (first come first served) or <b>priority</b> (higher priority requests are served first)</li>
9696
<li><code>waiting</code> and <code>running</code> queues</li>
97-
<li>KV cache manager - the heart of paged attention [[3]](#ref-3)</li>
97+
<li>KV cache manager - the heart of paged attention <a href="#ref-3">[3]</a></li>
9898

9999
The KV-cache manager maintains a <code>free_block_queue</code> - a pool of available KV-cache blocks (often on the order of hundreds of thousands, depending on VRAM size and block size). During paged attention, the blocks serve as the indexing structure that map tokens to their computed KV cache blocks.
100100

@@ -107,7 +107,7 @@ The KV-cache manager maintains a <code>free_block_queue</code> - a pool of avail
107107

108108
> [!NOTE]
109109
> Block size for a standard transformer layer (non-MLA [[4]](#ref-4)) is computed as follows:
110-
> 2 * <code>block_size</code> (default=16) * <code>num_kv_heads</code> * <code>head_size</code> * <code>dtype_num_bytes</code> (2 for bf16)
110+
> 2 (key/value) * <code>block_size</code> (default=16) * <code>num_kv_heads</code> * <code>head_size</code> * <code>dtype_num_bytes</code> (e.g. 2 for bf16)
111111
112112
During model executor construction, a <code>Worker</code> object is created, and three key procedures are executed. (Later, with <code>MultiProcExecutor</code>, these same procedures run independently on each worker process across different GPUs.)
113113

@@ -129,7 +129,7 @@ During model executor construction, a <code>Worker</code> object is created, and
129129
* Run a dummy/profiling forward pass and take a GPU memory snapshot to compute how many KV cache blocks fit in available VRAM
130130
* Allocate, reshape and bind KV cache tensors to attention layers
131131
* Prepare attention metadata (e.g. set the backend to FlashAttention) later consumed by kernels during the fwd pass
132-
* Unless <code>--enforce-eager</code> is provided, for each of warmup batch sizes do a dummy run and capture CUDA graphs. CUDA graphs record the whole sequence of GPU work into a DAG. Later during fwd pass we launch/reply pre-baked graphs and cut on kernel launch overhead and thus improve latency.
132+
* Unless <code>--enforce-eager</code> is provided, for each of warmup batch sizes do a dummy run and capture CUDA graphs. CUDA graphs record the whole sequence of GPU work into a DAG. Later during fwd pass we launch/replay pre-baked graphs and cut on kernel launch overhead and thus improve latency.
133133

134134
I've abstracted away many low-level details here — but these are the core pieces I'll introduce now, since I'll reference them repeatedly in the following sections.
135135

@@ -229,7 +229,7 @@ Here are the main steps:
229229
Forward-pass step itself has two execution modes:
230230

231231
1. <b>Eager mode</b> — run the standard PyTorch forward pass when eager execution is enabled.
232-
2. <b>"Captured" mode</b> — execute/reply a pre-captured CUDA Graph when eager is not enforced (remember we captured these during engine construction in the initialize KV cache procedure).
232+
2. <b>"Captured" mode</b> — execute/replay a pre-captured CUDA Graph when eager is not enforced (remember we captured these during engine construction in the initialize KV cache procedure).
233233

234234
Here is a concrete example that should make continuous batching and paged attention clear:
235235

@@ -316,7 +316,8 @@ During the first <code>generate</code> call, in the scheduling stage, inside <co
316316

317317
1. This function splits the <code>long_prefix + prompts[0]</code> into 16-token chunks.
318318
2. For each complete chunk, it computes a hash (using either the built-in hash or SHA-256, which is slower but has fewer collisions). The hash combines the previous block's hash, the current tokens, and optional metadata.
319-
> [!NOTE] optional metadata includes: MM hash, LoRA ID, cache salt (injected into hash of the first block ensures only requests with this cache salt can reuse blocks).
319+
> [!NOTE]
320+
> optional metadata includes: MM hash, LoRA ID, cache salt (injected into hash of the first block ensures only requests with this cache salt can reuse blocks).
320321
3. Each result is stored as a <code>BlockHash</code> object containing both the hash and its token IDs. We return a list of block hashes.
321322
322323
The list is stored in <code>self.req_to_block_hashes[request_id]</code>.
@@ -423,7 +424,7 @@ Here is an even simpler example with vocab_size = 8 and 8-bit integers (for thos
423424

424425
<p align="center">
425426
<picture>
426-
<img src="/assets/figures/2025-vllm-anatomy/fsm.png" width="100%">
427+
<img src="/assets/figures/2025-vllm-anatomy/fsm2.png" width="100%">
427428
</picture><br>
428429
<b>Figure 10</b>: Toy example
429430
</p>
@@ -650,7 +651,6 @@ Here is a visual example:
650651
651652
## From UniprocExecutor to MultiProcExecutor
652653

653-
From UniprocExecutor to MultiProcExecutor
654654
With the core techniques in place, we can now talk about scaling up.
655655

656656
Suppose your model weights no longer fit into a single GPU's VRAM.
@@ -944,7 +944,6 @@ vllm bench latency
944944
--input-tokens 32
945945
--output-tokens 128
946946
--batch-size 8
947-
}'
948947
```
949948

950949
> [!NOTE]
@@ -959,7 +958,7 @@ We began with the basic engine core (<code>UniprocExecutor</code>), added advanc
959958

960959
vLLM also includes specialized handling that I've skipped. E.g.:
961960

962-
* <b>Custom hardware backends</b>: TPUs, AWS Neuron (Trainium/Inferentia), etc.
961+
* <b>Diverse hardware backends</b>: TPUs, AWS Neuron (Trainium/Inferentia), etc.
963962
* <b>Architectures/techniques</b>: <code>MLA</code>, <code>MoE</code>, encoder-decoder (e.g., Whisper), pooling/embedding models, <code>EPLB</code>, <code>m-RoPE</code>, <code>LoRA</code>, <code>ALiBi</code>, attention-free variants, sliding-window attention, multimodal LMs, and state-space models (e.g., Mamba/Mamba-2, Jamba)
964963
* <b>TP/PP/SP</b>
965964
* <b>Hybrid KV-cache logic</b> (Jenga), more complex sampling methods like beam sampling, and more

0 commit comments

Comments
 (0)