You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-09-05-anatomy-of-vllm.md
+8-9Lines changed: 8 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -94,7 +94,7 @@ Engine core itself is made up of several sub components:
94
94
<oltype="a">
95
95
<li>policy setting - it can be either <b>FCFS</b> (first come first served) or <b>priority</b> (higher priority requests are served first)</li>
96
96
<li><code>waiting</code> and <code>running</code> queues</li>
97
-
<li>KV cache manager - the heart of paged attention [[3]](#ref-3)</li>
97
+
<li>KV cache manager - the heart of paged attention <ahref="#ref-3">[3]</a></li>
98
98
99
99
The KV-cache manager maintains a <code>free_block_queue</code> - a pool of available KV-cache blocks (often on the order of hundreds of thousands, depending on VRAM size and block size). During paged attention, the blocks serve as the indexing structure that map tokens to their computed KV cache blocks.
100
100
@@ -107,7 +107,7 @@ The KV-cache manager maintains a <code>free_block_queue</code> - a pool of avail
107
107
108
108
> [!NOTE]
109
109
> Block size for a standard transformer layer (non-MLA [[4]](#ref-4)) is computed as follows:
During model executor construction, a <code>Worker</code> object is created, and three key procedures are executed. (Later, with <code>MultiProcExecutor</code>, these same procedures run independently on each worker process across different GPUs.)
113
113
@@ -129,7 +129,7 @@ During model executor construction, a <code>Worker</code> object is created, and
129
129
* Run a dummy/profiling forward pass and take a GPU memory snapshot to compute how many KV cache blocks fit in available VRAM
130
130
* Allocate, reshape and bind KV cache tensors to attention layers
131
131
* Prepare attention metadata (e.g. set the backend to FlashAttention) later consumed by kernels during the fwd pass
132
-
* Unless <code>--enforce-eager</code> is provided, for each of warmup batch sizes do a dummy run and capture CUDA graphs. CUDA graphs record the whole sequence of GPU work into a DAG. Later during fwd pass we launch/reply pre-baked graphs and cut on kernel launch overhead and thus improve latency.
132
+
* Unless <code>--enforce-eager</code> is provided, for each of warmup batch sizes do a dummy run and capture CUDA graphs. CUDA graphs record the whole sequence of GPU work into a DAG. Later during fwd pass we launch/replay pre-baked graphs and cut on kernel launch overhead and thus improve latency.
133
133
134
134
I've abstracted away many low-level details here — but these are the core pieces I'll introduce now, since I'll reference them repeatedly in the following sections.
135
135
@@ -229,7 +229,7 @@ Here are the main steps:
229
229
Forward-pass step itself has two execution modes:
230
230
231
231
1. <b>Eager mode</b> — run the standard PyTorch forward pass when eager execution is enabled.
232
-
2. <b>"Captured" mode</b> — execute/reply a pre-captured CUDA Graph when eager is not enforced (remember we captured these during engine construction in the initialize KV cache procedure).
232
+
2. <b>"Captured" mode</b> — execute/replay a pre-captured CUDA Graph when eager is not enforced (remember we captured these during engine construction in the initialize KV cache procedure).
233
233
234
234
Here is a concrete example that should make continuous batching and paged attention clear:
235
235
@@ -316,7 +316,8 @@ During the first <code>generate</code> call, in the scheduling stage, inside <co
316
316
317
317
1. This function splits the <code>long_prefix + prompts[0]</code> into 16-token chunks.
318
318
2. For each complete chunk, it computes a hash (using either the built-in hash or SHA-256, which is slower but has fewer collisions). The hash combines the previous block's hash, the current tokens, and optional metadata.
319
-
> [!NOTE] optional metadata includes: MM hash, LoRA ID, cache salt (injected into hash of the first block ensures only requests with this cache salt can reuse blocks).
319
+
> [!NOTE]
320
+
> optional metadata includes: MM hash, LoRA ID, cache salt (injected into hash of the first block ensures only requests with this cache salt can reuse blocks).
320
321
3. Each result is stored as a <code>BlockHash</code> object containing both the hash and its token IDs. We return a list of block hashes.
321
322
322
323
The list is stored in <code>self.req_to_block_hashes[request_id]</code>.
@@ -423,7 +424,7 @@ Here is an even simpler example with vocab_size = 8 and 8-bit integers (for thos
0 commit comments