You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### From paged attention, continuous batching, prefix caching, specdec, etc. to multi-GPU, multi-node dynamic serving at scale
12
12
13
-
In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown of how vLLM [1] works.
13
+
In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown of how vLLM [[1]](#ref-1) works.
14
14
15
15
This post is the first in a series. It starts broad and then layers in detail (following an inverse-pyramid approach) so you can form an accurate high-level mental model of the complete system without drowning in minutiae.
16
16
@@ -65,7 +65,7 @@ This configuration is:
65
65
* offline (no web/distributed system scaffolding)
66
66
* synchronous (all execution happens in a single blocking process)
67
67
* single-GPU (no data/model/pipeline/expert parallelism; DP/TP/PP/EP = 1)
68
-
* using standard transformer [2] (supporting hybrid models like Jamba requires a more complex hybrid KV-cache memory allocator)
68
+
* using standard transformer [[2]](#ref-2) (supporting hybrid models like Jamba requires a more complex hybrid KV-cache memory allocator)
69
69
70
70
From here, we'll gradually build up to an online, async, multi-GPU, multi-node inference system - but still serving a standard transformer.
71
71
@@ -106,7 +106,7 @@ The KV-cache manager maintains a <code>free_block_queue</code> - a pool of avail
106
106
</p>
107
107
108
108
> [!NOTE]
109
-
> Block size for a standard transformer layer (non-MLA [4]) is computed as follows:
109
+
> Block size for a standard transformer layer (non-MLA [[4]](#ref-4)) is computed as follows:
During model executor construction, a <code>Worker</code> object is created, and three key procedures are executed. (Later, with <code>MultiProcExecutor</code>, these same procedures run independently on each worker process across different GPUs.)
@@ -125,7 +125,7 @@ During model executor construction, a <code>Worker</code> object is created, and
125
125
* Optional: call torch.compile() on the model
126
126
127
127
3. Initialize KV cache
128
-
* Get per-layer KV-cache spec. Historically this was always <code>FullAttentionSpec</code> (homogeneous transformer), but with hybrid models (sliding window, Transformer/SSM like Jamba) it became more complex (see Jenga [5])
128
+
* Get per-layer KV-cache spec. Historically this was always <code>FullAttentionSpec</code> (homogeneous transformer), but with hybrid models (sliding window, Transformer/SSM like Jamba) it became more complex (see Jenga [[5]](#ref-5))
129
129
* Run a dummy/profiling forward pass and take a GPU memory snapshot to compute how many KV cache blocks fit in available VRAM
130
130
* Allocate, reshape and bind KV cache tensors to attention layers
131
131
* Prepare attention metadata (e.g. set the backend to FlashAttention) later consumed by kernels during the fwd pass
@@ -144,7 +144,7 @@ The first step is to validate and feed requests into the engine. For each prompt
144
144
3. Pack this info into an <code>EngineCoreRequest</code>, adding priority, sampling params, and other metadata
145
145
4. Pass the request into the engine core, which wraps it in a <code>Request</code> object and sets its status to <code>WAITING</code>. This request is then added to the scheduler's <code>waiting</code> queue (append if FCFS, or heap-push if priority)
146
146
147
-
At this point the engine has been fed and execution can begin. In the synchronous engine example, these initial prompts are the only ones we'll process — there's no mechanism to inject new requests mid-run. In contrast, the asynchronous engine supports this (aka <b>continuous batching</b> [6]): after each step, both new and old requests are considered.
147
+
At this point the engine has been fed and execution can begin. In the synchronous engine example, these initial prompts are the only ones we'll process — there's no mechanism to inject new requests mid-run. In contrast, the asynchronous engine supports this (aka <b>continuous batching</b> [[6]](#ref-6)): after each step, both new and old requests are considered.
148
148
149
149
> [!NOTE]
150
150
> Because the forward pass flattens the batch into a single sequence and custom kernels handle it efficiently, continuous batching is fundamentally supported even in the synchronous engine.
@@ -405,7 +405,7 @@ In the toy example I gave (assume character-level tokenization): at prefill, the
405
405
How this works in vLLM:
406
406
407
407
1. At LLM engine construction, a <code>StructuredOutputManager</code> is created; it has access to the tokenizer and maintains a <code>_grammar_bitmask</code> tensor.
408
-
2. When adding a request, its status is set to <code>WAITING_FOR_FSM</code> and <code>grammar_init</code> selects the backend compiler (e.g., <code>xgrammar</code> [7]; note that backends are 3rd party code).
408
+
2. When adding a request, its status is set to <code>WAITING_FOR_FSM</code> and <code>grammar_init</code> selects the backend compiler (e.g., <code>xgrammar</code> [[7]](#ref-7); note that backends are 3rd party code).
409
409
3. The grammar for this request is compiled asynchronously.
410
410
4. During scheduling, if the async compile has completed, the status switches to <code>WAITING</code> and <code>request_id</code> is added to <code>structured_output_request_ids</code>; otherwise it's placed in <code>skipped_waiting_requests</code> to retry on next engine step.
411
411
5. After the scheduling loop (still inside scheduling), if there are FSM requests, the <code>StructuredOutputManager</code> asks the backend to prepare/update <code>_grammar_bitmask</code>.
@@ -434,7 +434,7 @@ You can enable this in vLLM by passing in a desired <code>guided_decoding</code>
434
434
435
435
In autoregressive generation, each new token requires a forward pass of the large LM. This is expensive — every step reloads and applies all model weights just to compute a single token! (assuming batch size == 1, in general it's <code>B</code>)
436
436
437
-
Speculative decoding [8] speeds this up by introducing a smaller draft LM. The draft proposes <code>k</code> tokens cheaply. But we don't ultimately want to sample from the smaller model — it's only there to guess candidate continuations. The large model still decides what's valid.
437
+
Speculative decoding [[8]](#ref-8) speeds this up by introducing a smaller draft LM. The draft proposes <code>k</code> tokens cheaply. But we don't ultimately want to sample from the smaller model — it's only there to guess candidate continuations. The large model still decides what's valid.
438
438
439
439
Here are the steps:
440
440
@@ -457,7 +457,7 @@ Here are the steps:
457
457
> [!NOTE]
458
458
> I recommend looking at [gpt-fast](https://github.com/meta-pytorch/gpt-fast) for a simple implementation, and the [original paper](https://arxiv.org/abs/2302.01318) for the math details and the proof of equivalence to sampling from the full model.
459
459
460
-
vLLM V1 does not support the LLM draft model method, instead it implements faster—but less accurate—proposal schemes: n-gram, EAGLE [9], and Medusa [10].
460
+
vLLM V1 does not support the LLM draft model method, instead it implements faster—but less accurate—proposal schemes: n-gram, EAGLE [[9]](#ref-9), and Medusa [[10]](#ref-10).
461
461
462
462
One-liners on each:
463
463
@@ -618,7 +618,7 @@ if __name__ == "__main__":
618
618
```
619
619
620
620
> [!NOTE]
621
-
> I've also experimented with <code>LMCache</code> [11], the fastest production-ready connector (uses NVIDIA's NIXL as the backend), but it's still at the bleeding edge and I ran into some bugs. Since much of its complexity lives in an external repo, <code>SharedStorageConnector</code> is a better choice for explanation.
621
+
> I've also experimented with <code>LMCache</code> [[11]](#ref-11), the fastest production-ready connector (uses NVIDIA's NIXL as the backend), but it's still at the bleeding edge and I ran into some bugs. Since much of its complexity lives in an external repo, <code>SharedStorageConnector</code> is a better choice for explanation.
622
622
623
623
These are the steps in vLLM:
624
624
@@ -979,14 +979,14 @@ A huge thank you to [Hyperstack](https://www.hyperstack.cloud/) for providing me
979
979
Thanks to [Nick Hill](https://www.linkedin.com/in/nickhillprofile/) (core vLLM contributor, RedHat), [Mark Saroufim](https://x.com/marksaroufim) (PyTorch), [Kyle Krannen](https://www.linkedin.com/in/kyle-kranen/) (NVIDIA, Dynamo), and [Ashish Vaswani](https://www.linkedin.com/in/ashish-vaswani-99892181/) for reading pre-release version of this blog post and providing feedback!
2. "Attention Is All You Need", [https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762)
984
-
3. "Efficient Memory Management for Large Language Model Serving with PagedAttention", [https://arxiv.org/abs/2309.06180](https://arxiv.org/abs/2309.06180)
985
-
4. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model", [https://arxiv.org/abs/2405.04434](https://arxiv.org/abs/2405.04434)
986
-
5. "Jenga: Effective Memory Management for Serving LLM with Heterogeneity", [https://arxiv.org/abs/2503.18292](https://arxiv.org/abs/2503.18292)
987
-
6. "Orca: A Distributed Serving System for Transformer-Based Generative Models", [https://www.usenix.org/conference/osdi22/presentation/yu](https://www.usenix.org/conference/osdi22/presentation/yu)
988
-
7. "XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models", [https://arxiv.org/abs/2411.15100](https://arxiv.org/abs/2411.15100)
989
-
8. "Accelerating Large Language Model Decoding with Speculative Sampling", [https://arxiv.org/abs/2302.01318](https://arxiv.org/abs/2302.01318)
2. <div href="ref-2">"Attention Is All You Need"<a href="https://arxiv.org/abs/1706.03762">https://arxiv.org/abs/1706.03762</a></div>
984
+
3. <div href="ref-3">"Efficient Memory Management for Large Language Model Serving with PagedAttention"<a href="https://arxiv.org/abs/2309.06180">https://arxiv.org/abs/2309.06180</a></div>
985
+
4. <div href="ref-4">"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model"<a href="https://arxiv.org/abs/2405.04434">https://arxiv.org/abs/2405.04434</a></div>
986
+
5. <div href="ref-5">"Jenga: Effective Memory Management for Serving LLM with Heterogeneity"<a href="https://arxiv.org/abs/2503.18292">https://arxiv.org/abs/2503.18292</a></div>
987
+
6. <div href="ref-6">"Orca: A Distributed Serving System for Transformer-Based Generative Models"<a href="https://www.usenix.org/conference/osdi22/presentation/yu">https://www.usenix.org/conference/osdi22/presentation/yu</a></div>
988
+
7. <div href="ref-7">"XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models"<a href="https://arxiv.org/abs/2411.15100">https://arxiv.org/abs/2411.15100</a></div>
989
+
8. <div href="ref-8">"Accelerating Large Language Model Decoding with Speculative Sampling"<a href="https://arxiv.org/abs/2302.01318">https://arxiv.org/abs/2302.01318</a></div>
0 commit comments