You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-09-05-anatomy-of-vllm.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -461,14 +461,14 @@ vLLM V1 does not support the LLM draft model method, instead it implements faste
461
461
462
462
One-liners on each:
463
463
464
-
1. <b>n-gram</b>: take the last <code>prompt_lookup_max</code> tokens; find a prior match in the sequence; if found, propose the <code>k</code> tokens that followed that match; otherwise decrement the window and retry down to <code>prompt_lookup_min</code>
464
+
* <b>n-gram</b>: take the last <code>prompt_lookup_max</code> tokens; find a prior match in the sequence; if found, propose the <code>k</code> tokens that followed that match; otherwise decrement the window and retry down to <code>prompt_lookup_min</code>
465
465
466
466
> [!NOTE]
467
467
> The current implementation returns <code>k</code> tokens after the first match. It feels more natural to introduce a recency bias and reverse the search direction? (i.e. last match)
468
468
469
-
2. <b>Eagle</b>: perform "model surgery" on the large LM—keep embeddings and LM head, replace the transformer stack with a lightweight MLP; fine-tune that as a cheap draft
469
+
* <b>Eagle</b>: perform "model surgery" on the large LM—keep embeddings and LM head, replace the transformer stack with a lightweight MLP; fine-tune that as a cheap draft
470
470
471
-
3. <b>Medusa</b>: train auxiliary linear heads on top (embeddings before LM head) of the large model to predict the next <code>k</code> tokens in parallel; use these heads to propose tokens more efficiently than running a separate small LM
471
+
* <b>Medusa</b>: train auxiliary linear heads on top (embeddings before LM head) of the large model to predict the next <code>k</code> tokens in parallel; use these heads to propose tokens more efficiently than running a separate small LM
472
472
473
473
Here's how to invoke speculative decoding in vLLM using <code>ngram</code> as the draft method:
474
474
@@ -979,14 +979,14 @@ A huge thank you to [Hyperstack](https://www.hyperstack.cloud/) for providing me
979
979
Thanks to [Nick Hill](https://www.linkedin.com/in/nickhillprofile/) (core vLLM contributor, RedHat), [Mark Saroufim](https://x.com/marksaroufim) (PyTorch), [Kyle Krannen](https://www.linkedin.com/in/kyle-kranen/) (NVIDIA, Dynamo), and [Ashish Vaswani](https://www.linkedin.com/in/ashish-vaswani-99892181/) for reading pre-release version of this blog post and providing feedback!
980
980
981
981
References
982
-
1. vLLM https://github.com/vllm-project/vllm
983
-
2. "Attention Is All You Need", https://arxiv.org/abs/1706.03762
984
-
3. "Efficient Memory Management for Large Language Model Serving with PagedAttention", https://arxiv.org/abs/2309.06180
985
-
4. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model", https://arxiv.org/abs/2405.04434
986
-
5. "Jenga: Effective Memory Management for Serving LLM with Heterogeneity", https://arxiv.org/abs/2503.18292
987
-
6. "Orca: A Distributed Serving System for Transformer-Based Generative Models", https://www.usenix.org/conference/osdi22/presentation/yu
988
-
7. "XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models", https://arxiv.org/abs/2411.15100
989
-
8. "Accelerating Large Language Model Decoding with Speculative Sampling", https://arxiv.org/abs/2302.01318
2. "Attention Is All You Need", [https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762)
984
+
3. "Efficient Memory Management for Large Language Model Serving with PagedAttention", [https://arxiv.org/abs/2309.06180](https://arxiv.org/abs/2309.06180)
985
+
4. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model", [https://arxiv.org/abs/2405.04434](https://arxiv.org/abs/2405.04434)
986
+
5. "Jenga: Effective Memory Management for Serving LLM with Heterogeneity", [https://arxiv.org/abs/2503.18292](https://arxiv.org/abs/2503.18292)
987
+
6. "Orca: A Distributed Serving System for Transformer-Based Generative Models", [https://www.usenix.org/conference/osdi22/presentation/yu](https://www.usenix.org/conference/osdi22/presentation/yu)
988
+
7. "XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models", [https://arxiv.org/abs/2411.15100](https://arxiv.org/abs/2411.15100)
989
+
8. "Accelerating Large Language Model Decoding with Speculative Sampling", [https://arxiv.org/abs/2302.01318](https://arxiv.org/abs/2302.01318)
0 commit comments