You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-09-05-anatomy-of-vllm.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -979,14 +979,14 @@ A huge thank you to [Hyperstack](https://www.hyperstack.cloud/) for providing me
979
979
Thanks to [Nick Hill](https://www.linkedin.com/in/nickhillprofile/) (core vLLM contributor, RedHat), [Mark Saroufim](https://x.com/marksaroufim) (PyTorch), [Kyle Krannen](https://www.linkedin.com/in/kyle-kranen/) (NVIDIA, Dynamo), and [Ashish Vaswani](https://www.linkedin.com/in/ashish-vaswani-99892181/) for reading pre-release version of this blog post and providing feedback!
2. <div href="ref-2">"Attention Is All You Need"<a href="https://arxiv.org/abs/1706.03762">https://arxiv.org/abs/1706.03762</a></div>
984
-
3. <div href="ref-3">"Efficient Memory Management for Large Language Model Serving with PagedAttention"<a href="https://arxiv.org/abs/2309.06180">https://arxiv.org/abs/2309.06180</a></div>
985
-
4. <div href="ref-4">"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model"<a href="https://arxiv.org/abs/2405.04434">https://arxiv.org/abs/2405.04434</a></div>
986
-
5. <div href="ref-5">"Jenga: Effective Memory Management for Serving LLM with Heterogeneity"<a href="https://arxiv.org/abs/2503.18292">https://arxiv.org/abs/2503.18292</a></div>
987
-
6. <div href="ref-6">"Orca: A Distributed Serving System for Transformer-Based Generative Models"<a href="https://www.usenix.org/conference/osdi22/presentation/yu">https://www.usenix.org/conference/osdi22/presentation/yu</a></div>
988
-
7. <div href="ref-7">"XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models"<a href="https://arxiv.org/abs/2411.15100">https://arxiv.org/abs/2411.15100</a></div>
989
-
8. <div href="ref-8">"Accelerating Large Language Model Decoding with Speculative Sampling"<a href="https://arxiv.org/abs/2302.01318">https://arxiv.org/abs/2302.01318</a></div>
2. <div id="ref-2">"Attention Is All You Need"<a href="https://arxiv.org/abs/1706.03762">https://arxiv.org/abs/1706.03762</a></div>
984
+
3. <div id="ref-3">"Efficient Memory Management for Large Language Model Serving with PagedAttention"<a href="https://arxiv.org/abs/2309.06180">https://arxiv.org/abs/2309.06180</a></div>
985
+
4. <div id="ref-4">"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model"<a href="https://arxiv.org/abs/2405.04434">https://arxiv.org/abs/2405.04434</a></div>
986
+
5. <div id="ref-5">"Jenga: Effective Memory Management for Serving LLM with Heterogeneity"<a href="https://arxiv.org/abs/2503.18292">https://arxiv.org/abs/2503.18292</a></div>
987
+
6. <div id="ref-6">"Orca: A Distributed Serving System for Transformer-Based Generative Models"<a href="https://www.usenix.org/conference/osdi22/presentation/yu">https://www.usenix.org/conference/osdi22/presentation/yu</a></div>
988
+
7. <div id="ref-7">"XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models"<a href="https://arxiv.org/abs/2411.15100">https://arxiv.org/abs/2411.15100</a></div>
989
+
8. <div id="ref-8">"Accelerating Large Language Model Decoding with Speculative Sampling"<a href="https://arxiv.org/abs/2302.01318">https://arxiv.org/abs/2302.01318</a></div>
0 commit comments