[New blog] Inside vLLM: Anatomy of a High-Throughput LLM Inference System #80

gordicaleksa · 2025-09-05T17:50:39Z

Porting the original blog: https://www.aleksagordic.com/blog/vllm

cloudflare-workers-and-pages · 2025-09-05T17:51:04Z

Deploying vllm-blog-source with Cloudflare Pages

Latest commit:	`beab729`
Status:	✅ Deploy successful!
Preview URL:	https://6cc3ed89.vllm-blog-source.pages.dev
Branch Preview URL:	https://gordicaleksa-anatomy-vllm.vllm-blog-source.pages.dev

View logs

gordicaleksa · 2025-09-05T18:02:31Z

Ok i see i haven't done the DCO thing, let me fix that.

Signed-off-by: Aleksa Gordic <[email protected]>

simon-mo · 2025-09-05T21:35:34Z

Very nice!! Any way we can get footnote working properly?

gordicaleksa · 2025-09-06T00:01:42Z

@simon-mo which footnote? not sure i understood?

simon-mo · 2025-09-06T00:02:57Z

Ah sorry I meant citations [1] [2]... they don't link the actual references.

Signed-off-by: Aleksa Gordic <[email protected]>

gordicaleksa · 2025-09-06T04:59:58Z

oh ok, added!

Signed-off-by: Aleksa Gordic <[email protected]>

gordicaleksa · 2025-09-06T15:13:00Z

@simon-mo lmk what you think now and happy to merge!

_posts/2025-09-05-anatomy-of-vllm.md

youkaichao

my nit comments are not blocking. we can publish first and then fix nit comments.

_posts/2025-09-05-anatomy-of-vllm.md

youkaichao · 2025-09-07T00:13:40Z

assets/figures/2025-vllm-anatomy/fwd_pass.png

for the decode step, input_ids only contains new tokens, same for positions and slot_mapping. we use block_table to keep track of the existing kv cache.

yeah, i'm aware of that, i made some simplifications so that i have to explain less hah

i'd cover those details if i covered fwd pass kernel!

hope we can have some disclaimer for this.

youkaichao · 2025-09-07T00:15:55Z

assets/figures/2025-vllm-anatomy/fwd_pass.png

I doubt if the relevant attention metadata and input ids etc for decode is right :( you should be able to just print them after preparing input.

same comment as above, i made a few simplifications in that drawing/explanation as i believe it's ok at that level of abstraction

_posts/2025-09-05-anatomy-of-vllm.md

gordicaleksa · 2025-09-07T02:24:45Z

@youkaichao thanks for the review! will address all a bit later tonight!

_posts/2025-09-05-anatomy-of-vllm.md

youkaichao · 2025-09-07T02:42:38Z

_posts/2025-09-05-anatomy-of-vllm.md

+2. <b>Verify</b>: run the large model once on context + <code>k</code> draft tokens. This produces probabilities for those <code>k</code> positions plus one extra (so we get <code>k+1</code> candidates)
+3. <b>Accept/reject</b>: going from left to right over the <code>k</code> draft tokens:
+  <ul>
+      <li>If the large model's probability for the draft token ≥ the draft's probability, accept it</li>


this looks incorrect for rejection sampling. we first sample a uniform distribution, and then compare p_large(token)/p_draft(token) with that uniform distribution.

that's an implementation detail that doesn't contradict the explanation?

it is true that we sample ahead of time vs step by step but i didn't explicitly mention that above

i think this is quite different. Rejection sampling is random sampling, so there're chances a draft token will be accepted or not.

Your description makes it deterministic, i.e. we just need to compare if the draft token ≥ the draft's probability.

youkaichao · 2025-09-07T02:43:20Z

_posts/2025-09-05-anatomy-of-vllm.md

+> [!NOTE]
+> I recommend looking at [gpt-fast](https://github.com/meta-pytorch/gpt-fast) for a simple implementation, and the [original paper](https://arxiv.org/abs/2302.01318) for the math details and the proof of equivalence to sampling from the full model.
+
+vLLM V1 does not support the LLM draft model method, instead it implements faster—but less accurate—proposal schemes: n-gram, EAGLE [[9]](#ref-9), and Medusa [[10]](#ref-10).


Suggested change

vLLM V1 does not support the LLM draft model method, instead it implements faster—but less accurate—proposal schemes: n-gram, EAGLE [[9]](#ref-9), and Medusa [[10]](#ref-10).

vLLM V1 does not support the LLM draft model method, instead it implements faster proposal schemes: n-gram, EAGLE [[9]](#ref-9), and Medusa [[10]](#ref-10).

valid point, it is strictly speaking mathematically equivalent as i mentioned

what i meant by "less accurate" is that in expectation you'd expect a good LLM draft model to be better / "more accurate" (loosely speaking) in predicting k tokens than n-gram model

youkaichao · 2025-09-07T08:07:29Z

_posts/2025-09-05-anatomy-of-vllm.md

+1. <b>Instantiation</b> — During engine construction, connectors are created in two places:
+* Inside the worker's init device procedure (under init worker distributed environment function), with role "worker".
+* Inside the scheduler constructor, with role "scheduler".
+2. <b>Cache lookup</b> — When the scheduler processes prefill requests from the <code>waiting</code> queue (after local prefix-cache checks), it calls connector's <code>get_num_new_matched_tokens</code>. This checks for externally cached tokens in the KV-cache server. Prefill always sees 0 here; decode may have a cache hit. The result is added to the local count before calling <code>allocate_slots</code>.


technically prefill can also have cache hit, but i'm not sure if we implement it. cc @njhill @robertgshaw2-redhat might know more.

for the SharedStorageConnector imp this is how it works

@gordicaleksa sorry I think I missed this too before. I'm not sure that it's a good idea to use SharedStorageConnector for a P/D example.

I'll try to summarize here:

The KV Connector interface is for generically plugging in KV cache distribution / sharing capability.

The two main categories for this are (1) P/D disagreggation and (2) offloading / tiered caching.

These are kind of separate but could overlap in that P/D disagg can be achieved by P offloading to shared storage which is subsequently consumed by D.

We also support using multiple connectors at the same time, so that you could have one doing offloading to CPU memory and another one handling P/D. (in this case usually it would be the prefill worker that benefits from an offloaded cache hit get_num_new_matched_tokens > 0).

Our "native" connector for P/D is NixlConnector ... for this you are right that it's only decode that will get get_num_new_matched_tokens > 0, since prefill is the "producer" and decode is the "consumer". There is an out-of-band piece to this where it's expected that an external router will pass metadata returned from the P worker to the D worker.

SharedStorageConnector is intended as an example implementation, but is in the (2) category ... as such it doesn't really have prefill / decode worker distinction, it just writes kv cache to disk as its computed, and then for new requests it checks against that disk cache to see if any can be loaded.

_posts/2025-09-05-anatomy-of-vllm.md

youkaichao · 2025-09-07T08:14:24Z

_posts/2025-09-05-anatomy-of-vllm.md

+<picture>
+<img src="/assets/figures/2025-vllm-anatomy/multiprocexecutor.png" width="100%">
+</picture><br>
+<b>Figure 13</b>: MultiProcExecutor in a TP=8 setting (driver worker being rank 0)


In v1, the executor actually lives in another process and it is not tp rank 0.

youkaichao · 2025-09-07T08:16:28Z

_posts/2025-09-05-anatomy-of-vllm.md

+2. The constructor loops over <code>world_size</code> (e.g. <code>TP=8 ⇒ world_size=8</code>) and spawns a daemon process for each rank via <code>WorkerProc.make_worker_process</code>.
+3. For each worker, the parent first creates a reader and writer pipe.
+4. The new process runs <code>WorkerProc.worker_main</code>, which instantiates a worker (going through the same "init device", "load model", etc. as in <code>UniprocExecutor</code>).
+5. Each worker determines whether it is the driver (rank 0 in the TP group) or a regular worker. Every worker sets up two queues:


cc @njhill to confirm, v1 does not have the concept of driver worker iirc. all workers are the same, and communicate with the parent (the executor process).

i'm fairly sure this is correct, a) i can double check b) nick did review it as well

Sorry, I think I missed this in my original review.

Yes @youkaichao is right that there isn't a concept of a driver worker anymore. There is a small optimization that we usually only return the per-step outputs from one of the workers to the scheduler (usually rank 0). When using kv connectors though this isn't the case, but it's probably too in-the-weeds detail to mention here anyhow.

hmm, are we taking into account that i'm using early August commit for the analysis?

will address your new comments by tomorrow EOD!

Yes, this is a major V0 vs V1 difference, and V0 is essentially gone now.

youkaichao · 2025-09-07T08:26:28Z

_posts/2025-09-05-anatomy-of-vllm.md

+* <b>Dummy steps for lockstep</b> — if any DP replica has work, all replicas execute a forward step; replicas without requests perform a dummy step to participate in required synchronization points (avoids blocking the active replica).
+
+> [!NOTE]
+> Lockstep clarification: this is actually only required for MoE models where the expert layers form an EP or TP group while attention layers are still DP. It's currently always done with DP - this is just because there's limited use for "built-in" non-MoE DP since you could just run multiple independent vLLMs and load-balance between them in a normal way.


although there's clarification here, I'm afraid that this can be misleading. I already hear many complaints from users that using the -dp argument is much slower than spinning up independent vLLM instances. this is expected due to extra per-step synchronization, but not intended. vLLM's dp is only meant for MoE models with EP and DP > 1. Using dp in such cases where vLLM instances can be totally independent, is very dangerous (slow).

@njhill do you think we should raise error if -dp is larger than 1 while people do not enable expert parallel?

I observed the above behavior in the codebase and Nick was the one who gave me this clarification during his review before i published (the one that's now in the note)

Discussed with @youkaichao separately but just adding here for completeness:

Expert layers use TP with DP if EP isn't enabled so we would potentially want to raise error / warn for non-MoE models rather than if EP isn't enabled.

It should be straightforward though to just disable the additional coordination in DP + non-MoE case. So that -dp can be used for convenient general scaleout. I will open an issue for this

_posts/2025-09-05-anatomy-of-vllm.md

youkaichao

This is amazing work! I took a full read and learned a lot from your post as well :)

Left some comments, some are small and some might be important. Feel free to DM me if I'm slow to respond.

Signed-off-by: Aleksa Gordic <[email protected]>

gordicaleksa · 2025-09-08T04:58:21Z

@youkaichao thanks again for the review! I added you to the acknowledgment section!

few points you raised above are potentially valid, and we should correct those post merge if it turns out those are mistakes.

gordicaleksa added 9 commits September 5, 2025 11:07

Add anatomy of VLLM blog - wip, needs formatting

6394e60

Signed-off-by: Aleksa Gordic <[email protected]>

Add anchor links for chapters

f1ac4a1

Signed-off-by: Aleksa Gordic <[email protected]>

Adding code annotation

f786d14

Signed-off-by: Aleksa Gordic <[email protected]>

Add more code formatting - up to FSM section

0274f3f

Signed-off-by: Aleksa Gordic <[email protected]>

Add more code formatting

f09488d

Signed-off-by: Aleksa Gordic <[email protected]>

Add links, further code formatting

fb4538e

Signed-off-by: Aleksa Gordic <[email protected]>

Fix few typos

8e09382

Signed-off-by: Aleksa Gordic <[email protected]>

Few minor fixes

9e3b6b2

Signed-off-by: Aleksa Gordic <[email protected]>

Fix references

c16a69e

Signed-off-by: Aleksa Gordic <[email protected]>

gordicaleksa force-pushed the gordicaleksa/anatomy-vllm branch from 76ff8b4 to c16a69e Compare September 5, 2025 18:08

gordicaleksa changed the title ~~[New blog] Inside vLLM: Anatomy of a High-Throughput LLM Inference System - NEW~~ [New blog] Inside vLLM: Anatomy of a High-Throughput LLM Inference System Sep 5, 2025

Add links for references

66b6d03

Signed-off-by: Aleksa Gordic <[email protected]>

gordicaleksa force-pushed the gordicaleksa/anatomy-vllm branch from 293aea1 to 66b6d03 Compare September 6, 2025 04:59

gordicaleksa added 2 commits September 5, 2025 22:04

Fix href->id bug

d452aad

Signed-off-by: Aleksa Gordic <[email protected]>

Replace div with a tag

3476bef

Signed-off-by: Aleksa Gordic <[email protected]>

youkaichao reviewed Sep 6, 2025

View reviewed changes

_posts/2025-09-05-anatomy-of-vllm.md Outdated Show resolved Hide resolved

youkaichao reviewed Sep 6, 2025

View reviewed changes

_posts/2025-09-05-anatomy-of-vllm.md Outdated Show resolved Hide resolved

youkaichao approved these changes Sep 6, 2025

View reviewed changes

youkaichao reviewed Sep 6, 2025

View reviewed changes

_posts/2025-09-05-anatomy-of-vllm.md Outdated Show resolved Hide resolved

youkaichao reviewed Sep 7, 2025

View reviewed changes

_posts/2025-09-05-anatomy-of-vllm.md Outdated Show resolved Hide resolved

youkaichao reviewed Sep 7, 2025

View reviewed changes

_posts/2025-09-05-anatomy-of-vllm.md Outdated Show resolved Hide resolved

youkaichao reviewed Sep 7, 2025

View reviewed changes

_posts/2025-09-05-anatomy-of-vllm.md Outdated Show resolved Hide resolved

youkaichao reviewed Sep 7, 2025

View reviewed changes

_posts/2025-09-05-anatomy-of-vllm.md Outdated Show resolved Hide resolved

youkaichao reviewed Sep 7, 2025

View reviewed changes

_posts/2025-09-05-anatomy-of-vllm.md Outdated Show resolved Hide resolved

youkaichao reviewed Sep 7, 2025

View reviewed changes

_posts/2025-09-05-anatomy-of-vllm.md Outdated Show resolved Hide resolved

youkaichao approved these changes Sep 7, 2025

View reviewed changes

gordicaleksa added 2 commits September 7, 2025 21:51

Fix few errors - youkaichao review

af78f2c

Signed-off-by: Aleksa Gordic <[email protected]>

Add Kaichao to acknowledgment section

beab729

Signed-off-by: Aleksa Gordic <[email protected]>

gordicaleksa merged commit 920be5e into main Sep 8, 2025
2 checks passed

hmellor deleted the gordicaleksa/anatomy-vllm branch September 8, 2025 07:14

	vLLM V1 does not support the LLM draft model method, instead it implements faster—but less accurate—proposal schemes: n-gram, EAGLE [[9]](#ref-9), and Medusa [[10]](#ref-10).
	vLLM V1 does not support the LLM draft model method, instead it implements faster proposal schemes: n-gram, EAGLE [[9]](#ref-9), and Medusa [[10]](#ref-10).

[New blog] Inside vLLM: Anatomy of a High-Throughput LLM Inference System #80

[New blog] Inside vLLM: Anatomy of a High-Throughput LLM Inference System #80

Conversation

gordicaleksa commented Sep 5, 2025

Uh oh!

cloudflare-workers-and-pages bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying vllm-blog-source with Cloudflare Pages

Uh oh!

gordicaleksa commented Sep 5, 2025

Uh oh!

simon-mo commented Sep 5, 2025

Uh oh!

gordicaleksa commented Sep 6, 2025

Uh oh!

simon-mo commented Sep 6, 2025

Uh oh!

gordicaleksa commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gordicaleksa commented Sep 6, 2025

Uh oh!

Uh oh!

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gordicaleksa commented Sep 7, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gordicaleksa Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gordicaleksa Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gordicaleksa Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cloudflare-workers-and-pages bot commented Sep 5, 2025 •

edited

Loading

gordicaleksa commented Sep 6, 2025 •

edited

Loading

gordicaleksa Sep 8, 2025 •

edited

Loading

gordicaleksa Sep 8, 2025 •

edited

Loading

gordicaleksa Sep 8, 2025 •

edited

Loading

njhill Sep 8, 2025 •

edited

Loading

gordicaleksa commented Sep 8, 2025 •

edited

Loading