Skip to content

Conversation

gordicaleksa
Copy link
Collaborator

Porting the original blog: https://www.aleksagordic.com/blog/vllm

Copy link

cloudflare-workers-and-pages bot commented Sep 5, 2025

Deploying vllm-blog-source with  Cloudflare Pages  Cloudflare Pages

Latest commit: beab729
Status: ✅  Deploy successful!
Preview URL: https://6cc3ed89.vllm-blog-source.pages.dev
Branch Preview URL: https://gordicaleksa-anatomy-vllm.vllm-blog-source.pages.dev

View logs

@gordicaleksa
Copy link
Collaborator Author

Ok i see i haven't done the DCO thing, let me fix that.

Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
@gordicaleksa gordicaleksa force-pushed the gordicaleksa/anatomy-vllm branch from 76ff8b4 to c16a69e Compare September 5, 2025 18:08
@gordicaleksa gordicaleksa changed the title [New blog] Inside vLLM: Anatomy of a High-Throughput LLM Inference System - NEW [New blog] Inside vLLM: Anatomy of a High-Throughput LLM Inference System Sep 5, 2025
@simon-mo
Copy link
Contributor

simon-mo commented Sep 5, 2025

Very nice!! Any way we can get footnote working properly?

@gordicaleksa
Copy link
Collaborator Author

@simon-mo which footnote? not sure i understood?

@simon-mo
Copy link
Contributor

simon-mo commented Sep 6, 2025

Ah sorry I meant citations [1] [2]... they don't link the actual references.

Signed-off-by: Aleksa Gordic <[email protected]>
@gordicaleksa gordicaleksa force-pushed the gordicaleksa/anatomy-vllm branch from 293aea1 to 66b6d03 Compare September 6, 2025 04:59
@gordicaleksa
Copy link
Collaborator Author

gordicaleksa commented Sep 6, 2025

oh ok, added!

Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
@gordicaleksa
Copy link
Collaborator Author

@simon-mo lmk what you think now and happy to merge!

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my nit comments are not blocking. we can publish first and then fix nit comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the decode step, input_ids only contains new tokens, same for positions and slot_mapping. we use block_table to keep track of the existing kv cache.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i'm aware of that, i made some simplifications so that i have to explain less hah

i'd cover those details if i covered fwd pass kernel!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope we can have some disclaimer for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt if the relevant attention metadata and input ids etc for decode is right :( you should be able to just print them after preparing input.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above, i made a few simplifications in that drawing/explanation as i believe it's ok at that level of abstraction

@gordicaleksa
Copy link
Collaborator Author

@youkaichao thanks for the review! will address all a bit later tonight!

2. <b>Verify</b>: run the large model once on context + <code>k</code> draft tokens. This produces probabilities for those <code>k</code> positions plus one extra (so we get <code>k+1</code> candidates)
3. <b>Accept/reject</b>: going from left to right over the <code>k</code> draft tokens:
<ul>
<li>If the large model's probability for the draft token ≥ the draft's probability, accept it</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks incorrect for rejection sampling. we first sample a uniform distribution, and then compare p_large(token)/p_draft(token) with that uniform distribution.

Copy link
Collaborator Author

@gordicaleksa gordicaleksa Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's an implementation detail that doesn't contradict the explanation?

it is true that we sample ahead of time vs step by step but i didn't explicitly mention that above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is quite different. Rejection sampling is random sampling, so there're chances a draft token will be accepted or not.

Your description makes it deterministic, i.e. we just need to compare if the draft token ≥ the draft's probability.

> [!NOTE]
> I recommend looking at [gpt-fast](https://github.com/meta-pytorch/gpt-fast) for a simple implementation, and the [original paper](https://arxiv.org/abs/2302.01318) for the math details and the proof of equivalence to sampling from the full model.

vLLM V1 does not support the LLM draft model method, instead it implements faster—but less accurate—proposal schemes: n-gram, EAGLE [[9]](#ref-9), and Medusa [[10]](#ref-10).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vLLM V1 does not support the LLM draft model method, instead it implements faster—but less accurate—proposal schemes: n-gram, EAGLE [[9]](#ref-9), and Medusa [[10]](#ref-10).
vLLM V1 does not support the LLM draft model method, instead it implements faster proposal schemes: n-gram, EAGLE [[9]](#ref-9), and Medusa [[10]](#ref-10).

Copy link
Collaborator Author

@gordicaleksa gordicaleksa Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid point, it is strictly speaking mathematically equivalent as i mentioned

what i meant by "less accurate" is that in expectation you'd expect a good LLM draft model to be better / "more accurate" (loosely speaking) in predicting k tokens than n-gram model

1. <b>Instantiation</b> — During engine construction, connectors are created in two places:
* Inside the worker's init device procedure (under init worker distributed environment function), with role "worker".
* Inside the scheduler constructor, with role "scheduler".
2. <b>Cache lookup</b> — When the scheduler processes prefill requests from the <code>waiting</code> queue (after local prefix-cache checks), it calls connector's <code>get_num_new_matched_tokens</code>. This checks for externally cached tokens in the KV-cache server. Prefill always sees 0 here; decode may have a cache hit. The result is added to the local count before calling <code>allocate_slots</code>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically prefill can also have cache hit, but i'm not sure if we implement it. cc @njhill @robertgshaw2-redhat might know more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the SharedStorageConnector imp this is how it works

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gordicaleksa sorry I think I missed this too before. I'm not sure that it's a good idea to use SharedStorageConnector for a P/D example.

I'll try to summarize here:

  • The KV Connector interface is for generically plugging in KV cache distribution / sharing capability.
  • The two main categories for this are (1) P/D disagreggation and (2) offloading / tiered caching.
  • These are kind of separate but could overlap in that P/D disagg can be achieved by P offloading to shared storage which is subsequently consumed by D.
  • We also support using multiple connectors at the same time, so that you could have one doing offloading to CPU memory and another one handling P/D. (in this case usually it would be the prefill worker that benefits from an offloaded cache hit get_num_new_matched_tokens > 0).
  • Our "native" connector for P/D is NixlConnector ... for this you are right that it's only decode that will get get_num_new_matched_tokens > 0, since prefill is the "producer" and decode is the "consumer". There is an out-of-band piece to this where it's expected that an external router will pass metadata returned from the P worker to the D worker.
  • SharedStorageConnector is intended as an example implementation, but is in the (2) category ... as such it doesn't really have prefill / decode worker distinction, it just writes kv cache to disk as its computed, and then for new requests it checks against that disk cache to see if any can be loaded.

<picture>
<img src="/assets/figures/2025-vllm-anatomy/multiprocexecutor.png" width="100%">
</picture><br>
<b>Figure 13</b>: MultiProcExecutor in a TP=8 setting (driver worker being rank 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v1, the executor actually lives in another process and it is not tp rank 0.

2. The constructor loops over <code>world_size</code> (e.g. <code>TP=8 ⇒ world_size=8</code>) and spawns a daemon process for each rank via <code>WorkerProc.make_worker_process</code>.
3. For each worker, the parent first creates a reader and writer pipe.
4. The new process runs <code>WorkerProc.worker_main</code>, which instantiates a worker (going through the same "init device", "load model", etc. as in <code>UniprocExecutor</code>).
5. Each worker determines whether it is the driver (rank 0 in the TP group) or a regular worker. Every worker sets up two queues:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @njhill to confirm, v1 does not have the concept of driver worker iirc. all workers are the same, and communicate with the parent (the executor process).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm fairly sure this is correct, a) i can double check b) nick did review it as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I missed this in my original review.

Yes @youkaichao is right that there isn't a concept of a driver worker anymore. There is a small optimization that we usually only return the per-step outputs from one of the workers to the scheduler (usually rank 0). When using kv connectors though this isn't the case, but it's probably too in-the-weeds detail to mention here anyhow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, are we taking into account that i'm using early August commit for the analysis?

will address your new comments by tomorrow EOD!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a major V0 vs V1 difference, and V0 is essentially gone now.

* <b>Dummy steps for lockstep</b> — if any DP replica has work, all replicas execute a forward step; replicas without requests perform a dummy step to participate in required synchronization points (avoids blocking the active replica).

> [!NOTE]
> Lockstep clarification: this is actually only required for MoE models where the expert layers form an EP or TP group while attention layers are still DP. It's currently always done with DP - this is just because there's limited use for "built-in" non-MoE DP since you could just run multiple independent vLLMs and load-balance between them in a normal way.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although there's clarification here, I'm afraid that this can be misleading. I already hear many complaints from users that using the -dp argument is much slower than spinning up independent vLLM instances. this is expected due to extra per-step synchronization, but not intended. vLLM's dp is only meant for MoE models with EP and DP > 1. Using dp in such cases where vLLM instances can be totally independent, is very dangerous (slow).

@njhill do you think we should raise error if -dp is larger than 1 while people do not enable expert parallel?

Copy link
Collaborator Author

@gordicaleksa gordicaleksa Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed the above behavior in the codebase and Nick was the one who gave me this clarification during his review before i published (the one that's now in the note)

Copy link
Member

@njhill njhill Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @youkaichao separately but just adding here for completeness:

  • Expert layers use TP with DP if EP isn't enabled so we would potentially want to raise error / warn for non-MoE models rather than if EP isn't enabled.
  • It should be straightforward though to just disable the additional coordination in DP + non-MoE case. So that -dp can be used for convenient general scaleout. I will open an issue for this

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing work! I took a full read and learned a lot from your post as well :)

Left some comments, some are small and some might be important. Feel free to DM me if I'm slow to respond.

@gordicaleksa
Copy link
Collaborator Author

gordicaleksa commented Sep 8, 2025

@youkaichao thanks again for the review! I added you to the acknowledgment section!

few points you raised above are potentially valid, and we should correct those post merge if it turns out those are mistakes.

@gordicaleksa gordicaleksa merged commit 920be5e into main Sep 8, 2025
2 checks passed
@hmellor hmellor deleted the gordicaleksa/anatomy-vllm branch September 8, 2025 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants