-
Notifications
You must be signed in to change notification settings - Fork 27
[New blog] Inside vLLM: Anatomy of a High-Throughput LLM Inference System #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Deploying vllm-blog-source with
|
Latest commit: |
beab729
|
Status: | ✅ Deploy successful! |
Preview URL: | https://6cc3ed89.vllm-blog-source.pages.dev |
Branch Preview URL: | https://gordicaleksa-anatomy-vllm.vllm-blog-source.pages.dev |
Ok i see i haven't done the DCO thing, let me fix that. |
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
76ff8b4
to
c16a69e
Compare
Very nice!! Any way we can get footnote working properly? |
@simon-mo which footnote? not sure i understood? |
Ah sorry I meant citations [1] [2]... they don't link the actual references. |
Signed-off-by: Aleksa Gordic <[email protected]>
293aea1
to
66b6d03
Compare
oh ok, added! |
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
@simon-mo lmk what you think now and happy to merge! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my nit comments are not blocking. we can publish first and then fix nit comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the decode step, input_ids
only contains new tokens, same for positions
and slot_mapping
. we use block_table
to keep track of the existing kv cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i'm aware of that, i made some simplifications so that i have to explain less hah
i'd cover those details if i covered fwd pass kernel!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hope we can have some disclaimer for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt if the relevant attention metadata and input ids etc for decode is right :( you should be able to just print them after preparing input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above, i made a few simplifications in that drawing/explanation as i believe it's ok at that level of abstraction
@youkaichao thanks for the review! will address all a bit later tonight! |
2. <b>Verify</b>: run the large model once on context + <code>k</code> draft tokens. This produces probabilities for those <code>k</code> positions plus one extra (so we get <code>k+1</code> candidates) | ||
3. <b>Accept/reject</b>: going from left to right over the <code>k</code> draft tokens: | ||
<ul> | ||
<li>If the large model's probability for the draft token ≥ the draft's probability, accept it</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks incorrect for rejection sampling. we first sample a uniform distribution, and then compare p_large(token)/p_draft(token)
with that uniform distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's an implementation detail that doesn't contradict the explanation?
it is true that we sample ahead of time vs step by step but i didn't explicitly mention that above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is quite different. Rejection sampling is random sampling, so there're chances a draft token will be accepted or not.
Your description makes it deterministic, i.e. we just need to compare if the draft token ≥ the draft's probability
.
> [!NOTE] | ||
> I recommend looking at [gpt-fast](https://github.com/meta-pytorch/gpt-fast) for a simple implementation, and the [original paper](https://arxiv.org/abs/2302.01318) for the math details and the proof of equivalence to sampling from the full model. | ||
|
||
vLLM V1 does not support the LLM draft model method, instead it implements faster—but less accurate—proposal schemes: n-gram, EAGLE [[9]](#ref-9), and Medusa [[10]](#ref-10). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vLLM V1 does not support the LLM draft model method, instead it implements faster—but less accurate—proposal schemes: n-gram, EAGLE [[9]](#ref-9), and Medusa [[10]](#ref-10). | |
vLLM V1 does not support the LLM draft model method, instead it implements faster proposal schemes: n-gram, EAGLE [[9]](#ref-9), and Medusa [[10]](#ref-10). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valid point, it is strictly speaking mathematically equivalent as i mentioned
what i meant by "less accurate" is that in expectation you'd expect a good LLM draft model to be better / "more accurate" (loosely speaking) in predicting k tokens than n-gram model
1. <b>Instantiation</b> — During engine construction, connectors are created in two places: | ||
* Inside the worker's init device procedure (under init worker distributed environment function), with role "worker". | ||
* Inside the scheduler constructor, with role "scheduler". | ||
2. <b>Cache lookup</b> — When the scheduler processes prefill requests from the <code>waiting</code> queue (after local prefix-cache checks), it calls connector's <code>get_num_new_matched_tokens</code>. This checks for externally cached tokens in the KV-cache server. Prefill always sees 0 here; decode may have a cache hit. The result is added to the local count before calling <code>allocate_slots</code>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically prefill can also have cache hit, but i'm not sure if we implement it. cc @njhill @robertgshaw2-redhat might know more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the SharedStorageConnector
imp this is how it works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gordicaleksa sorry I think I missed this too before. I'm not sure that it's a good idea to use SharedStorageConnector
for a P/D example.
I'll try to summarize here:
- The KV Connector interface is for generically plugging in KV cache distribution / sharing capability.
- The two main categories for this are (1) P/D disagreggation and (2) offloading / tiered caching.
- These are kind of separate but could overlap in that P/D disagg can be achieved by P offloading to shared storage which is subsequently consumed by D.
- We also support using multiple connectors at the same time, so that you could have one doing offloading to CPU memory and another one handling P/D. (in this case usually it would be the prefill worker that benefits from an offloaded cache hit
get_num_new_matched_tokens > 0
). - Our "native" connector for P/D is
NixlConnector
... for this you are right that it's only decode that will getget_num_new_matched_tokens > 0
, since prefill is the "producer" and decode is the "consumer". There is an out-of-band piece to this where it's expected that an external router will pass metadata returned from the P worker to the D worker. SharedStorageConnector
is intended as an example implementation, but is in the (2) category ... as such it doesn't really have prefill / decode worker distinction, it just writes kv cache to disk as its computed, and then for new requests it checks against that disk cache to see if any can be loaded.
<picture> | ||
<img src="/assets/figures/2025-vllm-anatomy/multiprocexecutor.png" width="100%"> | ||
</picture><br> | ||
<b>Figure 13</b>: MultiProcExecutor in a TP=8 setting (driver worker being rank 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In v1, the executor actually lives in another process and it is not tp rank 0.
2. The constructor loops over <code>world_size</code> (e.g. <code>TP=8 ⇒ world_size=8</code>) and spawns a daemon process for each rank via <code>WorkerProc.make_worker_process</code>. | ||
3. For each worker, the parent first creates a reader and writer pipe. | ||
4. The new process runs <code>WorkerProc.worker_main</code>, which instantiates a worker (going through the same "init device", "load model", etc. as in <code>UniprocExecutor</code>). | ||
5. Each worker determines whether it is the driver (rank 0 in the TP group) or a regular worker. Every worker sets up two queues: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @njhill to confirm, v1 does not have the concept of driver worker iirc. all workers are the same, and communicate with the parent (the executor process).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm fairly sure this is correct, a) i can double check b) nick did review it as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think I missed this in my original review.
Yes @youkaichao is right that there isn't a concept of a driver worker anymore. There is a small optimization that we usually only return the per-step outputs from one of the workers to the scheduler (usually rank 0). When using kv connectors though this isn't the case, but it's probably too in-the-weeds detail to mention here anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, are we taking into account that i'm using early August commit for the analysis?
will address your new comments by tomorrow EOD!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a major V0 vs V1 difference, and V0 is essentially gone now.
* <b>Dummy steps for lockstep</b> — if any DP replica has work, all replicas execute a forward step; replicas without requests perform a dummy step to participate in required synchronization points (avoids blocking the active replica). | ||
|
||
> [!NOTE] | ||
> Lockstep clarification: this is actually only required for MoE models where the expert layers form an EP or TP group while attention layers are still DP. It's currently always done with DP - this is just because there's limited use for "built-in" non-MoE DP since you could just run multiple independent vLLMs and load-balance between them in a normal way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although there's clarification here, I'm afraid that this can be misleading. I already hear many complaints from users that using the -dp
argument is much slower than spinning up independent vLLM instances. this is expected due to extra per-step synchronization, but not intended. vLLM's dp is only meant for MoE models with EP and DP > 1. Using dp in such cases where vLLM instances can be totally independent, is very dangerous (slow).
@njhill do you think we should raise error if -dp
is larger than 1 while people do not enable expert parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I observed the above behavior in the codebase and Nick was the one who gave me this clarification during his review before i published (the one that's now in the note)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @youkaichao separately but just adding here for completeness:
- Expert layers use TP with DP if EP isn't enabled so we would potentially want to raise error / warn for non-MoE models rather than if EP isn't enabled.
- It should be straightforward though to just disable the additional coordination in DP + non-MoE case. So that
-dp
can be used for convenient general scaleout. I will open an issue for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing work! I took a full read and learned a lot from your post as well :)
Left some comments, some are small and some might be important. Feel free to DM me if I'm slow to respond.
Signed-off-by: Aleksa Gordic <[email protected]>
Signed-off-by: Aleksa Gordic <[email protected]>
@youkaichao thanks again for the review! I added you to the acknowledgment section! few points you raised above are potentially valid, and we should correct those post merge if it turns out those are mistakes. |
Porting the original blog: https://www.aleksagordic.com/blog/vllm