Skip to content

Conversation

ksmusz
Copy link
Contributor

@ksmusz ksmusz commented Sep 3, 2025

Warming up the sampler with different configurations removes graph recompilations of bigger sampler graphs seen within the actual execution. As tested with example workloads and batch sizes, the only recompilations left from the sampler are from minor graphs, which have minimal influence to the execution time.

The warmup of the sampler takes around 1-3 seconds, depending on the buckets' batch sizes to be warmed up.

Additionally, removed the situation, where the warmup method is called twice (seen as duplicated prints within the warmup phase but with empty warmed up buckets, as these have all been already warmed up).

Signed-off-by: Krzysztof Smusz <[email protected]>
@ksmusz
Copy link
Contributor Author

ksmusz commented Sep 3, 2025

/run-gaudi-tests

@sys-hab-pt-service
Copy link
Collaborator

Only codeowners can request to run Gaudi tests. Contact list: kzawora-intel, xuechendi, mswiniarsk, adobrzyn

@adobrzyn
Copy link
Collaborator

adobrzyn commented Sep 4, 2025

/run-gaudi-tests

Signed-off-by: Krzysztof Smusz <[email protected]>
@adobrzyn
Copy link
Collaborator

adobrzyn commented Sep 4, 2025

/run-gaudi-tests

@adobrzyn adobrzyn requested a review from Copilot September 5, 2025 07:50
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces sampler warmup as a separate warmup step to reduce graph recompilations during model execution. The warmup tests various sampling configurations with different batch sizes and temperature/top-p/top-k values to pre-compile sampler graphs, reducing compilation overhead during actual inference.

Key changes:

  • Added separate sampler warmup phase before model graph warmup
  • Refactored sampling code to extract common functionality into reusable methods
  • Modified worker warmup conditions to prevent redundant warmups

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
vllm_gaudi/v1/worker/hpu_worker.py Added condition to prevent redundant model warmup when graphs are already compiled
vllm_gaudi/v1/worker/hpu_model_runner.py Added comprehensive sampler warmup functionality and refactored sampling code

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@adobrzyn
Copy link
Collaborator

adobrzyn commented Sep 5, 2025

/run-gaudi-tests

@adobrzyn
Copy link
Collaborator

adobrzyn commented Sep 9, 2025

/run-gaudi-tests

@kzawora-intel kzawora-intel enabled auto-merge (squash) September 9, 2025 11:44
@kzawora-intel
Copy link
Collaborator

/run-gaudi-tests

1 similar comment
@kzawora-intel
Copy link
Collaborator

/run-gaudi-tests

@kzawora-intel kzawora-intel merged commit a9702fe into vllm-project:main Sep 9, 2025
7 checks passed
kfojcik-intel pushed a commit to kfojcik-intel/vllm-gaudi that referenced this pull request Sep 12, 2025
Warming up the sampler with different configurations removes graph
recompilations of bigger sampler graphs seen within the actual
execution. As tested with example workloads and batch sizes, the only
recompilations left from the sampler are from minor graphs, which have
minimal influence to the execution time.

The warmup of the sampler takes around 1-3 seconds, depending on the
buckets' batch sizes to be warmed up.

Additionally, removed the situation, where the warmup method is called
twice (seen as duplicated prints within the warmup phase but with empty
warmed up buckets, as these have all been already warmed up).

---------

Signed-off-by: Krzysztof Smusz <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants