Skip to content

Conversation

@sarckk
Copy link
Collaborator

@sarckk sarckk commented Dec 2, 2025

Purpose

Add logging related to cudagraph dispatch, namely the distribution of padded/unpadded tokens and the runtime modes over a period of time. This info can be useful for cudagraph capture size tuning.

Adds user-facing --cudagraph-metrics flag to enable this.

Test Plan

VLLM_LOGGING_LEVEL=DEBUG vllm serve meta-llama/Llama-3.1-8B --cudagraph-metrics

Test Result

Logs:

(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] **CUDAGraph Config Settings:**
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] 
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] - Mode: FULL_AND_PIECEWISE
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] - Capture sizes: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] 
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] **CUDAGraph Stats:**
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] 
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | Unpadded Tokens | Padded Tokens | Num Paddings | Runtime Mode | Count |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] |-----------------|---------------|--------------|--------------|-------|
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1               | 1             | 0            | FULL         | 1687  |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1200            | 1200          | 0            | NONE         | 2     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1275            | 1275          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1189            | 1189          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1226            | 1226          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1179            | 1179          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1300            | 1300          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1247            | 1247          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1254            | 1254          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1202            | 1202          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1230            | 1230          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1112            | 1112          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1255            | 1255          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1267            | 1267          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1151            | 1151          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] 

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logging for CUDA graph related information, which is valuable for performance tuning. The changes are well-structured, adding a new configuration flag and plumbing the necessary statistics through the model execution path to the logger. I've identified a critical bug in the logging output format that needs to be addressed. Otherwise, the implementation looks good.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logging for CUDA graph-related information, which is a valuable addition for performance tuning and debugging. The implementation is well-structured, adding a new configuration flag and plumbing the statistics through the model execution pipeline. I've found one minor issue in the formatting of the log output, where two columns are swapped, which could be misleading. The fix is straightforward. Overall, this is a good contribution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable logging feature for CUDA graph-related information, which will be helpful for performance tuning. The changes are well-structured, adding a new configuration flag and plumbing the necessary statistics through the system. I've identified a minor bug in the new logging class where the columns for padded and unpadded tokens were swapped in the output. I've provided a fix for this. Overall, this is a good addition to the project.

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Dec 2, 2025
@zhuohan123 zhuohan123 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 2, 2025
class CUDAGraphStats:
num_unpadded_tokens: int
num_padded_tokens: int
runtime_mode: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use CUDAGraphMode enum?

Copy link
Collaborator Author

@sarckk sarckk Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didnt use an enum because CUDAGraphStats needs to be serializable, and CUDAGraphMode is not serializable (without more intrusive changes / writing a custom serializer for msgpack) since some of its values are tuples of other enum members (FULL_DECODE_ONLY and FULL_AND_PIECEWISE). The error is Enums must contain either all str or all int values - type <enum 'CUDAGraphMode'> is not supported

row_counts = Counter(self.rows)

# Create header
header = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we generate a markdown table instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i updated the PR to print a markdown table, but it needs extra logic to ensure columns are aligned.

Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
@22quinn 22quinn merged commit 69520bc into vllm-project:main Dec 3, 2025
54 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants