Add logging for cudagraph related info #29825

sarckk · 2025-12-02T00:27:14Z

Purpose

Add logging related to cudagraph dispatch, namely the distribution of padded/unpadded tokens and the runtime modes over a period of time. This info can be useful for cudagraph capture size tuning.

Adds user-facing --cudagraph-metrics flag to enable this.

Test Plan

VLLM_LOGGING_LEVEL=DEBUG vllm serve meta-llama/Llama-3.1-8B --cudagraph-metrics

Test Result

Logs:

(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] **CUDAGraph Config Settings:**
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] 
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] - Mode: FULL_AND_PIECEWISE
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] - Capture sizes: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] 
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] **CUDAGraph Stats:**
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] 
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | Unpadded Tokens | Padded Tokens | Num Paddings | Runtime Mode | Count |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] |-----------------|---------------|--------------|--------------|-------|
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1               | 1             | 0            | FULL         | 1687  |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1200            | 1200          | 0            | NONE         | 2     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1275            | 1275          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1189            | 1189          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1226            | 1226          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1179            | 1179          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1300            | 1300          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1247            | 1247          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1254            | 1254          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1202            | 1202          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1230            | 1230          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1112            | 1112          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1255            | 1255          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1267            | 1267          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108] | 1151            | 1151          | 0            | NONE         | 1     |
(APIServer pid=3606348) INFO 12-02 16:51:04 [compilation/cuda_graph.py:108]

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces logging for CUDA graph related information, which is valuable for performance tuning. The changes are well-structured, adding a new configuration flag and plumbing the necessary statistics through the model execution path to the logger. I've identified a critical bug in the logging output format that needs to be addressed. Otherwise, the implementation looks good.