Skip to content

Commit 3d54a1a

Browse files
authored
[None] [feat] nsys profile output kernel classifier (NVIDIA#7020)
Signed-off-by: Grace Ho <[email protected]>
1 parent 81fd468 commit 3d54a1a

File tree

6 files changed

+585
-0
lines changed

6 files changed

+585
-0
lines changed
Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# gputrc2graph.py
2+
3+
This script processes NVIDIA Nsight Systems (`nsys`) GPU trace files
4+
(`.nsys-rep`) with -t cuda tracing enabled, and generates kernel-level
5+
summaries and visualizations of GPU and non-GPU time. It is useful for
6+
profiling and analyzing nsys profile output.
7+
8+
## Usage
9+
10+
### Command-line Arguments
11+
12+
- `--in_file`
13+
**(required)**
14+
List of input files and their metadata. Each entry should be in the format:
15+
`<nsys-rep>,<engine>,<model>,<elapsed_nonprofiled_sec>`
16+
- `nsys-rep`: Path to the `.nsys-rep` file.
17+
- `engine`: Engine name (e.g., `trtllm`).
18+
- `model`: Model name (e.g., `llama`, `gpt-oss`, `ds`).
19+
- `elapsed_nonprofiled_sec`: Wall-clock runtime (in seconds) without
20+
profiling. Specify `0` to use the elapsed GPU time calculated from the nsys-rep file (this may inflate non-GPU time if actual runtime without profiling is less). Multiple entries can be provided, separated by spaces.
21+
22+
- `--out_dir`
23+
Output directory for the generated CSV and HTML files.
24+
If not specified, results are saved in the current directory.
25+
26+
- `--title`
27+
Title for the HTML chart/visualization.
28+
29+
- `--nsys_cmd`
30+
Path to the `nsys` command.
31+
Default: `nsys` (assumes it is in your PATH).
32+
Use this if `nsys` is not in your system PATH.
33+
34+
## Notes
35+
36+
- Make sure you have pandas and plotly python packages installed.
37+
- Make sure [nsys](https://developer.nvidia.com/nsight-systems/get-started) is
38+
installed, and specify the path to the `nsys` command with `--nsys_cmd` if it
39+
is not in your PATH.
40+
- For more details on available engines and models, see the help string in
41+
the script or run:
42+
43+
```bash
44+
python3 gputrc2graph.py --help
45+
```
46+
47+
## Example 1: analyze a single profile
48+
49+
To analyze the GPU cycles of for example, a llama-3.1-8B model with trtllm:
50+
51+
1. Run the following command to collect nsys profile, for trtllm serve config.
52+
53+
```bash
54+
nsys profile -t cuda -o nsys_res -f true --trace-fork-before-exec=true \
55+
--cuda-graph-trace=node --delay <DELAY> --duration <DURATION> \
56+
python3 -m trtllm-serve meta-llama/Llama-4-Scout-17B-16E-Instruct ...
57+
```
58+
59+
where:
60+
61+
- DELAY: how many seconds to delay nsys from collecting profiles, needed so
62+
that profiles aren't captured till trtllm server has come up and load
63+
generation starts.
64+
- DURATION: how many seconds for nsys profile to run before generating the
65+
profile. This should be > the duration of the run.
66+
67+
2. Run again, this time without collecting the profile, and get the total run
68+
time in seconds. This value will be used by the script to calculate the
69+
CPU(non-GPU) seconds for the analysis.
70+
71+
3. Say the run elapsed time is .35 seconds, from step #2. Run script to
72+
analyze:
73+
74+
```bash
75+
python3 gputrc2graph.py \
76+
--in_file run1.nsys-rep,trtllm,llama,.35
77+
```
78+
79+
The command will produce 2 files for analysis:
80+
81+
- result.html: this categorizes kernel names into different categories in a
82+
stacked bar chart.
83+
- result.csv: shows how the kernel names are mapped to the different
84+
categories.
85+
86+
### HTML visualization with result.html
87+
88+
The html file shows the number of elapsed seconds due to different GPU
89+
Substages or categories, which consist of moe_gemm as the biggest
90+
category, at .14 seconds, followed by "attn" kernels. This lets the user
91+
prioritize the kernels to focus on for performance optimizations.
92+
93+
![Example GPU Trace Visualization](images/html.png)
94+
95+
There's also an appended data table underneath the bar chart for copying out to
96+
other post-processing tools.
97+
98+
![Example GPU Trace Visualization Table](images/html_tbl.png)
99+
100+
### Kernel to category mapping with result.csv
101+
102+
Suppose the user would like to focus on improving decreasing calls to nccl
103+
kernels. The next step is to use the result.csv to dive into what the kernels
104+
are which compose the nccl GPU cycles. The following image shows that
105+
ar_fusion all reduce kernel to be the biggest contributor to GPU cycles for
106+
nccl, followed by AllGather.
107+
108+
![Example GPU Trace csv](images/csv.png)
109+
110+
## Example 2: analyze multiple profiles
111+
112+
Suppose the user has multiple nsys trace files, captured for different models,
113+
say llama and gpt-oss in this case, and wish to compare their GPU/non-GPU
114+
time, something like the following command can be used.
115+
116+
```bash
117+
python3 gputrc2graph.py \
118+
--in_file run1.nsys-rep,trtllm,llama,100 run2.nsys-rep,trtllm,gpt-oss,102 \
119+
--out_dir results
120+
```
121+
122+
The analysis process is similar to example 1 but now there will be multiple
123+
stack bar charts that can be compared. The categories for the different
124+
kernels will remain the same, so that it's easy to compare the GPU cycles for
125+
the same categories.
126+
127+
Once a category is shown to have more cycles for one configuration than
128+
another, the next step would be to use the csv file to see what kernels are
129+
mapped into that category, and which kernels are taking the largest amount of
130+
time which would cause a difference for the overall category.
131+
132+
## Example 3: add new classification for a new model
133+
134+
To create a new engine DEF with model ABC, just add another json file in the
135+
same directory as gputrc2graph.py with the same format as the other json files.
136+
The script will automatically pick up all the json files in the same directory
137+
as engine/model specifications.
138+
139+
Then, for this new model, suppose there are 4 kernels to be classified into
140+
"gemm" and "attn", where the gemm kernelshave names with "*H*" or "*I*" in
141+
them, and attn kernels have names with "*J*" or "*K*" in them, just add another
142+
.json file in the same directory as gputrc2graph.py with the same format as
143+
the other json files, like the following:
144+
145+
```json
146+
{
147+
"DEF": {
148+
"ABC": {
149+
"H|I": "gemm",
150+
"J|K": "attn",
151+
"CUDA mem": "non-gpu-H_D_memops",
152+
".*": "misc"
153+
}
154+
}
155+
}
156+
```
157+
158+
Each entry in the dictionary consists of:
159+
160+
- key: a regex used to classify the kernels
161+
- value: the category to classify the kernels into.
162+
163+
The last 2 entries are common for all engine/models, consisting of CUDA memory
164+
operations and a 'misc' for anything that's leftover and can't be classified.
165+
166+
When invoking gputrc2graph.py, specify a trace file with this new model/engine
167+
like the following:
168+
169+
```bash
170+
--in_file new.nsys-rep,DEF,ABC,<runtime>
171+
```
172+
173+
If the engine_DEF.json file already exists, just add the model as a new node in
174+
the existing engine file, after the other models.

0 commit comments

Comments
 (0)