|
| 1 | +# gputrc2graph.py |
| 2 | + |
| 3 | +This script processes NVIDIA Nsight Systems (`nsys`) GPU trace files |
| 4 | +(`.nsys-rep`) with -t cuda tracing enabled, and generates kernel-level |
| 5 | +summaries and visualizations of GPU and non-GPU time. It is useful for |
| 6 | +profiling and analyzing nsys profile output. |
| 7 | + |
| 8 | +## Usage |
| 9 | + |
| 10 | +### Command-line Arguments |
| 11 | + |
| 12 | +- `--in_file` |
| 13 | + **(required)** |
| 14 | + List of input files and their metadata. Each entry should be in the format: |
| 15 | + `<nsys-rep>,<engine>,<model>,<elapsed_nonprofiled_sec>` |
| 16 | + - `nsys-rep`: Path to the `.nsys-rep` file. |
| 17 | + - `engine`: Engine name (e.g., `trtllm`). |
| 18 | + - `model`: Model name (e.g., `llama`, `gpt-oss`, `ds`). |
| 19 | + - `elapsed_nonprofiled_sec`: Wall-clock runtime (in seconds) without |
| 20 | + profiling. Specify `0` to use the elapsed GPU time calculated from the nsys-rep file (this may inflate non-GPU time if actual runtime without profiling is less). Multiple entries can be provided, separated by spaces. |
| 21 | + |
| 22 | +- `--out_dir` |
| 23 | + Output directory for the generated CSV and HTML files. |
| 24 | + If not specified, results are saved in the current directory. |
| 25 | + |
| 26 | +- `--title` |
| 27 | + Title for the HTML chart/visualization. |
| 28 | + |
| 29 | +- `--nsys_cmd` |
| 30 | + Path to the `nsys` command. |
| 31 | + Default: `nsys` (assumes it is in your PATH). |
| 32 | + Use this if `nsys` is not in your system PATH. |
| 33 | + |
| 34 | +## Notes |
| 35 | + |
| 36 | +- Make sure you have pandas and plotly python packages installed. |
| 37 | +- Make sure [nsys](https://developer.nvidia.com/nsight-systems/get-started) is |
| 38 | +installed, and specify the path to the `nsys` command with `--nsys_cmd` if it |
| 39 | + is not in your PATH. |
| 40 | +- For more details on available engines and models, see the help string in |
| 41 | + the script or run: |
| 42 | + |
| 43 | +```bash |
| 44 | +python3 gputrc2graph.py --help |
| 45 | +``` |
| 46 | + |
| 47 | +## Example 1: analyze a single profile |
| 48 | + |
| 49 | +To analyze the GPU cycles of for example, a llama-3.1-8B model with trtllm: |
| 50 | + |
| 51 | +1. Run the following command to collect nsys profile, for trtllm serve config. |
| 52 | + |
| 53 | + ```bash |
| 54 | + nsys profile -t cuda -o nsys_res -f true --trace-fork-before-exec=true \ |
| 55 | + --cuda-graph-trace=node --delay <DELAY> --duration <DURATION> \ |
| 56 | + python3 -m trtllm-serve meta-llama/Llama-4-Scout-17B-16E-Instruct ... |
| 57 | + ``` |
| 58 | + |
| 59 | + where: |
| 60 | + |
| 61 | + - DELAY: how many seconds to delay nsys from collecting profiles, needed so |
| 62 | + that profiles aren't captured till trtllm server has come up and load |
| 63 | + generation starts. |
| 64 | + - DURATION: how many seconds for nsys profile to run before generating the |
| 65 | + profile. This should be > the duration of the run. |
| 66 | + |
| 67 | +2. Run again, this time without collecting the profile, and get the total run |
| 68 | + time in seconds. This value will be used by the script to calculate the |
| 69 | + CPU(non-GPU) seconds for the analysis. |
| 70 | + |
| 71 | +3. Say the run elapsed time is .35 seconds, from step #2. Run script to |
| 72 | + analyze: |
| 73 | + |
| 74 | + ```bash |
| 75 | + python3 gputrc2graph.py \ |
| 76 | + --in_file run1.nsys-rep,trtllm,llama,.35 |
| 77 | + ``` |
| 78 | + |
| 79 | +The command will produce 2 files for analysis: |
| 80 | + |
| 81 | +- result.html: this categorizes kernel names into different categories in a |
| 82 | + stacked bar chart. |
| 83 | +- result.csv: shows how the kernel names are mapped to the different |
| 84 | + categories. |
| 85 | + |
| 86 | +### HTML visualization with result.html |
| 87 | + |
| 88 | +The html file shows the number of elapsed seconds due to different GPU |
| 89 | +Substages or categories, which consist of moe_gemm as the biggest |
| 90 | +category, at .14 seconds, followed by "attn" kernels. This lets the user |
| 91 | +prioritize the kernels to focus on for performance optimizations. |
| 92 | + |
| 93 | + |
| 94 | + |
| 95 | +There's also an appended data table underneath the bar chart for copying out to |
| 96 | + other post-processing tools. |
| 97 | + |
| 98 | + |
| 99 | + |
| 100 | +### Kernel to category mapping with result.csv |
| 101 | + |
| 102 | +Suppose the user would like to focus on improving decreasing calls to nccl |
| 103 | +kernels. The next step is to use the result.csv to dive into what the kernels |
| 104 | +are which compose the nccl GPU cycles. The following image shows that |
| 105 | +ar_fusion all reduce kernel to be the biggest contributor to GPU cycles for |
| 106 | +nccl, followed by AllGather. |
| 107 | + |
| 108 | + |
| 109 | + |
| 110 | +## Example 2: analyze multiple profiles |
| 111 | + |
| 112 | +Suppose the user has multiple nsys trace files, captured for different models, |
| 113 | +say llama and gpt-oss in this case, and wish to compare their GPU/non-GPU |
| 114 | +time, something like the following command can be used. |
| 115 | + |
| 116 | +```bash |
| 117 | +python3 gputrc2graph.py \ |
| 118 | +--in_file run1.nsys-rep,trtllm,llama,100 run2.nsys-rep,trtllm,gpt-oss,102 \ |
| 119 | +--out_dir results |
| 120 | +``` |
| 121 | + |
| 122 | +The analysis process is similar to example 1 but now there will be multiple |
| 123 | +stack bar charts that can be compared. The categories for the different |
| 124 | +kernels will remain the same, so that it's easy to compare the GPU cycles for |
| 125 | +the same categories. |
| 126 | + |
| 127 | +Once a category is shown to have more cycles for one configuration than |
| 128 | +another, the next step would be to use the csv file to see what kernels are |
| 129 | +mapped into that category, and which kernels are taking the largest amount of |
| 130 | +time which would cause a difference for the overall category. |
| 131 | + |
| 132 | +## Example 3: add new classification for a new model |
| 133 | + |
| 134 | +To create a new engine DEF with model ABC, just add another json file in the |
| 135 | +same directory as gputrc2graph.py with the same format as the other json files. |
| 136 | +The script will automatically pick up all the json files in the same directory |
| 137 | +as engine/model specifications. |
| 138 | + |
| 139 | +Then, for this new model, suppose there are 4 kernels to be classified into |
| 140 | +"gemm" and "attn", where the gemm kernelshave names with "*H*" or "*I*" in |
| 141 | +them, and attn kernels have names with "*J*" or "*K*" in them, just add another |
| 142 | + .json file in the same directory as gputrc2graph.py with the same format as |
| 143 | + the other json files, like the following: |
| 144 | + |
| 145 | +```json |
| 146 | +{ |
| 147 | + "DEF": { |
| 148 | + "ABC": { |
| 149 | + "H|I": "gemm", |
| 150 | + "J|K": "attn", |
| 151 | + "CUDA mem": "non-gpu-H_D_memops", |
| 152 | + ".*": "misc" |
| 153 | + } |
| 154 | + } |
| 155 | +} |
| 156 | +``` |
| 157 | + |
| 158 | +Each entry in the dictionary consists of: |
| 159 | + |
| 160 | +- key: a regex used to classify the kernels |
| 161 | +- value: the category to classify the kernels into. |
| 162 | + |
| 163 | +The last 2 entries are common for all engine/models, consisting of CUDA memory |
| 164 | +operations and a 'misc' for anything that's leftover and can't be classified. |
| 165 | + |
| 166 | +When invoking gputrc2graph.py, specify a trace file with this new model/engine |
| 167 | +like the following: |
| 168 | + |
| 169 | +```bash |
| 170 | +--in_file new.nsys-rep,DEF,ABC,<runtime> |
| 171 | +``` |
| 172 | + |
| 173 | +If the engine_DEF.json file already exists, just add the model as a new node in |
| 174 | + the existing engine file, after the other models. |
0 commit comments