different results between --cudagraph and do_bench_cudagraph

`--cudagraph` runs `triton.testing.do_bench_cudagraph`. So these two should give the same latency. However, this is not true.

To repro, please check out #345 which updates the input settings.


`python3 run.py --op rope --cudagraph` gives 0.0020 for liger and 0.0027 for inductor. 

<img width="833" height="451" alt="Image" src="https://github.com/user-attachments/assets/263cc38e-5899-4040-96db-6f850f2dd331" />


HOWEVER, `python repro.py` gives 0.0017 for lilger and 0.0014 for inductor. `repro.py` copies code from `tritonbench/operators/rope/operator.py` and uses `triton.testing.do_bench_cudagraph` for benchmark.










Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

different results between --cudagraph and do_bench_cudagraph #346

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

different results between --cudagraph and do_bench_cudagraph #346

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions