--cudagraph runs triton.testing.do_bench_cudagraph. So these two should give the same latency. However, this is not true.
To repro, please check out #345 which updates the input settings.
python3 run.py --op rope --cudagraph gives 0.0020 for liger and 0.0027 for inductor.
HOWEVER, python repro.py gives 0.0017 for lilger and 0.0014 for inductor. repro.py copies code from tritonbench/operators/rope/operator.py and uses triton.testing.do_bench_cudagraph for benchmark.