### Suggestion Description Similar to [triton-distributed](https://github.com/ByteDance-Seed/Triton-distributed/blob/main/python/triton_dist/test/nvidia/test_sp_decode_attn.py), but with fused kernels using Iris ### Operating System _No response_ ### GPU _No response_ ### ROCm Component _No response_