[mlir][amdgpu] Add `rocdl.s.waitcnt` wrapper #149670

Hardcode84 · 2025-07-19T18:34:16Z

The main motivations is to pass vmcnt/expcnt/lgkmcnt values directly (similar to the asm format) and delegate architecture-dependent bitpacking to the amdgpu->rocdl lowering.

The main motivations is to pass vmcnt/expcnt/lgkmcnt values directly and delegate architecture-dependent bitpacking to the amdgpu->rocdl lowering. Only gfx9 bitpacking support added as part of this commit.

llvmbot · 2025-07-19T18:34:45Z

@llvm/pr-subscribers-mlir-gpu
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-backend-amdgpu

Author: Ivan Butygin (Hardcode84)

Changes

The main motivations is to pass vmcnt/expcnt/lgkmcnt values directly (similar to the asm format) and delegate architecture-dependent bitpacking to the amdgpu->rocdl lowering. Only gfx9 support added as part of this commit.

Full diff: https://github.com/llvm/llvm-project/pull/149670.diff

4 Files Affected:

(modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+20)
(modified) mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp (+49-3)
(added) mlir/test/Conversion/AMDGPUToROCDL/waitcnt.mlir (+20)
(modified) mlir/test/Dialect/AMDGPU/ops.mlir (+13)

diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
index 80959ffbaf426..cecb936e18ae3 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
@@ -717,6 +717,26 @@ def AMDGPU_SchedBarrierOp :
   }];
 }
 
+def AMDGPU_WaitcntOp :
+  AMDGPU_Op<"waitcnt">,
+  Arguments<(ins
+      OptionalAttr<I32Attr>:$vmcnt,
+      OptionalAttr<I32Attr>:$expcnt,
+      OptionalAttr<I32Attr>:$lgkmcnt
+    )>
+  {
+  let summary = "Wrapper on ROCDL SWaitcntOp";
+  let description = [{
+    Covenience wrapper on `rocdl.s.waitcnt`. Hides the architecture specific
+    bitpacking from user. Missing values will be assumed maximum values supported
+    by the architecture. Large values will also be clamped to the maximum
+    supported values.
+  }];
+  let assemblyFormat = [{
+    (`vmcnt` `(` $vmcnt^ `)` )? (`expcnt` `(` $expcnt^ `)` )? (`lgkmcnt` `(` $lgkmcnt^ `)`)? attr-dict
+  }];
+}
+
 def AMDGPU_MFMAPermB : I32EnumAttr<"MFMAPermB",
     "The possible permutations of the lanes storing B available in an MFMA",
     [
diff --git a/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp b/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
index ef35ee208f002..af588d5b70a45 100644
--- a/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
+++ b/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
@@ -419,6 +419,52 @@ struct RawBufferOpLowering : public ConvertOpToLLVMPattern<GpuOp> {
   }
 };
 
+// TODO: AMDGPU backend already have all this bitpacking logic, we should move
+// it to some common place.
+static FailureOr<unsigned> encodeWaitcnt(Chipset chipset, unsigned vmcnt,
+                                         unsigned expcnt, unsigned lgkmcnt) {
+  if (chipset.majorVersion == 9) {
+    vmcnt = std::min(63u, vmcnt);
+    expcnt = std::min(7u, expcnt);
+    lgkmcnt = std::min(15u, lgkmcnt);
+    unsigned lowBits = vmcnt & 0xF;
+    unsigned highBits = (vmcnt >> 4) << 14;
+    unsigned otherCnts = (expcnt << 4) | (lgkmcnt << 8);
+    return lowBits | highBits | otherCnts;
+  }
+  return failure();
+}
+
+struct WaitcntOpLowering : public ConvertOpToLLVMPattern<WaitcntOp> {
+  WaitcntOpLowering(const LLVMTypeConverter &converter, Chipset chipset)
+      : ConvertOpToLLVMPattern<WaitcntOp>(converter), chipset(chipset) {}
+
+  Chipset chipset;
+
+  LogicalResult
+  matchAndRewrite(WaitcntOp op, OpAdaptor adaptor,
+                  ConversionPatternRewriter &rewriter) const override {
+    auto getVal = [](Attribute attr) -> unsigned {
+      if (attr)
+        return cast<IntegerAttr>(attr).getInt();
+
+      // This value will be clamped to the maximum value for the chipset.
+      return 1024 * 1024;
+    };
+    unsigned vmcnt = getVal(adaptor.getVmcntAttr());
+    unsigned expcnt = getVal(adaptor.getExpcntAttr());
+    unsigned lgkmcnt = getVal(adaptor.getLgkmcntAttr());
+
+    FailureOr<unsigned> waitcnt =
+        encodeWaitcnt(chipset, vmcnt, expcnt, lgkmcnt);
+    if (failed(waitcnt))
+      return op.emitOpError("unsupported chipset");
+
+    rewriter.replaceOpWithNewOp<ROCDL::SWaitcntOp>(op, *waitcnt);
+    return success();
+  }
+};
+
 struct LDSBarrierOpLowering : public ConvertOpToLLVMPattern<LDSBarrierOp> {
   LDSBarrierOpLowering(const LLVMTypeConverter &converter, Chipset chipset)
       : ConvertOpToLLVMPattern<LDSBarrierOp>(converter), chipset(chipset) {}
@@ -1825,9 +1871,9 @@ void mlir::populateAMDGPUToROCDLConversionPatterns(LLVMTypeConverter &converter,
                                ROCDL::RawPtrBufferAtomicUminOp>,
            RawBufferOpLowering<RawBufferAtomicCmpswapOp,
                                ROCDL::RawPtrBufferAtomicCmpSwap>,
-           AMDGPUDPPLowering, LDSBarrierOpLowering, SchedBarrierOpLowering,
-           MFMAOpLowering, ScaledMFMAOpLowering, WMMAOpLowering,
-           ExtPackedFp8OpLowering, ScaledExtPackedOpLowering,
+           AMDGPUDPPLowering, WaitcntOpLowering, LDSBarrierOpLowering,
+           SchedBarrierOpLowering, MFMAOpLowering, ScaledMFMAOpLowering,
+           WMMAOpLowering, ExtPackedFp8OpLowering, ScaledExtPackedOpLowering,
            PackedScaledTruncOpLowering, PackedTrunc2xFp8OpLowering,
            PackedStochRoundFp8OpLowering, GatherToLDSOpLowering,
            TransposeLoadOpLowering>(converter, chipset);
diff --git a/mlir/test/Conversion/AMDGPUToROCDL/waitcnt.mlir b/mlir/test/Conversion/AMDGPUToROCDL/waitcnt.mlir
new file mode 100644
index 0000000000000..9c785670198ae
--- /dev/null
+++ b/mlir/test/Conversion/AMDGPUToROCDL/waitcnt.mlir
@@ -0,0 +1,20 @@
+// RUN: mlir-opt %s -convert-amdgpu-to-rocdl=chipset=gfx942 | FileCheck %s --check-prefixes=CHECK,GFX9
+// TODO: Add more chipsets support
+
+
+// CHECK-LABEL: func @waitcnt
+func.func @waitcnt() {
+  // GFX9: rocdl.s.waitcnt 53119
+  amdgpu.waitcnt
+
+  // GFX9: rocdl.s.waitcnt 3952
+  amdgpu.waitcnt vmcnt(0)
+
+  // GFX9: rocdl.s.waitcnt 53007
+  amdgpu.waitcnt expcnt(0)
+
+  // GFX9: rocdl.s.waitcnt 49279
+  amdgpu.waitcnt lgkmcnt(0)
+
+  return
+}
diff --git a/mlir/test/Dialect/AMDGPU/ops.mlir b/mlir/test/Dialect/AMDGPU/ops.mlir
index 5559ac8f1a5c3..b126b23cb8156 100644
--- a/mlir/test/Dialect/AMDGPU/ops.mlir
+++ b/mlir/test/Dialect/AMDGPU/ops.mlir
@@ -504,3 +504,16 @@ func.func @gather_to_lds(%idx1 : index, %idx2 : index, %mem1 : memref<32xf16>, %
   amdgpu.gather_to_lds %mem1[%idx1],        %smem2[%idx1, %idx2] : vector<2xf16>, memref<32xf16>,    memref<32x32xf16, #gpu.address_space<workgroup>>
   func.return
 }
+
+// CHECK-LABEL: func @waitcnt
+func.func @waitcnt() {
+  // CHECK: amdgpu.waitcnt vmcnt(1) expcnt(2) lgkmcnt(3)
+  // CHECK: amdgpu.waitcnt vmcnt(1)
+  // CHECK: amdgpu.waitcnt expcnt(2)
+  // CHECK: amdgpu.waitcnt lgkmcnt(3)
+  amdgpu.waitcnt vmcnt(1) expcnt(2) lgkmcnt(3)
+  amdgpu.waitcnt vmcnt(1)
+  amdgpu.waitcnt expcnt(2)
+  amdgpu.waitcnt lgkmcnt(3)
+  func.return
+}

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp

raikonenfnu · 2025-07-19T23:32:24Z

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp

+// it to some common place.
+static FailureOr<unsigned> encodeWaitcnt(Chipset chipset, unsigned vmcnt,
+                                         unsigned expcnt, unsigned lgkmcnt) {
+  if (chipset.majorVersion == 9) {


NIT: Thoughts on adding some doc based on

llvm-project/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h

Lines 1128 to 1135 in 54492c2

/// \p Vmcnt = \p Waitcnt[3:0] (pre-gfx9)

/// \p Vmcnt = \p Waitcnt[15:14,3:0] (gfx9,10)

/// \p Vmcnt = \p Waitcnt[15:10] (gfx11)

/// \p Expcnt = \p Waitcnt[6:4] (pre-gfx11)

/// \p Expcnt = \p Waitcnt[2:0] (gfx11)

/// \p Lgkmcnt = \p Waitcnt[11:8] (pre-gfx10)

/// \p Lgkmcnt = \p Waitcnt[13:8] (gfx10)

/// \p Lgkmcnt = \p Waitcnt[9:4] (gfx11)

here too? Something like:

/// \p Vmcnt = \p Waitcnt[15:14,3:0] /// \p Expcnt = \p Waitcnt[6:4] /// \p Lgkmcnt = \p Waitcnt[11:8]

done (and also added all other chipsets)

Signed-off-by: Ivan Butygin <[email protected]>

kuhar

LGTM but let's wait for an approval from @krzysz00

[mlir][amdgpu] Add amdgpu.waitcnt wrapper

442ed16

The main motivations is to pass vmcnt/expcnt/lgkmcnt values directly and delegate architecture-dependent bitpacking to the amdgpu->rocdl lowering. Only gfx9 bitpacking support added as part of this commit.

Hardcode84 requested review from krzysz00, kuhar, qedawkins and raikonenfnu July 19, 2025 18:34

llvmbot added backend:AMDGPU mlir:gpu mlir mlir:amdgpu labels Jul 19, 2025

raikonenfnu reviewed Jul 19, 2025

View reviewed changes

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp Show resolved Hide resolved

raikonenfnu reviewed Jul 19, 2025

View reviewed changes

more chisets

bfacb4d

Signed-off-by: Ivan Butygin <[email protected]>

kuhar approved these changes Jul 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mlir][amdgpu] Add `rocdl.s.waitcnt` wrapper #149670

[mlir][amdgpu] Add `rocdl.s.waitcnt` wrapper #149670

Hardcode84 commented Jul 19, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jul 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

raikonenfnu Jul 19, 2025

Uh oh!

Hardcode84 Jul 20, 2025 •

edited

Loading

Uh oh!

kuhar left a comment

Uh oh!

Uh oh!

	/// \p Vmcnt = \p Waitcnt[3:0] (pre-gfx9)
	/// \p Vmcnt = \p Waitcnt[15:14,3:0] (gfx9,10)
	/// \p Vmcnt = \p Waitcnt[15:10] (gfx11)
	/// \p Expcnt = \p Waitcnt[6:4] (pre-gfx11)
	/// \p Expcnt = \p Waitcnt[2:0] (gfx11)
	/// \p Lgkmcnt = \p Waitcnt[11:8] (pre-gfx10)
	/// \p Lgkmcnt = \p Waitcnt[13:8] (gfx10)
	/// \p Lgkmcnt = \p Waitcnt[9:4] (gfx11)

[mlir][amdgpu] Add rocdl.s.waitcnt wrapper #149670

Are you sure you want to change the base?

[mlir][amdgpu] Add rocdl.s.waitcnt wrapper #149670

Conversation

Hardcode84 commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

raikonenfnu Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

Hardcode84 Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kuhar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

[mlir][amdgpu] Add `rocdl.s.waitcnt` wrapper #149670

[mlir][amdgpu] Add `rocdl.s.waitcnt` wrapper #149670

Hardcode84 commented Jul 19, 2025 •

edited

Loading

llvmbot commented Jul 19, 2025 •

edited

Loading

Hardcode84 Jul 20, 2025 •

edited

Loading