[CK_TILE] B matrix 2D block scale gemm #3074

samremes · 2025-10-22T11:13:30Z

Proposed changes

Introduces 2d block scale support for B matrix (grouping both on N and K axes). The tile distribution for the scale matrix has different options depending on the group size.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

…d_blockscale

ThomasNing

Great PR!

ThomasNing · 2025-10-28T10:19:34Z

include/ck_tile/ops/gemm_quant/block/block_universal_gemm_as_bs_bquant_cr.hpp

+        static constexpr index_t KPerBlock = BlockGemmShape::kK;
+
+        static constexpr index_t NQPerBlock = NPerBlock / QuantGroupSize::kN;
+        static constexpr index_t BQPerBlock = KPerBlock / QuantGroupSize::kK;


The name should be KQPerBlock in that case.

ThomasNing · 2025-10-28T10:22:03Z

include/ck_tile/ops/gemm_quant/pipeline/gemm_aquant_pipeline_ag_bg_cr_base.hpp


-    static_assert(KPerBlock % QuantGroupSize == 0,
+    static_assert(KPerBlock % QuantGroupSize::kK == 0,
                  "KPerBlock must be a multiple of QuantGroupSize");


static_assert should be "KPerBlock must be a multiple of QuantGroupSize in K dim."

ThomasNing · 2025-10-28T10:42:17Z

include/ck_tile/ops/gemm_quant/pipeline/gemm_group_quant_utils.hpp

-    static constexpr index_t XR = 2;
+    CK_TILE_HOST_DEVICE static constexpr auto make_2d_static_tile_distribution()
+    {
+        if constexpr(YPerQ == 1)


YPerQ is the 1D blockscale case.

ThomasNing · 2025-10-28T10:43:20Z

include/ck_tile/ops/gemm_quant/pipeline/gemm_group_quant_utils.hpp

+        {
+            // YPerQ == 1 implementation - each row of B has independent scale
+            constexpr index_t X  = XPerTile;
+            constexpr index_t XR = 2;


XR could be set to 1 in that case.

ThomasNing · 2025-10-28T10:47:27Z

include/ck_tile/ops/gemm_quant/pipeline/gemm_group_quant_utils.hpp

          index_t XPerTile,
+          index_t YPerQ,
          index_t VecSize>
 struct tile_distribution_encoding_pattern_bq : public tile_distribution_encoding_pattern


This function looks good to me. @CongMa13 Could you also take a look at it and help me confirm?

ThomasNing · 2025-10-28T10:52:30Z

include/ck_tile/ops/gemm_quant/pipeline/gemm_group_quant_utils.hpp

          index_t KPerBlockAQ,
          index_t VecSize,
          bool PreshuffleQuant>
 struct tile_distribution_encoding_pattern_aq : public tile_distribution_encoding_pattern


Do we also have the similar change to this function as Bq?

Not yet for aquant. I think that should go pretty similar to bquant. I did some initial refactoring to support the group sizes for all M/N/K but kernel implementation only for the bquant.

ThomasNing · 2025-10-28T10:54:38Z

Please also address the Merge conflict.

Copilot

Pull Request Overview

This PR introduces 2D block scale support for the B matrix in GEMM operations, enabling quantization grouping on both the N and K axes. Previously, only 1D grouping (along the K axis) was supported for B matrix quantization.

Refactored QuantGroupSize from a simple integer constant to a structured type containing M, N, and K dimensions
Added multiple 2D block size configurations (8N, 16N, 64N, 128N) for testing B matrix quantization
Updated tile distribution logic to handle different group sizes with specialized patterns

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
test/ck_tile/gemm_block_scale/test_gemm_quant_typed.cpp	Adds 2D block size test configurations and updates GroupSize definitions
test/ck_tile/gemm_block_scale/test_gemm_quant_fixtures.hpp	Updates test fixtures to use structured QuantGroupSize type and handle 2D blocks
test/ck_tile/gemm_block_scale/test_gemm_quant_base.hpp	Changes QuantGroupSize from integer constant to type alias
include/ck_tile/ops/gemm_quant/pipeline/gemm_group_quant_utils.hpp	Introduces QuantGroupShape struct and implements conditional tile distribution patterns
include/ck_tile/ops/gemm_quant/pipeline/gemm_quant_pipeline_problem.hpp	Updates pipeline problem definition to use QuantGroupSize type
include/ck_tile/ops/gemm_quant/pipeline/gemm_bquant_pipeline_ag_bg_cr_v3.hpp	Updates BQuant pipeline to calculate NPerBlockBQ and use structured QuantGroupSize
include/ck_tile/ops/gemm_quant/pipeline/gemm_wp_bquant_pipeline_ag_bg_cr_v2.hpp	Updates weight preshuffle pipeline for new QuantGroupSize structure
include/ck_tile/ops/gemm_quant/pipeline/gemm_wp_bquant_pipeline_ag_bg_cr_base_policy.hpp	Adds NPerBlockBQ calculation in policy
include/ck_tile/ops/gemm_quant/pipeline/gemm_bquant_pipeline_ag_bg_cr_policy.hpp	Updates policy to use QuantGroupSize::kN and ::kK
include/ck_tile/ops/gemm_quant/pipeline/gemm_bquant_pipeline_ag_bg_cr_base.hpp	Adds NPerBlockBQ and validates block dimensions
include/ck_tile/ops/gemm_quant/pipeline/gemm_aquant_pipeline_ag_bg_cr_v3.hpp	Updates AQuant pipeline for structured QuantGroupSize
include/ck_tile/ops/gemm_quant/pipeline/gemm_aquant_pipeline_ag_bg_cr_policy.hpp	Updates AQuant policy to use QuantGroupSize::kK
include/ck_tile/ops/gemm_quant/pipeline/gemm_aquant_pipeline_ag_bg_cr_mem.hpp	Updates memory pipeline for structured QuantGroupSize
include/ck_tile/ops/gemm_quant/pipeline/gemm_aquant_pipeline_ag_bg_cr_base.hpp	Updates base implementation to use QuantGroupSize::kK
include/ck_tile/ops/gemm_quant/kernel/gemm_quant_kernel.hpp	Updates kernel to handle 2D block scales with proper indexing
include/ck_tile/ops/gemm_quant/block/block_universal_gemm_as_bs_bquant_cr.hpp	Implements conditional scale indexing based on NQPerBlock
include/ck_tile/ops/gemm_quant/block/block_universal_gemm_as_aquant_bs_cr.hpp	Updates AQuant block to use QuantGroupSize::kK
include/ck_tile/ops/gemm_quant/block/block_universal_gemm_ar_flatbr_bquant_cr.hpp	Updates preshuffle block for structured QuantGroupSize
include/ck_tile/host/reference/reference_gemm.hpp	Updates reference implementation to handle 2D quantization
example/ck_tile/38_block_scale_gemm/run_gemm_quant_example.inc	Updates example to use QuantGroupSize type
example/ck_tile/38_block_scale_gemm/gemm_quant_basic.cpp	Updates example instantiations with structured QuantGroupSize

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/ck_tile/gemm_block_scale/test_gemm_quant_fixtures.hpp

Copilot · 2025-10-28T11:42:14Z

test/ck_tile/gemm_block_scale/test_gemm_quant_fixtures.hpp

            K,   // M, N, K
            0,   // QK_A (not used for BQuant)
-            BQK, // QK_B
+            BQK, // QK_B - TODO: we can remove BQK and BQN from args later?


TODO comment suggests incomplete refactoring. Consider creating a tracked issue for this technical debt or addressing it in this PR if feasible.

Suggested change

BQK, // QK_B - TODO: we can remove BQK and BQN from args later?

BQK, // QK_B - FIXME: See issue #1234 - refactor to remove BQK and BQN from args if not needed

Copilot · 2025-10-28T11:42:15Z

include/ck_tile/ops/gemm_quant/pipeline/gemm_group_quant_utils.hpp

+    CK_TILE_HOST_DEVICE static constexpr auto make_2d_static_tile_distribution()
+    {


This complex function with multiple conditional branches handling different YPerQ values would benefit from a documentation comment explaining the different cases and when each branch is selected.

Suggested change

CK_TILE_HOST_DEVICE static constexpr auto make_2d_static_tile_distribution()

{

/**

* Returns a static tile distribution encoding for quantized GEMM, handling different cases

* based on YPerQ (number of rows per quantization scale) and YPerTile (tile size along Y).

*

* Branches:

* 1. YPerQ == 1:

* - Each row of B has an independent scale.

* - Distribution splits Y into (NIterPerWarp, NWarps, WarpGemm::kN).

* - Used when the finest granularity of quantization is required.

*

* 2. YPerTile >= NIterPerWarp * NWarps:

* - All warps in the block use the same scale.

* - Distribution replicates the scale across warps.

* - Used when quantization scale covers the entire block tile along Y.

*

* 3. YPerTile >= NIterPerWarp:

* - All NWarps have the same scale, replicated across warps.

* - Used when quantization scale covers multiple iterations per warp.

*

* 4. Otherwise:

* - Larger NQ block size, multiple iters/warps use the same scale, replicated to all threads.

* - Used when quantization scale is coarser than the block tile.

*

* This function ensures the correct distribution of quantization scales for each tile,

* optimizing memory access and computation based on the quantization granularity.

*/

CK_TILE_HOST_DEVICE static constexpr auto make_2d_static_tile_distribution()

This is acutally a nice documentation! Can we add it?

include/ck_tile/ops/gemm_quant/pipeline/gemm_bquant_pipeline_ag_bg_cr_v3.hpp

Copilot · 2025-10-28T11:42:15Z

include/ck_tile/ops/gemm_quant/block/block_universal_gemm_ar_flatbr_bquant_cr.hpp

+    using QuantGroupSize  = remove_cvref_t<typename Problem::QuantGroupSize>;
+
+    static_assert(QuantGroupSize::kM == 1, "only N/K blocks for BQuant preshuffle kernel!");
+    static_assert(QuantGroupSize::kN == 1, "no block for N supported yet!");


The second assertion prevents N-axis blocking in the preshuffle kernel, which conflicts with the PR's goal of supporting 2D block scales. This should be relaxed or the preshuffle kernel should be updated to support N-axis blocking.

Suggested change

static_assert(QuantGroupSize::kN == 1, "no block for N supported yet!");

// static_assert(QuantGroupSize::kN == 1, "no block for N supported yet!");

illsilin · 2025-10-28T17:18:35Z

Hi @samremes, could you please resolve the merge conflicts?

…d_blockscale

CongMa13 · 2025-10-24T18:04:01Z

test/ck_tile/gemm_block_scale/test_gemm_quant_typed.cpp

+    std::tuple<RowMajor, ColumnMajor, RowMajor, BF8, PkInt4, BF8,   Half, BQuantGrouped, GemmConfigBase, GroupSize>,
+
+    std::tuple<RowMajor, ColumnMajor, RowMajor, FP8, FP8,    float, Half, BQuantGrouped, GemmConfigBase, GroupSize64>,
+    std::tuple<RowMajor, ColumnMajor, RowMajor, BF8, BF8,    float, Half, BQuantGrouped, GemmConfigBase, GroupSize64>,
+    std::tuple<RowMajor, ColumnMajor, RowMajor, FP8, PkInt4, FP8,   Half, BQuantGrouped, GemmConfigBase, GroupSize64>,
+    std::tuple<RowMajor, ColumnMajor, RowMajor, BF8, PkInt4, BF8,   Half, BQuantGrouped, GemmConfigBase, GroupSize64>,
+
+    // 2d cases with grouping also on the n axis
+    std::tuple<RowMajor, ColumnMajor, RowMajor, FP8, FP8,    float, Half, BQuantGrouped, GemmConfigBase, GroupSize2D>,
+    std::tuple<RowMajor, ColumnMajor, RowMajor, BF8, BF8,    float, Half, BQuantGrouped, GemmConfigBase, GroupSize2D>,
+    std::tuple<RowMajor, ColumnMajor, RowMajor, FP8, PkInt4, FP8,   Half, BQuantGrouped, GemmConfigBase, GroupSize2D>,
+    std::tuple<RowMajor, ColumnMajor, RowMajor, BF8, PkInt4, BF8,   Half, BQuantGrouped, GemmConfigBase, GroupSize2D>


It is awesome to have these unit tests 👍

CongMa13 · 2025-10-29T03:28:45Z

test/ck_tile/gemm_block_scale/test_gemm_quant_typed.cpp

+    std::tuple<RowMajor, ColumnMajor, RowMajor, BF8, PkInt4, BF8,   Half, BQuantGrouped, GemmConfigBase, GroupSize64>,
+
+    // 2d cases with grouping also on the n axis
+    // FIXME: why is group size 8 not working?


group size cannot be smaller than GemmWarp::kN since warp tile is the smallest compute tile.

CongMa13 · 2025-10-29T03:28:52Z

test/ck_tile/gemm_block_scale/test_gemm_quant_typed.cpp

+    std::tuple<RowMajor, ColumnMajor, RowMajor, BF8, BF8,    float, Half, BQuantGrouped, GemmConfigBase, GroupSize2D16N>,
+    std::tuple<RowMajor, ColumnMajor, RowMajor, FP8, PkInt4, FP8,   Half, BQuantGrouped, GemmConfigBase, GroupSize2D16N>,
+    std::tuple<RowMajor, ColumnMajor, RowMajor, BF8, PkInt4, BF8,   Half, BQuantGrouped, GemmConfigBase, GroupSize2D16N>,
+    std::tuple<RowMajor, ColumnMajor, RowMajor, FP8, FP8,    float, Half, BQuantGrouped, GemmConfigBase, GroupSize2D64N>,


Tried to add group size 32 and the test case failed. Could you please try it?

I didn't get it to work either. I think the current tile distributions won't be able to handle it. I changed the conditions to be more specific that the split into NWarps/NIterPerWarp has to be exact.

ThomasNing · 2025-10-29T03:40:44Z

example/ck_tile/38_block_scale_gemm/gemm_quant_basic.cpp


    std::string quant_mode = arg_parser.get_str("quant_mode");

+    using QuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;


Could we make the Quant Group Size as an interface? Currently, we need to manually put the quant dim size.

ThomasNing · 2025-10-29T03:42:37Z

@CongMa13 Please try the solution we discussed of the tile distribution today and see the perf difference.

Co-authored-by: Copilot <[email protected]>

samremes · 2025-10-29T17:36:37Z

@ThomasNing @CongMa13 Did you have some ideas for the tile distribution? I think the current versions require that it exactly splits with NWarps and/or NIterPerWarp.

samremes added 14 commits October 13, 2025 14:05

Refactor quant group size to be configurable for M/N/K, not just K

8bb5255

add some asserts for configurations not implemented

98365f5

start setting of group size for N dimension

f6b07dc

enable 2d for reference quant gemm

22362f2

WIP: trying to figure out tile dstr and/or indexing for scale matrix

9988a46

WIP

36b88c6

Fix handling of n dim blocks in tile windows etc

bb52cd9

remove commented code and enable all tests again

f179a8a

fix formatting

d100ab6

Add more specialized tile distributions

37738e4

Enable NWarps replication for bquant tile dstr

98deefa

fix formatting

2d86cd0

Merge remote-tracking branch 'origin/develop' into samremes/bmatrix_2…

470d6e4

…d_blockscale

fix format

1f13003

samremes marked this pull request as ready for review October 27, 2025 15:24

samremes requested review from ThomasNing, afagaj, andriy-ca, aosewski, aska-0096, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz and tenpercent as code owners October 27, 2025 15:24

samremes requested review from shumway and vidyasagar-amd as code owners October 27, 2025 15:24

ThomasNing reviewed Oct 28, 2025

View reviewed changes

aosewski requested a review from Copilot October 28, 2025 11:40

Copilot AI reviewed Oct 28, 2025

View reviewed changes

samremes added 4 commits October 28, 2025 17:49

Merge remote-tracking branch 'origin/develop' into samremes/bmatrix_2…

a449728

…d_blockscale

Fix some issues from the merge

e12ab56

fix formatting

7c93551

one more fix to tile dstr, and revert debug initialization

e1475d4

CongMa13 reviewed Oct 28, 2025

View reviewed changes

CongMa13 approved these changes Oct 29, 2025

View reviewed changes

CongMa13 requested changes Oct 29, 2025

View reviewed changes

ThomasNing requested changes Oct 29, 2025

View reviewed changes

samremes and others added 3 commits October 29, 2025 11:19

Remove commented code

5e0a356

Co-authored-by: Copilot <[email protected]>

simplify conditions that are needed for tile distributions

1290b1b

only enable the working group sizes in tests

306e25a

fix formatting

68e41da

	BQK, // QK_B - TODO: we can remove BQK and BQN from args later?
	BQK, // QK_B - FIXME: See issue #1234 - refactor to remove BQK and BQN from args if not needed

		CK_TILE_HOST_DEVICE static constexpr auto make_2d_static_tile_distribution()
		{

-    CK_TILE_HOST_DEVICE static constexpr auto make_2d_static_tile_distribution()
-    {
+    /**
+     * Returns a static tile distribution encoding for quantized GEMM, handling different cases
+     * based on YPerQ (number of rows per quantization scale) and YPerTile (tile size along Y).
+     *
+     * Branches:
+     * 1. YPerQ == 1:
+     *    - Each row of B has an independent scale.
+     *    - Distribution splits Y into (NIterPerWarp, NWarps, WarpGemm::kN).
+     *    - Used when the finest granularity of quantization is required.
+     *
+     * 2. YPerTile >= NIterPerWarp * NWarps:
+     *    - All warps in the block use the same scale.
+     *    - Distribution replicates the scale across warps.
+     *    - Used when quantization scale covers the entire block tile along Y.
+     *
+     * 3. YPerTile >= NIterPerWarp:
+     *    - All NWarps have the same scale, replicated across warps.
+     *    - Used when quantization scale covers multiple iterations per warp.
+     *
+     * 4. Otherwise:
+     *    - Larger NQ block size, multiple iters/warps use the same scale, replicated to all threads.
+     *    - Used when quantization scale is coarser than the block tile.
+     *
+     * This function ensures the correct distribution of quantization scales for each tile,
+     * optimizing memory access and computation based on the quantization granularity.
+     */
+    CK_TILE_HOST_DEVICE static constexpr auto make_2d_static_tile_distribution()

	static_assert(QuantGroupSize::kN == 1, "no block for N supported yet!");
	// static_assert(QuantGroupSize::kN == 1, "no block for N supported yet!");


		std::string quant_mode = arg_parser.get_str("quant_mode");

		using QuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;

[CK_TILE] B matrix 2D block scale gemm #3074

Are you sure you want to change the base?

[CK_TILE] B matrix 2D block scale gemm #3074

Uh oh!

Conversation

samremes commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

ThomasNing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasNing commented Oct 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

illsilin commented Oct 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasNing commented Oct 29, 2025

Uh oh!

samremes commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

samremes commented Oct 22, 2025 •

edited

Loading