[CK_TILE] Stream-K GEMM Implementation #2781

ecamartins · 2025-09-04T19:38:35Z

Proposed changes

This PR, made in collaboration with @arai713, contains the first implementation of Stream-K in CK Tile. Our implementation is based on old CK's initial Stream-K implementation (see example/01_gemm/gemm_xdl_streamk.cpp). Stream-K is a GEMM algorithm that helps avoid imbalanced workloads across CUs. When more than one WG is assigned to the same C macro tile, the work is split along the K dimension, where WGs can either atomically add their results to C or select one WG to combine partial results via reduction. Since this is our first implementation of Stream-K in CK Tile, we have opted to implement the atomic strategy with the reduction strategy coming in future work.

For our implementation, this PR makes 3 main additions/alterations:

1) Updating Universal GEMM's MakeGemmTensorViews interface: The MakeGemmTensorViews function has a parameter of type SplitKBatchOffset, where only the integer member variable splitted_k is used in MakeGemmTensorViews. Since Stream-K, in general, is not split-K, we have updated the MakeGemmTensorViews interface to require only an integer rather than SplitKBatchOffset. This ensures that the same functionality is supported in MakeGemmTensorViews while avoiding the constraint of requiring a SplitKBatchOffset.
2) Adding a RunGemm function to the StreamKKernel Class: The Universal GEMM Kernel’s RunGemm function assumes that (a) the kernel uses a SplitKBatchOffset and (b) the tile partitioner can statically determine the number of iterations along the K dimension for a given WG (i.e., num_loop). These assumptions do not hold for Stream-K; we do not use SplitKBatchOffset and an instance method of the tile partitioner is needed to determine num_loop. In the Stream-K RunGemm, we opted to have num_loop as a function parameter to ensure that the main stream-K logic is isolated from RunGemm.
3) Implementing the Stream-K kernel's operator() function: This function contains the main Stream-K looping logic. In each iteration of the main Stream-K while loop, using the tile partitioner, offsets for the A, B, and C tensors are determined for the current iteration then there is a call to RunGemm.

Some other important notes:

As this is our first implementation, we are focused on correctly implementing the algorithm. Algorithmic optimization will come later.
There are currently 2 bugs in the StreamKTilePartitioner (ticket has already been logged with plans to work on them in upcoming sprints). Such bugs make it such that on certain architectures, certain instances of the stream-K algorithm fail. Thus, for this work in our test suite, we skip tests that have sk_blocks > 0. In other words, we only run cases where each macro tile in C is only worked on by one WG, all other tests are skipped. Update: commit 7328bc7 lifts this limitation.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

zjing14 · 2025-09-05T18:51:44Z

@ecamartins Hi, it is great to see StreamK in CKTile. How about performance of StreamK vs SplitK for random shapes?

arai713 · 2025-09-05T19:08:51Z

@ecamartins Hi, it is great to see StreamK in CKTile. How about performance of StreamK vs SplitK for random shapes?

@zjing14 This is still the initial implementation, we have not had the chance to benchmark StreamK vs SplitK yet.

include/ck_tile/ops/gemm/kernel/streamk_gemm_kernel.hpp

include/ck_tile/ops/gemm/kernel/gemm_tile_partitioner.hpp

test/ck_tile/gemm_streamk/CMakeLists.txt

include/ck_tile/ops/gemm/kernel/streamk_gemm_kernel.hpp

test/ck_tile/gemm_streamk/test_gemm_streamk_cases.inc

cgmillette

LGTM

cgmillette

LGTM

vidyasagar-amd

LGTM

CongMa13 · 2025-09-15T16:13:31Z

test/ck_tile/gemm_streamk/test_gemm_streamk_util.hpp

We have to use atomic_add for both StreamKReductionStrategy::Atomic and StreamKReductionStrategy::Reduction ?

Thanks for your comment! The StreamKReductionStrategy::Atomic will use atomic_add whereas the StreamKReductionStrategy::Reduction will use a set operation. The StreamKReductionStrategy::Reduction is currently not implemented, so there is a guard in IsSupportedArgument to prevent users from using StreamKReductionStrategy::Reduction. We will need to update the tests and the IsSupportedArgument function once the reduction is implemented :)

…::MakeGemmTensorViews function Prior to this change, the splitk_batch_offset parameter of MakeGemmTensorViews had type SplitKBatchOffset. But, the only member variable of the SplitKBatchOffset class used in the MakeGemmTensorViews function was splitted_k (an int32_t). The splitted_k value was used as part of defining the dimensions of the tensor view. That said, for Stream K, we do not need to use the SplitKBatchOffset class since we are not using Split K. Thus, this commit changes the splitk_batch_offset parameter to a int32_t called k_size. This will avoid the constraint of requiring a caller of MakeGemmTensorViews to use the SplitKBatchOffset class while still providing the same functionality. Calls to UniversalGemmKernel::MakeGemmTensorViews have been updated accordingly.

Stream K cannot simply use UniversalGemmKernel's RunGemm for the following reasons: 1. The UniversalGemmKernel::RunGemm function computes num_loop based on a static function of the TilePartitioner. That said, for Stream K, num_loop must be computed using a member function (namely GetCurrentIterLength from PR #2708). 2. The UniversalGemmKernel::RunGemm function requires the use of a SplitKBatchOffset object which is not used for Stream K since we are not using Split K. Thus, this change adds a RunGemm function in the StreamKKernel class.

…m-k algorithm and calls to RunGemm

These changes do the following: - Ensure offsets along the M and N dimensions are multiplied by MPerblock or NPerBlock, respectively. This ensures tile window origins are at the correct locations. - Fix bug in the tile partitioner's GetTileIdxWithOffset. Now, we apply divmod to the given references to ensure correct values are available to the caller. - Added documentation in the Stream-K operator()

These changes add an initial gtest suite for the CK Tile Stream-K kernel. Currently, due to bugs in the StreamKTilePartitioner (which will be handled in a future PR), there are validation issues for certain cases which may differ on different architectures. Thus, we opted to run cases that are only fully data-parallel (skipping others). A guard was added to Stream-K's IsSupportedArgument method to ensure that callers are aware of this constraint. Additionally, to ensure testing reproducibility, options for setting the number of CUs and occupancy were added to MakeKernelArgs.

In Stream-K, the num_loop value varies per WG and per iteration of a Stream-K loop. So instead, we use the version of the GemmPipeline's operator() function that takes in has_hot_loop and tail_num. This is similar to what is done in Grouped GEMM.

Prior to this change, WGs travelled backwards through their assigned macro tiles in the C tensor. For instance, if WG0 is responsible for C tiles 0 and 1, it would first visit tile 1 then tile 0. This means that the iter_end decrements in each iteration of the stream-K while loop. Since we are working with unsigned integers, the subtraction operation may not be safe. Thus, this change makes is such that WGs travel forward so that their iter_start is incremented and their iter_end remains fixed. Additionally, we added a guard against WGs that are neither sk_blocks nor dp_blocks to ensure such WGs do not participate in the GEMM. Together, these changes make is such that the algorithm is correct when sk_blocks is greater than zero.

This instance involves >=3 WGs contributing to each macro tile in C. Due to the use of atomics, this is resulting in precision errors. These errors will not persist once the reduction strategy is implemented. We will re-enable this test then.

* Change splitk_batch_offset parameter to k_size in UniversalGemmKernel::MakeGemmTensorViews function Prior to this change, the splitk_batch_offset parameter of MakeGemmTensorViews had type SplitKBatchOffset. But, the only member variable of the SplitKBatchOffset class used in the MakeGemmTensorViews function was splitted_k (an int32_t). The splitted_k value was used as part of defining the dimensions of the tensor view. That said, for Stream K, we do not need to use the SplitKBatchOffset class since we are not using Split K. Thus, this commit changes the splitk_batch_offset parameter to a int32_t called k_size. This will avoid the constraint of requiring a caller of MakeGemmTensorViews to use the SplitKBatchOffset class while still providing the same functionality. Calls to UniversalGemmKernel::MakeGemmTensorViews have been updated accordingly. * StreamK Kernel RunGemm Implementation Stream K cannot simply use UniversalGemmKernel's RunGemm for the following reasons: 1. The UniversalGemmKernel::RunGemm function computes num_loop based on a static function of the TilePartitioner. That said, for Stream K, num_loop must be computed using a member function (namely GetCurrentIterLength from PR #2708). 2. The UniversalGemmKernel::RunGemm function requires the use of a SplitKBatchOffset object which is not used for Stream K since we are not using Split K. Thus, this change adds a RunGemm function in the StreamKKernel class. * initial implementation for operator() for StreamKKernel: adding stream-k algorithm and calls to RunGemm * Fix indexing and offset issues for StreamK These changes do the following: - Ensure offsets along the M and N dimensions are multiplied by MPerblock or NPerBlock, respectively. This ensures tile window origins are at the correct locations. - Fix bug in the tile partitioner's GetTileIdxWithOffset. Now, we apply divmod to the given references to ensure correct values are available to the caller. - Added documentation in the Stream-K operator() * Initial gtests for Stream-K These changes add an initial gtest suite for the CK Tile Stream-K kernel. Currently, due to bugs in the StreamKTilePartitioner (which will be handled in a future PR), there are validation issues for certain cases which may differ on different architectures. Thus, we opted to run cases that are only fully data-parallel (skipping others). A guard was added to Stream-K's IsSupportedArgument method to ensure that callers are aware of this constraint. Additionally, to ensure testing reproducibility, options for setting the number of CUs and occupancy were added to MakeKernelArgs. * Use GemmPipeline operator() variant that takes hot loop and tail num In Stream-K, the num_loop value varies per WG and per iteration of a Stream-K loop. So instead, we use the version of the GemmPipeline's operator() function that takes in has_hot_loop and tail_num. This is similar to what is done in Grouped GEMM. * changes from review: comments, move readfirstlane, remove ifndef * Switch direction of C tensor traversal & add padding guard Prior to this change, WGs travelled backwards through their assigned macro tiles in the C tensor. For instance, if WG0 is responsible for C tiles 0 and 1, it would first visit tile 1 then tile 0. This means that the iter_end decrements in each iteration of the stream-K while loop. Since we are working with unsigned integers, the subtraction operation may not be safe. Thus, this change makes is such that WGs travel forward so that their iter_start is incremented and their iter_end remains fixed. Additionally, we added a guard against WGs that are neither sk_blocks nor dp_blocks to ensure such WGs do not participate in the GEMM. Together, these changes make is such that the algorithm is correct when sk_blocks is greater than zero. * Disable StreamK_M256_N256_K256_SKBlocks12 test case This instance involves >=3 WGs contributing to each macro tile in C. Due to the use of atomics, this is resulting in precision errors. These errors will not persist once the reduction strategy is implemented. We will re-enable this test then. --------- Co-authored-by: Astha Rai <[email protected]>

ecamartins assigned ecamartins and arai713 Sep 4, 2025

ecamartins marked this pull request as ready for review September 4, 2025 20:55

ecamartins requested review from ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent and vidyasagar-amd as code owners September 4, 2025 20:56

ecamartins force-pushed the sk_operator_cktile branch 2 times, most recently from 3f20e3c to 8956bbd Compare September 5, 2025 17:06

ecamartins force-pushed the sk_operator_cktile branch 2 times, most recently from b2a7adf to 016fd56 Compare September 10, 2025 01:31

CongMa13 previously approved these changes Sep 11, 2025

View reviewed changes

include/ck_tile/ops/gemm/kernel/streamk_gemm_kernel.hpp Outdated Show resolved Hide resolved

cgmillette reviewed Sep 11, 2025

View reviewed changes

include/ck_tile/ops/gemm/kernel/gemm_tile_partitioner.hpp Outdated Show resolved Hide resolved

cgmillette reviewed Sep 11, 2025

View reviewed changes

test/ck_tile/gemm_streamk/CMakeLists.txt Outdated Show resolved Hide resolved

cgmillette reviewed Sep 12, 2025

View reviewed changes

include/ck_tile/ops/gemm/kernel/streamk_gemm_kernel.hpp Outdated Show resolved Hide resolved

cgmillette reviewed Sep 12, 2025

View reviewed changes

include/ck_tile/ops/gemm/kernel/streamk_gemm_kernel.hpp Outdated Show resolved Hide resolved

cgmillette reviewed Sep 12, 2025

View reviewed changes

test/ck_tile/gemm_streamk/test_gemm_streamk_cases.inc Outdated Show resolved Hide resolved

arai713 dismissed CongMa13’s stale review via 00f76d7 September 12, 2025 18:39

arai713 requested a review from aska-0096 as a code owner September 12, 2025 18:39

ecamartins force-pushed the sk_operator_cktile branch from 00f76d7 to 7328bc7 Compare September 12, 2025 19:25

cgmillette previously approved these changes Sep 12, 2025

View reviewed changes

cgmillette requested a review from CongMa13 September 12, 2025 20:15

ecamartins dismissed cgmillette’s stale review via e4975f2 September 12, 2025 22:53

cgmillette approved these changes Sep 12, 2025

View reviewed changes

vidyasagar-amd approved these changes Sep 12, 2025

View reviewed changes

CongMa13 reviewed Sep 15, 2025

View reviewed changes

ecamartins and others added 9 commits September 16, 2025 14:31

initial implementation for operator() for StreamKKernel: adding strea…

55ea03e

…m-k algorithm and calls to RunGemm

changes from review: comments, move readfirstlane, remove ifndef

90491b0

ecamartins force-pushed the sk_operator_cktile branch from e4975f2 to f932ebb Compare September 16, 2025 14:35

ecamartins mentioned this pull request Sep 16, 2025

[CK_TILE] Stream-K bf16 and fp16 GEMM Example #2862

Closed

7 tasks

ecamartins merged commit dee185d into develop Sep 16, 2025
44 of 52 checks passed

ecamartins deleted the sk_operator_cktile branch September 16, 2025 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CK_TILE] Stream-K GEMM Implementation #2781

[CK_TILE] Stream-K GEMM Implementation #2781

Uh oh!

ecamartins commented Sep 4, 2025 •

edited

Loading

Uh oh!

zjing14 commented Sep 5, 2025

Uh oh!

arai713 commented Sep 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgmillette left a comment

Uh oh!

cgmillette left a comment

Uh oh!

vidyasagar-amd left a comment

Uh oh!

CongMa13 Sep 15, 2025

Uh oh!

ecamartins Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[CK_TILE] Stream-K GEMM Implementation #2781

[CK_TILE] Stream-K GEMM Implementation #2781

Uh oh!

Conversation

ecamartins commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Uh oh!

zjing14 commented Sep 5, 2025

Uh oh!

arai713 commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgmillette left a comment

Choose a reason for hiding this comment

Uh oh!

cgmillette left a comment

Choose a reason for hiding this comment

Uh oh!

vidyasagar-amd left a comment

Choose a reason for hiding this comment

Uh oh!

CongMa13 Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

ecamartins Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ecamartins commented Sep 4, 2025 •

edited

Loading

arai713 commented Sep 5, 2025 •

edited

Loading