Skip to content

Conversation

@ecamartins
Copy link
Collaborator

@ecamartins ecamartins commented Sep 4, 2025

Proposed changes

This PR, made in collaboration with @arai713, contains the first implementation of Stream-K in CK Tile. Our implementation is based on old CK's initial Stream-K implementation (see example/01_gemm/gemm_xdl_streamk.cpp). Stream-K is a GEMM algorithm that helps avoid imbalanced workloads across CUs. When more than one WG is assigned to the same C macro tile, the work is split along the K dimension, where WGs can either atomically add their results to C or select one WG to combine partial results via reduction. Since this is our first implementation of Stream-K in CK Tile, we have opted to implement the atomic strategy with the reduction strategy coming in future work.

For our implementation, this PR makes 3 main additions/alterations:

1) Updating Universal GEMM's MakeGemmTensorViews interface: The MakeGemmTensorViews function has a parameter of type SplitKBatchOffset, where only the integer member variable splitted_k is used in MakeGemmTensorViews. Since Stream-K, in general, is not split-K, we have updated the MakeGemmTensorViews interface to require only an integer rather than SplitKBatchOffset. This ensures that the same functionality is supported in MakeGemmTensorViews while avoiding the constraint of requiring a SplitKBatchOffset.
2) Adding a RunGemm function to the StreamKKernel Class: The Universal GEMM Kernel’s RunGemm function assumes that (a) the kernel uses a SplitKBatchOffset and (b) the tile partitioner can statically determine the number of iterations along the K dimension for a given WG (i.e., num_loop). These assumptions do not hold for Stream-K; we do not use SplitKBatchOffset and an instance method of the tile partitioner is needed to determine num_loop. In the Stream-K RunGemm, we opted to have num_loop as a function parameter to ensure that the main stream-K logic is isolated from RunGemm.
3) Implementing the Stream-K kernel's operator() function: This function contains the main Stream-K looping logic. In each iteration of the main Stream-K while loop, using the tile partitioner, offsets for the A, B, and C tensors are determined for the current iteration then there is a call to RunGemm.

Some other important notes:

  • As this is our first implementation, we are focused on correctly implementing the algorithm. Algorithmic optimization will come later.
  • There are currently 2 bugs in the StreamKTilePartitioner (ticket has already been logged with plans to work on them in upcoming sprints). Such bugs make it such that on certain architectures, certain instances of the stream-K algorithm fail. Thus, for this work in our test suite, we skip tests that have sk_blocks > 0. In other words, we only run cases where each macro tile in C is only worked on by one WG, all other tests are skipped. Update: commit 7328bc7 lifts this limitation.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

@zjing14
Copy link
Contributor

zjing14 commented Sep 5, 2025

@ecamartins Hi, it is great to see StreamK in CKTile. How about performance of StreamK vs SplitK for random shapes?

@arai713
Copy link
Contributor

arai713 commented Sep 5, 2025

@ecamartins Hi, it is great to see StreamK in CKTile. How about performance of StreamK vs SplitK for random shapes?

@zjing14 This is still the initial implementation, we have not had the chance to benchmark StreamK vs SplitK yet.

@ecamartins ecamartins force-pushed the sk_operator_cktile branch 2 times, most recently from b2a7adf to 016fd56 Compare September 10, 2025 01:31
CongMa13
CongMa13 previously approved these changes Sep 11, 2025
cgmillette
cgmillette previously approved these changes Sep 12, 2025
Copy link
Collaborator

@cgmillette cgmillette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@cgmillette cgmillette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@vidyasagar-amd vidyasagar-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to use atomic_add for both StreamKReductionStrategy::Atomic and StreamKReductionStrategy::Reduction ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment! The StreamKReductionStrategy::Atomic will use atomic_add whereas the StreamKReductionStrategy::Reduction will use a set operation. The StreamKReductionStrategy::Reduction is currently not implemented, so there is a guard in IsSupportedArgument to prevent users from using StreamKReductionStrategy::Reduction. We will need to update the tests and the IsSupportedArgument function once the reduction is implemented :)

ecamartins and others added 9 commits September 16, 2025 14:31
…::MakeGemmTensorViews function

Prior to this change, the splitk_batch_offset parameter of
MakeGemmTensorViews had type SplitKBatchOffset. But, the only member
variable of the SplitKBatchOffset class used in the MakeGemmTensorViews
function was splitted_k (an int32_t). The splitted_k value was used as
part of defining the dimensions of the tensor view. That said, for
Stream K, we do not need to use the SplitKBatchOffset class since we are
not using Split K. Thus, this commit changes the splitk_batch_offset
parameter to a int32_t called k_size. This will avoid the constraint of
requiring a caller of MakeGemmTensorViews to use the SplitKBatchOffset
class while still providing the same functionality. Calls to
UniversalGemmKernel::MakeGemmTensorViews have been updated accordingly.
Stream K cannot simply use UniversalGemmKernel's RunGemm for the
following reasons:

1. The UniversalGemmKernel::RunGemm function computes num_loop based on
   a static function of the TilePartitioner. That said, for Stream K,
num_loop must be computed using a member function (namely
GetCurrentIterLength from PR #2708).
2. The UniversalGemmKernel::RunGemm function requires the use of a
   SplitKBatchOffset object which is not used for Stream K since we are
not using Split K.

Thus, this change adds a RunGemm function in the StreamKKernel class.
These changes do the following:
- Ensure offsets along the M and N dimensions are multiplied by
  MPerblock or NPerBlock, respectively. This ensures tile window origins
are at the correct locations.
- Fix bug in the tile partitioner's GetTileIdxWithOffset. Now, we apply
  divmod to the given references to ensure correct values are available
to the caller.
- Added documentation in the Stream-K operator()
These changes add an initial gtest suite for the CK Tile Stream-K
kernel. Currently, due to bugs in the StreamKTilePartitioner (which will
be handled in a future PR), there are validation issues for certain
cases which may differ on different architectures. Thus, we opted to run
cases that are only fully data-parallel (skipping others). A guard was
added to Stream-K's IsSupportedArgument method to ensure that callers
are aware of this constraint. Additionally, to ensure testing
reproducibility, options for setting the number of CUs and occupancy
were added to MakeKernelArgs.
In Stream-K, the num_loop value varies per WG and per iteration of a
Stream-K loop. So instead, we use the version of the GemmPipeline's
operator() function that takes in has_hot_loop and tail_num. This is
similar to what is done in Grouped GEMM.
Prior to this change, WGs travelled backwards through their assigned
macro tiles in the C tensor. For instance, if WG0 is responsible for C
tiles 0 and 1, it would first visit tile 1 then tile 0. This means that
the iter_end decrements in each iteration of the stream-K while loop.

Since we are working with unsigned integers, the subtraction operation
may not be safe. Thus, this change makes is such that WGs travel forward
so that their iter_start is incremented and their iter_end remains
fixed.

Additionally, we added a guard against WGs that are neither sk_blocks
nor dp_blocks to ensure such WGs do not participate in the GEMM.

Together, these changes make is such that the algorithm is correct when
sk_blocks is greater than zero.
This instance involves >=3 WGs contributing to each macro tile in C. Due
to the use of atomics, this is resulting in precision errors. These
errors will not persist once the reduction strategy is implemented. We
will re-enable this test then.
@ecamartins ecamartins merged commit dee185d into develop Sep 16, 2025
44 of 52 checks passed
@ecamartins ecamartins deleted the sk_operator_cktile branch September 16, 2025 22:21
AviralGoelAMD pushed a commit that referenced this pull request Oct 16, 2025
* Change splitk_batch_offset parameter to k_size in UniversalGemmKernel::MakeGemmTensorViews function

Prior to this change, the splitk_batch_offset parameter of
MakeGemmTensorViews had type SplitKBatchOffset. But, the only member
variable of the SplitKBatchOffset class used in the MakeGemmTensorViews
function was splitted_k (an int32_t). The splitted_k value was used as
part of defining the dimensions of the tensor view. That said, for
Stream K, we do not need to use the SplitKBatchOffset class since we are
not using Split K. Thus, this commit changes the splitk_batch_offset
parameter to a int32_t called k_size. This will avoid the constraint of
requiring a caller of MakeGemmTensorViews to use the SplitKBatchOffset
class while still providing the same functionality. Calls to
UniversalGemmKernel::MakeGemmTensorViews have been updated accordingly.

* StreamK Kernel RunGemm Implementation

Stream K cannot simply use UniversalGemmKernel's RunGemm for the
following reasons:

1. The UniversalGemmKernel::RunGemm function computes num_loop based on
   a static function of the TilePartitioner. That said, for Stream K,
num_loop must be computed using a member function (namely
GetCurrentIterLength from PR #2708).
2. The UniversalGemmKernel::RunGemm function requires the use of a
   SplitKBatchOffset object which is not used for Stream K since we are
not using Split K.

Thus, this change adds a RunGemm function in the StreamKKernel class.

* initial implementation for operator() for StreamKKernel: adding stream-k algorithm and calls to RunGemm

* Fix indexing and offset issues for StreamK

These changes do the following:
- Ensure offsets along the M and N dimensions are multiplied by
  MPerblock or NPerBlock, respectively. This ensures tile window origins
are at the correct locations.
- Fix bug in the tile partitioner's GetTileIdxWithOffset. Now, we apply
  divmod to the given references to ensure correct values are available
to the caller.
- Added documentation in the Stream-K operator()

* Initial gtests for Stream-K

These changes add an initial gtest suite for the CK Tile Stream-K
kernel. Currently, due to bugs in the StreamKTilePartitioner (which will
be handled in a future PR), there are validation issues for certain
cases which may differ on different architectures. Thus, we opted to run
cases that are only fully data-parallel (skipping others). A guard was
added to Stream-K's IsSupportedArgument method to ensure that callers
are aware of this constraint. Additionally, to ensure testing
reproducibility, options for setting the number of CUs and occupancy
were added to MakeKernelArgs.

* Use GemmPipeline operator() variant that takes hot loop and tail num

In Stream-K, the num_loop value varies per WG and per iteration of a
Stream-K loop. So instead, we use the version of the GemmPipeline's
operator() function that takes in has_hot_loop and tail_num. This is
similar to what is done in Grouped GEMM.

* changes from review: comments, move readfirstlane, remove ifndef

* Switch direction of C tensor traversal & add padding guard

Prior to this change, WGs travelled backwards through their assigned
macro tiles in the C tensor. For instance, if WG0 is responsible for C
tiles 0 and 1, it would first visit tile 1 then tile 0. This means that
the iter_end decrements in each iteration of the stream-K while loop.

Since we are working with unsigned integers, the subtraction operation
may not be safe. Thus, this change makes is such that WGs travel forward
so that their iter_start is incremented and their iter_end remains
fixed.

Additionally, we added a guard against WGs that are neither sk_blocks
nor dp_blocks to ensure such WGs do not participate in the GEMM.

Together, these changes make is such that the algorithm is correct when
sk_blocks is greater than zero.

* Disable StreamK_M256_N256_K256_SKBlocks12 test case

This instance involves >=3 WGs contributing to each macro tile in C. Due
to the use of atomics, this is resulting in precision errors. These
errors will not persist once the reduction strategy is implemented. We
will re-enable this test then.

---------

Co-authored-by: Astha Rai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants