[CK_TILE] Stream-K bf16 and fp16 GEMM Example #2862

ecamartins · 2025-09-16T20:54:28Z

Proposed changes

NOTE: This PR is left as a draft as it depends on PR #2781. Once PR #2781 is merged, I will update this PR and mark it as ready to review

Recently, an initial implementation of Stream-K has been added to CK Tile (see PR #2781) with a gtest suite covering initial cases for fp16 and bf16. Thus, this PR aims to add a minimal example for Stream-K in example/ck_tile. Specifically, the example allows users to run Stream-K with either b16 or fp16 with options to configure number of Stream-K blocks, reduction strategy, etc. At this time, only the atomic reduction strategy is supported (see PR #2781 for the guard in IsSupportedArgument for details).

There are plans to add more gtests cases to the Stream-K unit tests, which will test for variants, including more pipelines. These examples will be expanded once more unit tests are in place.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

…::MakeGemmTensorViews function Prior to this change, the splitk_batch_offset parameter of MakeGemmTensorViews had type SplitKBatchOffset. But, the only member variable of the SplitKBatchOffset class used in the MakeGemmTensorViews function was splitted_k (an int32_t). The splitted_k value was used as part of defining the dimensions of the tensor view. That said, for Stream K, we do not need to use the SplitKBatchOffset class since we are not using Split K. Thus, this commit changes the splitk_batch_offset parameter to a int32_t called k_size. This will avoid the constraint of requiring a caller of MakeGemmTensorViews to use the SplitKBatchOffset class while still providing the same functionality. Calls to UniversalGemmKernel::MakeGemmTensorViews have been updated accordingly.

Stream K cannot simply use UniversalGemmKernel's RunGemm for the following reasons: 1. The UniversalGemmKernel::RunGemm function computes num_loop based on a static function of the TilePartitioner. That said, for Stream K, num_loop must be computed using a member function (namely GetCurrentIterLength from PR #2708). 2. The UniversalGemmKernel::RunGemm function requires the use of a SplitKBatchOffset object which is not used for Stream K since we are not using Split K. Thus, this change adds a RunGemm function in the StreamKKernel class.

…m-k algorithm and calls to RunGemm

These changes do the following: - Ensure offsets along the M and N dimensions are multiplied by MPerblock or NPerBlock, respectively. This ensures tile window origins are at the correct locations. - Fix bug in the tile partitioner's GetTileIdxWithOffset. Now, we apply divmod to the given references to ensure correct values are available to the caller. - Added documentation in the Stream-K operator()

These changes add an initial gtest suite for the CK Tile Stream-K kernel. Currently, due to bugs in the StreamKTilePartitioner (which will be handled in a future PR), there are validation issues for certain cases which may differ on different architectures. Thus, we opted to run cases that are only fully data-parallel (skipping others). A guard was added to Stream-K's IsSupportedArgument method to ensure that callers are aware of this constraint. Additionally, to ensure testing reproducibility, options for setting the number of CUs and occupancy were added to MakeKernelArgs.

In Stream-K, the num_loop value varies per WG and per iteration of a Stream-K loop. So instead, we use the version of the GemmPipeline's operator() function that takes in has_hot_loop and tail_num. This is similar to what is done in Grouped GEMM.

Prior to this change, WGs travelled backwards through their assigned macro tiles in the C tensor. For instance, if WG0 is responsible for C tiles 0 and 1, it would first visit tile 1 then tile 0. This means that the iter_end decrements in each iteration of the stream-K while loop. Since we are working with unsigned integers, the subtraction operation may not be safe. Thus, this change makes is such that WGs travel forward so that their iter_start is incremented and their iter_end remains fixed. Additionally, we added a guard against WGs that are neither sk_blocks nor dp_blocks to ensure such WGs do not participate in the GEMM. Together, these changes make is such that the algorithm is correct when sk_blocks is greater than zero.

This instance involves >=3 WGs contributing to each macro tile in C. Due to the use of atomics, this is resulting in precision errors. These errors will not persist once the reduction strategy is implemented. We will re-enable this test then.

Addition of initial CK Tile Stream-K example for bf16 and fp16. These examples are minimal. As more functionality and gtests are added for Stream-K (coming in future PRs), these examples will be expanded.

ecamartins and others added 10 commits September 16, 2025 14:31

initial implementation for operator() for StreamKKernel: adding strea…

55ea03e

…m-k algorithm and calls to RunGemm

changes from review: comments, move readfirstlane, remove ifndef

90491b0

Add CK Tile Stream-K bf16 and fp16 examples

9bcd0ee

Addition of initial CK Tile Stream-K example for bf16 and fp16. These examples are minimal. As more functionality and gtests are added for Stream-K (coming in future PRs), these examples will be expanded.

ecamartins self-assigned this Sep 16, 2025

ecamartins closed this Sep 16, 2025

ecamartins deleted the emimarti/ck_tile/streamk_example branch September 29, 2025 22:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CK_TILE] Stream-K bf16 and fp16 GEMM Example #2862

[CK_TILE] Stream-K bf16 and fp16 GEMM Example #2862

Uh oh!

ecamartins commented Sep 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CK_TILE] Stream-K bf16 and fp16 GEMM Example #2862

[CK_TILE] Stream-K bf16 and fp16 GEMM Example #2862

Uh oh!

Conversation

ecamartins commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ecamartins commented Sep 16, 2025 •

edited

Loading