-
Notifications
You must be signed in to change notification settings - Fork 251
[CK_TILE] Stream-K GEMM Implementation #2781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3f20e3c to
8956bbd
Compare
|
@ecamartins Hi, it is great to see StreamK in CKTile. How about performance of StreamK vs SplitK for random shapes? |
@zjing14 This is still the initial implementation, we have not had the chance to benchmark StreamK vs SplitK yet. |
b2a7adf to
016fd56
Compare
00f76d7 to
7328bc7
Compare
cgmillette
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cgmillette
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
vidyasagar-amd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to use atomic_add for both StreamKReductionStrategy::Atomic and StreamKReductionStrategy::Reduction ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment! The StreamKReductionStrategy::Atomic will use atomic_add whereas the StreamKReductionStrategy::Reduction will use a set operation. The StreamKReductionStrategy::Reduction is currently not implemented, so there is a guard in IsSupportedArgument to prevent users from using StreamKReductionStrategy::Reduction. We will need to update the tests and the IsSupportedArgument function once the reduction is implemented :)
…::MakeGemmTensorViews function Prior to this change, the splitk_batch_offset parameter of MakeGemmTensorViews had type SplitKBatchOffset. But, the only member variable of the SplitKBatchOffset class used in the MakeGemmTensorViews function was splitted_k (an int32_t). The splitted_k value was used as part of defining the dimensions of the tensor view. That said, for Stream K, we do not need to use the SplitKBatchOffset class since we are not using Split K. Thus, this commit changes the splitk_batch_offset parameter to a int32_t called k_size. This will avoid the constraint of requiring a caller of MakeGemmTensorViews to use the SplitKBatchOffset class while still providing the same functionality. Calls to UniversalGemmKernel::MakeGemmTensorViews have been updated accordingly.
Stream K cannot simply use UniversalGemmKernel's RunGemm for the following reasons: 1. The UniversalGemmKernel::RunGemm function computes num_loop based on a static function of the TilePartitioner. That said, for Stream K, num_loop must be computed using a member function (namely GetCurrentIterLength from PR #2708). 2. The UniversalGemmKernel::RunGemm function requires the use of a SplitKBatchOffset object which is not used for Stream K since we are not using Split K. Thus, this change adds a RunGemm function in the StreamKKernel class.
…m-k algorithm and calls to RunGemm
These changes do the following: - Ensure offsets along the M and N dimensions are multiplied by MPerblock or NPerBlock, respectively. This ensures tile window origins are at the correct locations. - Fix bug in the tile partitioner's GetTileIdxWithOffset. Now, we apply divmod to the given references to ensure correct values are available to the caller. - Added documentation in the Stream-K operator()
These changes add an initial gtest suite for the CK Tile Stream-K kernel. Currently, due to bugs in the StreamKTilePartitioner (which will be handled in a future PR), there are validation issues for certain cases which may differ on different architectures. Thus, we opted to run cases that are only fully data-parallel (skipping others). A guard was added to Stream-K's IsSupportedArgument method to ensure that callers are aware of this constraint. Additionally, to ensure testing reproducibility, options for setting the number of CUs and occupancy were added to MakeKernelArgs.
In Stream-K, the num_loop value varies per WG and per iteration of a Stream-K loop. So instead, we use the version of the GemmPipeline's operator() function that takes in has_hot_loop and tail_num. This is similar to what is done in Grouped GEMM.
Prior to this change, WGs travelled backwards through their assigned macro tiles in the C tensor. For instance, if WG0 is responsible for C tiles 0 and 1, it would first visit tile 1 then tile 0. This means that the iter_end decrements in each iteration of the stream-K while loop. Since we are working with unsigned integers, the subtraction operation may not be safe. Thus, this change makes is such that WGs travel forward so that their iter_start is incremented and their iter_end remains fixed. Additionally, we added a guard against WGs that are neither sk_blocks nor dp_blocks to ensure such WGs do not participate in the GEMM. Together, these changes make is such that the algorithm is correct when sk_blocks is greater than zero.
This instance involves >=3 WGs contributing to each macro tile in C. Due to the use of atomics, this is resulting in precision errors. These errors will not persist once the reduction strategy is implemented. We will re-enable this test then.
e4975f2 to
f932ebb
Compare
* Change splitk_batch_offset parameter to k_size in UniversalGemmKernel::MakeGemmTensorViews function Prior to this change, the splitk_batch_offset parameter of MakeGemmTensorViews had type SplitKBatchOffset. But, the only member variable of the SplitKBatchOffset class used in the MakeGemmTensorViews function was splitted_k (an int32_t). The splitted_k value was used as part of defining the dimensions of the tensor view. That said, for Stream K, we do not need to use the SplitKBatchOffset class since we are not using Split K. Thus, this commit changes the splitk_batch_offset parameter to a int32_t called k_size. This will avoid the constraint of requiring a caller of MakeGemmTensorViews to use the SplitKBatchOffset class while still providing the same functionality. Calls to UniversalGemmKernel::MakeGemmTensorViews have been updated accordingly. * StreamK Kernel RunGemm Implementation Stream K cannot simply use UniversalGemmKernel's RunGemm for the following reasons: 1. The UniversalGemmKernel::RunGemm function computes num_loop based on a static function of the TilePartitioner. That said, for Stream K, num_loop must be computed using a member function (namely GetCurrentIterLength from PR #2708). 2. The UniversalGemmKernel::RunGemm function requires the use of a SplitKBatchOffset object which is not used for Stream K since we are not using Split K. Thus, this change adds a RunGemm function in the StreamKKernel class. * initial implementation for operator() for StreamKKernel: adding stream-k algorithm and calls to RunGemm * Fix indexing and offset issues for StreamK These changes do the following: - Ensure offsets along the M and N dimensions are multiplied by MPerblock or NPerBlock, respectively. This ensures tile window origins are at the correct locations. - Fix bug in the tile partitioner's GetTileIdxWithOffset. Now, we apply divmod to the given references to ensure correct values are available to the caller. - Added documentation in the Stream-K operator() * Initial gtests for Stream-K These changes add an initial gtest suite for the CK Tile Stream-K kernel. Currently, due to bugs in the StreamKTilePartitioner (which will be handled in a future PR), there are validation issues for certain cases which may differ on different architectures. Thus, we opted to run cases that are only fully data-parallel (skipping others). A guard was added to Stream-K's IsSupportedArgument method to ensure that callers are aware of this constraint. Additionally, to ensure testing reproducibility, options for setting the number of CUs and occupancy were added to MakeKernelArgs. * Use GemmPipeline operator() variant that takes hot loop and tail num In Stream-K, the num_loop value varies per WG and per iteration of a Stream-K loop. So instead, we use the version of the GemmPipeline's operator() function that takes in has_hot_loop and tail_num. This is similar to what is done in Grouped GEMM. * changes from review: comments, move readfirstlane, remove ifndef * Switch direction of C tensor traversal & add padding guard Prior to this change, WGs travelled backwards through their assigned macro tiles in the C tensor. For instance, if WG0 is responsible for C tiles 0 and 1, it would first visit tile 1 then tile 0. This means that the iter_end decrements in each iteration of the stream-K while loop. Since we are working with unsigned integers, the subtraction operation may not be safe. Thus, this change makes is such that WGs travel forward so that their iter_start is incremented and their iter_end remains fixed. Additionally, we added a guard against WGs that are neither sk_blocks nor dp_blocks to ensure such WGs do not participate in the GEMM. Together, these changes make is such that the algorithm is correct when sk_blocks is greater than zero. * Disable StreamK_M256_N256_K256_SKBlocks12 test case This instance involves >=3 WGs contributing to each macro tile in C. Due to the use of atomics, this is resulting in precision errors. These errors will not persist once the reduction strategy is implemented. We will re-enable this test then. --------- Co-authored-by: Astha Rai <[email protected]>
Proposed changes
This PR, made in collaboration with @arai713, contains the first implementation of Stream-K in CK Tile. Our implementation is based on old CK's initial Stream-K implementation (see example/01_gemm/gemm_xdl_streamk.cpp). Stream-K is a GEMM algorithm that helps avoid imbalanced workloads across CUs. When more than one WG is assigned to the same C macro tile, the work is split along the K dimension, where WGs can either atomically add their results to C or select one WG to combine partial results via reduction. Since this is our first implementation of Stream-K in CK Tile, we have opted to implement the atomic strategy with the reduction strategy coming in future work.
For our implementation, this PR makes 3 main additions/alterations:
1) Updating Universal GEMM's
MakeGemmTensorViewsinterface: TheMakeGemmTensorViewsfunction has a parameter of typeSplitKBatchOffset, where only the integer member variablesplitted_kis used inMakeGemmTensorViews. Since Stream-K, in general, is not split-K, we have updated theMakeGemmTensorViewsinterface to require only an integer rather thanSplitKBatchOffset. This ensures that the same functionality is supported inMakeGemmTensorViewswhile avoiding the constraint of requiring a SplitKBatchOffset.2) Adding a
RunGemmfunction to the StreamKKernel Class: The Universal GEMM Kernel’sRunGemmfunction assumes that (a) the kernel uses aSplitKBatchOffsetand (b) the tile partitioner can statically determine the number of iterations along the K dimension for a given WG (i.e.,num_loop). These assumptions do not hold for Stream-K; we do not useSplitKBatchOffsetand an instance method of the tile partitioner is needed to determinenum_loop. In the Stream-KRunGemm, we opted to havenum_loopas a function parameter to ensure that the main stream-K logic is isolated fromRunGemm.3) Implementing the Stream-K kernel's operator() function: This function contains the main Stream-K looping logic. In each iteration of the main Stream-K while loop, using the tile partitioner, offsets for the A, B, and C tensors are determined for the current iteration then there is a call to
RunGemm.Some other important notes:
There are currently 2 bugs in theUpdate: commit 7328bc7 lifts this limitation.StreamKTilePartitioner(ticket has already been logged with plans to work on them in upcoming sprints). Such bugs make it such that on certain architectures, certain instances of the stream-K algorithm fail. Thus, for this work in our test suite, we skip tests that have sk_blocks > 0. In other words, we only run cases where each macro tile in C is only worked on by one WG, all other tests are skipped.Checklist
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed files