[CK Tile] Grouped conv fwd splitn support #2776

JH-Leon-KIM-AMD · 2025-09-04T11:50:01Z

Proposed changes

This PR adds Split-N functionality to the grouped convolution forward operation in CK Tile, enabling efficient processing of large tensors (>2GB) that would otherwise face memory addressing limitations.

Problem solved:

GPU tensors exceeding 2GB (2^31 bytes) face memory addressing limitations
Split-N automatically partitions the batch dimension when tensors exceed this threshold
Enables handling of production workloads with large batch sizes

Implementation Details

Core Split-N logic:

grouped_convolution_forward_kernel.hpp: Added Split-N support with grid.z dimension for batch parallelization
transform_conv_fwd_to_gemm.hpp: Added GetSplitedNSize() to detect 2GB threshold and calculate optimal number of splits
Algorithm finds smallest divisor of N to keep each split under 2GB
Kernel uses blockIdx.z to index into split batches

Testing:

Verified locally with tile_example_grouped_conv_fwd
Split-N activates correctly for tensors >2GB (grid.z > 1)
Tested up to 200 splits successfully with modified threshold
Unit tests removed from PR due to Jenkins environment differences (tests pass locally)

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Test Results (MI250)

Local testing with tile_example_grouped_conv_fwd:

100mb threshold with cpu reference:

# N=32, H=W=112 (~200MB) - Split-N active
./bin/tile_example_grouped_conv_fwd -prec=fp16 -g=1 -n=32 -c=256 -k=256 -y=3 -x=3 -h=112 -w=112 -v=0
grid: {6052, 1, 4}, blocks: {256, 1, 1} # Z=4 splits
44.0296 ms, 10.3739 TFlops, ✓ Validation: PASSED

# N=48 (~300MB) - Split-N with 6 splits
grid: {6052, 1, 6}, blocks: {256, 1, 1} # Z=6 splits
66.0836 ms, 10.3678 TFlops, ✓ Validation: PASSED

# N=200 (~1.2GB with 100MB threshold) - Split-N with 200 splits
grid: {1208, 1, 200}, blocks: {256, 1, 1} # Z=200 splits
869.812 ms, 10.42 TFlops, ✓ CPU Validation: PASSED

2GB threshold without cpu reference:

# N=128, C=512, H=W=96 (~1.2GB) - No Split-N
grid: {70688, 1, 2}, blocks: {256, 1, 1} # Z=2 (just under threshold)
510.706 ms, 10.4498 TFlops, ✓ Validation: PASSED

# N=256 (~2.5GB) - Split-N active
grid: {76832, 1, 4}, blocks: {256, 1, 1} # Z=4 splits
1109.5 ms, 10.4563 TFlops, ✓ Validation: PASSED

# N=320 (~3.1GB) - Split-N with 4 splits
grid: {96040, 1, 4}, blocks: {256, 1, 1} # Z=4 splits
1386.74 ms, 10.4573 TFlops, ✓ Validation: PASSED

# N=400 (~3.9GB) - Split-N with 4 splits
grid: {115248, 1, 4}, blocks: {256, 1, 1} # Z=4 splits
1664.27 ms, 10.4562 TFlops, ✓ Validation: PASSED

Discussion

Important Split-N behavior for odd/prime batch sizes

Currently, when N cannot be evenly divided, the implementation falls back to n_per_split=1, meaning each batch element is processed separately:

Example:

N=128: Splits evenly into 2×64 (n_splits=2, n_per_split=64) ✓
N=127: Cannot split evenly, falls back to 127×1 (n_splits=127, n_per_split=1) ⚠️

This fallback ensures correctness but can lead to inefficient parallelization for prime numbers. Potential improvements for future work:

Padding approach: Add dummy batches to make N divisible
Uneven split handling: Allow splits of different sizes
Relaxed divisibility: Find nearest good divisor and handle remainder

Split-K conflict prevention

Both Split-K and Split-N use blockIdx.z for parallelization. The current implementation prevents using both simultaneously by checking in the kernel arguments. This ensures correctness but may limit optimization opportunities in some scenarios.

The Split-N implementation remains functional in: - include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp - include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp Tests pass locally and Split-N verified working with tile_example_grouped_conv_fwd

jakpiase

Awesome, glad you have tested it! LGTM

bartekxk

Nice

bartekxk · 2025-09-10T20:28:55Z