TF32 POC in Conv3d on MI30x platform #2763

lymAMD · 2025-09-01T07:21:05Z

Proposed changes

Demonstrate TF32(XF32 in CDNA3 ISA) kernel in conv3d. Also add lots of instances for miopen.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

...de/ck/tensor_operation/gpu/device/impl/device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp

example/01_gemm/gemm_xdl_lds_direct_load_fp32_tf32.cpp

bartekxk · 2025-09-05T11:33:17Z

.gitignore

@@ -70,4 +70,3 @@ build*/
 __pycache__/



Yes. It seems a space is auto deleted by VSCode. Will try to recover it.

bartekxk · 2025-09-05T11:33:45Z

example/01_gemm/common.hpp

@@ -310,10 +310,14 @@ bool parse_cmd_args<ProblemSizeSplitK>(int argc,
    return true;
 }

-template <typename DataType>
+template <typename DataType, typename GemmType = DataType>


Can you change it to ComputeType to keep naming convention?

Sure. Use ComputeDataType to align with device_gemm_xdl_cshuffle_lds_direct_load.hpp#L61

bartekxk · 2025-09-05T11:34:41Z

example/01_gemm/run_gemm_example.inc

@@ -4,6 +4,11 @@
 #pragma once
 #include "ck/library/utility/validation_common.hpp"

+// use macro to minimize code change
+#ifndef EXAMPLE_WITH_GEMM_DATATYPE
+using GemmDataType = AccDataType;


ComputeType

bartekxk · 2025-09-05T11:35:00Z

example/09_convnd_fwd/convnd_fwd_common.hpp

@@ -68,10 +72,14 @@ inline __host__ __device__ constexpr double get_rtol()
    }
 }

-template <typename DataType>
+template <typename DataType, typename GemmType = DataType>


Compute Type

bartekxk · 2025-09-05T11:36:19Z

example/09_convnd_fwd/run_convnd_fwd_example.inc

+#ifndef EXAMPLE_WITH_GEMM_DATATYPE
+using GemmDataType = AccDataType;
+#endif


ComputeDataType

bartekxk · 2025-09-05T11:44:03Z

library/include/ck/library/reference_tensor_operation/gpu/reference_gemm.hpp

@@ -111,8 +111,9 @@ template <typename ALayout,
          typename AElementwiseOperation,
          typename BElementwiseOperation,
          typename CElementwiseOperation,
-          typename ComputeTypeA = CDataType,
-          typename ComputeTypeB = ComputeTypeA>
+          typename ComputeTypeA    = CDataType,


bartekxk · 2025-09-05T11:44:36Z

..._operation_instance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_xdl_dynamic_op_instance.hpp

+          typename DsLayout,
+          typename ELayout,
+          ConvolutionForwardSpecialization ConvSpec>
+using device_grouped_conv_fwd_xdl_dynamic_op_f32_tf32_instances = std::tuple<


We probably dont need dynamic op instances since it has not been integrated with MIOpen

bartekxk · 2025-09-05T11:45:57Z

library/include/ck/library/tensor_operation_instance/gpu/grouped_convolution_forward.hpp

@@ -553,6 +565,12 @@ struct DeviceOperationInstanceFactory<ck::tensor_operation::device::DeviceGroupe
                add_device_grouped_conv3d_fwd_xdl_ngcdhw_gkczyx_ngkdhw_f32_mem_inter_instances(
                    op_ptrs);
            }
+            if constexpr(is_same_v<InDataType, float> && is_same_v<WeiDataType, float> &&


Do we need something like CK_ENABLE_TF32?

CK API use different template params ComputeDataTypeA/B to distinguish tf32 or fp32 compute. No incorrect usage will occur.
While MIOpen use MIOPEN_TF32_OVERRIDE (vs NVIDIA_TF32_OVERRIDE) to disable TF32 mode which means MIOpen will select different CK kernel. That should be enough.

bartekxk · 2025-09-05T11:48:34Z

...uped_conv3d_fwd/xdl/device_grouped_conv3d_fwd_xdl_gndhwc_gkzyxc_gndhwk_f32_tf32_instance.cpp

+namespace ck {
+namespace tensor_operation {
+namespace device {
+namespace instance {


Plese dont extend gndhwc layout since it is not used widely

...uped_conv3d_fwd/xdl/device_grouped_conv3d_fwd_xdl_ngcdhw_gkczyx_ngkdhw_f32_tf32_instance.cpp

linqun

looks good to me.

lymAMD requested review from illsilin, carlushuang, qianfengz, aosewski, poyenc, geyyer, bartekxk, andriy-ca, afagaj, asleepzzz, ThomasNing, coderfeli, shumway, vidyasagar-amd, a team and tenpercent as code owners September 1, 2025 07:21

lymAMD mentioned this pull request Sep 1, 2025

MIOpen:feature:tf32:demonstrate tf32 in conv3d on MI30X platform ROCm/rocm-libraries#1414

Open

1 task

lymAMD force-pushed the xf32_0814 branch 4 times, most recently from a2506cc to 1187441 Compare September 4, 2025 08:09

feature:tf32:add initial conv3d fwd kernel support

30f2193

lymAMD force-pushed the xf32_0814 branch from 1187441 to 30f2193 Compare September 5, 2025 01:42

lymAMD added 2 commits September 5, 2025 14:16

remove more GemmDataTypes

8370a17

rename xf32 to tf32

e5492d0

lymAMD force-pushed the xf32_0814 branch from 57c955b to e5492d0 Compare September 5, 2025 09:48

linqun reviewed Sep 5, 2025

View reviewed changes

...de/ck/tensor_operation/gpu/device/impl/device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp Outdated Show resolved Hide resolved

example/01_gemm/gemm_xdl_lds_direct_load_fp32_tf32.cpp Show resolved Hide resolved

bartekxk reviewed Sep 5, 2025

View reviewed changes

refine according to comments

b55c64b

lymAMD requested a review from bartekxk September 8, 2025 02:00

lymAMD requested a review from linqun September 8, 2025 02:00

linqun reviewed Sep 9, 2025

View reviewed changes

lymAMD self-assigned this Sep 9, 2025

illsilin and others added 2 commits September 8, 2025 19:19

Merge branch 'develop' into xf32_0814

6a16460

fix clang format

12c4b63

TF32 POC in Conv3d on MI30x platform #2763

Are you sure you want to change the base?

TF32 POC in Conv3d on MI30x platform #2763

Conversation

lymAMD commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lymAMD Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

linqun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lymAMD commented Sep 1, 2025 •

edited

Loading

lymAMD Sep 6, 2025 •

edited

Loading