-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[ARM CPU] SVE support for Sgemm kernel #26027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@microsoft-github-policy-service agree company=“Fujitsu Research of India Private Ltd” |
7ebfed8 to
ff2a6a4
Compare
|
CC: @edgchen1 , @hariharans29 |
Code rebase
acf16b6 to
2493ad2
Compare
code rebase Code update code cleanup code cleanup cleanup Cleanup Cleanup
8e968e8 to
c508e83
Compare
6ee2ffc to
5dd2161
Compare
|
Please address the build failures and please run onnxruntime_mlas_test in your env with your change and report any failures if any here. |
|
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
cmake/CMakeLists.txt
Outdated
| if(TARGET onnxruntime_providers) | ||
| include("${CMAKE_CURRENT_SOURCE_DIR}/onnxruntime_providers_pch.cmake") | ||
| endif() | ||
| endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Stale change
onnxruntime/core/mlas/lib/mlasi.h
Outdated
|
|
||
| bool HasArmNeon_BF16() const { return has_arm_neon_bf16_; } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Stale change
onnxruntime/core/mlas/lib/sgemm.cpp
Outdated
|
|
||
| SVE_LOAD_STORE(D,b); | ||
|
|
||
| D += 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the mixed usage of the macro MLAS_SGEMM_STRIDEN_THREAD_ALIGN and hard-coded value of 16 intentional ? Can we just go back to the earlier usage of 16 in the while condition ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can go back to usage of 16 in while condition . Will update same.
onnxruntime/core/mlas/lib/sgemm.cpp
Outdated
| #if defined(MLAS_USE_SVE) || defined(MLAS_NEON_INTRINSICS) | ||
| if(MLAS_CPUIDINFO::GetCPUIDInfo().HasArmSve()) | ||
| { | ||
| #if defined(MLAS_USE_SVE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel there is something wrong in the code/design here - what happens when MLAS_USE_SVE is false and MLAS_NEON_INTRINSICS is true and HasArmSve() evaluates to true ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HasArmSve() is added to address runtime check for SVE , will update the code to handle above condition properly
onnxruntime/core/mlas/lib/sgemm.cpp
Outdated
| vst4q_f32(D, vld4q_f32(b)); | ||
| while (CountX >= MLAS_SGEMM_STRIDEN_THREAD_ALIGN) { | ||
|
|
||
| const float* b = B; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a lot of overlap between the code here ? Can the duplication be avoided and the code here be better re-written so that the code only differs only on the store instruction ?
| const float* b = B; | ||
|
|
||
| #if defined(MLAS_USE_SVE) || defined(MLAS_NEON_INTRINSICS) | ||
| if (MLAS_CPUIDINFO::GetCPUIDInfo().HasArmSve()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before - what happens when MLAS_USE_SVE is false and MLAS_NEON_INTRINSICS is true and HasArmSve() evaluates to true ?
| void MLAS_SVE_TARGET MLASCALL | ||
| SCATTER_STORE(float* d, const float* b); | ||
|
|
||
| void MLAS_SVE_TARGET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the files have formatting issues. Can you please address them ?
|
Have you tried this on multiple Gemm problem shapes - Please submit comprehensive micro-benchmarks for all SVE platforms. I tried taking this change on a Graviton4 and for a Conv heavy model (which uses Im2Col + SGemm for the Conv implementation), it slows down the model very much - when I build with |
Yes. I have tried on different shapes ,and for large shapes tuning is required. This PR is added as patch to enable sve for gemm kernel and for larger shapes this gemm implementation has to be tuned and we are currently working on that. |
|
Hi @akote123: Any plan to resume this work ? Or at the very least, can you please guide what remains to be done for perf tuning ? |
|
@hariharans29 ,Thank you. We will resume this work .This PR we planned as SVE sgemm enablement work and do performance tuning as future task and contribute |
903f766 to
7f3180f
Compare
7f3180f to
2e2c6db
Compare
This PR ports the SGEMM kernel and associated packing kernels to the ARM SVE (Scalable Vector Extension) backend. Specifically, it introduces new wrapper implementations for SVE functions in lib/sve/sgemm_sve.cpp and integrates these wrappers within the existing kernel implementations.
Motivation and Context
This work is part of an ongoing effort to enhance ONNX Runtime's performance and architecture-awareness on ARM platforms. By leveraging ARM SVE, we aim to unlock better computational efficiency and scalability on modern ARM hardware.
This PR builds upon and extends the SVE work introduced in PR #25238
Performance Analysis:

Results are captured from sve 256,128 and SVE 512 supported machines.
This PR is a joint contribution by: