Skip to content

Conversation

chilo-ms
Copy link
Contributor

@chilo-ms chilo-ms commented Sep 9, 2025

Description

The package size limit for PyPI and Nuget are:

  • python package size under 300MB
  • Nuget package size under 250MB

To meet the size limit,
this PR firstly removes some old GPU arch support in CMAKE_CUDA_ARCHITECTURE.
Secondly, it removes the FPA_INTB_GEMM support in Linux Python wheel.

Python wheel

OS cmake_cuda_architecture CUDA kernel removal Package size Under 300MB
Linux 60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual 341 MB No (original)
Linux 70-real;75-real;80-real;86-real;90a-real;90a-virtual 329 MB No
Linux 75-real;80-real;86-real;90a-real;90a-virtual 319 MB No
Linux 80-real;86-real;90a-real;90a-virtual 304 MB No
Linux 60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual. FPA_INTB_GEMM 287 MB Yes
Windows 52-real;61-real;75-real;86-real;89-real;90a-virtual 272 MB Yes (original)

Nuget

OS cmake_cuda_architecture CUDA kernel removal Package size Under 250MB
Linux 60-real;70-real;75-real;80-real;90a-real;90a-virtual 276 MB No (original)
Linux 75-real;80-real;90a-real;90a-virtual 253 MB No
Linux 60-real;70-real;75-real;80-real;90a-real;90a-virtual FPA_INTB_GEMM 230 MB Yes
Windows 52-real;61-real;75-real;86-real;89-real;90a-virtual 264 MB No (original)
Windows 61-real;75-real;86-real;89-real;90a-virtual 254 MB No
Windows 75-real;86-real;89-real;90a-virtual 242 MB Yes

Motivation and Context

@snnn snnn requested a review from tianleiwu September 10, 2025 15:50
@chilo-ms
Copy link
Contributor Author

chilo-ms commented Sep 11, 2025

The size of python wheel for Linux is 304 MB with "80-real;86-real;90a-real;90a-virtual" CUDA architecture enabled, which is still slightly over 300 MB size limit.
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=936299&view=artifacts&pathAsName=false&type=publishedArtifacts

SM80 (A100), SM86 (A10), SM90 (H100) seems to be the main GPUs our customers are using, and we can't remove them from ORT support list.
Another option is to sacrifice the performance meaning remove 86-real or 90-real (SASS) and add the virtual one (PTX) for compatibility.

@chilo-ms
Copy link
Contributor Author

chilo-ms commented Sep 15, 2025

Discussed offline that we might also want to reduce some heaviest cuda kernels, i.e. beam_search_topk.
With only keeping branch of 8:

if (k <= 8) {
    TopKLauncher(8)
  } else {
    ORT_THROW("K>8 is not supported for beam search");
  }

For Linux,
Now CUDA EP library is 448MB.
The wheel is ~302MB.

Update: We don't need to modify beam_search_topk as removing FPA_INTB_GEMM can give us space back.
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=942363&view=artifacts&pathAsName=false&type=publishedArtifacts

@chilo-ms chilo-ms changed the title Remove old CUDA arch in CMAKE_CUDA_ARCHITECTURES to reduce package size Reduce Python and Nuget GPU package size Sep 16, 2025
tianleiwu
tianleiwu previously approved these changes Sep 17, 2025
@chilo-ms chilo-ms merged commit fd35afb into main Sep 18, 2025
105 of 115 checks passed
@chilo-ms chilo-ms deleted the chi/remove_cuda_arch branch September 18, 2025 21:01
snnn pushed a commit that referenced this pull request Sep 18, 2025
### Description
The package size limit for PyPI and Nuget are:
- python package size under 300MB
- Nuget package size under 250MB

To meet the size limit, 
this PR firstly removes some old GPU arch support in
CMAKE_CUDA_ARCHITECTURE.
Secondly, it removes the FPA_INTB_GEMM support in Linux Python wheel.


#### Python wheel


| OS | cmake_cuda_architecture | CUDA kernel removal |Package size |
Under 300MB|

|---------|--------------------------------------------------------|-|-------------|---|
| Linux | 60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual |
|341 MB |No (original)|
| Linux | 70-real;75-real;80-real;86-real;90a-real;90a-virtual | | 329
MB |No|
| Linux | 75-real;80-real;86-real;90a-real;90a-virtual | |319 MB |No|
| Linux   | 80-real;86-real;90a-real;90a-virtual   | |304 MB       |No|
| Linux | 60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual.
| FPA_INTB_GEMM|287 MB |Yes|
| Windows | 52-real;61-real;75-real;86-real;89-real;90a-virtual | | 272
MB |Yes (original)|

#### Nuget


| OS | cmake_cuda_architecture | CUDA kernel removal |Package size
|Under 250MB|

|---------|--------------------------------------------------------|---|--------------|---|
| Linux | 60-real;70-real;75-real;80-real;90a-real;90a-virtual | |276 MB
|No (original)|
| Linux   | 75-real;80-real;90a-real;90a-virtual   | |253 MB       |No|
| Linux | 60-real;70-real;75-real;80-real;90a-real;90a-virtual
|FPA_INTB_GEMM| 230 MB |Yes|
| Windows | 52-real;61-real;75-real;86-real;89-real;90a-virtual || 264
MB |No (original)|
| Windows | 61-real;75-real;86-real;89-real;90a-virtual || 254 MB |No|
| Windows | 75-real;86-real;89-real;90a-virtual    || 242 MB       |Yes|


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
snnn added a commit that referenced this pull request Sep 19, 2025
Reduce Python and Nuget GPU package size (#26002)
[CUDA] Add build flag onnxruntime_USE_FPA_INTB_GEMM (#25802)
@snnn
Copy link
Member

snnn commented Sep 19, 2025

This PR has been cherry-picked into the rel-1.23.0 branch in PR #26087. Removing the release:1.23.0 label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants