vulkan: fuse adds #15252

jeffbolznv · 2025-08-11T21:14:52Z

Fuse adds that have the same shape, which are common in MoE models. It will currently fuse up to 6 adds, because we assume no more than 8 descriptors per dispatch. But this could be changed.

5090 before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 512 -r 10 --prio 1 -m c:\models\bartowski\DeepSeek-Coder-V2-Lite-Instruct-GGUF\DeepSeek-Coder-V2-Lite-Instruct-Q2_K.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |      7544.08 ± 70.50 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        249.36 ± 1.58 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      3971.06 ± 37.61 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        175.16 ± 0.48 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     6264.83 ± 251.65 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        212.95 ± 0.73 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |     6936.34 ± 183.22 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        233.17 ± 0.66 |

5090 after:

ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |     7530.23 ± 127.67 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        263.83 ± 0.93 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      3999.28 ± 41.87 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        188.21 ± 0.95 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     6327.27 ± 161.58 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        218.28 ± 1.83 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |     6916.37 ± 206.58 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        244.11 ± 0.78 |

4070 before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 512 -r 10 --prio 1 -m c:\models\bartowski\DeepSeek-Coder-V2-Lite-Instruct-GGUF\DeepSeek-Coder-V2-Lite-Instruct-Q2_K.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |      2777.76 ± 10.68 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        181.91 ± 0.37 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |      2550.84 ± 25.51 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        120.20 ± 0.20 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |       1983.36 ± 9.40 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        162.10 ± 0.29 |

4070 after:

ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |       2790.31 ± 7.72 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        188.98 ± 0.25 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |      2562.82 ± 28.59 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        121.95 ± 0.28 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |       1984.36 ± 9.35 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        166.07 ± 1.04 |

Fuse adds that have the same shape, which are common in MoE models. It will currently fuse up to 6 adds, because we assume no more than 8 descriptors per dispatch. But this could be changed.

0cc4m

Looks good on AMD and Nvidia, but I can't get it to run on Intel.

terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Device::waitForFences: ErrorDeviceLost

I'll investigate further later.

ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp

jeffbolznv · 2025-08-12T13:56:50Z

Looks good on AMD and Nvidia, but I can't get it to run on Intel.

Strange. Any validation failures? Does the backend test fail, or just in real models?

0cc4m · 2025-08-15T14:23:22Z

Looks good on AMD and Nvidia, but I can't get it to run on Intel.

Strange. Any validation failures? Does the backend test fail, or just in real models?

Yeah, the test fails too on Intel:

[ADD] NMSE = 18.017009325 > 0.000000100 ADD(type=f32,ne=[16,5,4,3],nr=[1,1,1,1],nf=16): FAIL

Edit: No validation failures. Probably a driver bug.

jeffbolznv · 2025-08-15T18:17:49Z

Edit: No validation failures. Probably a driver bug.

Shall I just disable the optimization for Intel?

0cc4m · 2025-08-16T08:08:36Z

Edit: No validation failures. Probably a driver bug.

Shall I just disable the optimization for Intel?

Yeah, I don't see why it's failing.

vulkan: fuse adds

8a2e4f7

Fuse adds that have the same shape, which are common in MoE models. It will currently fuse up to 6 adds, because we assume no more than 8 descriptors per dispatch. But this could be changed.

jeffbolznv requested a review from 0cc4m as a code owner August 11, 2025 21:14

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 11, 2025

0cc4m reviewed Aug 12, 2025

View reviewed changes

ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Show resolved Hide resolved

check runtimeDescriptorArray feature

87ead86

jeffbolznv mentioned this pull request Aug 13, 2025

vulkan: optimize rms_norm, and allow the work to spread across multiple SMs #15281

Open

0cc4m approved these changes Aug 16, 2025

View reviewed changes

disable multi_add for Intel due to likely driver bug

878ea20

jeffbolznv merged commit 1fe0029 into ggml-org:master Aug 16, 2025
50 of 51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: fuse adds #15252

vulkan: fuse adds #15252

jeffbolznv commented Aug 11, 2025

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

jeffbolznv commented Aug 12, 2025

Uh oh!

0cc4m commented Aug 15, 2025 •

edited

Loading

Uh oh!

jeffbolznv commented Aug 15, 2025

Uh oh!

0cc4m commented Aug 16, 2025

Uh oh!

Uh oh!

Uh oh!

vulkan: fuse adds #15252

vulkan: fuse adds #15252

Conversation

jeffbolznv commented Aug 11, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeffbolznv commented Aug 12, 2025

Uh oh!

0cc4m commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Aug 15, 2025

Uh oh!

0cc4m commented Aug 16, 2025

Uh oh!

Uh oh!

Uh oh!

0cc4m commented Aug 15, 2025 •

edited

Loading