vulkan: optimizations for deepseek prompt processing #14555

jeffbolznv · 2025-07-06T22:36:19Z

Some optimizations for mul_mat_id, and flash attention with large head size. See commit messages for more detail.

before:

coopmat2:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -fa 1 -n 128 -p 512 -d 512,8192 --prio 1 -r 5
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d512 |      4061.98 ± 17.67 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d512 |        210.15 ± 0.88 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      2392.49 ± 16.93 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        116.90 ± 0.26 |

coopmat1:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -fa 1 -n 128 -p 512 -d 512,8192 --prio 1 -r 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d512 |       1013.83 ± 0.00 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d512 |        203.44 ± 0.00 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |        159.70 ± 0.00 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        151.71 ± 0.00 |

scalar:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -fa 1 -n 128 -p 512 -d 512,8192 --prio 1 -r 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d512 |        868.16 ± 0.00 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d512 |        202.75 ± 0.00 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |        161.14 ± 0.00 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        151.54 ± 0.00 |

after:

coopmat2:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -fa 1 -n 128 -p 512 -d 512,8192 --prio 1 -r 5
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d512 |      6419.30 ± 17.34 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d512 |        209.36 ± 0.86 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3606.39 ± 10.05 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        116.51 ± 0.28 |

coopmat1:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -fa 1 -n 128 -p 512 -d 512,8192 --prio 1 -r 5
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d512 |      5412.19 ± 58.07 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d512 |        202.60 ± 0.77 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       1630.19 ± 3.38 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        151.86 ± 0.46 |

scalar:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -fa 1 -n 128 -p 512 -d 512,8192 --prio 1 -r 5
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d512 |       1833.69 ± 7.71 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d512 |        204.14 ± 0.45 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       1034.34 ± 1.45 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        151.88 ± 0.27 |

…o coopmat1 path

…oth scalar and CM2 paths (CM1 isn't used due to shared memory limits)

0cc4m

LGTM

ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp

jeffbolznv added 4 commits July 6, 2025 16:57

vulkan: allow unclamped loads in coopmat2 mul_mat_id shader

129a0f1

vulkan: increase coopmat2 mul_mat_id tile size

b54ddba

vulkan: optimize mat_mul_id row_ids search to batch loads, and port t…

2b54086

…o coopmat1 path

vulkan: use smaller FA row size when head size is large. applies to b…

bd8e0bf

…oth scalar and CM2 paths (CM1 isn't used due to shared memory limits)

jeffbolznv requested a review from 0cc4m July 6, 2025 22:36

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 6, 2025

0cc4m approved these changes Jul 12, 2025

View reviewed changes

ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp Show resolved Hide resolved

0cc4m merged commit 98197e5 into ggml-org:master Jul 12, 2025
47 of 48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: optimizations for deepseek prompt processing #14555

vulkan: optimizations for deepseek prompt processing #14555

Uh oh!

jeffbolznv commented Jul 6, 2025

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vulkan: optimizations for deepseek prompt processing #14555

vulkan: optimizations for deepseek prompt processing #14555

Uh oh!

Conversation

jeffbolznv commented Jul 6, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!