ggml : add ggml_mul_mat_id_set_prec and use in llama.cpp #15619

ggerganov · 2025-08-27T17:36:48Z

fix #15274, #15517, #15516

Force F32 accumulators for attention and ffn output matrix multiplications.

ggml-ci

jeffbolznv · 2025-08-27T19:47:40Z

How broadly is this enabled? F32 accumulators are half speed of F16 on geforce.

ggerganov · 2025-08-28T05:33:18Z

On master we force F32 accumulators for:

K*Q multiplication in the attention
Output multiplication from the attention (only for GLM4)
Output multiplication (down) from the FFN (only for GLM4)

This PR updates to use F32 accumulators for:

K*Q multiplication in the attention
Output multiplication from the attention (all models)
Output multiplication (down) from the FFN (all models)
Output multiplication (down_exps) from the MoE FFN (all models)

jeffbolznv · 2025-08-28T12:57:42Z

Here's a quick before/after, it is a significant slowdown across a lot of models:

4070 before

ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |      4103.29 ± 10.14 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |      3623.11 ± 99.69 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           pp512 |       2279.02 ± 3.35 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           pp512 |    20453.96 ± 541.00 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           pp512 |    17877.62 ± 634.47 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |      7665.58 ± 38.24 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      2530.48 ± 23.09 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           pp512 |      3443.31 ± 57.53 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |      3497.16 ± 17.36 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |      3440.03 ± 33.63 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |     6527.42 ± 421.90 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |      4334.88 ± 24.50 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |     8148.65 ± 249.08 |

4070 after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 0 -p 512 -r 20 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |      3350.81 ± 15.48 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |      3044.60 ± 10.45 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           pp512 |       1907.57 ± 2.89 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           pp512 |    17519.24 ± 626.24 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           pp512 |    15264.31 ± 486.02 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |     6322.62 ± 121.66 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      2357.07 ± 19.95 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           pp512 |       2885.22 ± 4.28 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |      3276.08 ± 17.98 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |      3182.48 ± 24.47 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |     5451.61 ± 125.99 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |      3741.65 ± 44.63 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |     6682.32 ± 270.79 |

I'd prefer to apply this only to models that need it. Or, we could try other fixes like clamping to finite values, or scaling/unscaling to avoid infinities.

ggerganov · 2025-08-28T14:37:42Z

I'd prefer to apply this only to models that need it.

The main problem is that we don't really have a way to reliably determine which models/tensors need F32 accumulators. Sometimes the FP range issues occur at large contexts with specific content. I am pretty confident that the existing checks for GLM4 are not enough and there are other models that need F32 accumulators and flew under the radar - it's just nobody has reported that yet, so we don't know.

We can also consider enabling F32 accumulators as proposed and whitelisting F16 acc only when we run enough tests at large context to have some confidence that it does not break?

For now I'll update the PR to just whitelist F32 acc only for GLM and GPT-OSS and we can decide later how to improve this. Just waiting on #15274 (comment) to confirm that they used the correct branch. Otherwise it would mean there are more tensors that need F32 to make GPT-OSS run with Vulkan.

jeffbolznv · 2025-08-28T16:48:18Z

I'd also be curious to know whether we really need F32 precision for these models, or if we're just running into infinities and we might be able to get away with clamping to the max value.

0cc4m · 2025-08-28T17:25:54Z

Maybe it's possible to build a specific test that runs a full-size prompt through the model and checks each mul_mat, mul_mat_id, flash attention, conv2d and whatever else could use a fp16 accumulator for NaNs/infinities. Then tuning each model would be simple.

jeffbolznv · 2025-08-28T19:47:00Z

I was able to reproduce the failure from #15274. I verified that clamping infinites to +/-max fp16 in the mul_mat and mul_mat_id shaders was sufficient to fix it (I didn't do it in FA, but that should be possible too). The quality of the output seems fine (see below). This is something we could enable all the time and it has negligible effect on performance.

llama-cli -m c:\models\gpt-oss-120b-mxfp4-00001-of-00003.gguf --ctx-size 32768 --n-gpu-layers 37 --temp 0 -fa -f ..\..\..\tools\main\main.cpp --seed 0

prec_f32:

}<|end|><|start|>assistant<|channel|>analysis<|message|>We need to add a new command line flag `--ignore-eos`. It should be a boolean flag that, when set, disables the check for end-of-generation token (EOS) during generation. The check is currently in the main loop:

`` `
if (!embd.empty() && llama_vocab_is_eog(vocab, embd.back()) && !(params.interactive)) {
    LOG(" [end of text]\n");
    break;
}
`` `

Also there is a check for EOG token earlier:

`` `
if (!waiting_for_first_input && llama_vocab_is_eog(vocab, common_sampler_last(smpl))) {
    LOG_DBG("found an EOG token\n");
    if (params.interactive) {
        if (!params.antiprompt.empty()) {
            // tokenize and inject first reverse prompt
            const auto first_antiprompt = common_tokenize(ctx, params.antiprompt.front(), false, true);
            embd_inp.insert(embd_inp.end(), first_antiprompt.begin(), first_antiprompt.end());
            is_antiprompt = true;
        }

        if (params.enable_chat_template) {
            chat_add_and_format("assistant", assistant_ss.str());
        }
        is_interacting = true;
        LOG("\n");
    }
}
`` `

But the requirement says "disable the check for end-of-generation token (EOS) during generation". Likely they refer to the check that stops generation when EOS token is encountered. That is the first check after generation loop. Also maybe the second check for EOG token triggers interactive behavior; but ignoring EOS likely means we don't stop generation when EOS token appears. So we need to add a flag to params, parse it, and modify the condition to also check that flag is not set.

Where are command line flags parsed? There's a function `common_params_parse` in common.h/cpp. Let's search.<|end|><|start|>assistant<|channel|>analysis<|message|>Search for `common_params` struct.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search in repository.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>We need to add a new boolean field to `common_params` struct, maybe `ignore_eos`. Then modify parsing to accept `--ignore-eos`. Then modify the condition.

Let's locate `common_params` definition.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `struct common_params`.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Probably in `common.h`.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Let's open common.h.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `common_params` in repository.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Probably in `common.h` or `common.cpp`.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `struct common_params`.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Let's open `common.h`.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `common_params` again.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Maybe it's defined in `common.h` as `struct common_params`. Let's open that file.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `struct common_params` in the repository.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `common_params` variable usage.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `params.` usage. In main, they have `common_params params;`. So struct is defined somewhere.

Search for `common_params` in `common.h`.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>
>

f16 with clamping:


}<|end|><|start|>assistant<|channel|>analysis<|message|>We need to add a new command line flag `--ignore-eos`. It should be a boolean flag (no argument). When set, the program should ignore the EOS token during generation. That means that the generation loop should not stop when encountering EOS token. Currently, the code checks for EOS token in two places:

1. In the generation loop, after processing tokens, there is a check:

```cpp
// deal with end of generation tokens in interactive mode
if (!waiting_for_first_input && llama_vocab_is_eog(vocab, common_sampler_last(smpl))) {
    LOG_DBG("found an EOG token
");
    if (params.interactive) {
        if (!params.antiprompt.empty()) {
            // tokenize and inject first reverse prompt
            const auto first_antiprompt = common_tokenize(ctx, params.antiprompt.front(), false, true);
            embd_inp.insert(embd_inp.end(), first_antiprompt.begin(), first_antiprompt.end());
            is_antiprompt = true;
        }

        if (params.enable_chat_template) {
            chat_add_and_format("assistant", assistant_ss.str());
        }
        is_interacting = true;
        LOG("
");
    }
}
`` `

2. Later, after generation loop, there is:

```cpp
// end of generation
if (!embd.empty() && llama_vocab_is_eog(vocab, embd.back()) && !(params.interactive)) {
    LOG(" [end of text]
");
    break;
}
`` `

Also there is a check for `llama_vocab_is_eog(vocab, common_sampler_last(smpl))` earlier for interactive mode.

We need to add a flag `ignore_eos` to `common_params` (or maybe to `params`). Let's search for `common_params` definition.<

jeffbolznv · 2025-08-29T03:45:06Z

I pushed a draft change to clamp values to finite in #15652.

ggerganov · 2025-08-29T08:58:35Z

I'm OK to proceed with the proposed #15652 if you prefer it, although I continue to have some doubts about the usage of F16 accumulators. What I plan to do is to run the AIME25 benchmark and compare the results between the CUDA and Vulkan backends. I recently got a RTX 5090 so I should be able to run the eval with gpt-oss-20b. Probably will have the results sometime next week (it takes a very long time to run this eval multiple times to get enough statistics).

Ideally it would be better to run the AIME25 eval for the large gpt-oss-120b and confirm that the clamped-f16 Vulkan achieves the 93.3% success rate. But this requires much more VRAM than I currently have access to in order to do this in a reasonable amount of time.

ggerganov · 2025-08-31T17:14:32Z

Replaced with #15652

ggml : add ggml_mul_mat_id_set_prec and use in llama.cpp

ef2650a

ggml-ci

ggerganov requested a review from 0cc4m as a code owner August 27, 2025 17:36

ggerganov mentioned this pull request Aug 27, 2025

Eval bug: Jinja fails on gpt-oss-120b when using Vulkan #15274

Open

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 27, 2025

ggerganov requested a review from jeffbolznv August 27, 2025 18:43

jeffbolznv mentioned this pull request Aug 29, 2025

vulkan: clamp matmul and FA results to the max finite value #15652

Merged

ggerganov closed this Aug 31, 2025

ggerganov deleted the gg/mmid-set-prec branch August 31, 2025 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : add ggml_mul_mat_id_set_prec and use in llama.cpp #15619

ggml : add ggml_mul_mat_id_set_prec and use in llama.cpp #15619

ggerganov commented Aug 27, 2025

Uh oh!

jeffbolznv commented Aug 27, 2025

Uh oh!

ggerganov commented Aug 28, 2025

Uh oh!

jeffbolznv commented Aug 28, 2025

Uh oh!

ggerganov commented Aug 28, 2025 •

edited

Loading

Uh oh!

jeffbolznv commented Aug 28, 2025

Uh oh!

0cc4m commented Aug 28, 2025

Uh oh!

jeffbolznv commented Aug 28, 2025

Uh oh!

jeffbolznv commented Aug 29, 2025

Uh oh!

ggerganov commented Aug 29, 2025

Uh oh!

ggerganov commented Aug 31, 2025

Uh oh!

Uh oh!

ggml : add ggml_mul_mat_id_set_prec and use in llama.cpp #15619

ggml : add ggml_mul_mat_id_set_prec and use in llama.cpp #15619

Conversation

ggerganov commented Aug 27, 2025

Uh oh!

jeffbolznv commented Aug 27, 2025

Uh oh!

ggerganov commented Aug 28, 2025

Uh oh!

jeffbolznv commented Aug 28, 2025

Uh oh!

ggerganov commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Aug 28, 2025

Uh oh!

0cc4m commented Aug 28, 2025

Uh oh!

jeffbolznv commented Aug 28, 2025

Uh oh!

jeffbolznv commented Aug 29, 2025

Uh oh!

ggerganov commented Aug 29, 2025

Uh oh!

ggerganov commented Aug 31, 2025

Uh oh!

Uh oh!

ggerganov commented Aug 28, 2025 •

edited

Loading