Skip to content

Conversation

ggerganov
Copy link
Member

fix #15274, #15517, #15516

Force F32 accumulators for attention and ffn output matrix multiplications.

@ggerganov ggerganov requested a review from 0cc4m as a code owner August 27, 2025 17:36
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 27, 2025
@ggerganov ggerganov requested a review from jeffbolznv August 27, 2025 18:43
@jeffbolznv
Copy link
Collaborator

How broadly is this enabled? F32 accumulators are half speed of F16 on geforce.

@ggerganov
Copy link
Member Author

On master we force F32 accumulators for:

  • K*Q multiplication in the attention
  • Output multiplication from the attention (only for GLM4)
  • Output multiplication (down) from the FFN (only for GLM4)

This PR updates to use F32 accumulators for:

  • K*Q multiplication in the attention
  • Output multiplication from the attention (all models)
  • Output multiplication (down) from the FFN (all models)
  • Output multiplication (down_exps) from the MoE FFN (all models)

@jeffbolznv
Copy link
Collaborator

Here's a quick before/after, it is a significant slowdown across a lot of models:

4070 before

ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |      4103.29 ± 10.14 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |      3623.11 ± 99.69 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           pp512 |       2279.02 ± 3.35 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           pp512 |    20453.96 ± 541.00 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           pp512 |    17877.62 ± 634.47 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |      7665.58 ± 38.24 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      2530.48 ± 23.09 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           pp512 |      3443.31 ± 57.53 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |      3497.16 ± 17.36 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |      3440.03 ± 33.63 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |     6527.42 ± 421.90 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |      4334.88 ± 24.50 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |     8148.65 ± 249.08 |

4070 after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 0 -p 512 -r 20 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |      3350.81 ± 15.48 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |      3044.60 ± 10.45 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           pp512 |       1907.57 ± 2.89 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           pp512 |    17519.24 ± 626.24 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           pp512 |    15264.31 ± 486.02 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |     6322.62 ± 121.66 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      2357.07 ± 19.95 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           pp512 |       2885.22 ± 4.28 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |      3276.08 ± 17.98 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |      3182.48 ± 24.47 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |     5451.61 ± 125.99 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |      3741.65 ± 44.63 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           pp512 |     6682.32 ± 270.79 |

I'd prefer to apply this only to models that need it. Or, we could try other fixes like clamping to finite values, or scaling/unscaling to avoid infinities.

@ggerganov
Copy link
Member Author

ggerganov commented Aug 28, 2025

I'd prefer to apply this only to models that need it.

The main problem is that we don't really have a way to reliably determine which models/tensors need F32 accumulators. Sometimes the FP range issues occur at large contexts with specific content. I am pretty confident that the existing checks for GLM4 are not enough and there are other models that need F32 accumulators and flew under the radar - it's just nobody has reported that yet, so we don't know.

We can also consider enabling F32 accumulators as proposed and whitelisting F16 acc only when we run enough tests at large context to have some confidence that it does not break?

For now I'll update the PR to just whitelist F32 acc only for GLM and GPT-OSS and we can decide later how to improve this. Just waiting on #15274 (comment) to confirm that they used the correct branch. Otherwise it would mean there are more tensors that need F32 to make GPT-OSS run with Vulkan.

@jeffbolznv
Copy link
Collaborator

I'd also be curious to know whether we really need F32 precision for these models, or if we're just running into infinities and we might be able to get away with clamping to the max value.

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 28, 2025

Maybe it's possible to build a specific test that runs a full-size prompt through the model and checks each mul_mat, mul_mat_id, flash attention, conv2d and whatever else could use a fp16 accumulator for NaNs/infinities. Then tuning each model would be simple.

@jeffbolznv
Copy link
Collaborator

I was able to reproduce the failure from #15274. I verified that clamping infinites to +/-max fp16 in the mul_mat and mul_mat_id shaders was sufficient to fix it (I didn't do it in FA, but that should be possible too). The quality of the output seems fine (see below). This is something we could enable all the time and it has negligible effect on performance.

llama-cli -m c:\models\gpt-oss-120b-mxfp4-00001-of-00003.gguf --ctx-size 32768 --n-gpu-layers 37 --temp 0 -fa -f ..\..\..\tools\main\main.cpp --seed 0

prec_f32:

}<|end|><|start|>assistant<|channel|>analysis<|message|>We need to add a new command line flag `--ignore-eos`. It should be a boolean flag that, when set, disables the check for end-of-generation token (EOS) during generation. The check is currently in the main loop:

`` `
if (!embd.empty() && llama_vocab_is_eog(vocab, embd.back()) && !(params.interactive)) {
    LOG(" [end of text]\n");
    break;
}
`` `

Also there is a check for EOG token earlier:

`` `
if (!waiting_for_first_input && llama_vocab_is_eog(vocab, common_sampler_last(smpl))) {
    LOG_DBG("found an EOG token\n");
    if (params.interactive) {
        if (!params.antiprompt.empty()) {
            // tokenize and inject first reverse prompt
            const auto first_antiprompt = common_tokenize(ctx, params.antiprompt.front(), false, true);
            embd_inp.insert(embd_inp.end(), first_antiprompt.begin(), first_antiprompt.end());
            is_antiprompt = true;
        }

        if (params.enable_chat_template) {
            chat_add_and_format("assistant", assistant_ss.str());
        }
        is_interacting = true;
        LOG("\n");
    }
}
`` `

But the requirement says "disable the check for end-of-generation token (EOS) during generation". Likely they refer to the check that stops generation when EOS token is encountered. That is the first check after generation loop. Also maybe the second check for EOG token triggers interactive behavior; but ignoring EOS likely means we don't stop generation when EOS token appears. So we need to add a flag to params, parse it, and modify the condition to also check that flag is not set.

Where are command line flags parsed? There's a function `common_params_parse` in common.h/cpp. Let's search.<|end|><|start|>assistant<|channel|>analysis<|message|>Search for `common_params` struct.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search in repository.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>We need to add a new boolean field to `common_params` struct, maybe `ignore_eos`. Then modify parsing to accept `--ignore-eos`. Then modify the condition.

Let's locate `common_params` definition.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `struct common_params`.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Probably in `common.h`.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Let's open common.h.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `common_params` in repository.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Probably in `common.h` or `common.cpp`.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `struct common_params`.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Let's open `common.h`.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `common_params` again.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Maybe it's defined in `common.h` as `struct common_params`. Let's open that file.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `struct common_params` in the repository.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `common_params` variable usage.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|>Search for `params.` usage. In main, they have `common_params params;`. So struct is defined somewhere.

Search for `common_params` in `common.h`.<|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>
>

f16 with clamping:


}<|end|><|start|>assistant<|channel|>analysis<|message|>We need to add a new command line flag `--ignore-eos`. It should be a boolean flag (no argument). When set, the program should ignore the EOS token during generation. That means that the generation loop should not stop when encountering EOS token. Currently, the code checks for EOS token in two places:

1. In the generation loop, after processing tokens, there is a check:

```cpp
// deal with end of generation tokens in interactive mode
if (!waiting_for_first_input && llama_vocab_is_eog(vocab, common_sampler_last(smpl))) {
    LOG_DBG("found an EOG token
");
    if (params.interactive) {
        if (!params.antiprompt.empty()) {
            // tokenize and inject first reverse prompt
            const auto first_antiprompt = common_tokenize(ctx, params.antiprompt.front(), false, true);
            embd_inp.insert(embd_inp.end(), first_antiprompt.begin(), first_antiprompt.end());
            is_antiprompt = true;
        }

        if (params.enable_chat_template) {
            chat_add_and_format("assistant", assistant_ss.str());
        }
        is_interacting = true;
        LOG("
");
    }
}
`` `

2. Later, after generation loop, there is:

```cpp
// end of generation
if (!embd.empty() && llama_vocab_is_eog(vocab, embd.back()) && !(params.interactive)) {
    LOG(" [end of text]
");
    break;
}
`` `

Also there is a check for `llama_vocab_is_eog(vocab, common_sampler_last(smpl))` earlier for interactive mode.

We need to add a flag `ignore_eos` to `common_params` (or maybe to `params`). Let's search for `common_params` definition.<

@jeffbolznv
Copy link
Collaborator

I pushed a draft change to clamp values to finite in #15652.

@ggerganov
Copy link
Member Author

I'm OK to proceed with the proposed #15652 if you prefer it, although I continue to have some doubts about the usage of F16 accumulators. What I plan to do is to run the AIME25 benchmark and compare the results between the CUDA and Vulkan backends. I recently got a RTX 5090 so I should be able to run the eval with gpt-oss-20b. Probably will have the results sometime next week (it takes a very long time to run this eval multiple times to get enough statistics).

Ideally it would be better to run the AIME25 eval for the large gpt-oss-120b and confirm that the clamped-f16 Vulkan achieves the 93.3% success rate. But this requires much more VRAM than I currently have access to in order to do this in a reasonable amount of time.

@ggerganov
Copy link
Member Author

Replaced with #15652

@ggerganov ggerganov closed this Aug 31, 2025
@ggerganov ggerganov deleted the gg/mmid-set-prec branch August 31, 2025 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: Jinja fails on gpt-oss-120b when using Vulkan
3 participants