-
Notifications
You must be signed in to change notification settings - Fork 12.9k
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
version: 6294 (bcbddcd)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
./build/bin/llama-server -ngl 999 -fa -m GLM-4.5-Air-Q5_K_M-00001-of-00002.gguf -c 131071 --jinja --reasoning-format deepseek --slots --n-predict 131071 --no-context-shift -ctk q8_0 -ctv q8_0 --parallel 1
GGufs are here : `https://huggingface.co/unsloth/GLM-4.5-Air-GGUF/tree/main/Q5_K_M`
Problem description & steps to reproduce
When using tool calling, sometime server crash (see bellow for stacktrace).
I first met this issue when working on this PR : #15248
But I found a way to 100% replicate this error on master:head ... but the model is big.
When trying with smaller examples I do not meet this issue (simple tool calling example provided in llama.cpp documentation)
Here is a .js that needs to be ran with node nodetest.js
that triggers the crash every single times it's called.
First Bad Commit
No response
Relevant log output
srv params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 4244
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.482564
slot update_slots: id 0 | task 0 | kv cache rm [2048, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 4096, n_tokens = 2048, progress = 0.965127
slot update_slots: id 0 | task 0 | kv cache rm [4096, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 4244, n_tokens = 148, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 4244, n_tokens = 148
/home/xxxxxxxxxx/idextend/llama.cpp/build/bin/libggml-base.so(+0x16dab)[0x7fb495e8ddab]
/home/xxxxxxxxxx/idextend/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x21f)[0x7fb495e8e20f]
/home/xxxxxxxxxx/idextend/llama.cpp/build/bin/libggml-base.so(+0x2975f)[0x7fb495ea075f]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7fb495cdf20c]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7fb495cdf277]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8)[0x7fb495cdf4d8]
/home/xxxxxxxxxx/idextend/llama.cpp/build/bin/libllama.so(+0x65f61)[0x7fb495f89f61]
/home/xxxxxxxxxx/idextend/llama.cpp/build/bin/libllama.so(_Z25llama_grammar_accept_implR13llama_grammari+0x26f)[0x7fb495fcd81f]
./build/bin/llama-server(+0x1e08ac)[0x55bff96ba8ac]
./build/bin/llama-server(+0xdfb13)[0x55bff95b9b13]
./build/bin/llama-server(+0x83c8d)[0x55bff955dc8d]
./build/bin/llama-server(+0x4af3d)[0x55bff9524f3d]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fb495928d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fb495928e40]
./build/bin/llama-server(+0x4c995)[0x55bff9526995]
terminate called after throwing an instance of 'std::runtime_error'
what(): Unexpected empty grammar stack after accepting piece:
<tool_call>
Aborted (core dumped)