-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Description
Name and Version
./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6600 XT, gfx1032 (0x1032), VMM: no, Wave Size: 32
version: 6123 (79c1160)
built with AMD clang version 19.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.4.3 25224 d366fa84f3fdcbd4b10847ebd5db572ae12a34fb) for x86_64-unknown-linux-gnu
Ubuntu 22.04
Operating systems
Linux
GGML backends
BLAS, HIP
Hardware
CPU Specs: Model: Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz | Cores: 6 | Threads: 12 | Arch: x86_64
CPU Caches: L1d: 192 | L1i: 192 | L2: 1.5 | L3: 12
CPU Governor: performance
GPU Specs: Model: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] | Name: amdgpu | Total VRAM: 8176MiB
Memory Specs: Total RAM: 62Gi
Models
so far, Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf and Qwen_Qwen2.5-Coder-14B-Instruct-GGUF_qwen2.5-coder-14b-instruct-q8_0- are models that doesn't have this error (fully offloaded). All others I've tried have this error, including:
gpt-oss-20b-Q4_K_M.gguf
Qwen3-30B-A3B-Q4_K_M.gguf
deepseek-coder-6.7b-instruct.Q4_K_M.gguf
mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf
Problem description & steps to reproduce
I get the same error reported here: #12878 (comment)
Though I can't understand how that issue was resolved/closed. My guess is that the dense model works due to its more predictable GPU memory access; that the MoE models all share in common an inherently more complex routing mechanism and more matrix multiplication operations, even when using --cpu-moe to offload experts. Is that it in a nutshell or is there a way around this?
First Bad Commit
No response
Relevant log output
slot launch_slot_: id 0 | task 0 | processing task
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 1, front = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 16000, n_keep = 0, n_prompt_tokens = 8633
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 512, n_tokens = 512, progress = 0.059307
srv update_slots: decoding batch, n_tokens = 512
clear_adapter_lora: call
set_embeddings: value = 0
/home/gym/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:84: ROCm error
ROCm error: CUBLAS_STATUS_INTERNAL_ERROR
current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at /home/gym/llama.cpp/ggml/src/ggml-cuda/ ggml-cuda.cu:1943
hipblasGemmStridedBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, src0_ptr, cu_data_type_a, nb01/nb00, sma, src1_ptr, cu_data_type_b, s11, smb, beta, dst_t, cu_data_type, ne0, ne1*ne0, ne1 2*ne13, cu_compute_type, HIPBLAS_GEMM_DEFAULT)
[New LWP 286318]
[New LWP 286840]
[New LWP 286841]
[New LWP 286842]
[New LWP 286843]
[New LWP 286844]
[New LWP 286845]
[New LWP 286846]
[New LWP 286847]
[New LWP 286848]
[New LWP 286849]
[New LWP 286850]
[New LWP 286851]
[New LWP 286852]
[New LWP 286875]
[New LWP 287147]
[New LWP 287148]
[New LWP 287149]
[New LWP 287150]
[New LWP 287151]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x0000722a26aea42f in __GI___wait4 (pid=287462, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux /wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x0000722a26aea42f in __GI___wait4 (pid=287462, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/l inux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x0000722a2b463486 in ggml_print_backtrace () from /home/gym/llama.cpp/build/bin/libggml-base.so
#2 0x0000722a2b4636d9 in ggml_abort () from /home/gym/llama.cpp/build/bin/libggml-base.so
#3 0x0000722a27339eb2 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/ gym/llama.cpp/build/bin/libggml-hip.so
#4 0x0000722a27344b49 in ggml_cuda_mul_mat_batched_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_t ensor const*, ggml_tensor*) () from /home/gym/llama.cpp/build/bin/libggml-hip.so
#5 0x0000722a27341c02 in ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, g gml_tensor*) () from /home/gym/llama.cpp/build/bin/libggml-hip.so
#6 0x0000722a2733fd80 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/gym/llama.cp p/build/bin/libggml-hip.so
#7 0x0000722a2b47dff7 in ggml_backend_sched_graph_compute_async () from /home/gym/llama.cpp/build/bin/libggml-ba se.so
#8 0x0000722a2b2dd3f1 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/gym/llama.cpp/build/bin/ libllama.so
#9 0x0000722a2b2dd06b in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context _i*, ggml_status&) () from /home/gym/llama.cpp/build/bin/libllama.so
#10 0x0000722a2b2de45e in llama_context::decode(llama_batch const&) () from /home/gym/llama.cpp/build/bin/libllam a.so
#11 0x0000722a2b2e251b in llama_decode () from /home/gym/llama.cpp/build/bin/libllama.so
#12 0x0000000000352e32 in server_context::update_slots() ()
#13 0x00000000002d1284 in server_queue::start_loop() ()
#14 0x000000000028cbf9 in main ()
[Inferior 1 (process 286315) detached]
Aborted (core dumped)