Eval bug: ROCm error: CUBLAS_STATUS_INTERNAL_ERROR

### Name and Version

./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6600 XT, gfx1032 (0x1032), VMM: no, Wave Size: 32
version: 6123 (79c1160b)
built with AMD clang version 19.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.4.3 25224 d366fa84f3fdcbd4b10847ebd5db572ae12a34fb) for x86_64-unknown-linux-gnu
Ubuntu 22.04

### Operating systems

Linux

### GGML backends

BLAS, HIP

### Hardware

CPU Specs: Model: Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz | Cores: 6 | Threads: 12 | Arch: x86_64
CPU Caches: L1d: 192 | L1i: 192 | L2: 1.5 | L3: 12
CPU Governor: performance
GPU Specs: Model: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] | Name: amdgpu | Total VRAM: 8176MiB
Memory Specs: Total RAM: 62Gi


### Models

so far, `Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf` and `Qwen_Qwen2.5-Coder-14B-Instruct-GGUF_qwen2.5-coder-14b-instruct-q8_0-` are models that doesn't have this error (fully offloaded). All others I've tried have this error, including:

`gpt-oss-20b-Q4_K_M.gguf`
`Qwen3-30B-A3B-Q4_K_M.gguf`
`deepseek-coder-6.7b-instruct.Q4_K_M.gguf` 
`mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf`

### Problem description & steps to reproduce

I get the same error reported here: https://github.com/ggml-org/llama.cpp/issues/12878#issuecomment-2798523929

Though I can't understand how that issue was resolved/closed. My guess is that the dense model works due to its more predictable GPU memory access; that the MoE models all share in common an inherently more complex routing mechanism and more matrix multiplication operations, even when using --cpu-moe to offload experts. Is that it in a nutshell or is there a way around this?

### First Bad Commit

_No response_

### Relevant log output

```shell
slot launch_slot_: id  0 | task 0 | processing task
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 1, front = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 16000, n_keep = 0, n_prompt_tokens = 8633
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 512, n_tokens = 512, progress = 0.059307
srv  update_slots: decoding batch, n_tokens = 512
clear_adapter_lora: call
set_embeddings: value = 0
/home/gym/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:84: ROCm error
ROCm error: CUBLAS_STATUS_INTERNAL_ERROR
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas_impl at /home/gym/llama.cpp/ggml/src/ggml-cuda/          ggml-cuda.cu:1943
  hipblasGemmStridedBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, src0_ptr,           cu_data_type_a, nb01/nb00, sma, src1_ptr, cu_data_type_b, s11, smb, beta, dst_t, cu_data_type, ne0, ne1*ne0, ne1          2*ne13, cu_compute_type, HIPBLAS_GEMM_DEFAULT)
[New LWP 286318]
[New LWP 286840]
[New LWP 286841]
[New LWP 286842]
[New LWP 286843]
[New LWP 286844]
[New LWP 286845]
[New LWP 286846]
[New LWP 286847]
[New LWP 286848]
[New LWP 286849]
[New LWP 286850]
[New LWP 286851]
[New LWP 286852]
[New LWP 286875]
[New LWP 287147]
[New LWP 287148]
[New LWP 287149]
[New LWP 287150]
[New LWP 287151]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x0000722a26aea42f in __GI___wait4 (pid=287462, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux          /wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x0000722a26aea42f in __GI___wait4 (pid=287462, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/l          inux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000722a2b463486 in ggml_print_backtrace () from /home/gym/llama.cpp/build/bin/libggml-base.so
#2  0x0000722a2b4636d9 in ggml_abort () from /home/gym/llama.cpp/build/bin/libggml-base.so
#3  0x0000722a27339eb2 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/          gym/llama.cpp/build/bin/libggml-hip.so
#4  0x0000722a27344b49 in ggml_cuda_mul_mat_batched_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_t          ensor const*, ggml_tensor*) () from /home/gym/llama.cpp/build/bin/libggml-hip.so
#5  0x0000722a27341c02 in ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, g          gml_tensor*) () from /home/gym/llama.cpp/build/bin/libggml-hip.so
#6  0x0000722a2733fd80 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/gym/llama.cp          p/build/bin/libggml-hip.so
#7  0x0000722a2b47dff7 in ggml_backend_sched_graph_compute_async () from /home/gym/llama.cpp/build/bin/libggml-ba          se.so
#8  0x0000722a2b2dd3f1 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/gym/llama.cpp/build/bin/          libllama.so
#9  0x0000722a2b2dd06b in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context          _i*, ggml_status&) () from /home/gym/llama.cpp/build/bin/libllama.so
#10 0x0000722a2b2de45e in llama_context::decode(llama_batch const&) () from /home/gym/llama.cpp/build/bin/libllam          a.so
#11 0x0000722a2b2e251b in llama_decode () from /home/gym/llama.cpp/build/bin/libllama.so
#12 0x0000000000352e32 in server_context::update_slots() ()
#13 0x00000000002d1284 in server_queue::start_loop() ()
#14 0x000000000028cbf9 in main ()
[Inferior 1 (process 286315) detached]
Aborted (core dumped)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: ROCm error: CUBLAS_STATUS_INTERNAL_ERROR #15244

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: ROCm error: CUBLAS_STATUS_INTERNAL_ERROR #15244

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions