Running on iGPUs with Vulkan can be fun - llama.cpp OOM error debugging

My current goal is to run DeepSeek R1 671B on a 4-node cluster of servers with 128 GB of shared RAM on each system (with an iGPU that works with Vulkan).

The problem is it seems I can only use up to like 43 GB of space on each node (of the 128 GB). I'm guessing it's a limitation of shared VRAM, to prevent resource contention? Not sure. In the past all my `llama.cpp` work has involved dedicated GPUs, and either only VRAM or shared VRAM + system memory.

Running `llama.cpp` with rpc on three nodes but not main node:

```
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors:      Vulkan0 model buffer size = 96507.91 MiB
load_tensors: RPC[10.0.2.242:50052] model buffer size = 97323.94 MiB
load_tensors: RPC[10.0.2.223:50052] model buffer size = 103503.00 MiB
load_tensors: RPC[10.0.2.209:50052] model buffer size = 87857.67 MiB
load_tensors:          CPU model buffer size =   497.11 MiB
radv/amdgpu: Not enough memory for command submission.
llama_model_load: error loading model: vk::Queue::submit: ErrorDeviceLost
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'models/DeepSeek-R1-Q4_K_M.gguf'
main: error: unable to load model
```

RPC on four nodes including main node results in OOM on the main node since `llama.cpp` is trying a local Vulkan runner plus the RPC runner:

```
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
Killed
```

Setting:

```
export GGML_VK_PREFER_HOST_MEMORY=1
```

Only on main node:

```
........................./opt/llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:579: Remote RPC server crashed or returned malformed response
[New LWP 11588]
[New LWP 11587]
[New LWP 11586]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f00fe8876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#0  0x00007f00fe8876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#1  0x00007f00fe87b9da in __internal_syscall_cancel () from /lib64/libc.so.6
#2  0x00007f00fe87ba24 in __syscall_cancel () from /lib64/libc.so.6
#3  0x00007f00fe8eb5ef in wait4 () from /lib64/libc.so.6
#4  0x00007f0100b6c263 in ggml_print_backtrace () from /opt/llama.cpp/build/bin/libggml-base.so
#5  0x00007f0100b6c3af in ggml_abort () from /opt/llama.cpp/build/bin/libggml-base.so
#6  0x00007f0100e33c1a in ggml_backend_rpc_buffer_set_tensor(ggml_backend_buffer*, ggml_tensor*, void const*, unsigned long, unsigned long) () from /opt/llama.cpp/build/bin/libggml-rpc.so
#7  0x00007f0100c8a234 in llama_model_loader::load_all_data(ggml_context*, std::unordered_map<unsigned int, ggml_backend_buffer*, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, ggml_backend_buffer*> > >&, std::vector<std::unique_ptr<llama_mlock, std::default_delete<llama_mlock> >, std::allocator<std::unique_ptr<llama_mlock, std::default_delete<llama_mlock> > > >*, bool (*)(float, void*), void*) () from /opt/llama.cpp/build/bin/libllama.so
#8  0x00007f0100ccdb21 in llama_model::load_tensors(llama_model_loader&) () from /opt/llama.cpp/build/bin/libllama.so
#9  0x00007f0100c1b26d in llama_model_load_from_file_impl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, llama_model_params) () from /opt/llama.cpp/build/bin/libllama.so
#10 0x00007f0100c1bffc in llama_model_load_from_file () from /opt/llama.cpp/build/bin/libllama.so
#11 0x000000000051399f in common_init_from_params(common_params&) ()
#12 0x000000000042464d in main ()
[Inferior 1 (process 11585) detached]
Aborted (core dumped
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Running on iGPUs with Vulkan can be fun - llama.cpp OOM error debugging #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Running on iGPUs with Vulkan can be fun - llama.cpp OOM error debugging #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions