Skip to content

Running on iGPUs with Vulkan can be fun - llama.cpp OOM error debugging #2

@geerlingguy

Description

@geerlingguy

My current goal is to run DeepSeek R1 671B on a 4-node cluster of servers with 128 GB of shared RAM on each system (with an iGPU that works with Vulkan).

The problem is it seems I can only use up to like 43 GB of space on each node (of the 128 GB). I'm guessing it's a limitation of shared VRAM, to prevent resource contention? Not sure. In the past all my llama.cpp work has involved dedicated GPUs, and either only VRAM or shared VRAM + system memory.

Running llama.cpp with rpc on three nodes but not main node:

load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors:      Vulkan0 model buffer size = 96507.91 MiB
load_tensors: RPC[10.0.2.242:50052] model buffer size = 97323.94 MiB
load_tensors: RPC[10.0.2.223:50052] model buffer size = 103503.00 MiB
load_tensors: RPC[10.0.2.209:50052] model buffer size = 87857.67 MiB
load_tensors:          CPU model buffer size =   497.11 MiB
radv/amdgpu: Not enough memory for command submission.
llama_model_load: error loading model: vk::Queue::submit: ErrorDeviceLost
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'models/DeepSeek-R1-Q4_K_M.gguf'
main: error: unable to load model

RPC on four nodes including main node results in OOM on the main node since llama.cpp is trying a local Vulkan runner plus the RPC runner:

print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
Killed

Setting:

export GGML_VK_PREFER_HOST_MEMORY=1

Only on main node:

........................./opt/llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:579: Remote RPC server crashed or returned malformed response
[New LWP 11588]
[New LWP 11587]
[New LWP 11586]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f00fe8876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#0  0x00007f00fe8876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#1  0x00007f00fe87b9da in __internal_syscall_cancel () from /lib64/libc.so.6
#2  0x00007f00fe87ba24 in __syscall_cancel () from /lib64/libc.so.6
#3  0x00007f00fe8eb5ef in wait4 () from /lib64/libc.so.6
#4  0x00007f0100b6c263 in ggml_print_backtrace () from /opt/llama.cpp/build/bin/libggml-base.so
#5  0x00007f0100b6c3af in ggml_abort () from /opt/llama.cpp/build/bin/libggml-base.so
#6  0x00007f0100e33c1a in ggml_backend_rpc_buffer_set_tensor(ggml_backend_buffer*, ggml_tensor*, void const*, unsigned long, unsigned long) () from /opt/llama.cpp/build/bin/libggml-rpc.so
#7  0x00007f0100c8a234 in llama_model_loader::load_all_data(ggml_context*, std::unordered_map<unsigned int, ggml_backend_buffer*, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, ggml_backend_buffer*> > >&, std::vector<std::unique_ptr<llama_mlock, std::default_delete<llama_mlock> >, std::allocator<std::unique_ptr<llama_mlock, std::default_delete<llama_mlock> > > >*, bool (*)(float, void*), void*) () from /opt/llama.cpp/build/bin/libllama.so
#8  0x00007f0100ccdb21 in llama_model::load_tensors(llama_model_loader&) () from /opt/llama.cpp/build/bin/libllama.so
#9  0x00007f0100c1b26d in llama_model_load_from_file_impl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, llama_model_params) () from /opt/llama.cpp/build/bin/libllama.so
#10 0x00007f0100c1bffc in llama_model_load_from_file () from /opt/llama.cpp/build/bin/libllama.so
#11 0x000000000051399f in common_init_from_params(common_params&) ()
#12 0x000000000042464d in main ()
[Inferior 1 (process 11585) detached]
Aborted (core dumped

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions