Distributed inference GPU not being used and wrong memory value reported #11272

gngglobetech · 2025-01-17T04:01:20Z

gngglobetech
Jan 17, 2025

I am trying to do distributed Inference on two older computers, but the backend pc is only reporting Vram (11330MB) for one of the two gpu in that server. Nvidia-smi and NVCC work without errors. The command I use to start the rpc server is
“./rpc-server -p 50052 -H 0.0.0.0”
CUDA_VISIBLE_DEVICES=0,1
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
Never expose the RPC server to an open network!
This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla M40, compute capability 5.2, VMM: yes
Device 1: Tesla M40, compute capability 5.2, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 11330 MB

The command I use to start the main server is “./llama-cli -m /mnt/database/ds25/DeepSeek-V2.5-1210-Q6_K-00001-of-00005.gguf -p "Hello, my name is" -ngl 24 --rpc 192.168.1.46:50052” NVTOP shows all the gpus on the main server active but only one gpu being used on the backend pc. Is there any settings I can use to fix this?

Main server: Dell Poweredge C4130 - 4 X M40 (24GB)
Backend server: Dell Poweredge R720 - 2 X M40 (12GB)
Network: 10GBE

I also noticed the active gpu are only using their memory and the GPU% for them is stuck at zero.

Answered by rgerganov

Jan 17, 2025

You need to start a separate rpc-server for each GPU that you have:

$ CUDA_VISIBLE_DEVICES=0 bin/rpc-server -H 0.0.0.0 -p 50052
$ CUDA_VISIBLE_DEVICES=1 bin/rpc-server -H 0.0.0.0 -p 50053

Then on the main host:

$ llama-cli <other-params> --rpc <rpc-host>:50052 --rpc <rpc-host>:50053

View full answer

rgerganov · 2025-01-17T07:02:56Z

rgerganov
Jan 17, 2025
Collaborator

You need to start a separate rpc-server for each GPU that you have:

$ CUDA_VISIBLE_DEVICES=0 bin/rpc-server -H 0.0.0.0 -p 50052
$ CUDA_VISIBLE_DEVICES=1 bin/rpc-server -H 0.0.0.0 -p 50053

Then on the main host:

$ llama-cli <other-params> --rpc <rpc-host>:50052 --rpc <rpc-host>:50053

0 replies

chenyihang1993 · 2025-01-17T07:19:01Z

chenyihang1993
Jan 17, 2025

+1

0 replies

gngglobetech · 2025-01-17T21:18:58Z

gngglobetech
Jan 17, 2025
Author

Thank you I am now utilizing the gpu over RPC, one more question. How can i squeeze a bit more power from these cards. The M40-24GB is only using 12213 MIB, and the M40-12GB is using ~ 7106MIB?

print_info: max token length = 256
load_tensors: offloading 10 repeating layers to GPU
load_tensors: offloaded 10/62 layers to GPU
load_tensors: RPC[192.168.1.46:50052] model buffer size = 7106.67 MiB
load_tensors: RPC[192.168.1.46:50053] model buffer size = 6179.06 MiB
load_tensors: CUDA0 model buffer size = 14213.34 MiB
load_tensors: CUDA1 model buffer size = 14213.34 MiB
load_tensors: CUDA2 model buffer size = 14213.34 MiB
load_tensors: CUDA3 model buffer size = 14213.34 MiB
load_tensors: CPU_Mapped model buffer size = 41019.64 MiB
load_tensors: CPU_Mapped model buffer size = 41053.86 MiB
load_tensors: CPU_Mapped model buffer size = 41881.06 MiB
load_tensors: CPU_Mapped model buffer size = 40960.48 MiB
load_tensors: CPU_Mapped model buffer size = 41053.86 MiB
load_tensors: CPU_Mapped model buffer size = 40953.45 MiB
load_tensors: CPU_Mapped model buffer size = 40960.48 MiB
load_tensors: CPU_Mapped model buffer size = 27667.72 MiB
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0

0 replies

gngglobetech · 2025-01-17T22:12:22Z

gngglobetech
Jan 17, 2025
Author

I am not sure why the models I am loading are not using more of my GPU, Deepseek and llama3.3 hovers at 50%, what variable could I use to increase the GPU%

2 replies

slaren Jan 17, 2025
Maintainer

You can use the -ts parameter to configure this.

gngglobetech Jan 21, 2025
Author

-ts does not seems to be what is needed. The GPU% are not being utilized above 40 percent.

alejandrods · 2025-09-02T15:39:22Z

alejandrods
Sep 2, 2025

Does the main host need a GPU to run? Or can it be deployed on a smaller machine as the host, with the RPC servers handling the GPU work? Does the main host only perform tokenization of the received requests and coordination?

I'm trying to understand if I can use a lightweight CPU-only machine as the main coordinator while offloading all the heavy computation to separate RPC servers with GPUs. Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed inference GPU not being used and wrong memory value reported #11272

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Distributed inference GPU not being used and wrong memory value reported #11272

Uh oh!

gngglobetech Jan 17, 2025

Replies: 5 comments · 2 replies

Uh oh!

Uh oh!

rgerganov Jan 17, 2025 Collaborator

Uh oh!

chenyihang1993 Jan 17, 2025

Uh oh!

gngglobetech Jan 17, 2025 Author

Uh oh!

gngglobetech Jan 17, 2025 Author

Uh oh!

slaren Jan 17, 2025 Maintainer

Uh oh!

gngglobetech Jan 21, 2025 Author

Uh oh!

alejandrods Sep 2, 2025

gngglobetech
Jan 17, 2025

Replies: 5 comments 2 replies

rgerganov
Jan 17, 2025
Collaborator

chenyihang1993
Jan 17, 2025

gngglobetech
Jan 17, 2025
Author

gngglobetech
Jan 17, 2025
Author

slaren Jan 17, 2025
Maintainer

gngglobetech Jan 21, 2025
Author

alejandrods
Sep 2, 2025