Distributed inference GPU not being used and wrong memory value reported #11272
-
I am trying to do distributed Inference on two older computers, but the backend pc is only reporting Vram (11330MB) for one of the two gpu in that server. Nvidia-smi and NVCC work without errors. The command I use to start the rpc server is The command I use to start the main server is “./llama-cli -m /mnt/database/ds25/DeepSeek-V2.5-1210-Q6_K-00001-of-00005.gguf -p "Hello, my name is" -ngl 24 --rpc 192.168.1.46:50052” NVTOP shows all the gpus on the main server active but only one gpu being used on the backend pc. Is there any settings I can use to fix this? Main server: Dell Poweredge C4130 - 4 X M40 (24GB) I also noticed the active gpu are only using their memory and the GPU% for them is stuck at zero. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 2 replies
-
You need to start a separate
Then on the main host:
|
Beta Was this translation helpful? Give feedback.
-
+1 |
Beta Was this translation helpful? Give feedback.
-
Thank you I am now utilizing the gpu over RPC, one more question. How can i squeeze a bit more power from these cards. The M40-24GB is only using 12213 MIB, and the M40-12GB is using ~ 7106MIB? print_info: max token length = 256 |
Beta Was this translation helpful? Give feedback.
-
I am not sure why the models I am loading are not using more of my GPU, Deepseek and llama3.3 hovers at 50%, what variable could I use to increase the GPU% |
Beta Was this translation helpful? Give feedback.
-
Does the main host need a GPU to run? Or can it be deployed on a smaller machine as the host, with the RPC servers handling the GPU work? Does the main host only perform tokenization of the received requests and coordination? I'm trying to understand if I can use a lightweight CPU-only machine as the main coordinator while offloading all the heavy computation to separate RPC servers with GPUs. Thanks! |
Beta Was this translation helpful? Give feedback.
You need to start a separate
rpc-server
for each GPU that you have:Then on the main host: