-
Hello there, thanks for the great work. I'm wondering how to set a device order when using multiGPU system + RPC. I have this example. I have a consumer mobo, running on Linux, Fedora:
And a Windows PC with a RTX 5090. I have a 10gbps NIC on both PCs. This complex example will be shown using GLM 4.6 IQ4_XS. When running fully on GPU on the linux PC, with this command:
I get:
But when removing a 3090 for this PC and using the 40Gbps NIC, running it with:
I get about 240 t/s PP and 16 t/s TG. Note that -mg 0 or -mg 1 makes no difference. When using 40Gbps at X1 3.0 (so about 9Gbps), I get
I noticed this when loading the model:
Where RPC seems to be first, and seems compute buffers also follow that pattern
For reference, GPU order is this (I manually set a 5090 first):
It seems bigger compute buffer is on RPC, despite specifying -mg 1. So I think it is first doing RPC and then sending the data via RPC (about 4-5Gbps) to the other PC and then it starts working. Is there a way to like, do CUDA_VISIBLE_DEVICES to reorder, but for any device? Something like: GGML_VISIBLE_DEVICES=CUDA 0, RPC0[192.168.50.2:50052], CUDA 1, CUDA 2, CUDA 3, CUDA 4, CUDA 5 Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 18 replies
-
From tools/rpc/README.md
in your case and example, it would be:
Edit: I think you will have to move the RPC device further down the list. RPC is not without loss. Even if the RPC device is set inside the same machine, you will be losing performance compared to no RPC. There is no free lunch. |
Beta Was this translation helpful? Give feedback.
-
We are always putting RPC devices first in the device chain because we want to make sure we don't copy logits over the network (see PR #9296). |
Beta Was this translation helpful? Give feedback.
I am 90% sure, but in PR #16276 (comment) made it easier for setting the device name (thanks , @rgerganov). I had forgotten this.
It should now be
--device CUDA0,RPC0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5