How to specify device order when using RPC? Getting worse performance than expected on multiGPU + RPC. #16625

Panchovix · 2025-10-17T00:17:44Z

Panchovix
Oct 17, 2025

Hello there, thanks for the great work.

I'm wondering how to set a device order when using multiGPU system + RPC.

I have this example.

I have a consumer mobo, running on Linux, Fedora:

X8/X8 5.0 from CPU from top 2 PCIe slots (5090/5090).
X4/X4 4.0 from CPU from top 2 M2 slots, to PCIe adapters (4090/4090, both slots and adapters support 5.0 but 4090s are 4.0).
X4 4.0 from Chipset from bottom PCIe slot (A6000)
X4/X4 4.0 from Chipset from bottom M2 slots, to PCIe adapters (3090/3090)

And a Windows PC with a RTX 5090.

I have a 10gbps NIC on both PCs.

This complex example will be shown using GLM 4.6 IQ4_XS.

When running fully on GPU on the linux PC, with this command:

LLAMA_SET_ROWS=1 ./llama-server \
  -m '/models/GLM-4.6-IQ4_XS.gguf' \
  -c 32768 \
  --no-mmap \
  -ngl 999 \
  -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15).ffn.=CUDA0" \
  -ot "blk.(16|17|18|19|20|21|22|23|24|25).ffn.=CUDA1" \
  -ot "blk.(27|28|29|30|31|32|33|34|35|36).ffn.=CUDA2" \
  -ot "blk.(38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.=CUDA3" \
  -ot "blk.(51|52|53|54|55|56|57|58|59).ffn.=CUDA4" \
  -ot "blk.(61|62|63|64|65|66|67|68|69|70).ffn.=CUDA5" \
  -ot "blk.(72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91).ffn.=CUDA6" \
  -ot "blk.26.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA1" \
  -ot "blk.26.ffn_gate_exps.weight=CUDA1" \
  -ot "blk.26.ffn_(down_exps|up_exps).weight=CUDA0" \
  -ot "blk.37.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA2" \
  -ot "blk.37.ffn_gate_exps.weight=CUDA2" \
  -ot "blk.37.ffn_(down_exps|up_exps).weight=CUDA3" \
  -ot "blk.60.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA4" \
  -ot "blk.60.ffn_gate_exps.weight=CUDA4" \
  -ot "blk.60.ffn_(down_exps|up_exps).weight=CUDA6" \
  -ot "blk.71.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA5" \
  -ot "blk.71.ffn_gate_exps.weight=CUDA5" \
  -ot "blk.71.ffn_(down_exps|up_exps).weight=CUDA6" \
  -fa on \
  -mg 0 \
  -ub 1792

I get:

prompt eval time =    5781.87 ms /  4410 tokens (    1.31 ms per token,   762.73 tokens per second)
       eval time =   64378.63 ms /  1700 tokens (   37.87 ms per token,    26.41 tokens per second)

But when removing a 3090 for this PC and using the 40Gbps NIC, running it with:

LLAMA_SET_ROWS=1 ./llama-server -m '/models/GLM-4.6-IQ4_XS.gguf' -c 32768 --no-mmap --rpc 192.168.50.2:50052 -ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15).ffn.=CUDA0" \
-ot "blk.(16|17|18|19|20|21|22|23|24|25).ffn.=CUDA1" \
-ot "blk.(27|28|29|30|31|32|33|34|35|36).ffn.=CUDA2" \
-ot "blk.(38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.=CUDA3" \
-ot "blk.(51|52|53|54|55|56|57|58|59).ffn.=CUDA4" \
-ot "blk.(61|62|63|64|65|66|67|68|69|70).ffn.=RPC0[192.168.50.2:50052]" \
-ot "blk.(72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91).ffn.=CUDA5" \
-ot "blk.26.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA1" \
-ot "blk.26.ffn_gate_exps.weight=CUDA1" \
-ot "blk.26.ffn_(down_exps|up_exps).weight=CUDA0" \
-ot "blk.37.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA2" \
-ot "blk.37.ffn_gate_exps.weight=CUDA2" \
-ot "blk.37.ffn_(down_exps|up_exps).weight=CUDA3" \
-ot "blk.60.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA4" \
-ot "blk.60.ffn_gate_exps.weight=CUDA4" \
-ot "blk.60.ffn_(down_exps|up_exps).weight=CUDA5" \
-ot "blk.71.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=RPC0[192.168.50.2:50052]" \
-ot "blk.71.ffn_gate_exps.weight=RPC0[192.168.50.2:50052]" \
-ot "blk.71.ffn_(down_exps|up_exps).weight=CUDA5" \
-fa on -mg 1 -ub 1792

I get about 240 t/s PP and 16 t/s TG.

Note that -mg 0 or -mg 1 makes no difference.

When using 40Gbps at X1 3.0 (so about 9Gbps), I get

prompt eval time =   27661.35 ms /  4410 tokens (    6.27 ms per token,   159.43 tokens per second)
       eval time =  140832.84 ms /  1784 tokens (   78.94 ms per token,    12.67 tokens per second)

I noticed this when loading the model:

load_tensors: offloading 93 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 94/94 layers to GPU
load_tensors: RPC0[192.168.50.2:50052] model buffer size = 20957.13 MiB
load_tensors:          CPU model buffer size =   416.25 MiB
load_tensors:        CUDA0 model buffer size = 27658.19 MiB
load_tensors:        CUDA1 model buffer size = 20677.38 MiB
load_tensors:        CUDA2 model buffer size = 20747.32 MiB
load_tensors:        CUDA3 model buffer size = 27371.32 MiB
load_tensors:        CUDA4 model buffer size = 18745.29 MiB
load_tensors:        CUDA5 model buffer size = 43127.69 MiB

Where RPC seems to be first, and seems compute buffers also follow that pattern

llama_context: n_ctx_per_seq (32768) < n_ctx_train (202752) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache: RPC0[192.168.50.2:50052] KV buffer size =  1792.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =  1792.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =  1280.00 MiB
llama_kv_cache:      CUDA2 KV buffer size =  1408.00 MiB
llama_kv_cache:      CUDA3 KV buffer size =  1792.00 MiB
llama_kv_cache:      CUDA4 KV buffer size =  1280.00 MiB
llama_kv_cache:      CUDA5 KV buffer size =  2432.00 MiB
llama_kv_cache: size = 11776.00 MiB ( 32768 cells,  92 layers,  1/1 seqs), K (f16): 5888.00 MiB, V (f16): 5888.00 MiB
llama_context: RPC0[192.168.50.2:50052] compute buffer size =   819.03 MiB
llama_context:      CUDA0 compute buffer size =   750.18 MiB
llama_context:      CUDA1 compute buffer size =   638.15 MiB
llama_context:      CUDA2 compute buffer size =   638.15 MiB
llama_context:      CUDA3 compute buffer size =   750.18 MiB
llama_context:      CUDA4 compute buffer size =   638.15 MiB
llama_context:      CUDA5 compute buffer size =  1141.00 MiB
llama_context:        CPU compute buffer size =   259.05 MiB
llama_context: graph nodes  = 6529
llama_context: graph splits = 276

For reference, GPU order is this (I manually set a 5090 first):

ggml_cuda_init: found 6 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 5: NVIDIA RTX A6000, compute capability 8.6, VMM: yes

It seems bigger compute buffer is on RPC, despite specifying -mg 1.

So I think it is first doing RPC and then sending the data via RPC (about 4-5Gbps) to the other PC and then it starts working.

Is there a way to like, do CUDA_VISIBLE_DEVICES to reorder, but for any device? Something like:

GGML_VISIBLE_DEVICES=CUDA 0, RPC0[192.168.50.2:50052], CUDA 1, CUDA 2, CUDA 3, CUDA 4, CUDA 5

Thanks in advance!

Answered by abc-nix

Oct 17, 2025

I am 90% sure, but in PR #16276 (comment) made it easier for setting the device name (thanks , @rgerganov). I had forgotten this.

It should now be

--device CUDA0,RPC0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5

View full answer

abc-nix · 2025-10-17T05:50:26Z

abc-nix
Oct 17, 2025

Is there a way to like, do CUDA_VISIBLE_DEVICES to reorder, but for any device? Something like:

GGML_VISIBLE_DEVICES=CUDA 0, RPC0[192.168.50.2:50052], CUDA 1, CUDA 2, CUDA 3, CUDA 4, CUDA 5

From tools/rpc/README.md

You can control the set of exposed CUDA devices with the CUDA_VISIBLE_DEVICES environment variable or the --device command line option.

in your case and example, it would be:

~~--device CUDA0,RPC0[192.168.50.2:50052],CUDA1,CUDA2,CUDA3,CUDA4,CUDA5~~

Edit: --device CUDA0,RPC0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5

I think you will have to move the RPC device further down the list.

RPC is not without loss. Even if the RPC device is set inside the same machine, you will be losing performance compared to no RPC. There is no free lunch.

9 replies

Panchovix Oct 17, 2025
Author

I just tried but got an error

LLAMA_SET_ROWS=1 ./llama-server -m '/models/GLM-4.6-IQ4_XS.gguf' -c 32768 --no-mmap --rpc 192.168.50.2:50052 -ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15).ffn.=CUDA0" \
-ot "blk.(16|17|18|19|20|21|22|23|24|25).ffn.=CUDA1" \
-ot "blk.(27|28|29|30|31|32|33|34|35|36).ffn.=CUDA2" \
-ot "blk.(38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.=CUDA3" \
-ot "blk.(51|52|53|54|55|56|57|58|59).ffn.=CUDA4" \
-ot "blk.(61|62|63|64|65|66|67|68|69|70).ffn.=RPC0[192.168.50.2:50052]" \
-ot "blk.(72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91).ffn.=CUDA5" \
-ot "blk.26.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA1" \
-ot "blk.26.ffn_gate_exps.weight=CUDA1" \
-ot "blk.26.ffn_(down_exps|up_exps).weight=CUDA0" \
-ot "blk.37.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA2" \
-ot "blk.37.ffn_gate_exps.weight=CUDA2" \
-ot "blk.37.ffn_(down_exps|up_exps).weight=CUDA3" \
-ot "blk.60.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA4" \
-ot "blk.60.ffn_gate_exps.weight=CUDA4" \
-ot "blk.60.ffn_(down_exps|up_exps).weight=CUDA5" \
-ot "blk.71.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=RPC0[192.168.50.2:50052]" \
-ot "blk.71.ffn_gate_exps.weight=RPC0[192.168.50.2:50052]" \
-ot "blk.71.ffn_(down_exps|up_exps).weight=CUDA5" \
-fa on -mg 0 -ub 1792 --device CUDA0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5,RPC0[192.168.50.2:50052
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 5: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
error while handling argument "--device": invalid device: RPC0[192.168.50.2:50052]

Is there something wrong in the syntax?

abc-nix Oct 17, 2025

Sorry, I am not on my main machine right now and cannot check my llama-swap file. Try RPC without the device number RPC[192.168.50.2:50052]. I think there was a change to how the device was named, but I cannot find it now.

Once I am back home (or I find the PR with the change explained) I will confirm.

Panchovix Oct 17, 2025
Author

Tried but sadly no luck either

ggml_cuda_init: found 6 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 5: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
error while handling argument "--device": invalid device: RPC[192.168.50.2:50052]

Sure no problem, many thanks!

abc-nix Oct 17, 2025

I am 90% sure, but in PR #16276 (comment) made it easier for setting the device name (thanks , @rgerganov). I had forgotten this.

It should now be

--device CUDA0,RPC0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5

Answer selected by Panchovix

Panchovix Oct 17, 2025
Author

Okay perfect, that helped a lot!

At default order it was:

prompt eval time =   27661.35 ms /  4410 tokens (    6.27 ms per token,   159.43 tokens per second)
       eval time =  140832.84 ms /  1784 tokens (   78.94 ms per token,    12.67 tokens per second)

By setting it at the end, it went to

prompt eval time =   16727.94 ms /  4410 tokens (    3.79 ms per token,   263.63 tokens per second)
       eval time =   78875.15 ms /  1396 tokens (   56.50 ms per token,    17.70 tokens per second)

When putting it on the middle:

prompt eval time =    6483.46 ms /  4410 tokens (    1.47 ms per token,   680.19 tokens per second)
       eval time =   78029.06 ms /  1757 tokens (   44.41 ms per token,    22.52 tokens per second)

Which is absolutely insane! So I will want to test installing Linux-Fedora in this PC and see if it makes it faster, but this is absolutely usable.

AbdullahMPrograms Oct 17, 2025

I am 90% sure, but in PR #16276 (comment) made it easier for setting the device name (thanks , @rgerganov). I had forgotten this.

It should now be

--device CUDA0,RPC0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5

When I try to specify --device ROCm0,RPC0,RPC1,RPC2 like you mention it always ends up trying to offload the entire model to ROCm0, any suggestions?

command:
./llama-bench -m DeepSeek-R1-0528-UD-Q2_K_XL-00001-of-00006.gguf -p 4096 -ngl 999 -fa 0,1 --mmap 0 --rpc 10.6.204.167:50053,10.6.207.147:50053,10.6.207.181:50053 --device ROCm0,RPC0,RPC1,RPC2 --
verbose

output:
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 238967.39 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate ROCm0 buffer of size 250575469056
llama_model_load: error loading model: unable to allocate ROCm0 buffer
llama_model_load_from_file_impl: failed to load mode

llama-bench --list-devices:
Available devices:
ROCm0: AMD Radeon Graphics (122880 MiB, 122724 MiB free)
RPC0: 10.6.204.167:50053 (122880 MiB, 122584 MiB free)
RPC1: 10.6.207.147:50053 (122880 MiB, 122584 MiB free)
RPC2: 10.6.207.181:50053 (122880 MiB, 122579 MiB free)

slaren Oct 17, 2025
Maintainer

With llama-bench use / as the device separator, e.g. --device ROCm0/RPC0/RPC1/RPC2.

AbdullahMPrograms Oct 17, 2025

With llama-bench use / as the device separator, e.g. --device ROCm0/RPC0/RPC1/RPC2.

That works! Thank you!

rgerganov · 2025-10-17T08:24:28Z

rgerganov
Oct 17, 2025
Collaborator

We are always putting RPC devices first in the device chain because we want to make sure we don't copy logits over the network (see PR #9296). This is done here and unfortunately cannot be overridden with --device as @abc-nix suggests. Feel free to hack the device order in the code and let me know if you get better results if RPC devices are in the middle of the device chain. Ideally, we should respect what the user specified with --device and try to "optimize" only if device order is not explicitly set.

9 replies

jukofyork Oct 17, 2025
Collaborator

Also your first example was missing the closing square bracket: RPC0[192.168.50.2:50052 so perhaps will work with the IP if you add that?

Panchovix Oct 17, 2025
Author

@jukofyork At the end did what @abc-nix mentioned on #16625 (reply in thread), which is as you suggest as well to just use "RPC0" and it worked. It is basically 4x+ times faster on PP and ~2x times faster on TG, it's insane haha.

jukofyork Oct 17, 2025
Collaborator

Can you run a test that is exactly the same as your "When running fully on GPU on the linux PC" settings, but ofload 1 small tensor to the other machine to see how much the PP and TG drop due to the latency costs?

Panchovix Oct 17, 2025
Author

@jukofyork sure I will give it a go when I get home.

Panchovix Oct 18, 2025
Author

@jukofyork Sorry for the delay, but here are the results.

Running fully on GPU with the command mentioned on first post, I get:

prompt eval time =    5868.68 ms /  4458 tokens (    1.32 ms per token,   759.63 tokens per second)
       eval time =   67975.59 ms /  1760 tokens (   38.62 ms per token,    25.89 tokens per second)

And then, giving 1 layer (~2GB) via RPC with:


LLAMA_SET_ROWS=1 ./llama-server \
  -m '/models/GLM-4.6-IQ4_XS.gguf' \
  -c 32768 \
  --no-mmap \
  -ngl 999 \
  --rpc 192.168.50.2:50052 \
  -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15).ffn.=CUDA0" \
  -ot "blk.(16|17|18|19|20|21|22|23|24|25).ffn.=CUDA1" \
  -ot "blk.(27|28|29|30|31|32|33|34|35|36).ffn.=CUDA2" \
  -ot "blk.(38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.=CUDA3" \
  -ot "blk.(51|52|53|54|55|56|57|58|59).ffn.=CUDA4" \
  -ot "blk.61.ffn.=RPC0[192.168.50.2:50052]" \
  -ot "blk.(62|63|64|65|66|67|68|69|70).ffn.=CUDA5" \
  -ot "blk.(72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91).ffn.=CUDA6" \
  -ot "blk.26.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA1" \
  -ot "blk.26.ffn_gate_exps.weight=CUDA1" \
  -ot "blk.26.ffn_(down_exps|up_exps).weight=CUDA0" \
  -ot "blk.37.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA2" \
  -ot "blk.37.ffn_gate_exps.weight=CUDA2" \
  -ot "blk.37.ffn_(down_exps|up_exps).weight=CUDA3" \
  -ot "blk.60.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA4" \
  -ot "blk.60.ffn_gate_exps.weight=CUDA4" \
  -ot "blk.60.ffn_(down_exps|up_exps).weight=CUDA6" \
  -ot "blk.71.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA5" \
  -ot "blk.71.ffn_gate_exps.weight=CUDA5" \
  -ot "blk.71.ffn_(down_exps|up_exps).weight=CUDA6" \
  -fa on \
  -mg 0 \
  -ub 1792 --device CUDA0,CUDA1,CUDA2,CUDA3,CUDA4,CUDA5,RPC0,CUDA6

I get

prompt eval time =   13757.30 ms /  4458 tokens (    3.09 ms per token,   324.05 tokens per second)
       eval time =   74831.85 ms /  1391 tokens (   53.80 ms per token,    18.59 tokens per second)

For some reason by using an extra CUDA 6 device, it tanks the speed. I have confirmed that removing that extra device, I get

prompt eval time =    6472.06 ms /  4458 tokens (    1.45 ms per token,   688.81 tokens per second)
       eval time =   86154.20 ms /  1967 tokens (   43.80 ms per token,    22.83 tokens per second)

Not sure if I'm doing something wrong or such, will have to do more tests.

How to specify device order when using RPC? Getting worse performance than expected on multiGPU + RPC. #16625

Uh oh!

Uh oh!

Panchovix Oct 17, 2025

Replies: 2 comments · 18 replies

Uh oh!

Uh oh!

abc-nix Oct 17, 2025

Uh oh!

Panchovix Oct 17, 2025 Author

Uh oh!

abc-nix Oct 17, 2025

Uh oh!

Panchovix Oct 17, 2025 Author

Uh oh!

abc-nix Oct 17, 2025

Uh oh!

Panchovix Oct 17, 2025 Author

Uh oh!

AbdullahMPrograms Oct 17, 2025

Uh oh!

slaren Oct 17, 2025 Maintainer

Uh oh!

Uh oh!

AbdullahMPrograms Oct 17, 2025

Uh oh!

Uh oh!

rgerganov Oct 17, 2025 Collaborator

Uh oh!

jukofyork Oct 17, 2025 Collaborator

Uh oh!

Uh oh!

Panchovix Oct 17, 2025 Author

Uh oh!

jukofyork Oct 17, 2025 Collaborator

Uh oh!

Panchovix Oct 17, 2025 Author

Uh oh!

Panchovix Oct 18, 2025 Author

Panchovix
Oct 17, 2025

Replies: 2 comments 18 replies

abc-nix
Oct 17, 2025

Panchovix Oct 17, 2025
Author

Panchovix Oct 17, 2025
Author

Panchovix Oct 17, 2025
Author

slaren Oct 17, 2025
Maintainer

rgerganov
Oct 17, 2025
Collaborator

jukofyork Oct 17, 2025
Collaborator

Panchovix Oct 17, 2025
Author

jukofyork Oct 17, 2025
Collaborator

Panchovix Oct 17, 2025
Author

Panchovix Oct 18, 2025
Author