Enabling Usage on GPUs other than GPUs 0-4 #86

icwhite · 2025-10-13T19:01:17Z

Previously, when running this codebase, even if one specified CUDA_VISIBLE_DEVICES=4,5,6,7, the code would still run on GPUs 0,1,2,3. This is especially important when sharing machines with others or running parallel experiments on the same device.
Now, if you set CUDA_VISIBLE_DEVICES="4,5,6,7" it will use these GPUs, by taking the index from within those available GPUS. There is also error-catching logic such that if you want all of the GPUs on a machine you don't have to set the cuda environment variable.

…n specify arbitrary gpus through the cuda visible devices environment variable

rafapi · 2025-10-17T09:03:40Z

Thanks for your PR and for digging into the CUDA mask issue! Unfortunately this approach breaks our launcher in a couple of ways:

VISIBLE_GPUS = os.getenv("CUDA_VISIBLE_DEVICES") comes back None on many of our runners, so the new .split(',') in run_ref_llm, run_actor_llm, and run_finetune crashes and the orchestrator never starts.

In run_finetune the extra gpu_str argument shifts --gpu-ids off its value; accelerate.launch sees pipelinerl/entrypoints/run_finetune.py in the wrong position and exits.

Even when the env var is set, re-indexing through visible_gpus[gpu] breaks whenever CUDA_VISIBLE_DEVICES doesn’t list every physical numerical value. In multi-node jobs we already pass those values from world.py. So remapping using the mask will throw a IndexErrors or launch on the wrong device.

Given the regression we should keep using the scheduled gpus from world.py, we should address a safer remapping strategy separately.

icwhite · 2025-10-17T16:34:00Z

That makes sense. Is there a better way for me to implement this? This issue is currently blocking me from using pipeline RL on my local servers.

icwhite added 2 commits October 13, 2025 11:50

make it possible to do pipeline rl runs on gpus other 0-3 so users ca…

a9c2a1d

…n specify arbitrary gpus through the cuda visible devices environment variable

Merge branch 'main' of https://github.com/icwhite/PipelineRL into main

c302c61

rafapi self-requested a review October 17, 2025 09:29

rafapi mentioned this pull request Oct 17, 2025

Allowing for using different GPUs other than 0 (e.g. using 4-7 on a shared cluster) #85

Open

icwhite marked this pull request as draft October 17, 2025 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enabling Usage on GPUs other than GPUs 0-4 #86

Enabling Usage on GPUs other than GPUs 0-4 #86

Uh oh!

icwhite commented Oct 13, 2025

Uh oh!

rafapi commented Oct 17, 2025 •

edited

Loading

Uh oh!

icwhite commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Enabling Usage on GPUs other than GPUs 0-4 #86

Are you sure you want to change the base?

Enabling Usage on GPUs other than GPUs 0-4 #86

Uh oh!

Conversation

icwhite commented Oct 13, 2025

Uh oh!

rafapi commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icwhite commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rafapi commented Oct 17, 2025 •

edited

Loading