Skip to content

Conversation

@icwhite
Copy link

@icwhite icwhite commented Oct 13, 2025

Previously, when running this codebase, even if one specified CUDA_VISIBLE_DEVICES=4,5,6,7, the code would still run on GPUs 0,1,2,3. This is especially important when sharing machines with others or running parallel experiments on the same device.
Now, if you set CUDA_VISIBLE_DEVICES="4,5,6,7" it will use these GPUs, by taking the index from within those available GPUS. There is also error-catching logic such that if you want all of the GPUs on a machine you don't have to set the cuda environment variable.

@rafapi
Copy link
Collaborator

rafapi commented Oct 17, 2025

Thanks for your PR and for digging into the CUDA mask issue! Unfortunately this approach breaks our launcher in a couple of ways:

VISIBLE_GPUS = os.getenv("CUDA_VISIBLE_DEVICES") comes back None on many of our runners, so the new .split(',') in run_ref_llm, run_actor_llm, and run_finetune crashes and the orchestrator never starts.

In run_finetune the extra gpu_str argument shifts --gpu-ids off its value; accelerate.launch sees pipelinerl/entrypoints/run_finetune.py in the wrong position and exits.

Even when the env var is set, re-indexing through visible_gpus[gpu] breaks whenever CUDA_VISIBLE_DEVICES doesn’t list every physical numerical value. In multi-node jobs we already pass those values from world.py. So remapping using the mask will throw a IndexErrors or launch on the wrong device.

Given the regression we should keep using the scheduled gpus from world.py, we should address a safer remapping strategy separately.

@icwhite
Copy link
Author

icwhite commented Oct 17, 2025

That makes sense. Is there a better way for me to implement this? This issue is currently blocking me from using pipeline RL on my local servers.

@icwhite icwhite marked this pull request as draft October 17, 2025 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants