Skip to content

Conversation

sali1293
Copy link

Add Azure ML compatibility to ParallelRunner via distributed env var support (RANK/WORLD_SIZE) and env:// init method

Description

This PR introduces compatibility for Azure Machine Learning Studio in the ParallelRunner by adding support for distributed environment variables (e.g., RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, MASTER_PORT) and using the 'env://' initialization method for torch.distributed when applicable.

What problem does this change solve?

It enables seamless parallel inference in cloud-based distributed environments like Azure ML, where Slurm (srun) is not used, by detecting and utilizing standard PyTorch distributed env vars instead of relying solely on Slurm or manual process spawning.

What issue or task does this change relate to?

N/A (This is an enhancement based on modifications for Azure ML compatibility; no specific GitHub issue linked.)

Additional notes

  • Changes are minimal and isolated to the _bootstrap_processes and _init_parallel methods in ParallelRunnerMixin to avoid disrupting existing Slurm or manual spawning workflows.

  • Added helper methods _using_distributed_env and _is_mpi_env for cleaner logic.

  • No breaking changes; falls back gracefully to existing behaviors.

  • Tested in a multi-GPU Azure ML environment; no updates to dependencies required.

  • MPI detection is optional and only used for non-CUDA backends when available.

Add Azure ML compatibility to ParallelRunner via distributed env var support (RANK/WORLD_SIZE) and env:// init method
@sali1293 sali1293 changed the title Update parallel.py Add Azure ML compatibility to ParallelRunner Sep 23, 2025
@sali1293 sali1293 changed the title Add Azure ML compatibility to ParallelRunner feat: Add Azure ML compatibility to ParallelRunner Sep 23, 2025
Copy link
Member

@gmertes gmertes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for this, this looks okay at first glance. A while ago the idea did come up to separate this code out of the ParallelRunner and delegate it to something like a ClusterEnvironment class (taking inspiration from pytorch-lightning.

The setting of all these variables like global_rank, local_rank etc and the initialisation of the backend would then be done by a derived class for Slurm, MPI, Azur, etc.

Would you be interested in working on this kind of refactor? We can always split it into two PRs: first we merge this one with the ifs to get something that works for you, and then we refactor into delegated classes. I believe @cathalobrien will also have some suggestions on this.

@cathalobrien
Copy link
Contributor

Nice work! I would be happy to work with you on this.

I think it would be good if we made a parallel runner base class with the following abstract methods:
_bootstrap_processes
_init_parallel

and then create local, SLURM, AzureML subclasses which implement these methods

@gmertes
Copy link
Member

gmertes commented Sep 24, 2025

Would delegation be easier to manage instead of inheritance? Then the cluster environment can simply be part of the constructor through a lookup table, something like:

ENVIRONMENTS = {
  'mpi': MpiEnv,
  'slurm': SlurmEnv
}

class ParallelRunner:
  def __init__(self, env = 'slurm'):
    self.env = ENVIRONMENTS[env](self)   # pass self so env has access to runner attributes if needed

  self.env.bootstrap_processes()
  self.env.init_parallel()

@sali1293
Copy link
Author

Thank for this, this looks okay at first glance. A while ago the idea did come up to separate this code out of the ParallelRunner and delegate it to something like a ClusterEnvironment class (taking inspiration from pytorch-lightning.

The setting of all these variables like global_rank, local_rank etc and the initialisation of the backend would then be done by a derived class for Slurm, MPI, Azur, etc.

Would you be interested in working on this kind of refactor? We can always split it into two PRs: first we merge this one with the ifs to get something that works for you, and then we refactor into delegated classes. I believe @cathalobrien will also have some suggestions on this.

@gmertes happy with two PRs approach, merging this one first and then a second one with further changes / enhancements. Are you happy for me to publish the PR (it's in draft state currently)

@gmertes
Copy link
Member

gmertes commented Sep 25, 2025

Yes that sounds good to me!

@sali1293 sali1293 marked this pull request as ready for review September 25, 2025 11:11
@sali1293
Copy link
Author

@cathalobrien can you please have a look as @gmertes requested. Thanks

@cathalobrien
Copy link
Contributor

i'm on leave, i'll have a look on monday. cheers

@sali1293
Copy link
Author

sali1293 commented Oct 1, 2025

Hi @cathalobrien, wondering if you would have time this week to have a look and possibly merge this. Thanks

@cathalobrien
Copy link
Contributor

thanks for reminding me, I will have a look today

@sali1293
Copy link
Author

sali1293 commented Oct 7, 2025

Hi @cathalobrien hope you are alright. Just wondering, did you manage to have a look?

@cathalobrien
Copy link
Contributor

Hi @cathalobrien hope you are alright. Just wondering, did you manage to have a look?

Hey @sali1293 I started a review, can you see it? https://github.com/ecmwf/anemoi-inference/pull/329/files/c39939e899b72fc2fd9cc71c779f9837973c214c

@sali1293
Copy link
Author

Hi @cathalobrien, not able to see any comments / feedback if you have left any?

sali1293 and others added 3 commits October 13, 2025 16:14
LOG.warning(
f"world size ({self.config.world_size}) set in the config is ignored because we are launching via srun, using 'SLURM_NTASKS' instead"
)
elif "RANK" in os.environ and "WORLD_SIZE" in os.environ:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried RANK and WORLD_SIZE are too generic, and we would come in here too much by mistake. Also, this block will fail if MASTER_ADDR/PORT are not set.

What about changing this line to
elif "MASTER_ADDR" in os.environ and "WORLD_SIZE in os.environ"

f"world size ({self.config.world_size}) set in the config is ignored because we are launching via srun, using 'SLURM_NTASKS' instead"
)
elif "RANK" in os.environ and "WORLD_SIZE" in os.environ:
# New branch for Azure ML / general distributed env (e.g., env:// mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove mention of Azure ML here.
You have added a way to bootstrap torch dist from env:// , one of the use cases of which is Azure ML. But I would rather keep the code more generic and not mention Azure ML

self.master_port = os.environ.get("MASTER_PORT")
if self.master_addr is None or self.master_port is None:
raise ValueError(
"MASTER_ADDR and MASTER_PORT must be set for distributed initialization (e.g., in Azure ML)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also remove mention of Azure ML here please

model_comm_group = dist.new_group(model_comm_group_ranks)
else:
if self._using_distributed_env():
init_method = "env://" # Azure ML recommended
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove Azure ML here please

return global_rank, local_rank, world_size

def _using_distributed_env(self) -> bool:
"""Checks for distributed env vars like those in Azure ML."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and remove Azure ML here too please

@cathalobrien
Copy link
Contributor

sorry for the delay @sali1293 , I didn't hit a button to publish the comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: To be triaged

Development

Successfully merging this pull request may close these issues.

5 participants