feat: Add Azure ML compatibility to ParallelRunner #329

sali1293 · 2025-09-23T09:32:06Z

Add Azure ML compatibility to ParallelRunner via distributed env var support (RANK/WORLD_SIZE) and env:// init method

Description

This PR introduces compatibility for Azure Machine Learning Studio in the ParallelRunner by adding support for distributed environment variables (e.g., RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, MASTER_PORT) and using the 'env://' initialization method for torch.distributed when applicable.

What problem does this change solve?

It enables seamless parallel inference in cloud-based distributed environments like Azure ML, where Slurm (srun) is not used, by detecting and utilizing standard PyTorch distributed env vars instead of relying solely on Slurm or manual process spawning.

What issue or task does this change relate to?

N/A (This is an enhancement based on modifications for Azure ML compatibility; no specific GitHub issue linked.)

Additional notes

Changes are minimal and isolated to the _bootstrap_processes and _init_parallel methods in ParallelRunnerMixin to avoid disrupting existing Slurm or manual spawning workflows.
Added helper methods _using_distributed_env and _is_mpi_env for cleaner logic.
No breaking changes; falls back gracefully to existing behaviors.
Tested in a multi-GPU Azure ML environment; no updates to dependencies required.
MPI detection is optional and only used for non-CUDA backends when available.

Add Azure ML compatibility to ParallelRunner via distributed env var support (RANK/WORLD_SIZE) and env:// init method

for more information, see https://pre-commit.ci

gmertes

Thank for this, this looks okay at first glance. A while ago the idea did come up to separate this code out of the ParallelRunner and delegate it to something like a ClusterEnvironment class (taking inspiration from pytorch-lightning.

The setting of all these variables like global_rank, local_rank etc and the initialisation of the backend would then be done by a derived class for Slurm, MPI, Azur, etc.

Would you be interested in working on this kind of refactor? We can always split it into two PRs: first we merge this one with the ifs to get something that works for you, and then we refactor into delegated classes. I believe @cathalobrien will also have some suggestions on this.

cathalobrien · 2025-09-23T14:01:08Z

Nice work! I would be happy to work with you on this.

I think it would be good if we made a parallel runner base class with the following abstract methods:
_bootstrap_processes
_init_parallel

and then create local, SLURM, AzureML subclasses which implement these methods

gmertes · 2025-09-24T09:42:54Z

Would delegation be easier to manage instead of inheritance? Then the cluster environment can simply be part of the constructor through a lookup table, something like:

ENVIRONMENTS = {
  'mpi': MpiEnv,
  'slurm': SlurmEnv
}

class ParallelRunner:
  def __init__(self, env = 'slurm'):
    self.env = ENVIRONMENTS[env](self)   # pass self so env has access to runner attributes if needed

  self.env.bootstrap_processes()
  self.env.init_parallel()

sali1293 · 2025-09-24T10:51:49Z

Thank for this, this looks okay at first glance. A while ago the idea did come up to separate this code out of the ParallelRunner and delegate it to something like a ClusterEnvironment class (taking inspiration from pytorch-lightning.

The setting of all these variables like global_rank, local_rank etc and the initialisation of the backend would then be done by a derived class for Slurm, MPI, Azur, etc.

Would you be interested in working on this kind of refactor? We can always split it into two PRs: first we merge this one with the ifs to get something that works for you, and then we refactor into delegated classes. I believe @cathalobrien will also have some suggestions on this.

@gmertes happy with two PRs approach, merging this one first and then a second one with further changes / enhancements. Are you happy for me to publish the PR (it's in draft state currently)

gmertes · 2025-09-25T10:00:12Z

Yes that sounds good to me!

sali1293 · 2025-09-26T09:05:11Z

@cathalobrien can you please have a look as @gmertes requested. Thanks

cathalobrien · 2025-09-26T09:07:11Z

i'm on leave, i'll have a look on monday. cheers

sali1293 · 2025-10-01T12:58:33Z

Hi @cathalobrien, wondering if you would have time this week to have a look and possibly merge this. Thanks

cathalobrien · 2025-10-01T13:07:15Z

thanks for reminding me, I will have a look today

sali1293 · 2025-10-07T11:56:28Z

Hi @cathalobrien hope you are alright. Just wondering, did you manage to have a look?

cathalobrien · 2025-10-07T12:12:13Z

Hi @cathalobrien hope you are alright. Just wondering, did you manage to have a look?

Hey @sali1293 I started a review, can you see it? https://github.com/ecmwf/anemoi-inference/pull/329/files/c39939e899b72fc2fd9cc71c779f9837973c214c

sali1293 · 2025-10-10T09:55:56Z

Hi @cathalobrien, not able to see any comments / feedback if you have left any?