[220] [DRAFT] Init of Parallelization on LSF scheduler #913

simone99n · 2025-09-16T09:13:14Z

Description

The main idea is to have the Python code read a set of possible environment variables that define the rank and size of each process. Since each HPC job scheduler exposes different environment variables (SLURM exposes SLURM_PROCID/SLURM_NTASKS, but LSF that use mpirun exposes PMI_RANK and PMI_SIZE), this approach avoids hardcoding scheduler-specific names. Instead, the model will iterate through a list of known variable names.

This code works for Cassandra.
This code works for Juwels-Booster, but require some small changes (e.g. the export of "CUDA_VISIBLE_DEVICES" in the private bash launch file), but the final goal is to keep the existing private code unchanged."

This PR was opened not for merging, but to track the progress on the LSF issue.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Issue Number

Track the work of #220

Code Compatibility

I have performed a self-review of my code

Code Performance and Testing

I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

I have ensured that the code is still pip-installable after the changes and runs
I have tested that new dependencies themselves are pip-installable.
I have not introduced new dependencies in the inference portion of the pipeline

Documentation

My code follows the style guidelines of this project
I have updated the documentation and docstrings to reflect the changes
I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

simone99n · 2025-09-16T09:13:42Z

@tjhunter

Simone Norberti added 4 commits September 16, 2025 10:39

Added LSF support

595c7b5

[220] LSF support

38723a0

[220] LSF support

8b4c79d

[220] minor change

ec314b4

tjhunter marked this pull request as draft September 16, 2025 09:57

tjhunter added this to WeatherGen-dev Sep 17, 2025

simone99n added 2 commits September 23, 2025 10:26

Merge branch 'ecmwf:develop' into simone99n/dev/220_lsf

409a15b

Merge branch 'ecmwf:develop' into simone99n/dev/220_lsf

37df869

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[220] [DRAFT] Init of Parallelization on LSF scheduler #913

[220] [DRAFT] Init of Parallelization on LSF scheduler #913

Uh oh!

simone99n commented Sep 16, 2025

Uh oh!

simone99n commented Sep 16, 2025

Uh oh!

Uh oh!

[220] [DRAFT] Init of Parallelization on LSF scheduler #913

Are you sure you want to change the base?

[220] [DRAFT] Init of Parallelization on LSF scheduler #913

Uh oh!

Conversation

simone99n commented Sep 16, 2025

Description

Type of Change

Issue Number

Code Compatibility

Code Performance and Testing

Dependencies

Documentation

Additional Notes

Uh oh!

simone99n commented Sep 16, 2025

Uh oh!

Uh oh!