Skip to content

Conversation

simone99n
Copy link

Description

The main idea is to have the Python code read a set of possible environment variables that define the rank and size of each process. Since each HPC job scheduler exposes different environment variables (SLURM exposes SLURM_PROCID/SLURM_NTASKS, but LSF that use mpirun exposes PMI_RANK and PMI_SIZE), this approach avoids hardcoding scheduler-specific names. Instead, the model will iterate through a list of known variable names.

This code works for Cassandra.
This code works for Juwels-Booster, but require some small changes (e.g. the export of "CUDA_VISIBLE_DEVICES" in the private bash launch file), but the final goal is to keep the existing private code unchanged."

This PR was opened not for merging, but to track the progress on the LSF issue.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Issue Number

Track the work of #220

Code Compatibility

  • I have performed a self-review of my code

Code Performance and Testing

  • I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
  • If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

  • I have ensured that the code is still pip-installable after the changes and runs
  • I have tested that new dependencies themselves are pip-installable.
  • I have not introduced new dependencies in the inference portion of the pipeline

Documentation

  • My code follows the style guidelines of this project
  • I have updated the documentation and docstrings to reflect the changes
  • I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

@simone99n
Copy link
Author

@tjhunter

@tjhunter tjhunter marked this pull request as draft September 16, 2025 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

1 participant