[220] [DRAFT] Init of Parallelization on LSF scheduler #913
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The main idea is to have the Python code read a set of possible environment variables that define the rank and size of each process. Since each HPC job scheduler exposes different environment variables (SLURM exposes SLURM_PROCID/SLURM_NTASKS, but LSF that use mpirun exposes PMI_RANK and PMI_SIZE), this approach avoids hardcoding scheduler-specific names. Instead, the model will iterate through a list of known variable names.
This code works for Cassandra.
This code works for Juwels-Booster, but require some small changes (e.g. the export of "CUDA_VISIBLE_DEVICES" in the private bash launch file), but the final goal is to keep the existing private code unchanged."
This PR was opened not for merging, but to track the progress on the LSF issue.
Type of Change
Issue Number
Track the work of #220
Code Compatibility
Code Performance and Testing
uv run train
and (if necessary)uv run evaluate
on a least one GPU node and it works$WEATHER_GENERATOR_PRIVATE
directoryDependencies
Documentation
Additional Notes