Integration of Flux as a wrapper engine (YAML Jobspecs method) #2720

pablogoitia · 2025-11-26T19:21:37Z

This PR addresses #2697. While implementing the wrappers, I realized that there are Slurm directives that cannot be mapped directly onto Flux ones, because the set of Flux batch directives is more limited. As a solution to this problem, I have been exploring the option of launching jobs on the HPC by manually generating Jobspecs (represented in YAML files), which are processed using the Flux Python API. Jobspecs allow us to control resources in a more detailed and advanced way. So far, I have not found a more direct way to launch jobs using their specification.

In this issue, I provide a prototype for solving this problem. The implementation is not yet fully functional: I only provide limited support for vertical wrappers. However, extending the implementation to other types of wrappers would be relatively straightforward.

The part that is not easy and would require more dedication is the correct construction of the Jobspecs, ensuring that the scheduling parameters (e.g., processors, tasks, threads...) are accurately mapped.

This method would facilitate the introduction of equivalents to Slurm's hetjobs (there are still no direct alternatives in Flux) in the future.

Note: this PR would overwrite a significant part of the implementation carried out in PR #2708.

Check List
Not applies, as the branch would not be merged into master yet.

manuel-g-castro · 2025-12-05T11:18:15Z

Hi I was trying your branch to see the issue with the module load and I found some weird things.

I have the following yaml for the jobs:

JOBS:
  SIM:
    DEPENDENCIES: SIM-1
    RUNNING: chunk
    PROCESSORS: 10
    WALLCLOCK: 00:10
    TASKS: 1
    SCRIPT: |
        echo $PATH
        which module
        module load impi

And when I executed I found that Autosubmit was doing the following request:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          33336145       gpp a01a_AST bsc03237 PD       0:00     10 (None)

Maybe the PROCESSORS is being misinterpreted as the number of nodes.

dbeltrankyl · 2025-12-05T11:25:36Z

Maybe the PROCESSORS is being misinterpreted as the number of nodes.

Didn't check how it is coded, but another possibility is that you don't have the PLATFORMS.PLATFORM.PROCESSORS_PER_NODE defined, and instead of 112 for mn5, it is taking the value as 1?

manuel-g-castro · 2025-12-05T11:28:15Z

Hi @dbeltrankyl ! Thanks for dropping in!

I have the following for the platform:

PLATFORMS:
  MARENOSTRUM5:
    TYPE: slurm
    HOST: ...
    PROJECT: ...
    USER: ...
    QUEUE: gp_debug
    SCRATCH_DIR: /gpfs/scratch
    ADD_PROJECT_TO_HOST: false
    MAX_WALLCLOCK: 48:00
    PROCESSORS_PER_NODE: 112
    MAX_PROCESSORS: 112

dbeltrankyl · 2025-12-05T11:29:53Z

Then it is ignoring both max_processors and processors_per_node, no?

as 10 x 112 is way more than 112 😅

manuel-g-castro · 2025-12-05T11:52:49Z

Hi, @pablogoitia , regarding the "command not found module".

I have executed your branch and faced the same issue.

To test it I created this job file to see what is being loaded in the environment in the inner job. I am doing all of my tests in MareNostrum 5.

JOBS:
  SIM:
    DEPENDENCIES: SIM-1
    RUNNING: chunk
    PROCESSORS: 1
    WALLCLOCK: 00:10
    TASKS: 1
    SCRIPT: printenv

To only find that everything was unset with the exception of Flux specific variables (I attach the output at the end of this comment as an appendix).

So that is why it is not being able to find any executable of the system. But then I altered your script submission to do the same right before the execution of the srun flux start... and found that all the environment variables seem to be properly set.

Then I noticed that the ASThread job was producing an error (this is not transferred back to local, not sure why). And there I saw the following message:

Dec 05 12:36:41.085766 CET 2025 job-list.err[0]: parse_jobspec: job f9Fs12P invalid jobspec; level 0: Expected integer, got object

So my guess is that something is failing in the job spec, so it is not properly being executed to the.

APPENDIX

ASTHREAD FULL ERROR

load MINICONDA/24.1.2 (PATH, LD_LIBRARY_PATH, LIBRARY_PATH, C_INCLUDE_PATH,
CPLUS_INCLUDE_PATH, PKG_CONFIG_PATH, MANPATH) 
flux-start: /home/bsc/bsc032371/venvs/flux/libexec/flux/cmd/flux-broker python flux_runner.py
Dec 05 12:36:41.085766 CET 2025 job-list.err[0]: parse_jobspec: job f9Fs12P invalid jobspec; level 0: Expected integer, got object
Traceback (most recent call last):
  File "/gpfs/scratch/bsc32/bsc032371/a01a/LOG_a01a/flux_runner.py", line 27, in <module>
    print("RESOURCE COUNTS :" + str(jobspec.resource_counts()))
                                    ~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/bsc/bsc032371/venvs/flux/lib/python3.13/site-packages/flux/job/Jobspec.py", line 779, in resource_counts
    for _, resource, count in self.resource_walk():
                              ~~~~~~~~~~~~~~~~~~^^
  File "/home/bsc/bsc032371/venvs/flux/lib/python3.13/site-packages/flux/job/Jobspec.py", line 739, in walk_helper
    res_count = count * resource["count"]
                ~~~~~~^~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for *: 'int' and 'dict'
Dec 05 12:36:41.113821 CET 2025 broker.err[0]: rc2.0: python flux_runner.py Exited (rc=1) 0.2s
srun: error: gs02r2b16: task 0: Exited with exit code 1
srun: Terminating StepId=33338054.0

INNER JOBSPEC

###############################################################################
#                   SIM a01a EXPERIMENT
###############################################################################

resources:
- type: node
  count:
    min: 1
  exclusive: false
  with:
  - type: slot
    label: task
    count: 1
    with:
    - type: core
      count: 1
tasks:
- command:
  - '{{tmpdir}}/script'
  slot: task
  count:
    per_slot: 1
attributes:
  system:
    duration: 600
    cwd: /gpfs/scratch/bsc32/bsc032371/a01a/LOG_a01a
    job:
      name: a01a_20000101_fc0_1_SIM
    shell:
      options:
        output:
          stdout:
            type: file
            path: /gpfs/scratch/bsc32/bsc032371/a01a/LOG_a01a/a01a_20000101_fc0_1_SIM.cmd.out.0
          stderr:
            type: file
            path: /gpfs/scratch/bsc32/bsc032371/a01a/LOG_a01a/a01a_20000101_fc0_1_SIM.cmd.err.0
    files:
      script:
        mode: 33216
        data: |+
          #!/bin/bash

          ###############################################################################
          # The following lines contain the script. [SIM a01a EXPERIMENT]
          ###############################################################################

          ###################
          # Autosubmit header
          ###################
          locale_to_set=$(locale -a | grep ^C.)
          if [ -z "$locale_to_set" ] ; then
              # locale installed...
              export LC_ALL=$locale_to_set
          else
              # locale not installed...
              locale_to_set=$(locale -a | grep ^en_GB.utf8)
              if [ -z "$locale_to_set" ] ; then
                  export LC_ALL=$locale_to_set
              else
                  export LC_ALL=C
              fi 
          fi

          set -xuve
          job_name_ptrn='/gpfs/scratch/bsc32/bsc032371/a01a/LOG_a01a/a01a_20000101_fc0_1_SIM'
          echo $(date +%s) > ${job_name_ptrn}_STAT_0

          ################### 
          # AS CHECKPOINT FUNCTION
          ###################
          # Creates a new checkpoint file upon call based on the current numbers of calls to the function

          AS_CHECKPOINT_CALLS=0
          function as_checkpoint {
              AS_CHECKPOINT_CALLS=$((AS_CHECKPOINT_CALLS+1))
              touch ${job_name_ptrn}_CHECKPOINT_${AS_CHECKPOINT_CALLS}
          }
          

          ###################
          # Autosubmit job
          ###################

          r=0
          set +e
          bash -e <<__AS_CMD__
          set -xuve
          printenv


          __AS_CMD__

          r=$?

          # Write the finish time in the job _STAT_
          echo $(date +%s) >> ${job_name_ptrn}_STAT_0

          # If the user-provided script failed, we exit here with the same exit code;
          # otherwise, we let the execution of the tailer happen, where the _COMPLETED
          # file will be created.
          if [ $r -ne 0 ]; then
              exit $r
          fi
          ###################
          # Autosubmit tailer
          ###################
          set -xuve
          touch ${job_name_ptrn}_COMPLETED
          exit 0

        encoding: utf-8
version: 1

INNER JOB PRINTENV

PMI_SIZE=1
PMI_FD=21
PWD=/gpfs/scratch/bsc32/bsc032371/a01a/LOG_a01a
FLUX_TASK_RANK=0
OMPI_MCA_btl_vader_backing_directory=/scratch/tmp/33338054/flux-owJ029/jobtmp-0-f9Fs12P
FLUX_KVS_NAMESPACE=job-5419040768
CUDA_DEVICE_ORDER=PCI_BUS_ID
FLUX_TERMINUS_SESSION=0
FLUX_JOB_NNODES=1
FLUX_JOB_SIZE=1
CUDA_VISIBLE_DEVICES=-1
FLUX_JOB_TMPDIR=/scratch/tmp/33338054/flux-owJ029/jobtmp-0-f9Fs12P
SHLVL=2
FLUX_PMI_LIBRARY_PATH=/home/bsc/bsc032371/venvs/flux/lib/flux/libpmi.so
FLUX_URI=local:///scratch/tmp/33338054/flux-owJ029/local-0
PMI_RANK=0
LD_LIBRARY_PATH=/home/bsc/bsc032371/venvs/flux/lib/flux
FLUX_JOB_ID_PATH=/f9Fs12P
FLUX_JOB_ID=f9Fs12P
LC_ALL=C
FLUX_TASK_LOCAL_ID=0
_=/usr/bin/printenv

pablogoitia · 2025-12-05T15:29:46Z

Maybe the PROCESSORS is being misinterpreted as the number of nodes.

Hi @manuel-g-castro! Fortunately, I can say that this is the expected behavior. For that task, you are requesting 10 PROCESSORS and 1 TASK. This is translated to the ASTHREAD script header as 10 tasks and 1 task per node, so the result is a request of a total of 10 nodes.

@dbeltrankyl left a really good explanation on how the job resource parameters work in this issue

pablogoitia · 2025-12-05T16:04:31Z

So my guess is that something is failing in the job spec, so it is not properly being executed to the.

Hi @manuel-g-castro. Thank you so much for reporting this. I have not tested this specific case in remote yet, but it is a special one because it covers those requests where no node count is specified, but tasks per node, for example (remember we talked about it yesterday in the meeting). In this case, what I do is to request a minimum of one node using the min key in the node resource. I will check if this is an error in the Jobspec or if it is a matter of compatibility with the Jobspec V1. I would initially say that it is the first one because I have observed a discrepance between my specification and an example in the docs. In that case, I will tell you when I upload the fix to the branch.

Meanwhile, any other job specification that includes the count of nodes should properly work, including those where not nodes nor tasks per node are provided.

pablogoitia · 2025-12-05T16:22:40Z

Hi again, @manuel-g-castro. After some testing I have concluded that the job specification is right. There are some examples in RFC 14. Specifically, Use Case 1.6 shows an example of min count of nodes.

However, something is leading Flux to fail, and it does not matter if the Jobspec V1 or the normal Jobspec is being used because it fails anyway.

I will search for a way to handle this specific case so that I could avoid the usage of the min key. I will let you know the solution.

Thanks again for reporting the bug. If you want to keep testing, remember that I expect any other case to work. By now, do not test with cases where tasks per node are specified without specifying the node count too.

pablogoitia added 3 commits November 25, 2025 16:28

Initial approach for launching jobs in Flux through Jobspecs

d76d281

Fixed jobspec yaml generation

facd6c7

Remove empty fields

68d7488

pablogoitia self-assigned this Nov 26, 2025

pablogoitia added this to Autosubmit project Nov 26, 2025

pablogoitia added enhancement New feature or request working on Someone is working on it labels Nov 26, 2025

github-project-automation bot moved this to Todo in Autosubmit project Nov 26, 2025

pablogoitia moved this from Todo to In Progress in Autosubmit project Nov 26, 2025

pablogoitia added 4 commits November 26, 2025 21:08

Implement self-contained generator

4538807

Delete unnecessary property

8cb3cef

Now YAML files are written directly in the .cmd files

7b444e8

Introduce exceptions for safety

596f6f2

pablogoitia mentioned this pull request Nov 27, 2025

How are task resources mapped in Slurm directives? #2723

Closed

pablogoitia added 16 commits November 27, 2025 16:48

Stat, output and error files generation and retry management

fd409a4

Vertical wrappers are now fully functional

5a3ae58

Horizontal wrappers are now fully functional

cd35385

TODO: Fix resource parameters mapping

b66d272

Vertical-Horizontal wrappers are now fully functional

3379777

Horizontal-Vertical wrappers are now fully functional

7fc29ae

Remove unnecessary concurrency in the execution of H-V wrappers

3633912

Improvements in wrapper scripts

e48380d

Remove solved TODOs

690c899

Improvements in YAML generation

5516d15

Introduced more logic, but there are cases that are still not covered

9f6c627

The YAML generation logic is complete (I hope)

6432566

Remove unused 'Job' variable

4e18ddd

Improvements in docstrings

42ec2c8

Delete unused return value

dfc53eb

Format improvements

b4268b5

Private functions have been moved to the end of the class

c20fe7e

pablogoitia added 2 commits December 9, 2025 15:49

Fixed environment inside FluxOverSlurm allocations

f292dc2

Applied fix to the other types of wrapper. Improved vertical performance

e2480cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integration of Flux as a wrapper engine (YAML Jobspecs method) #2720

Integration of Flux as a wrapper engine (YAML Jobspecs method) #2720

Uh oh!

pablogoitia commented Nov 26, 2025

Uh oh!

manuel-g-castro commented Dec 5, 2025

Uh oh!

dbeltrankyl commented Dec 5, 2025 •

edited

Loading

Uh oh!

manuel-g-castro commented Dec 5, 2025

Uh oh!

dbeltrankyl commented Dec 5, 2025

Uh oh!

manuel-g-castro commented Dec 5, 2025

Uh oh!

pablogoitia commented Dec 5, 2025

Uh oh!

pablogoitia commented Dec 5, 2025

Uh oh!

pablogoitia commented Dec 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Integration of Flux as a wrapper engine (YAML Jobspecs method) #2720

Are you sure you want to change the base?

Integration of Flux as a wrapper engine (YAML Jobspecs method) #2720

Uh oh!

Conversation

pablogoitia commented Nov 26, 2025

Uh oh!

manuel-g-castro commented Dec 5, 2025

Uh oh!

dbeltrankyl commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

manuel-g-castro commented Dec 5, 2025

Uh oh!

dbeltrankyl commented Dec 5, 2025

Uh oh!

manuel-g-castro commented Dec 5, 2025

Uh oh!

pablogoitia commented Dec 5, 2025

Uh oh!

pablogoitia commented Dec 5, 2025

Uh oh!

pablogoitia commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dbeltrankyl commented Dec 5, 2025 •

edited

Loading

pablogoitia commented Dec 5, 2025 •

edited

Loading