-
Notifications
You must be signed in to change notification settings - Fork 13
Integration of Flux as a wrapper engine (YAML Jobspecs method) #2720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: flux-wrapper-engine
Are you sure you want to change the base?
Conversation
|
Hi I was trying your branch to see the issue with the module load and I found some weird things. I have the following yaml for the jobs: And when I executed I found that Autosubmit was doing the following request: Maybe the PROCESSORS is being misinterpreted as the number of nodes. |
Didn't check how it is coded, but another possibility is that you don't have the PLATFORMS.PLATFORM.PROCESSORS_PER_NODE defined, and instead of 112 for mn5, it is taking the value as 1? |
|
Hi @dbeltrankyl ! Thanks for dropping in! I have the following for the platform: |
|
Then it is ignoring both max_processors and processors_per_node, no? as 10 x 112 is way more than 112 😅 |
|
Hi, @pablogoitia , regarding the "command not found module". I have executed your branch and faced the same issue. To test it I created this job file to see what is being loaded in the environment in the inner job. I am doing all of my tests in MareNostrum 5. To only find that everything was unset with the exception of Flux specific variables (I attach the output at the end of this comment as an appendix). So that is why it is not being able to find any executable of the system. But then I altered your script submission to do the same right before the execution of the Then I noticed that the ASThread job was producing an error (this is not transferred back to local, not sure why). And there I saw the following message: So my guess is that something is failing in the job spec, so it is not properly being executed to the. APPENDIX ASTHREAD FULL ERROR INNER JOBSPEC INNER JOB PRINTENV |
Hi @manuel-g-castro! Fortunately, I can say that this is the expected behavior. For that task, you are requesting 10 PROCESSORS and 1 TASK. This is translated to the ASTHREAD script header as 10 tasks and 1 task per node, so the result is a request of a total of 10 nodes. @dbeltrankyl left a really good explanation on how the job resource parameters work in this issue |
Hi @manuel-g-castro. Thank you so much for reporting this. I have not tested this specific case in remote yet, but it is a special one because it covers those requests where no node count is specified, but tasks per node, for example (remember we talked about it yesterday in the meeting). In this case, what I do is to request a minimum of one node using the Meanwhile, any other job specification that includes the count of nodes should properly work, including those where not nodes nor tasks per node are provided. |
|
Hi again, @manuel-g-castro. After some testing I have concluded that the job specification is right. There are some examples in RFC 14. Specifically, Use Case 1.6 shows an example of min count of nodes. However, something is leading Flux to fail, and it does not matter if the Jobspec V1 or the normal Jobspec is being used because it fails anyway. I will search for a way to handle this specific case so that I could avoid the usage of the Thanks again for reporting the bug. If you want to keep testing, remember that I expect any other case to work. By now, do not test with cases where tasks per node are specified without specifying the node count too. |
This PR addresses #2697. While implementing the wrappers, I realized that there are Slurm directives that cannot be mapped directly onto Flux ones, because the set of Flux batch directives is more limited. As a solution to this problem, I have been exploring the option of launching jobs on the HPC by manually generating Jobspecs (represented in YAML files), which are processed using the Flux Python API. Jobspecs allow us to control resources in a more detailed and advanced way. So far, I have not found a more direct way to launch jobs using their specification.
In this issue, I provide a prototype for solving this problem. The implementation is not yet fully functional: I only provide limited support for vertical wrappers. However, extending the implementation to other types of wrappers would be relatively straightforward.
The part that is not easy and would require more dedication is the correct construction of the Jobspecs, ensuring that the scheduling parameters (e.g., processors, tasks, threads...) are accurately mapped.
This method would facilitate the introduction of equivalents to Slurm's hetjobs (there are still no direct alternatives in Flux) in the future.
Note: this PR would overwrite a significant part of the implementation carried out in PR #2708.
Check List
Not applies, as the branch would not be merged into master yet.