forked from dptech-corp/dflow
-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
There are some todos for wlm-operator, including but not limited to
- Develop a robust agent for forwarding the red-box socket. It may retry under network interruptions.
- Make configurator more robust under the forwarding interruptions of socket.
- Wlm-operator is able to get logs of slurm jobs, while Argo's resource template only outputs something like
time="2022-07-13T02:39:55.042Z" level=info msg="Get slurmjobs 200"
time="2022-07-13T02:39:55.043Z" level=info msg="failure condition '{status.status == [Failed]}' evaluated false"
time="2022-07-13T02:39:55.043Z" level=info msg="success condition '{status.status == [Succeeded]}' evaluated false"
time="2022-07-13T02:39:55.044Z" level=info msg="0/1 success conditions matched"
time="2022-07-13T02:39:55.045Z" level=info msg="Waiting for resource slurmjob.wlm.sylabs.io/wlm-rhhbc-hello-dphos-hello-slurm-run-42
03105651 in namespace argo resulted in retryable error: Neither success condition nor the failure condition has been matched. Retryi
ng..."
Wlm-operator may provide a log persistence on the local side.
- To avoid modification of Argo, dflow use 3 steps to complete a wlm template, including a prepare step, a run step and a collect step. The prepare step copies inputs artifacts from the container to some host path. The run step mounts the host directory and apply the wlm resource which uploads the input files to the remote cluster, and submit a slurm job, finally downloads output files to a mounted host directory. The collect step copies the output artifacts from the host to the container for Argo collecting. Is simplification of the procedure possible?
Metadata
Metadata
Assignees
Labels
No labels