Skip to content

Enhance wlm-operator #70

@zjgemi

Description

@zjgemi

There are some todos for wlm-operator, including but not limited to

  • Develop a robust agent for forwarding the red-box socket. It may retry under network interruptions.
  • Make configurator more robust under the forwarding interruptions of socket.
  • Wlm-operator is able to get logs of slurm jobs, while Argo's resource template only outputs something like
time="2022-07-13T02:39:55.042Z" level=info msg="Get slurmjobs 200"
time="2022-07-13T02:39:55.043Z" level=info msg="failure condition '{status.status == [Failed]}' evaluated false"
time="2022-07-13T02:39:55.043Z" level=info msg="success condition '{status.status == [Succeeded]}' evaluated false"
time="2022-07-13T02:39:55.044Z" level=info msg="0/1 success conditions matched"
time="2022-07-13T02:39:55.045Z" level=info msg="Waiting for resource slurmjob.wlm.sylabs.io/wlm-rhhbc-hello-dphos-hello-slurm-run-42
03105651 in namespace argo resulted in retryable error: Neither success condition nor the failure condition has been matched. Retryi
ng..."

Wlm-operator may provide a log persistence on the local side.

  • To avoid modification of Argo, dflow use 3 steps to complete a wlm template, including a prepare step, a run step and a collect step. The prepare step copies inputs artifacts from the container to some host path. The run step mounts the host directory and apply the wlm resource which uploads the input files to the remote cluster, and submit a slurm job, finally downloads output files to a mounted host directory. The collect step copies the output artifacts from the host to the container for Argo collecting. Is simplification of the procedure possible?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions