Skip to content

Conversation

@fishingfly
Copy link

@fishingfly fishingfly commented Feb 13, 2025

What is the problem you're trying to solve

  • HyperNode functions similarly to a switch or tor, requiring switch vendors to update both the Spec and Status fields on the HyperNode. Currently, the Status field only contains Conditions and NodeCount.
  • When an RDMA network card issue occurs on a node connected to the leaf switch or tor —such as high BER (Bit Error Rate) or link flapping (as discussed in the paper "RDMA over Ethernet for Distributed Training at Meta Scale")—vendors can only update the network card status in the Conditions field.
  • The Conditions field reflects the overall status of the HyperNode, which may be connected to multiple nodes. However, there is no existing mechanism to indicate the health status of individual nodes. This lack of granularity prevents the scheduler from accurately identifying and handling unhealthy nodes.
  • Since HyperNode essentially functions as a switch or tor, it is also necessary to introduce standard switch condition types for vendors to use in HyperNodeStatus.Conditions.

Describe the solution you'd like

  • Introduce a new field under Status, such as UnhealthyNodeNames, to explicitly list nodes that are currently unschedulable under the given HyperNode.
  • Define two common switch condition types to standardize switch health reporting:
    • HyperNodeSystemFailure: Indicates a system-level issue on the switch or tor, such as CPU or memory overload, power failure, fan malfunction, or other critical system faults.
    • HyperNodeNetworkUnavailable: Indicates a network-related issue on the switch or tor, such as abnormal link status, interface failures, or other network disruptions.

Expected HyperNode Structure

The final HyperNode status should incorporate these enhancements to provide granular node-level health insights and common switch condition types, improving scheduler awareness and resource allocation efficiency.

The final expected HyperNode is as follows:

apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
  creationTimestamp: "2025-02-05T09:35:50Z"
  generation: 2
  name: leaf1
  resourceVersion: "341389665"
  uid: 0be6f513-0c58-4845-97e9-7da84f04a4d4
spec:
  members:
  - selector:
      exactMatch:
        name: worker-28
    type: Node
  - selector:
      exactMatch:
        name: worker-29
    type: Node
  - selector:
      exactMatch:
        name: worker-30
    type: Node
  tier: 1
status:
  conditions:
  - lastTransitionTime: "2025-02-10T07:41:38Z"
    message: There are network-related problems with the switch
    reason: OPTICAL_LINK_SUBHEALTH_FourHundredGigE1_0_9
    status: "True"
    type: NetworkUnavailable
  - lastTransitionTime: "2025-02-10T07:41:38Z"
    message: The switch is healthy
    reason: SwitchSystemIsHealthy
    status: "False"
    type: SystemFailure
  unhealthyNodeNames:
    - worker-28
    - worker-29
  nodeCount: 3

@volcano-sh-bot
Copy link
Collaborator

Welcome @fishingfly!

It looks like this is your first PR to volcano-sh/apis.

Thank you, and welcome to Volcano. 😃

@volcano-sh-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign kevin-wangzefeng
You can assign the PR to them by writing /assign @kevin-wangzefeng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yeahdongcn
Copy link

yeahdongcn commented Feb 13, 2025

Hi @Monokaix
As we discussed offline on Tuesday, this is the initial proposal.

@fishingfly fishingfly changed the title Proposal:add UnhealthyNodeNames feild in HyperNode status and common SwitchConditionType Proposal:add UnhealthyNodeNames feild in HyperNode status and common HyperNodeConditionType Feb 13, 2025
@Monokaix
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants