Skip to content

Conversation

lxin
Copy link

@lxin lxin commented Sep 4, 2025

When a slave is detached from a bond (e.g. during NetworkManager reactivation) before LACP negotiation completes, some Cisco switches mishandle LACP packets with agg=0, and disable the port.

To work around this, if a bond slave is UP and LACP is enabled, wait up to 5 seconds for both ends to reach the "collecting,distributing" state before reattaching the interface. If negotiation does not complete within this window, a warning is logged but configuration continues.

Note: This is a switch error, and sending agg=0 is valid. This is a workaround that only hides the issue on affected hardware and may reduce system's fault tolerance.

When a slave is detached from a bond (e.g. during NetworkManager
reactivation) before LACP negotiation completes, some Cisco switches
mishandle LACP packets with agg=0, and disable the port.

To work around this, if a bond slave is UP and LACP is enabled, wait
up to 5 seconds for both ends to reach the "collecting,distributing"
state before reattaching the interface. If negotiation does not
complete within this window, a warning is logged but configuration
continues.

Note: This is a switch error, and sending agg=00 is valid. This is a
workaround that only hides the issue on affected hardware and may
reduce system's fault tolerance.

Signed-off-by: Xin Long <[email protected]>
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 4, 2025
@openshift-ci-robot
Copy link
Contributor

@lxin: This pull request references Jira Issue OCPBUGS-60805, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

When a slave is detached from a bond (e.g. during NetworkManager reactivation) before LACP negotiation completes, some Cisco switches mishandle LACP packets with agg=0, and disable the port.

To work around this, if a bond slave is UP and LACP is enabled, wait up to 5 seconds for both ends to reach the "collecting,distributing" state before reattaching the interface. If negotiation does not complete within this window, a warning is logged but configuration continues.

Note: This is a switch error, and sending agg=0 is valid. This is a workaround that only hides the issue on affected hardware and may reduce system's fault tolerance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 4, 2025
Copy link
Contributor

openshift-ci bot commented Sep 4, 2025

Hi @lxin. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cybertron
Copy link
Member

/ok-to-test
/lgtm

This patch was tested in an environment where the problem reproduces and was found to fix it. Should be good to go.

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 5, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 5, 2025
for conn in "${connections[@]}"; do
local slave_type=$($NMCLI_GET_VALUE connection.slave-type connection show "$conn")
if [ "$slave_type" = "team" ] || [ "$slave_type" = "bond" ]; then
# Work around a Cisco switch issue: if a slave is detached from a bond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, this can't be fixed in network manager?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be possible to implement something similar in NetworkManager. When NM detaches the interface from a 802.3ad bond, it could wait until the LACP negotiation is finished, up to a certain time.

However, if we decide to add this workaround to NM, it's more complicated because because it requires to implement new netlink parsing and to make the port detach asynchronous.

Also, it's not clear if this workaround is something that we want to enforce in all situations where NM is running. Perhaps there are scenarios where the user doesn't care if the port gets disabled? In the configure-ovs script the workaround is applied when we are reactivating the ports, and so it's a more controlled environment.

I think it's better for now to do the change in the MCO, and create a task for NM to investigate whether this workaround should be implemented there.

@rbbratta
Copy link
Contributor

/verified by testing in Cisco environment,
Details in the Jira.

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Sep 22, 2025
@openshift-ci-robot
Copy link
Contributor

@rbbratta: This PR has been marked as verified by testing in Cisco environment,.

In response to this:

/verified by testing in Cisco environment,
Details in the Jira.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@cybertron
Copy link
Member

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 22, 2025
@openshift-ci-robot
Copy link
Contributor

@cybertron: This pull request references Jira Issue OCPBUGS-60805, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rbbratta

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from rbbratta September 22, 2025 15:10
@cybertron
Copy link
Member

/assign @yuqi-zhang
/backport release-4.20,release-4.19,release-4.18

@cybertron
Copy link
Member

/cherry-pick release-4.20 release-4.19 release-4.18

One of these days I'll learn the backport syntax. ;-)

@openshift-cherrypick-robot

@cybertron: once the present PR merges, I will cherry-pick it on top of release-4.20 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.20 release-4.19 release-4.18

One of these days I'll learn the backport syntax. ;-)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logically lgtm

Copy link
Contributor

openshift-ci bot commented Sep 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron, lxin, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 22, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 7d591ce and 2 for PR HEAD 327d9fc in total

Copy link
Contributor

openshift-ci bot commented Sep 23, 2025

@lxin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-mco-disruptive 327d9fc link false /test e2e-gcp-mco-disruptive
ci/prow/e2e-azure-ovn-upgrade-out-of-change 327d9fc link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-gcp-op-ocl 327d9fc link false /test e2e-gcp-op-ocl
ci/prow/e2e-aws-mco-disruptive 327d9fc link false /test e2e-aws-mco-disruptive
ci/prow/e2e-aws-ovn 327d9fc link unknown /test e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. verified Signifies that the PR passed pre-merge verification criteria
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants