-
Notifications
You must be signed in to change notification settings - Fork 447
OCPBUGS-60805: configure-ovs: work around a Cisco switch issue #5274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When a slave is detached from a bond (e.g. during NetworkManager reactivation) before LACP negotiation completes, some Cisco switches mishandle LACP packets with agg=0, and disable the port. To work around this, if a bond slave is UP and LACP is enabled, wait up to 5 seconds for both ends to reach the "collecting,distributing" state before reattaching the interface. If negotiation does not complete within this window, a warning is logged but configuration continues. Note: This is a switch error, and sending agg=00 is valid. This is a workaround that only hides the issue on affected hardware and may reduce system's fault tolerance. Signed-off-by: Xin Long <[email protected]>
@lxin: This pull request references Jira Issue OCPBUGS-60805, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Hi @lxin. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test This patch was tested in an environment where the problem reproduces and was found to fix it. Should be good to go. |
for conn in "${connections[@]}"; do | ||
local slave_type=$($NMCLI_GET_VALUE connection.slave-type connection show "$conn") | ||
if [ "$slave_type" = "team" ] || [ "$slave_type" = "bond" ]; then | ||
# Work around a Cisco switch issue: if a slave is detached from a bond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, this can't be fixed in network manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be possible to implement something similar in NetworkManager. When NM detaches the interface from a 802.3ad bond, it could wait until the LACP negotiation is finished, up to a certain time.
However, if we decide to add this workaround to NM, it's more complicated because because it requires to implement new netlink parsing and to make the port detach asynchronous.
Also, it's not clear if this workaround is something that we want to enforce in all situations where NM is running. Perhaps there are scenarios where the user doesn't care if the port gets disabled? In the configure-ovs script the workaround is applied when we are reactivating the ports, and so it's a more controlled environment.
I think it's better for now to do the change in the MCO, and create a task for NM to investigate whether this workaround should be implemented there.
/verified by testing in Cisco environment, |
@rbbratta: This PR has been marked as verified by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@cybertron: This pull request references Jira Issue OCPBUGS-60805, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/assign @yuqi-zhang |
/cherry-pick release-4.20 release-4.19 release-4.18 One of these days I'll learn the backport syntax. ;-) |
@cybertron: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logically lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cybertron, lxin, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@lxin: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
When a slave is detached from a bond (e.g. during NetworkManager reactivation) before LACP negotiation completes, some Cisco switches mishandle LACP packets with agg=0, and disable the port.
To work around this, if a bond slave is UP and LACP is enabled, wait up to 5 seconds for both ends to reach the "collecting,distributing" state before reattaching the interface. If negotiation does not complete within this window, a warning is logged but configuration continues.
Note: This is a switch error, and sending agg=0 is valid. This is a workaround that only hides the issue on affected hardware and may reduce system's fault tolerance.