Skip to content

Conversation

@pperiyasamy
Copy link
Member

This PR adds required resilience e2e tests which ensures IPsec deployment and pod traffic are working in all such scenarios.

@pperiyasamy pperiyasamy marked this pull request as draft October 24, 2024 14:43
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 24, 2024
@openshift-ci openshift-ci bot requested review from danwinship and travier October 24, 2024 14:43
@pperiyasamy pperiyasamy force-pushed the ipsec-pod-restart-test branch from 49571f9 to 50854b0 Compare November 5, 2024 14:53
@pperiyasamy pperiyasamy marked this pull request as ready for review November 5, 2024 14:53
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 5, 2024
@pperiyasamy pperiyasamy changed the title Add IPsec resilience tests SDN-4168: Add IPsec resilience tests Nov 5, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 5, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 5, 2024

@pperiyasamy: This pull request references SDN-4168 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

This PR adds required resilience e2e tests which ensures IPsec deployment and pod traffic are working in all such scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from pacevedom and trozet November 5, 2024 14:54
@pperiyasamy
Copy link
Member Author

/assign @jcaamano @trozet

PTAL. Let's make use of this PR to add all missing tests for sufficient IPsec coverage.

@pacevedom
Copy link
Contributor

/retest

@pperiyasamy pperiyasamy force-pushed the ipsec-pod-restart-test branch from 50854b0 to a6da4a6 Compare November 15, 2024 13:55
@pperiyasamy
Copy link
Member Author

/retest

@openshift-trt
Copy link

openshift-trt bot commented Nov 26, 2024

Job Failure Risk Analysis for sha: a6da4a6

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn High
[sig-arch] Only known images used by tests
This test has passed 100.00% of 18 runs on jobs ['periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn'] in the last 14 days.

@pperiyasamy pperiyasamy force-pushed the ipsec-pod-restart-test branch from a6da4a6 to 6e19beb Compare January 8, 2025 12:24
@pperiyasamy
Copy link
Member Author

/payload-job ?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 8, 2025

@pperiyasamy: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@pperiyasamy
Copy link
Member Author

/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-ovn-ipsec-step-registry

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 8, 2025

@pperiyasamy: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@pperiyasamy
Copy link
Member Author

/assign @jluhrsen

This commit adds required tests to ensure pod traffic across nodes are not impacted
upon multiple reboot of ipsec dameonset pods.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
@pperiyasamy pperiyasamy force-pushed the ipsec-pod-restart-test branch from 6e19beb to d5d96b1 Compare January 9, 2025 08:16
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 9, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pperiyasamy
Once this PR has been reviewed and has the lgtm label, please assign dgoodwin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-trt
Copy link

openshift-trt bot commented Jan 9, 2025

Job Failure Risk Analysis for sha: d5d96b1

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade Low
[sig-node] static pods should start after being created
This test has passed 68.87% of 151 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

Open Bugs
Static pod controller pods sometimes fail to start [etcd]
---
[sig-node] static pods should start after being created
This test has passed 69.33% of 150 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

Open Bugs
Static pod controller pods sometimes fail to start [etcd]

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2025

@pperiyasamy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade 6e19beb link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn-single-node-serial 6e19beb link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-agnostic-ovn-cmd 6e19beb link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-openstack-ovn 6e19beb link false /test e2e-openstack-ovn
ci/prow/e2e-aws-ovn-single-node 6e19beb link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-upgrade 6e19beb link false /test e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-cgroupsv2 6e19beb link false /test e2e-aws-ovn-cgroupsv2
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 6e19beb link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/e2e-vsphere-ovn-upi d5d96b1 link true /test e2e-vsphere-ovn-upi
ci/prow/e2e-aws-ovn-microshift d5d96b1 link true /test e2e-aws-ovn-microshift
ci/prow/e2e-aws-ovn-edge-zones d5d96b1 link true /test e2e-aws-ovn-edge-zones
ci/prow/e2e-vsphere-ovn d5d96b1 link true /test e2e-vsphere-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

},
Parallelism: 30,
MaximumAllowedFlakes: 15,
TestTimeout: 20 * time.Minute,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to have a discussion with the maintainers regarding this - did it occur on another comms channel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this test to ipsec test suite, so timeout change is not needed here.

func restartIPsecDaemonSet(oc *exutil.CLI) error {
ds, err := getDaemonSet(oc, ovnNamespace, ovnIPsecDsName)
if err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may need to enrich this error message with context because you could return an api error message and we cannot determine where in this func it occurs. Also, ln 242

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making it as ginkgo helper function now, so this is not needed now.

})

g.It("check pod traffic are working across nodes after ipsec daemonset restart", func() {
if ipsecMode != v1.IPsecModeFull {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not do this for other modes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous tested is valid for all ipsec modes, but this one is valid for full mode, added a comment in the code, hope that helps.

})

func createWebServerPods(oc *exutil.CLI, namespace string) []corev1.Pod {
g.GinkgoHelper()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/*
GinkgoHelper marks the function it's called in as a test helper.  When a failure occurs inside a helper function, Ginkgo will skip the helper when analyzing the stack trace to identify where the failure occurred.

This is an alternative, simpler, mechanism to passing in a skip offset when calling Fail or using Gomega.
*/

for _, targetPod := range pods {
if sourcePod.Name == targetPod.Name {
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you make sure also to only try to ping pods that are on different nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the sourcePod.Name == targetPod.Name check avoids the self ping.

for i := 1; i <= 5; i++ {
g.By(fmt.Sprintf("attempt#%d restarting IPsec pods", i))
err := restartIPsecDaemonSet(oc)
o.Expect(err).NotTo(o.HaveOccurred())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; i like to see the reasons why something failed with this expects otherwise you have to go look at the line of code that it failed on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! yes martin, good catch. added restartIPsecDaemonSet as a test helper function, that will solve the problem that you described.

Copy link
Member Author

@pperiyasamy pperiyasamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you suggested, move this change into another PR #29563, let's continue our review there.
so i'm going to close this PR,

},
Parallelism: 30,
MaximumAllowedFlakes: 15,
TestTimeout: 20 * time.Minute,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this test to ipsec test suite, so timeout change is not needed here.

func restartIPsecDaemonSet(oc *exutil.CLI) error {
ds, err := getDaemonSet(oc, ovnNamespace, ovnIPsecDsName)
if err != nil {
return err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making it as ginkgo helper function now, so this is not needed now.

})

g.It("check pod traffic are working across nodes after ipsec daemonset restart", func() {
if ipsecMode != v1.IPsecModeFull {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous tested is valid for all ipsec modes, but this one is valid for full mode, added a comment in the code, hope that helps.

for i := 1; i <= 5; i++ {
g.By(fmt.Sprintf("attempt#%d restarting IPsec pods", i))
err := restartIPsecDaemonSet(oc)
o.Expect(err).NotTo(o.HaveOccurred())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! yes martin, good catch. added restartIPsecDaemonSet as a test helper function, that will solve the problem that you described.

})

func createWebServerPods(oc *exutil.CLI, namespace string) []corev1.Pod {
g.GinkgoHelper()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/*
GinkgoHelper marks the function it's called in as a test helper.  When a failure occurs inside a helper function, Ginkgo will skip the helper when analyzing the stack trace to identify where the failure occurred.

This is an alternative, simpler, mechanism to passing in a skip offset when calling Fail or using Gomega.
*/

for _, targetPod := range pods {
if sourcePod.Name == targetPod.Name {
continue
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the sourcePod.Name == targetPod.Name check avoids the self ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants