Skip to content

use containerd 2.0 in presubmit scalability jobs #35073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 7, 2025

Conversation

AnishShah
Copy link
Contributor

@AnishShah AnishShah commented Jul 2, 2025

This change is to check pod startup latency with containerd 2.0

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 2, 2025
@k8s-ci-robot k8s-ci-robot requested review from mm4tt and wojtek-t July 2, 2025 17:43
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/config Issues or PRs related to code in /config area/jobs sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jul 2, 2025
Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a single project with this much quota, so any 5k node jobs have to not conflict in terms of scheduling.

cc @kubernetes/sig-scalability / @kubernetes/sig-scalability-leads

FYI @kubernetes/sig-k8s-infra-leads @kubernetes/sig-testing-leads

@BenTheElder
Copy link
Member

The other question would be: should we just use containerd 2.x on the scale jobs?
I'm pretty sure most of our critical CI is using either cri-o or containerd 2.x, not sure the default scale jobs should be using 1.7 still.

SIG Node would probably know this best actually ... @SergeyKanzhelev or @dims most likely.

@ameukam
Copy link
Member

ameukam commented Jul 2, 2025

Given there is no justification for this, we should probably not add this.
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 2, 2025
@dims
Copy link
Member

dims commented Jul 2, 2025

@BenTheElder a recent run of the 5k GCE job shows that we are using cos-109-17800-519-40 still. So you are right https://cloud.google.com/container-optimized-os/docs/release-notes/m109 shows that this image is containerd v1.7.x

I'd recommend starting with a clone of ci-kubernetes-e2e-gci-gce-scalability which is 100 nodes, make that work, once we have some faith there then add the clone in this PR that runs 5k (split the time between the 2 jobs), and then when both variants with 2.x work reliably, then drop the clones and switch the primaries to 2.x. But will leave the decision-making / time-line to sig-node folks led by @SergeyKanzhelev

@BenTheElder
Copy link
Member

Here we are overriding it for pull-kubernetes-e2e-gce instead of the version in the COS image I think:

- --env=KUBE_UBUNTU_INSTALL_CONTAINERD_VERSION=v2.1.0

@BenTheElder
Copy link
Member

Given there is no justification for this, we should probably not add this.

There's an open question if we should do it via phasing over via additional jobs (as in this PR and @dims's suggestion) or not, but the issue is there's been some scaling gaps between the legacy 1.x we are still using in these scale jobs and the 2.x we are using in most of CI now.

We should be aligning these to 2.x eventually, the question is the approach.

x-ref:

# This is an equivalent of the ci-kubernetes-e2e-gce-scale-correctness test
# at 100 node scale. It's an optional presubmit to simplify testing changes

This problem currently extends to the 100 node jobs, which are not blocking but informing. The default blocking job is on 2.x

@BenTheElder
Copy link
Member

Here we are overriding it for pull-kubernetes-e2e-gce instead of the version in the COS image I think:

... as @AnishShah reminded me asking questions out of band ... that's because that job is on Ubuntu, we can't install over the system containerd on COS.

We usually defer to @SergeyKanzhelev and SIG node for COS support & bump timing in the GCE jobs.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 2, 2025
@AnishShah
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-100-performance

@k8s-ci-robot
Copy link
Contributor

@AnishShah: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-test-infra-gubernator
/test pull-test-infra-misc-image-build-test
/test pull-test-infra-prow-checkconfig
/test pull-test-infra-unit-test
/test pull-test-infra-verify-cri-o
/test pull-test-infra-verify-lint

The following commands are available to trigger optional jobs:

/test pull-test-infra-unit-test-race-detector-nonblocking

Use /test all to run the following jobs that were automatically triggered:

pull-test-infra-gubernator
pull-test-infra-prow-checkconfig
pull-test-infra-unit-test
pull-test-infra-unit-test-race-detector-nonblocking
pull-test-infra-verify-lint

In response to this:

/test pull-kubernetes-e2e-gce-100-performance

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@AnishShah AnishShah changed the title add scale jobs with containerd 2.0 use containerd 2.0 in presubmit scalability jobs Jul 2, 2025
@AnishShah
Copy link
Contributor Author

I modified pull-kubernetes-e2e-gce-100-performance presubmit job to use containerd 2.0.5 via KUBE_COS_INSTALL_CONTAINERD_VERSION envvar. PTAL

@AnishShah
Copy link
Contributor Author

/assign @SergeyKanzhelev

@BenTheElder
Copy link
Member

/test pull-kubernetes-e2e-gce-100-performance

You have to do that in the repo it is defined for (so kubernetes/kubernetes), and you won't be able to do that before merge.

@@ -43,6 +43,8 @@ presubmits:
- --cluster=
- --env=HEAPSTER_MACHINE_TYPE=e2-standard-8
- --env=KUBEMARK_APISERVER_TEST_ARGS=--max-requests-inflight=80 --max-mutating-requests-inflight=0 --profiling --contention-profiling
- --env=KUBE_COS_INSTALL_CONTAINERD_VERSION=v2.0.5
- --env=KUBE_COS_INSTALL_RUNC_VERSION=v1.2.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK since this job is not blocking anyhow, but we should be prepared to rollback.

@@ -43,6 +43,8 @@ presubmits:
- --cluster=
- --env=HEAPSTER_MACHINE_TYPE=e2-standard-8
- --env=KUBEMARK_APISERVER_TEST_ARGS=--max-requests-inflight=80 --max-mutating-requests-inflight=0 --profiling --contention-profiling
- --env=KUBE_COS_INSTALL_CONTAINERD_VERSION=v2.0.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to maintain and update it? What's the plan for bumping those versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We manually bump containerd version in other sig-node jobs as well.

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

I think we generally try to run master tests on a COS family closest to the latest and corresponding containerd version. These will match the EOL of 1.34 the best and easiest for maintainers and users for unification.

I think bumping containerd as a first step is OK.

I think if we open to rollback right away if issue will be detected, the effort of duplicating this expensive job temporarily is not worth it. If no immidiate concerns from @dims, please unhold

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 7, 2025
@AnishShah
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AnishShah, SergeyKanzhelev, upodroid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2025
@k8s-ci-robot k8s-ci-robot merged commit 60d647e into kubernetes:master Jul 7, 2025
7 checks passed
@k8s-ci-robot
Copy link
Contributor

@AnishShah: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

  • key sig-scalability-presubmit-jobs.yaml using file config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml

In response to this:

This change is to check pod startup latency with containerd 2.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@AnishShah AnishShah deleted the containerd-2 branch July 7, 2025 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants