use containerd 2.0 in presubmit scalability jobs #35073

AnishShah · 2025-07-02T17:43:29Z

This change is to check pod startup latency with containerd 2.0

BenTheElder

we have a single project with this much quota, so any 5k node jobs have to not conflict in terms of scheduling.

cc @kubernetes/sig-scalability / @kubernetes/sig-scalability-leads

FYI @kubernetes/sig-k8s-infra-leads @kubernetes/sig-testing-leads

BenTheElder · 2025-07-02T18:14:15Z

The other question would be: should we just use containerd 2.x on the scale jobs?
I'm pretty sure most of our critical CI is using either cri-o or containerd 2.x, not sure the default scale jobs should be using 1.7 still.

SIG Node would probably know this best actually ... @SergeyKanzhelev or @dims most likely.

ameukam · 2025-07-02T18:28:24Z

Given there is no justification for this, we should probably not add this.
/hold

dims · 2025-07-02T18:32:19Z

@BenTheElder a recent run of the 5k GCE job shows that we are using cos-109-17800-519-40 still. So you are right https://cloud.google.com/container-optimized-os/docs/release-notes/m109 shows that this image is containerd v1.7.x

I'd recommend starting with a clone of ci-kubernetes-e2e-gci-gce-scalability which is 100 nodes, make that work, once we have some faith there then add the clone in this PR that runs 5k (split the time between the 2 jobs), and then when both variants with 2.x work reliably, then drop the clones and switch the primaries to 2.x. But will leave the decision-making / time-line to sig-node folks led by @SergeyKanzhelev

BenTheElder · 2025-07-02T20:15:50Z

Here we are overriding it for pull-kubernetes-e2e-gce instead of the version in the COS image I think:

test-infra/config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml

Line 264 in 7f4dd38

- --env=KUBE_UBUNTU_INSTALL_CONTAINERD_VERSION=v2.1.0

BenTheElder · 2025-07-02T20:21:01Z

Given there is no justification for this, we should probably not add this.

There's an open question if we should do it via phasing over via additional jobs (as in this PR and @dims's suggestion) or not, but the issue is there's been some scaling gaps between the legacy 1.x we are still using in these scale jobs and the 2.x we are using in most of CI now.

We should be aligning these to 2.x eventually, the question is the approach.

x-ref:

test-infra/config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml

Lines 88 to 89 in 52f5173

    
           # This is an equivalent of the ci-kubernetes-e2e-gce-scale-correctness test 
        
           # at 100 node scale. It's an optional presubmit to simplify testing changes

This problem currently extends to the 100 node jobs, which are not blocking but informing. The default blocking job is on 2.x

BenTheElder · 2025-07-02T20:51:54Z

Here we are overriding it for pull-kubernetes-e2e-gce instead of the version in the COS image I think:

... as @AnishShah reminded me asking questions out of band ... that's because that job is on Ubuntu, we can't install over the system containerd on COS.

We usually defer to @SergeyKanzhelev and SIG node for COS support & bump timing in the GCE jobs.

AnishShah · 2025-07-02T21:39:38Z

/test pull-kubernetes-e2e-gce-100-performance

k8s-ci-robot · 2025-07-02T21:39:42Z

@AnishShah: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-test-infra-gubernator

/test pull-test-infra-misc-image-build-test

/test pull-test-infra-prow-checkconfig

/test pull-test-infra-unit-test

/test pull-test-infra-verify-cri-o

/test pull-test-infra-verify-lint

The following commands are available to trigger optional jobs:

/test pull-test-infra-unit-test-race-detector-nonblocking

Use /test all to run the following jobs that were automatically triggered:

pull-test-infra-gubernator

pull-test-infra-prow-checkconfig

pull-test-infra-unit-test

pull-test-infra-unit-test-race-detector-nonblocking

pull-test-infra-verify-lint

In response to this:

/test pull-kubernetes-e2e-gce-100-performance

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

AnishShah · 2025-07-02T21:42:03Z

I modified pull-kubernetes-e2e-gce-100-performance presubmit job to use containerd 2.0.5 via KUBE_COS_INSTALL_CONTAINERD_VERSION envvar. PTAL

AnishShah · 2025-07-02T22:21:36Z

/assign @SergeyKanzhelev

BenTheElder · 2025-07-02T23:29:31Z

/test pull-kubernetes-e2e-gce-100-performance

You have to do that in the repo it is defined for (so kubernetes/kubernetes), and you won't be able to do that before merge.

BenTheElder · 2025-07-02T23:30:22Z

config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml

@@ -43,6 +43,8 @@ presubmits:
        - --cluster=
        - --env=HEAPSTER_MACHINE_TYPE=e2-standard-8
        - --env=KUBEMARK_APISERVER_TEST_ARGS=--max-requests-inflight=80 --max-mutating-requests-inflight=0 --profiling --contention-profiling
+        - --env=KUBE_COS_INSTALL_CONTAINERD_VERSION=v2.0.5
+        - --env=KUBE_COS_INSTALL_RUNC_VERSION=v1.2.1


I think this is OK since this job is not blocking anyhow, but we should be prepared to rollback.

jprzychodzen · 2025-07-03T12:57:00Z

config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml

@@ -43,6 +43,8 @@ presubmits:
        - --cluster=
        - --env=HEAPSTER_MACHINE_TYPE=e2-standard-8
        - --env=KUBEMARK_APISERVER_TEST_ARGS=--max-requests-inflight=80 --max-mutating-requests-inflight=0 --profiling --contention-profiling
+        - --env=KUBE_COS_INSTALL_CONTAINERD_VERSION=v2.0.5


How are we going to maintain and update it? What's the plan for bumping those versions?

We manually bump containerd version in other sig-node jobs as well.

SergeyKanzhelev

/lgtm
/approve

I think we generally try to run master tests on a COS family closest to the latest and corresponding containerd version. These will match the EOL of 1.34 the best and easiest for maintainers and users for unification.

I think bumping containerd as a first step is OK.

I think if we open to rollback right away if issue will be detected, the effort of duplicating this expensive job temporarily is not worth it. If no immidiate concerns from @dims, please unhold

AnishShah · 2025-07-07T18:44:48Z

/unhold

k8s-ci-robot · 2025-07-07T21:20:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AnishShah, SergeyKanzhelev, upodroid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~config/jobs/kubernetes/sig-scalability/OWNERS~~ [upodroid]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-07-07T21:35:30Z

@AnishShah: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

key sig-scalability-presubmit-jobs.yaml using file config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml

In response to this:

This change is to check pod startup latency with containerd 2.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 2, 2025

k8s-ci-robot requested review from mm4tt and wojtek-t July 2, 2025 17:43

BenTheElder reviewed Jul 2, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 2, 2025

use containerd 2.0.5 in presubmit scale test

a3fb174

AnishShah force-pushed the containerd-2 branch from e522fde to a3fb174 Compare July 2, 2025 21:39

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 2, 2025

AnishShah changed the title ~~add scale jobs with containerd 2.0~~ use containerd 2.0 in presubmit scalability jobs Jul 2, 2025

BenTheElder reviewed Jul 2, 2025

View reviewed changes

jprzychodzen reviewed Jul 3, 2025

View reviewed changes

SergeyKanzhelev approved these changes Jul 7, 2025

View reviewed changes

k8s-ci-robot assigned SergeyKanzhelev Jul 7, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 7, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2025

upodroid approved these changes Jul 7, 2025

View reviewed changes

k8s-ci-robot assigned upodroid Jul 7, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2025

k8s-ci-robot merged commit 60d647e into kubernetes:master Jul 7, 2025
7 checks passed

AnishShah deleted the containerd-2 branch July 7, 2025 21:35

use containerd 2.0 in presubmit scalability jobs #35073

use containerd 2.0 in presubmit scalability jobs #35073

Uh oh!

Conversation

AnishShah commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenTheElder left a comment

Choose a reason for hiding this comment

Uh oh!

BenTheElder commented Jul 2, 2025

Uh oh!

ameukam commented Jul 2, 2025

Uh oh!

dims commented Jul 2, 2025

Uh oh!

BenTheElder commented Jul 2, 2025

Uh oh!

BenTheElder commented Jul 2, 2025

Uh oh!

BenTheElder commented Jul 2, 2025

Uh oh!

AnishShah commented Jul 2, 2025

Uh oh!

k8s-ci-robot commented Jul 2, 2025

Uh oh!

AnishShah commented Jul 2, 2025

Uh oh!

AnishShah commented Jul 2, 2025

Uh oh!

BenTheElder commented Jul 2, 2025

Uh oh!

BenTheElder Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

jprzychodzen Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

AnishShah Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

Uh oh!

AnishShah commented Jul 7, 2025

Uh oh!

k8s-ci-robot commented Jul 7, 2025

Uh oh!

Uh oh!

k8s-ci-robot commented Jul 7, 2025

Uh oh!

Uh oh!

AnishShah commented Jul 2, 2025 •

edited

Loading