diff --git a/keps/prod-readiness/sig-node/4622.yaml b/keps/prod-readiness/sig-node/4622.yaml index 7fbe865efff..e3c839966dd 100644 --- a/keps/prod-readiness/sig-node/4622.yaml +++ b/keps/prod-readiness/sig-node/4622.yaml @@ -1,3 +1,5 @@ kep-number: 4622 beta: approver: "@jpbetz" +stable: + approver: "@jpbetz" diff --git a/keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/README.md b/keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/README.md index 013fe9364f7..3c39731e19c 100644 --- a/keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/README.md +++ b/keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/README.md @@ -1,26 +1,12 @@ # KEP-4622: New TopologyManager Policy which configure the value of maxAllowableNUMANodes - - - - - [Release Signoff Checklist](#release-signoff-checklist) - [Summary](#summary) -- [Goals](#goals) -- [Non-Goals](#non-goals) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) - [User Stories (Optional)](#user-stories-optional) - [Story 1](#story-1) - [Story 2](#story-2) @@ -48,41 +34,26 @@ tags, and then generate with `hack/update-toc.sh`. - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) -- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) ## Release Signoff Checklist - - Items marked with (R) are required *prior to targeting to a milestone / release*. -- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented -- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - - [ ] e2e Tests for all Beta API Operations (endpoints) - - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free -- [ ] (R) Graduation criteria is in place - - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) -- [ ] (R) Production readiness review completed -- [ ] (R) Production readiness review approved -- [ ] "Implementation History" section is up-to-date for milestone -- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] -- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes +- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [X] (R) KEP approvers have approved the KEP status as `implementable` +- [X] (R) Design details are appropriately documented +- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [X] e2e Tests for all Beta API Operations (endpoints) + - [] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [X] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [X] (R) Graduation criteria is in place + - [X] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [X] (R) Production readiness review completed +- [X] (R) Production readiness review approved +- [X] "Implementation History" section is up-to-date for milestone +- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes - -[x] I/we understand the owners of the involved components may require updates to +[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement. ##### Prerequisite testing updates - - ##### Unit tests - - - - - `k8s.io/kubernetes/pkg/kubelet/cm/topologymanager`: `20240405` - `91.5%` ##### Integration tests - - - - No new integration tests for kubelet are planned. ##### e2e tests - For beta: - Verify the input validation with the existing e2e tests(e.g. 9 or 10 or something bigger than the current default but not "too big") -For GA: - -- degrading the node and checking the node is reported as degraded - ### Graduation Criteria #### Beta @@ -249,123 +162,58 @@ For GA: #### GA -- Add a metrics: `kubelet_topology_manager_admission_time`. +- An existing metric: `topology_manager_admission_duration_ms` can be used. ### Upgrade / Downgrade Strategy - - We anticipate no repercussions. The new policy option is voluntary and operates independent of the existing options. ### Version Skew Strategy + No changes needed. - ## Production Readiness Review Questionnaire - - ### Feature Enablement and Rollback - 1.31: + - enable by default - allow gate to disable the feature - release note -1.32: +1.35: + - promote to GA -- cannot be disabled +- LockToDefault: true (cannot be disabled) - release note -###### How can this feature be enabled / disabled in a live cluster? +1.36: - +###### How can this feature be enabled / disabled in a live cluster? -- [x] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: - - `TopologyManagerPolicyBetaOptions` - - `TopologyManagerPolicyOptions` +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `TopologyManagerPolicyBetaOptions` - Components depending on the feature gate: `kubelet` -- [x] Change the kubelet configuration to set a TopologyManager policy of static and a TopologyManager policy option of `max-allowable-numa-nodes` - - Will enabling / disabling the feature require downtime of the control plane? +- [X] Other + - Describe the mechanism: Change the kubelet configuration to set a TopologyManager policy of static and a TopologyManager policy option of `max-allowable-numa-nodes` + - Will enabling / disabling the feature require downtime of the control + plane? No - - Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume Dynamic Kubelet Config feature is enabled). - Yes -- kubelet restart is required. -###### Does enabling the feature change any default behavior? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + Yes, Kubelet restart is required. - +###### Does enabling the feature change any default behavior? -No. +No. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? - - Yes, When it is disabled once (i.e. no value is set), this falls back to the default behavior. ###### What happens if we reenable the feature if it was previously rolled back? @@ -376,81 +224,48 @@ Running containers won't be affected by the rollback of the feature, only newly This new `TopologyManager` policy option will start immediately from beta stage. The unit tests will test whether the configured value of `max-allowable-numa-nodes` is as expected and whether it is the default recommended value when it is not configured. - - ### Rollout, Upgrade and Rollback Planning + When feature a is not enabled or configured, its value is the default value. and the feature is fully contained in the kubelet, has no dependencies and rollback and upgrades both will affect only newly created pods. + ###### How can a rollout or rollback fail? Can it impact already running workloads? -Rollout or rollout fail do not impact already running workloads, only impact the new workloads. - ###### What specific metrics should inform a rollback? - -We have a metric which records the topology manager admission time: `kubelet_topology_manager_admission_time`. +We have an existing metric which records the topology manager admission time: `topology_manager_admission_duration_ms`. + ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? - Rollout or upgrade do not impact already running workloads. We plan to add an e2e test for this in the furture. ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? - No. ### Monitoring Requirements - -We add a metric: `kubelet_topology_manager_admission_time` for kubelet, which can be used to check if the setting is causing unacceptable performance drops. ###### How can an operator determine if the feature is in use by workloads? - - Examine the kubelet configuration of a node to verify the existence of the feature gate and the utilization of the new policy option. we can use the following command to check the feature if it is enabled: ``` @@ -459,59 +274,28 @@ kubectl get --raw "/api/v1/nodes//proxy/configz" | jq '.kubeletconfig. ###### How can someone using this feature know that it is working for their instance? - - - [ ] Events - Event Reason: - [ ] API .status - Condition name: - Other field: -- [x] Other (treat as last resort) +- [X] Other (treat as last resort) - Details: If their system has more than 8 NUMA nodes, the TopologyManager is turned on and the kubelet is not crashing, then the feature is working. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? - The value of max-allowable-numa-nodes does not (in and of itself) affect the latency of pod admission. With the TopologyManager enabled, the time to admit a pod is tied to the number of NUMA nodes on the physical machine. In the past, this was hard-coded at 8 to ensure that pod admission always completed in a reasonable amount of time. If a machine had more than 8 NUMA nodes, the kubelet would crash with a log message stating that the ToplogyManager is unsupported on machines with more than 8 NUMA nodes. With the new max-allowable-numa-nodes option, admins now have the ability to allow nodes with more than 8 NUMA nodes to run with the TopologyManager enabled. However, it is unknown exactly how much this will slow down pod admission on any given system. This feature is therefore to be used at-your-own-risk until we have a proper solution in place to reduce the state explosion that causes pod admission time to slow down as the number of NUMA nodes increases. ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? - - -- [ ] Metrics - - Metric name: kubelet_topology_manager_admission_time +- [X] Metrics + - Metric name: `topology_manager_admission_duration_ms` + - [Optional] Aggregation method: - Components exposing the metric: kubelet ###### Are there any missing metrics that would be useful to have to improve observability of this feature? - - The feature is not used by workloads in any way shape or form. and it only (potentially) impacts how long it takes for the kubelet to start a workload. We can easily check if this feature is enabled by looking at the kubelet config, example: ```shell @@ -520,169 +304,60 @@ kubectl get --raw "/api/v1/nodes//proxy/configz" | jq '.kubeletconfig. ### Dependencies - - -N/A - ###### Does this feature depend on any specific services running in the cluster? - - No. It doesn't rely on other Kubernetes components. ### Scalability - - ###### Will enabling / using this feature result in any new API calls? - -No +No. ###### Will enabling / using this feature result in introducing new API types? - -No +No. ###### Will enabling / using this feature result in any new calls to the cloud provider? - -No +No. ###### Will enabling / using this feature result in increasing size or count of the existing API objects? - -No +No. ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? - - It will slow down pod admission/start time on the node, and the slowdown occurs because the kubelet's TopoolgyManager now has more combinations it needs to consider when deciding where a cpus and devices can be allocated in an aligned way, and the slowdown affects only node configured with the feature, there is not any cluster impact as the feature is at node-level. ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? - - It will increase the kubelet's CPU usage time. If your system has more than 8 NUMA nodes, then you will not be able to run kubernetes on it without this feature. so the purpose is then to provide an escape hatch for those that are OK paying the price of increased latency for pod admission (and its associated CPU/RAM costs) in order to allow the kubelet to run on such a node. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? - - Same answer as above. ### Troubleshooting - - ###### How does this feature react if the API server and/or etcd is unavailable? + N/A ###### What are other known failure modes? -Setting a value lower 8 causes kubelet crash. +Keeping the default value will cause the kubelet to fail to start on machines with 9 or more NUMA cells if any but the `none` topology manager policy is also configured. on machines with 9 or more NUMA cells if any but the `none` topology manager policy is also configured. ###### What steps should be taken if SLOs are not being met to determine the problem? - As a cluster administrator you should know the number of NUMA nodes on your nodes and adjust the value of the kubelet's topologyManager options or turn it off. +As a cluster administrator you should know the number of NUMA nodes on your nodes and adjust the value of the kubelet's topologyManager options or turn it off. ## Implementation History - - - 2024-05-08 - initial KEP draft created - 2024-06-06 - updates per review feedback +- 2025-10-07 - promote it to GA ## Drawbacks @@ -694,17 +369,5 @@ With this feature: you get a potential slowdown, but at least the kubelet will r ## Alternatives - - Adding a new kubelet configuration option. - \ No newline at end of file diff --git a/keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/kep.yaml b/keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/kep.yaml index 39c50f56bac..460d26f8b83 100644 --- a/keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/kep.yaml +++ b/keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/kep.yaml @@ -2,37 +2,42 @@ title: New TopologyManager Policy which configure the value of maxAllowableNUMAN kep-number: 4622 authors: - "@cyclinder" + - "@ffromani" # ONLY for GA graduation and PRR review owning-sig: sig-node participating-sigs: [] status: implementable -creation-date: "2024-05-08" +creation-date: "2025-02-15" reviewers: - "@klueska" - "@ffromani" approvers: - - "@sig-node-tech-leads" -see-also: [] + - "@klueska" +see-also: + - "keps/sig-node/2625-cpumanager-policies-thread-placement/" + - "keps/sig-node/2902-cpumanager-distribute-cpus-policy-option/" + - "keps/sig-node/3545-improved-multi-numa-alignment/" + - "keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/" + - "keps/sig-node/4540-strict-cpu-reservation" + - "keps/sig-node/4800-cpumanager-split-uncorecache/" replaces: [] # The target maturity stage in the current dev cycle for this KEP. -stage: beta +stage: stable # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.31" +latest-milestone: "v1.35" # The milestone at which this feature was, or is targeted to be, at each stage. +# (ffromani): started as beta xref: https://github.com/kubernetes/enhancements/issues/4622#issuecomment-2150320232 milestone: beta: "v1.31" - stable: "v1.32" + stable: "v1.35" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled feature-gates: - - name: "TopologyManagerPolicyBetaOptions" - components: - - kubelet - name: "TopologyManagerPolicyOptions" components: - kubelet @@ -40,4 +45,4 @@ disable-supported: true # The following PRR answers are required at beta release metrics: - - kubelet_topology_manager_admission_time + - topology_manager_admission_duration_ms