Skip to content

Commit b6b9de2

Browse files
authored
Merge pull request #51878 from asa3311/sync-zh-191
[zh] sync dra
2 parents abd7c05 + 4c7abf0 commit b6b9de2

File tree

1 file changed

+60
-51
lines changed
  • content/zh-cn/docs/concepts/cluster-administration

1 file changed

+60
-51
lines changed

content/zh-cn/docs/concepts/cluster-administration/dra.md

Lines changed: 60 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ DRA 驱动是运行在集群的每个节点上的第三方应用,对接节点
5757
5858
DRA drivers implement the [`kubeletplugin` package
5959
interface](https://pkg.go.dev/k8s.io/dynamic-resource-allocation/kubeletplugin).
60-
Your driver may support seamless upgrades by implementing a property of this
60+
Your driver may support _seamless upgrades_ by implementing a property of this
6161
interface that allows two versions of the same DRA driver to coexist for a short
6262
time. This is only available for kubelet versions 1.33 and above and may not be
6363
supported by your driver for heterogeneous clusters with attached nodes running
@@ -67,7 +67,7 @@ older versions of Kubernetes - check your driver's documentation to be sure.
6767

6868
DRA 驱动实现
6969
[`kubeletplugin` 包接口](https://pkg.go.dev/k8s.io/dynamic-resource-allocation/kubeletplugin)
70-
你的驱动可能通过实现此接口的一个属性,支持两个版本共存一段时间,从而实现无缝升级
70+
你的驱动可能通过实现此接口的一个属性,支持两个版本共存一段时间,从而实现**无缝升级**
7171
该功能仅适用于 kubelet v1.33 及更高版本,对于运行旧版 Kubernetes 的节点所组成的异构集群,
7272
可能不支持这种功能。请查阅你的驱动文档予以确认。
7373

@@ -98,7 +98,7 @@ observe that:
9898
<!--
9999
### Confirm your DRA driver exposes a liveness probe and utilize it
100100
101-
Your DRA driver likely implements a grpc socket for healthchecks as part of DRA
101+
Your DRA driver likely implements a gRPC socket for healthchecks as part of DRA
102102
driver good practices. The easiest way to utilize this grpc socket is to
103103
configure it as a liveness probe for the DaemonSet deploying your DRA driver.
104104
Your driver's documentation or deployment tooling may already include this, but
@@ -110,7 +110,7 @@ heal, reducing scheduling delays or troubleshooting time.
110110
-->
111111
### 确认你的 DRA 驱动暴露了存活探针并加以利用 {#confirm-your-dra-driver-exposes-a-liveness-probe-and-utilize-it}
112112

113-
你的 DRA 驱动可能已实现用于健康检查的 grpc 套接字,这是 DRA 驱动的良好实践之一。
113+
你的 DRA 驱动可能已实现用于健康检查的 gRPC 套接字,这是 DRA 驱动的良好实践之一。
114114
最简单的利用方式是将该 grpc 套接字配置为部署 DRA 驱动 DaemonSet 的存活探针。
115115
驱动文档或部署工具可能已包括此项配置,但如果你是自行配置或未以 Kubernetes Pod 方式运行 DRA 驱动,
116116
确保你的编排工具在该 grpc 套接字健康检查失败时能重启驱动。这样可以最大程度地减少 DRA 驱动的意外停机,
@@ -136,25 +136,29 @@ ResourceClaim 或 ResourceClaimTemplate。
136136
<!--
137137
## Monitor and tune components for higher load, especially in high scale environments
138138
139-
Control plane component `kube-scheduler` and the internal ResourceClaim
140-
controller orchestrated by the component `kube-controller-manager` do the heavy
141-
lifting during scheduling of Pods with claims based on metadata stored in the
142-
DRA APIs. Compared to non-DRA scheduled Pods, the number of API server calls,
143-
memory, and CPU utilization needed by these components is increased for Pods
144-
using DRA claims. In addition, node local components like the DRA driver and
145-
kubelet utilize DRA APIs to allocated the hardware request at Pod sandbox
139+
Control plane component {{< glossary_tooltip text="kube-scheduler"
140+
term_id="kube-scheduler" >}} and the internal ResourceClaim controller
141+
orchestrated by the component {{< glossary_tooltip
142+
text="kube-controller-manager" term_id="kube-controller-manager" >}} do the
143+
heavy lifting during scheduling of Pods with claims based on metadata stored in
144+
the DRA APIs. Compared to non-DRA scheduled Pods, the number of API server
145+
calls, memory, and CPU utilization needed by these components is increased for
146+
Pods using DRA claims. In addition, node local components like the DRA driver
147+
and kubelet utilize DRA APIs to allocated the hardware request at Pod sandbox
146148
creation time. Especially in high scale environments where clusters have many
147149
nodes, and/or deploy many workloads that heavily utilize DRA defined resource
148150
claims, the cluster administrator should configure the relevant components to
149151
anticipate the increased load.
150152
-->
151153
## 在大规模环境中在高负载场景下监控和调优组件 {#monitor-and-tune-components-for-higher-load-especially-in-high-scale-environments}
152154

153-
控制面组件 `kube-scheduler` 以及 `kube-controller-manager` 中的内部 ResourceClaim
154-
控制器在调度使用 DRA 申领的 Pod 时承担了大量任务。与不使用 DRA 的 Pod 相比,这些组件所需的
155-
API 服务器调用次数、内存和 CPU 使用率都更高。此外,节点本地组件(如 DRA 驱动和 kubelet)也在创建
156-
Pod 沙箱时使用 DRA API 分配硬件请求资源。
157-
尤其在集群节点数量众多或大量工作负载依赖 DRA 定义的资源申领时,集群管理员应当预先为相关组件配置合理参数以应对增加的负载。
155+
控制面组件 {{< glossary_tooltip text="kube-scheduler" term_id="kube-scheduler" >}}
156+
以及 {{< glossary_tooltip text="kube-controller-manager" term_id="kube-controller-manager" >}}
157+
中的内部 ResourceClaim 控制器在调度使用 DRA 申领的 Pod 时承担了大量任务。与不使用 DRA 的 Pod 相比,
158+
这些组件所需的 API 服务器调用次数、内存和 CPU 使用率都更高。此外,
159+
节点本地组件(如 DRA 驱动和 kubelet)也在创建 Pod 沙箱时使用 DRA API 分配硬件请求资源。
160+
尤其在集群节点数量众多或大量工作负载依赖 DRA 定义的资源申领时,
161+
集群管理员应当预先为相关组件配置合理参数以应对增加的负载。
158162

159163
<!--
160164
The effects of mistuned components can have direct or snowballing affects
@@ -171,26 +175,29 @@ client-go configuration within `kube-controller-manager` are critical.
171175
<!--
172176
The specific values to tune your cluster to depend on a variety of factors like
173177
number of nodes/pods, rate of pod creation, churn, even in non-DRA environments;
174-
see the [SIG-Scalability README on Kubernetes scalability
178+
see the [SIG Scalability README on Kubernetes scalability
175179
thresholds](https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md)
176180
for more information. In scale tests performed against a DRA enabled cluster
177181
with 100 nodes, involving 720 long-lived pods (90% saturation) and 80 churn pods
178182
(10% churn, 10 times), with a job creation QPS of 10, `kube-controller-manager`
179183
QPS could be set to as low as 75 and Burst to 150 to meet equivalent metric
180184
targets for non-DRA deployments. At this lower bound, it was observed that the
181-
client side rate limiter was triggered enough to protect apiserver from
182-
explosive burst but was is high enough that pod startup SLOs were not impacted.
185+
client side rate limiter was triggered enough to protect the API server from
186+
explosive burst but was high enough that pod startup SLOs were not impacted.
183187
While this is a good starting point, you can get a better idea of how to tune
184188
the different components that have the biggest effect on DRA performance for
185-
your deployment by monitoring the following metrics.
189+
your deployment by monitoring the following metrics. For more information on all
190+
the stable metrics in Kubernetes, see the [Kubernetes Metrics
191+
Reference](/docs/reference/generated/metrics/).
186192
-->
187193
集群调优所需的具体数值取决于多个因素,如节点/Pod 数量、Pod 创建速率、变化频率,甚至与是否使用 DRA 无关。更多信息请参考
188-
[SIG-Scalability README 中的可扩缩性阈值](https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md)
194+
[SIG Scalability README 中的可扩缩性阈值](https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md)
189195
在一项针对启用了 DRA 的 100 节点集群的规模测试中,部署了 720 个长生命周期 Pod(90% 饱和度)和 80
190196
个短周期 Pod(10% 流失,重复 10 次),作业创建 QPS 为 10。将 `kube-controller-manager` 的 QPS
191197
设置为 75、Burst 设置为 150,能达到与非 DRA 部署中相同的性能指标。在这个下限设置下,
192198
客户端速率限制器能有效保护 API 服务器避免突发请求,同时不影响 Pod 启动 SLO。
193199
这可作为一个良好的起点。你可以通过监控下列指标,进一步判断对 DRA 性能影响最大的组件,从而优化其配置。
200+
有关 Kubernetes 中所有稳定指标的更多信息,请参阅 [Kubernetes 指标参考](/zh-cn/docs/reference/generated/metrics/)
194201

195202
<!--
196203
### `kube-controller-manager` metrics
@@ -203,24 +210,22 @@ managed by the `kube-controller-manager` component.
203210
以下指标聚焦于由 `kube-controller-manager` 组件管理的内部 ResourceClaim 控制器:
204211

205212
<!--
206-
* Workqueue Add Rate: Monitor
207-
`sum(rate(workqueue_adds_total{name="resource_claim"}[5m]))` to gauge how
208-
quickly items are added to the ResourceClaim controller.
213+
* Workqueue Add Rate: Monitor {{< highlight promql "hl_inline=true" >}} sum(rate(workqueue_adds_total{name="resource_claim"}[5m])) {{< /highlight >}} to gauge how quickly items are added to the ResourceClaim controller.
209214
* Workqueue Depth: Track
210-
`sum(workqueue_depth{endpoint="kube-controller-manager",
211-
name="resource_claim"})` to identify any backlogs in the ResourceClaim
215+
{{< highlight promql "hl_inline=true" >}}sum(workqueue_depth{endpoint="kube-controller-manager",
216+
name="resource_claim"}){{< /highlight >}} to identify any backlogs in the ResourceClaim
212217
controller.
213-
* Workqueue Work Duration: Observe `histogram_quantile(0.99,
218+
* Workqueue Work Duration: Observe {{< highlight promql "hl_inline=true">}}histogram_quantile(0.99,
214219
sum(rate(workqueue_work_duration_seconds_bucket{name="resource_claim"}[5m]))
215-
by (le))` to understand the speed at which the ResourceClaim controller
220+
by (le)){{< /highlight >}} to understand the speed at which the ResourceClaim controller
216221
processes work.
217222
-->
218-
* 工作队列添加速率:监控 `sum(rate(workqueue_adds_total{name="resource_claim"}[5m]))`
223+
* 工作队列添加速率:监控 {{< highlight promql "hl_inline=true" >}}sum(rate(workqueue_adds_total{name="resource_claim"}[5m])){{< /highlight >}}
219224
以衡量任务加入 ResourceClaim 控制器的速度。
220-
* 工作队列深度:跟踪 `sum(workqueue_depth{endpoint="kube-controller-manager", name="resource_claim"})`
225+
* 工作队列深度:跟踪 {{< highlight promql "hl_inline=true" >}}sum(workqueue_depth{endpoint="kube-controller-manager", name="resource_claim"}){{< /highlight >}}
221226
识别 ResourceClaim 控制器中是否存在积压。
222227
* 工作队列处理时长:观察
223-
`histogram_quantile(0.99, sum(rate(workqueue_work_duration_seconds_bucket{name="resource_claim"}[5m])) by (le))`
228+
{{< highlight promql "hl_inline=true">}}histogram_quantile(0.99, sum(rate(workqueue_work_duration_seconds_bucket{name="resource_claim"}[5m])) by (le)){{< /highlight >}}
224229
以了解 ResourceClaim 控制器的处理速度。
225230

226231
<!--
@@ -249,7 +254,7 @@ manageable.
249254
The following scheduler metrics are high level metrics aggregating performance
250255
across all Pods scheduled, not just those using DRA. It is important to note
251256
that the end-to-end metrics are ultimately influenced by the
252-
kube-controller-manager's performance in creating ResourceClaims from
257+
`kube-controller-manager`'s performance in creating ResourceClaims from
253258
ResourceClainTemplates in deployments that heavily use ResourceClainTemplates.
254259
-->
255260
### `kube-scheduler` 指标 {#kube-scheduler-metrics}
@@ -259,17 +264,17 @@ ResourceClainTemplates in deployments that heavily use ResourceClainTemplates.
259264
的性能影响,尤其在广泛使用 ResourceClaimTemplate 的部署中。
260265

261266
<!--
262-
* Scheduler End-to-End Duration: Monitor `histogram_quantile(0.99,
267+
* Scheduler End-to-End Duration: Monitor {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99,
263268
sum(increase(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])) by
264-
(le))`.
265-
* Scheduler Algorithm Latency: Track `histogram_quantile(0.99,
269+
(le)){{< /highlight >}}.
270+
* Scheduler Algorithm Latency: Track {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99,
266271
sum(increase(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by
267-
(le))`.
272+
(le)){{< /highlight >}}.
268273
-->
269274
* 调度器端到端耗时:监控
270-
`histogram_quantile(0.99, sum(increase(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])) by (le))`
275+
{{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99, sum(increase(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])) by (le)){{< /highlight >}}。
271276
* 调度器算法延迟:跟踪
272-
`histogram_quantile(0.99, sum(increase(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (le))`
277+
{{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99, sum(increase(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (le)){{< /highlight >}}。
273278

274279
<!--
275280
### `kubelet` metrics
@@ -285,17 +290,17 @@ following metrics.
285290
`NodePrepareResources``NodeUnprepareResources` 方法。你可以通过以下指标从 kubelet 的角度观察其行为。
286291

287292
<!--
288-
* Kubelet NodePrepareResources: Monitor `histogram_quantile(0.99,
293+
* Kubelet NodePrepareResources: Monitor {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99,
289294
sum(rate(dra_operations_duration_seconds_bucket{operation_name="PrepareResources"}[5m]))
290-
by (le))`.
291-
* Kubelet NodeUnprepareResources: Track `histogram_quantile(0.99,
295+
by (le)){{< /highlight >}}.
296+
* Kubelet NodeUnprepareResources: Track {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99,
292297
sum(rate(dra_operations_duration_seconds_bucket{operation_name="UnprepareResources"}[5m]))
293-
by (le))`.
298+
by (le)){{< /highlight >}}.
294299
-->
295300
* kubelet 调用 PrepareResources:监控
296-
`histogram_quantile(0.99, sum(rate(dra_operations_duration_seconds_bucket{operation_name="PrepareResources"}[5m])) by (le))`
301+
{{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99, sum(rate(dra_operations_duration_seconds_bucket{operation_name="PrepareResources"}[5m])) by (le)){{< /highlight >}}。
297302
* kubelet 调用 UnprepareResources:跟踪
298-
`histogram_quantile(0.99, sum(rate(dra_operations_duration_seconds_bucket{operation_name="UnprepareResources"}[5m])) by (le))`
303+
{{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99, sum(rate(dra_operations_duration_seconds_bucket{operation_name="UnprepareResources"}[5m])) by (le)){{< /highlight >}}。
299304
<!--
300305
### DRA kubeletplugin operations
301306
@@ -313,21 +318,25 @@ DRA 驱动实现 [`kubeletplugin` 包接口](https://pkg.go.dev/k8s.io/dynamic-r
313318
你可以从内部 kubeletplugin 的角度通过以下指标观察其行为:
314319

315320
<!--
316-
* DRA kubeletplugin gRPC NodePrepareResources operation: Observe `histogram_quantile(0.99,
321+
* DRA kubeletplugin gRPC NodePrepareResources operation: Observe {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99,
317322
sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodePrepareResources"}[5m]))
318-
by (le))`
319-
* DRA kubeletplugin gRPC NodeUnprepareResources operation: Observe `histogram_quantile(0.99,
323+
by (le)){{< /highlight >}}.
324+
* DRA kubeletplugin gRPC NodeUnprepareResources operation: Observe {{< highlight promql "hl_inline=true" >}} histogram_quantile(0.99,
320325
sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodeUnprepareResources"}[5m]))
321-
by (le))`.
326+
by (le)){{< /highlight >}}.
322327
-->
323328
* DRA kubeletplugin 的 NodePrepareResources 操作:观察
324-
`histogram_quantile(0.99, sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodePrepareResources"}[5m])) by (le))`
329+
{{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99, sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodePrepareResources"}[5m])) by (le)){{< /highlight >}}。
325330
* DRA kubeletplugin 的 NodeUnprepareResources 操作:观察
326-
`histogram_quantile(0.99, sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodeUnprepareResources"}[5m])) by (le))`
331+
{{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99, sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodeUnprepareResources"}[5m])) by (le)){{< /highlight >}}。
327332

328333
## {{% heading "whatsnext" %}}
329334

330335
<!--
331-
* [Learn more about DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation)
336+
* [Learn more about
337+
DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/)
338+
* Read the [Kubernetes Metrics
339+
Reference](/docs/reference/generated/metrics/)
332340
-->
333341
* [进一步了解 DRA](/zh-cn/docs/concepts/scheduling-eviction/dynamic-resource-allocation)
342+
* 阅读 [Kubernetes 指标参考](/zh-cn/docs/reference/generated/metrics/)

0 commit comments

Comments
 (0)