sync dra

asa3311 · asa3311 · commit 4c7abf0dfbfc · 2025-08-12T10:20:31.000+08:00
Update dra.md
diff --git a/content/zh-cn/docs/concepts/cluster-administration/dra.md b/content/zh-cn/docs/concepts/cluster-administration/dra.md
@@ -57,7 +57,7 @@ DRA 驱动是运行在集群的每个节点上的第三方应用，对接节点
 
 DRA drivers implement the [`kubeletplugin` package
 interface](https://pkg.go.dev/k8s.io/dynamic-resource-allocation/kubeletplugin).
-Your driver may support seamless upgrades by implementing a property of this
+Your driver may support _seamless upgrades_ by implementing a property of this
 interface that allows two versions of the same DRA driver to coexist for a short
 time. This is only available for kubelet versions 1.33 and above and may not be
 supported by your driver for heterogeneous clusters with attached nodes running
@@ -67,7 +67,7 @@ older versions of Kubernetes - check your driver's documentation to be sure.
 
 DRA 驱动实现
 [`kubeletplugin` 包接口](https://pkg.go.dev/k8s.io/dynamic-resource-allocation/kubeletplugin)。
-你的驱动可能通过实现此接口的一个属性，支持两个版本共存一段时间，从而实现无缝升级。
+你的驱动可能通过实现此接口的一个属性，支持两个版本共存一段时间，从而实现**无缝升级**。
 该功能仅适用于 kubelet v1.33 及更高版本，对于运行旧版 Kubernetes 的节点所组成的异构集群，
 可能不支持这种功能。请查阅你的驱动文档予以确认。
 
@@ -98,7 +98,7 @@ observe that:
 <!--
 ### Confirm your DRA driver exposes a liveness probe and utilize it
 
-Your DRA driver likely implements a grpc socket for healthchecks as part of DRA
+Your DRA driver likely implements a gRPC socket for healthchecks as part of DRA
 driver good practices. The easiest way to utilize this grpc socket is to
 configure it as a liveness probe for the DaemonSet deploying your DRA driver.
 Your driver's documentation or deployment tooling may already include this, but
@@ -110,7 +110,7 @@ heal, reducing scheduling delays or troubleshooting time.
 -->
 ### 确认你的 DRA 驱动暴露了存活探针并加以利用 {#confirm-your-dra-driver-exposes-a-liveness-probe-and-utilize-it}
 
-你的 DRA 驱动可能已实现用于健康检查的 grpc 套接字，这是 DRA 驱动的良好实践之一。
+你的 DRA 驱动可能已实现用于健康检查的 gRPC 套接字，这是 DRA 驱动的良好实践之一。
 最简单的利用方式是将该 grpc 套接字配置为部署 DRA 驱动 DaemonSet 的存活探针。
 驱动文档或部署工具可能已包括此项配置，但如果你是自行配置或未以 Kubernetes Pod 方式运行 DRA 驱动，
 确保你的编排工具在该 grpc 套接字健康检查失败时能重启驱动。这样可以最大程度地减少 DRA 驱动的意外停机，
@@ -136,25 +136,29 @@ ResourceClaim 或 ResourceClaimTemplate。
 <!--
 ## Monitor and tune components for higher load, especially in high scale environments
 
-Control plane component `kube-scheduler` and the internal ResourceClaim
-controller orchestrated by the component `kube-controller-manager` do the heavy
-lifting during scheduling of Pods with claims based on metadata stored in the
-DRA APIs. Compared to non-DRA scheduled Pods, the number of API server calls,
-memory, and CPU utilization needed by these components is increased for Pods
-using DRA claims. In addition, node local components like the DRA driver and
-kubelet utilize DRA APIs to allocated the hardware request at Pod sandbox
+Control plane component {{< glossary_tooltip text="kube-scheduler"
+term_id="kube-scheduler" >}} and the internal ResourceClaim controller
+orchestrated by the component {{< glossary_tooltip
+text="kube-controller-manager" term_id="kube-controller-manager" >}} do the
+heavy lifting during scheduling of Pods with claims based on metadata stored in
+the DRA APIs. Compared to non-DRA scheduled Pods, the number of API server
+calls, memory, and CPU utilization needed by these components is increased for
+Pods using DRA claims. In addition, node local components like the DRA driver
+and kubelet utilize DRA APIs to allocated the hardware request at Pod sandbox
 creation time. Especially in high scale environments where clusters have many
 nodes, and/or deploy many workloads that heavily utilize DRA defined resource
 claims, the cluster administrator should configure the relevant components to
 anticipate the increased load.
 -->
 ## 在大规模环境中在高负载场景下监控和调优组件  {#monitor-and-tune-components-for-higher-load-especially-in-high-scale-environments}
 
-控制面组件 `kube-scheduler` 以及 `kube-controller-manager` 中的内部 ResourceClaim
-控制器在调度使用 DRA 申领的 Pod 时承担了大量任务。与不使用 DRA 的 Pod 相比，这些组件所需的
-API 服务器调用次数、内存和 CPU 使用率都更高。此外，节点本地组件（如 DRA 驱动和 kubelet）也在创建
-Pod 沙箱时使用 DRA API 分配硬件请求资源。
-尤其在集群节点数量众多或大量工作负载依赖 DRA 定义的资源申领时，集群管理员应当预先为相关组件配置合理参数以应对增加的负载。
+控制面组件 {{< glossary_tooltip text="kube-scheduler" term_id="kube-scheduler" >}}
+以及 {{< glossary_tooltip text="kube-controller-manager" term_id="kube-controller-manager" >}}
+中的内部 ResourceClaim 控制器在调度使用 DRA 申领的 Pod 时承担了大量任务。与不使用 DRA 的 Pod 相比，
+这些组件所需的 API 服务器调用次数、内存和 CPU 使用率都更高。此外，
+节点本地组件（如 DRA 驱动和 kubelet）也在创建 Pod 沙箱时使用 DRA API 分配硬件请求资源。
+尤其在集群节点数量众多或大量工作负载依赖 DRA 定义的资源申领时，
+集群管理员应当预先为相关组件配置合理参数以应对增加的负载。
 
 <!--
 The effects of mistuned components can have direct or snowballing affects
@@ -171,26 +175,29 @@ client-go configuration within `kube-controller-manager` are critical.
 <!--
 The specific values to tune your cluster to depend on a variety of factors like
 number of nodes/pods, rate of pod creation, churn, even in non-DRA environments;
-see the [SIG-Scalability README on Kubernetes scalability
+see the [SIG Scalability README on Kubernetes scalability
  thresholds](https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md)
 for more information. In scale tests performed against a DRA enabled cluster
 with 100 nodes, involving 720 long-lived pods (90% saturation) and 80 churn pods
 (10% churn, 10 times), with a job creation QPS of 10, `kube-controller-manager`
 QPS could be set to as low as 75 and Burst to 150 to meet equivalent metric
 targets for non-DRA deployments. At this lower bound, it was observed that the
-client side rate limiter was triggered enough to protect apiserver from
-explosive burst but was is high enough that pod startup SLOs were not impacted.
+client side rate limiter was triggered enough to protect the API server from
+explosive burst but was high enough that pod startup SLOs were not impacted.
 While this is a good starting point, you can get a better idea of how to tune
 the different components that have the biggest effect on DRA performance for
-your deployment by monitoring the following metrics.
+your deployment by monitoring the following metrics. For more information on all
+the stable metrics in Kubernetes, see the [Kubernetes Metrics
+Reference](/docs/reference/generated/metrics/).
 -->
 集群调优所需的具体数值取决于多个因素，如节点/Pod 数量、Pod 创建速率、变化频率，甚至与是否使用 DRA 无关。更多信息请参考
-[SIG-Scalability README 中的可扩缩性阈值](https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md)。
+[SIG Scalability README 中的可扩缩性阈值](https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md)。
 在一项针对启用了 DRA 的 100 节点集群的规模测试中，部署了 720 个长生命周期 Pod（90% 饱和度）和 80
 个短周期 Pod（10% 流失，重复 10 次），作业创建 QPS 为 10。将 `kube-controller-manager` 的 QPS
 设置为 75、Burst 设置为 150，能达到与非 DRA 部署中相同的性能指标。在这个下限设置下，
 客户端速率限制器能有效保护 API 服务器避免突发请求，同时不影响 Pod 启动 SLO。
 这可作为一个良好的起点。你可以通过监控下列指标，进一步判断对 DRA 性能影响最大的组件，从而优化其配置。
+有关 Kubernetes 中所有稳定指标的更多信息，请参阅 [Kubernetes 指标参考](/zh-cn/docs/reference/generated/metrics/)。
 
 <!--
 ### `kube-controller-manager` metrics
@@ -203,24 +210,22 @@ managed by the `kube-controller-manager` component.
 以下指标聚焦于由 `kube-controller-manager` 组件管理的内部 ResourceClaim 控制器：
 
 <!--
-* Workqueue Add Rate: Monitor
-  `sum(rate(workqueue_adds_total{name="resource_claim"}[5m]))` to gauge how
-  quickly items are added to the ResourceClaim controller.
+* Workqueue Add Rate: Monitor {{< highlight promql "hl_inline=true"  >}} sum(rate(workqueue_adds_total{name="resource_claim"}[5m])) {{< /highlight >}} to gauge how quickly items are added to the ResourceClaim controller.
 * Workqueue Depth: Track
-  `sum(workqueue_depth{endpoint="kube-controller-manager",
-  name="resource_claim"})` to identify any backlogs in the ResourceClaim
+  {{< highlight promql "hl_inline=true" >}}sum(workqueue_depth{endpoint="kube-controller-manager",
+  name="resource_claim"}){{< /highlight >}} to identify any backlogs in the ResourceClaim
   controller.
-* Workqueue Work Duration: Observe `histogram_quantile(0.99,
+* Workqueue Work Duration: Observe {{< highlight promql "hl_inline=true">}}histogram_quantile(0.99,
   sum(rate(workqueue_work_duration_seconds_bucket{name="resource_claim"}[5m]))
-  by (le))` to understand the speed at which the ResourceClaim controller
+  by (le)){{< /highlight >}} to understand the speed at which the ResourceClaim controller
   processes work.
 -->
-* 工作队列添加速率：监控 `sum(rate(workqueue_adds_total{name="resource_claim"}[5m]))`，
+* 工作队列添加速率：监控 {{< highlight promql "hl_inline=true"  >}}sum(rate(workqueue_adds_total{name="resource_claim"}[5m])){{< /highlight >}}，
   以衡量任务加入 ResourceClaim 控制器的速度。
-* 工作队列深度：跟踪 `sum(workqueue_depth{endpoint="kube-controller-manager", name="resource_claim"})`，
+* 工作队列深度：跟踪 {{< highlight promql "hl_inline=true" >}}sum(workqueue_depth{endpoint="kube-controller-manager", name="resource_claim"}){{< /highlight >}}，
   识别 ResourceClaim 控制器中是否存在积压。
 * 工作队列处理时长：观察
-  `histogram_quantile(0.99, sum(rate(workqueue_work_duration_seconds_bucket{name="resource_claim"}[5m])) by (le))`，
+  {{< highlight promql "hl_inline=true">}}histogram_quantile(0.99, sum(rate(workqueue_work_duration_seconds_bucket{name="resource_claim"}[5m])) by (le)){{< /highlight >}}，
   以了解 ResourceClaim 控制器的处理速度。
 
 <!--
@@ -249,7 +254,7 @@ manageable.
 The following scheduler metrics are high level metrics aggregating performance
 across all Pods scheduled, not just those using DRA. It is important to note
 that the end-to-end metrics are ultimately influenced by the
-kube-controller-manager's performance in creating ResourceClaims from
+`kube-controller-manager`'s performance in creating ResourceClaims from
 ResourceClainTemplates in deployments that heavily use ResourceClainTemplates.
 -->
 ### `kube-scheduler` 指标 {#kube-scheduler-metrics}
@@ -259,17 +264,17 @@ ResourceClainTemplates in deployments that heavily use ResourceClainTemplates.
 的性能影响，尤其在广泛使用 ResourceClaimTemplate 的部署中。
 
 <!--
-* Scheduler End-to-End Duration: Monitor `histogram_quantile(0.99,
+* Scheduler End-to-End Duration: Monitor {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99,
   sum(increase(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])) by
-  (le))`.
-* Scheduler Algorithm Latency: Track `histogram_quantile(0.99,
+  (le)){{< /highlight >}}.
+* Scheduler Algorithm Latency: Track {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99,
   sum(increase(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by
-  (le))`.
+  (le)){{< /highlight >}}.
 -->
 * 调度器端到端耗时：监控
-  `histogram_quantile(0.99, sum(increase(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])) by (le))`
+  {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99, sum(increase(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])) by (le)){{< /highlight >}}。
 * 调度器算法延迟：跟踪
-  `histogram_quantile(0.99, sum(increase(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (le))`
+  {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99, sum(increase(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (le)){{< /highlight >}}。
 
 <!--
 ### `kubelet` metrics
@@ -285,17 +290,17 @@ following metrics.
 `NodePrepareResources` 和 `NodeUnprepareResources` 方法。你可以通过以下指标从 kubelet 的角度观察其行为。
 
 <!--
-* Kubelet NodePrepareResources: Monitor `histogram_quantile(0.99,
+* Kubelet NodePrepareResources: Monitor {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99,
   sum(rate(dra_operations_duration_seconds_bucket{operation_name="PrepareResources"}[5m]))
-  by (le))`.
-* Kubelet NodeUnprepareResources: Track `histogram_quantile(0.99,
+  by (le)){{< /highlight >}}.
+* Kubelet NodeUnprepareResources: Track {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99,
   sum(rate(dra_operations_duration_seconds_bucket{operation_name="UnprepareResources"}[5m]))
-  by (le))`.
+  by (le)){{< /highlight >}}.
 -->
 * kubelet 调用 PrepareResources：监控
-  `histogram_quantile(0.99, sum(rate(dra_operations_duration_seconds_bucket{operation_name="PrepareResources"}[5m])) by (le))`
+  {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99, sum(rate(dra_operations_duration_seconds_bucket{operation_name="PrepareResources"}[5m])) by (le)){{< /highlight >}}。
 * kubelet 调用 UnprepareResources：跟踪
-  `histogram_quantile(0.99, sum(rate(dra_operations_duration_seconds_bucket{operation_name="UnprepareResources"}[5m])) by (le))`
+  {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99, sum(rate(dra_operations_duration_seconds_bucket{operation_name="UnprepareResources"}[5m])) by (le)){{< /highlight >}}。
 <!--
 ### DRA kubeletplugin operations
 
@@ -313,21 +318,25 @@ DRA 驱动实现 [`kubeletplugin` 包接口](https://pkg.go.dev/k8s.io/dynamic-r
 你可以从内部 kubeletplugin 的角度通过以下指标观察其行为：
 
 <!--
-* DRA kubeletplugin gRPC NodePrepareResources operation: Observe `histogram_quantile(0.99,
+* DRA kubeletplugin gRPC NodePrepareResources operation: Observe {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99,
   sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodePrepareResources"}[5m]))
-  by (le))` 
-* DRA kubeletplugin gRPC NodeUnprepareResources operation: Observe `histogram_quantile(0.99,
+  by (le)){{< /highlight >}}.
+* DRA kubeletplugin gRPC NodeUnprepareResources operation: Observe {{< highlight promql "hl_inline=true" >}} histogram_quantile(0.99,
   sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodeUnprepareResources"}[5m]))
-  by (le))`.
+  by (le)){{< /highlight >}}.
 -->
 * DRA kubeletplugin 的 NodePrepareResources 操作：观察
-  `histogram_quantile(0.99, sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodePrepareResources"}[5m])) by (le))`
+  {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99, sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodePrepareResources"}[5m])) by (le)){{< /highlight >}}。
 * DRA kubeletplugin 的 NodeUnprepareResources 操作：观察
-  `histogram_quantile(0.99, sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodeUnprepareResources"}[5m])) by (le))`
+  {{< highlight promql "hl_inline=true" >}}histogram_quantile(0.99, sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodeUnprepareResources"}[5m])) by (le)){{< /highlight >}}。
 
 ## {{% heading "whatsnext" %}}
 
 <!--
-* [Learn more about DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation)
+* [Learn more about
+  DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/)
+* Read the [Kubernetes Metrics
+  Reference](/docs/reference/generated/metrics/)
 -->
 * [进一步了解 DRA](/zh-cn/docs/concepts/scheduling-eviction/dynamic-resource-allocation)
+* 阅读 [Kubernetes 指标参考](/zh-cn/docs/reference/generated/metrics/)