|
| 1 | +--- |
| 2 | +title: Good practices for Dynamic Resource Allocation as a Cluster Admin |
| 3 | +content_type: concept |
| 4 | +weight: 60 |
| 5 | +--- |
| 6 | + |
| 7 | +<!-- overview --> |
| 8 | +This page describes good practices when configuring a Kubernetes cluster |
| 9 | +utilizing Dynamic Resource Allocation (DRA). These instructions are for cluster |
| 10 | +administrators. |
| 11 | + |
| 12 | +<!-- body --> |
| 13 | +## Separate permissions to DRA related APIs |
| 14 | + |
| 15 | +DRA is orchestrated through a number of different APIs. Use authorization tools |
| 16 | +(like RBAC, or another solution) to control access to the right APIs depending |
| 17 | +on the persona of your user. |
| 18 | + |
| 19 | +In general, DeviceClasses and ResourceSlices should be restricted to admins and |
| 20 | +the DRA drivers. Cluster operators that will be deploying Pods with claims will |
| 21 | +need access to ResourceClaim and ResourceClaimTemplate APIs; both of these APIs |
| 22 | +are namespace scoped. |
| 23 | + |
| 24 | +## DRA driver deployment and maintenance |
| 25 | + |
| 26 | +DRA drivers are third-party applications that run on each node of your cluster |
| 27 | +to interface with the hardware of that node and Kubernetes' native DRA |
| 28 | +components. The installation procedure depends on the driver you choose, but is |
| 29 | +likely deployed as a DaemonSet to all or a selection of the nodes (using node |
| 30 | +selectors or similar mechanisms) in your cluster. |
| 31 | + |
| 32 | +### Use drivers with seamless upgrade if available |
| 33 | + |
| 34 | +DRA drivers implement the [`kubeletplugin` package |
| 35 | +interface](https://pkg.go.dev/k8s.io/dynamic-resource-allocation/kubeletplugin). |
| 36 | +Your driver may support seamless upgrades by implementing a property of this |
| 37 | +interface that allows two versions of the same DRA driver to coexist for a short |
| 38 | +time. This is only available for kubelet versions 1.33 and above and may not be |
| 39 | +supported by your driver for heterogeneous clusters with attached nodes running |
| 40 | +older versions of Kubernetes - check your driver's documentation to be sure. |
| 41 | + |
| 42 | +If seamless upgrades are available for your situation, consider using it to |
| 43 | +minimize scheduling delays when your driver updates. |
| 44 | + |
| 45 | +If you cannot use seamless upgrades, during driver downtime for upgrades you may |
| 46 | +observe that: |
| 47 | +* Pods cannot start unless the claims they depend on were already prepared for |
| 48 | + use. |
| 49 | +* Cleanup after the last pod which used a claim gets delayed until the driver is |
| 50 | + available again. The pod is not marked as terminated. This prevents reusing |
| 51 | + the resources used by the pod for other pods. |
| 52 | +* Running pods will continue to run. |
| 53 | + |
| 54 | +### Confirm your DRA driver exposes a liveness probe and utilize it |
| 55 | + |
| 56 | +Your DRA driver likely implements a grpc socket for healthchecks as part of DRA |
| 57 | +driver good practices. The easiest way to utilize this grpc socket is to |
| 58 | +configure it as a liveness probe for the DaemonSet deploying your DRA driver. |
| 59 | +Your driver's documentation or deployment tooling may already include this, but |
| 60 | +if you are building your configuration separately or not running your DRA driver |
| 61 | +as a Kubernetes pod, be sure that your orchestration tooling restarts the DRA |
| 62 | +driver on failed healthchecks to this grpc socket. Doing so will minimize any |
| 63 | +accidental downtime of the DRA driver and give it more opportunities to self |
| 64 | +heal, reducing scheduling delays or troubleshooting time. |
| 65 | + |
| 66 | +### When draining a node, drain the DRA driver as late as possible |
| 67 | + |
| 68 | +The DRA driver is responsible for unpreparing any devices that were allocated to |
| 69 | +Pods, and if the DRA driver is {{< glossary_tooltip text="drained" |
| 70 | +term_id="drain" >}} before Pods with claims have been deleted, it will not be |
| 71 | +able to finalize its cleanup. If you implement custom drain logic for nodes, |
| 72 | +consider checking that there are no allocated/reserved ResourceClaim or |
| 73 | +ResourceClaimTemplates before terminating the DRA driver itself. |
| 74 | + |
| 75 | + |
| 76 | +## Monitor and tune components for higher load, especially in high scale environments |
| 77 | + |
| 78 | +Control plane component `kube-scheduler` and the internal ResourceClaim |
| 79 | +controller orchestrated by the component `kube-controller-manager` do the heavy |
| 80 | +lifting during scheduling of Pods with claims based on metadata stored in the |
| 81 | +DRA APIs. Compared to non-DRA scheduled Pods, the number of API server calls, |
| 82 | +memory, and CPU utilization needed by these components is increased for Pods |
| 83 | +using DRA claims. In addition, node local components like the DRA driver and |
| 84 | +kubelet utilize DRA APIs to allocated the hardware request at Pod sandbox |
| 85 | +creation time. Especially in high scale environments where clusters have many |
| 86 | +nodes, and/or deploy many workloads that heavily utilize DRA defined resource |
| 87 | +claims, the cluster administrator should configure the relevant components to |
| 88 | +anticipate the increased load. |
| 89 | + |
| 90 | +The effects of mistuned components can have direct or snowballing affects |
| 91 | +causing different symptoms during the Pod lifecycle. If the `kube-scheduler` |
| 92 | +component's QPS and burst configurations are too low, the scheduler might |
| 93 | +quickly identify a suitable node for a Pod but take longer to bind the Pod to |
| 94 | +that node. With DRA, during Pod scheduling, the QPS and Burst parameters in the |
| 95 | +client-go configuration within `kube-controller-manager` are critical. |
| 96 | + |
| 97 | +The specific values to tune your cluster to depend on a variety of factors like |
| 98 | +number of nodes/pods, rate of pod creation, churn, even in non-DRA environments; |
| 99 | +see the [SIG-Scalability README on Kubernetes scalability |
| 100 | + thresholds](https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md) |
| 101 | +for more information. In scale tests performed against a DRA enabled cluster |
| 102 | +with 100 nodes, involving 720 long-lived pods (90% saturation) and 80 churn pods |
| 103 | +(10% churn, 10 times), with a job creation QPS of 10, `kube-controller-manager` |
| 104 | +QPS could be set to as low as 75 and Burst to 150 to meet equivalent metric |
| 105 | +targets for non-DRA deployments. At this lower bound, it was observed that the |
| 106 | +client side rate limiter was triggered enough to protect apiserver from |
| 107 | +explosive burst but was is high enough that pod startup SLOs were not impacted. |
| 108 | +While this is a good starting point, you can get a better idea of how to tune |
| 109 | +the different components that have the biggest effect on DRA performance for |
| 110 | +your deployment by monitoring the following metrics. |
| 111 | + |
| 112 | +### `kube-controller-manager` metrics |
| 113 | + |
| 114 | +The following metrics look closely at the internal ResourceClaim controller |
| 115 | +managed by the `kube-controller-manager` component. |
| 116 | + |
| 117 | +* Workqueue Add Rate: Monitor |
| 118 | + `sum(rate(workqueue_adds_total{name="resource_claim"}[5m]))` to gauge how |
| 119 | + quickly items are added to the ResourceClaim controller. |
| 120 | +* Workqueue Depth: Track |
| 121 | + `sum(workqueue_depth{endpoint="kube-controller-manager", |
| 122 | + name="resource_claim"})` to identify any backlogs in the ResourceClaim |
| 123 | + controller. |
| 124 | +* Workqueue Work Duration: Observe `histogram_quantile(0.99, |
| 125 | + sum(rate(workqueue_work_duration_seconds_bucket{name="resource_claim"}[5m])) |
| 126 | + by (le))` to understand the speed at which the ResourceClaim controller |
| 127 | + processes work. |
| 128 | + |
| 129 | +If you are experiencing low Workqueue Add Rate, high Workqueue Depth, and/or |
| 130 | +high Workqueue Work Duration, this suggests the controller isn't performing |
| 131 | +optimally. Consider tuning parameters like QPS, burst, and CPU/memory |
| 132 | +configurations. |
| 133 | + |
| 134 | +If you are experiencing high Workequeue Add Rate, high Workqueue Depth, but |
| 135 | +reasonable Workqueue Work Duration, this indicates the controller is processing |
| 136 | +work, but concurrency might be insufficient. Concurrency is hardcoded in the |
| 137 | +controller, so as a cluster administrator, you can tune for this by reducing the |
| 138 | +pod creation QPS, so the add rate to the resource claim workqueue is more |
| 139 | +manageable. |
| 140 | + |
| 141 | +### `kube-scheduler` metrics |
| 142 | + |
| 143 | +The following scheduler metrics are high level metrics aggregating performance |
| 144 | +across all Pods scheduled, not just those using DRA. It is important to note |
| 145 | +that the end-to-end metrics are ultimately influenced by the |
| 146 | +kube-controller-manager's performance in creating ResourceClaims from |
| 147 | +ResourceClainTemplates in deployments that heavily use ResourceClainTemplates. |
| 148 | + |
| 149 | +* Scheduler End-to-End Duration: Monitor `histogram_quantile(0.99, |
| 150 | + sum(increase(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])) by |
| 151 | + (le))`. |
| 152 | +* Scheduler Algorithm Latency: Track `histogram_quantile(0.99, |
| 153 | + sum(increase(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by |
| 154 | + (le))`. |
| 155 | + |
| 156 | +### `kubelet` metrics |
| 157 | + |
| 158 | +When a Pod bound to a node must have a ResourceClaim satisfied, kubelet calls |
| 159 | +the `NodePrepareResources` and `NodeUnprepareResources` methods of the DRA |
| 160 | +driver. You can observe this behavior from the kubelet's point of view with the |
| 161 | +following metrics. |
| 162 | + |
| 163 | +* Kubelet NodePrepareResources: Monitor `histogram_quantile(0.99, |
| 164 | + sum(rate(dra_operations_duration_seconds_bucket{operation_name="PrepareResources"}[5m])) |
| 165 | + by (le))`. |
| 166 | +* Kubelet NodeUnprepareResources: Track `histogram_quantile(0.99, |
| 167 | + sum(rate(dra_operations_duration_seconds_bucket{operation_name="UnprepareResources"}[5m])) |
| 168 | + by (le))`. |
| 169 | + |
| 170 | +### DRA kubeletplugin operations |
| 171 | + |
| 172 | +DRA drivers implement the [`kubeletplugin` package |
| 173 | +interface](https://pkg.go.dev/k8s.io/dynamic-resource-allocation/kubeletplugin) |
| 174 | +which surfaces its own metric for the underlying gRPC operation |
| 175 | +`NodePrepareResources` and `NodeUnprepareResources`. You can observe this |
| 176 | +behavior from the point of view of the internal kubeletplugin with the following |
| 177 | +metrics. |
| 178 | + |
| 179 | +* DRA kubeletplugin gRPC NodePrepareResources operation: Observe `histogram_quantile(0.99, |
| 180 | + sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodePrepareResources"}[5m])) |
| 181 | + by (le))` |
| 182 | +* DRA kubeletplugin gRPC NodeUnprepareResources operation: Observe `histogram_quantile(0.99, |
| 183 | + sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodeUnprepareResources"}[5m])) |
| 184 | + by (le))`. |
| 185 | + |
| 186 | + |
| 187 | +## {{% heading "whatsnext" %}} |
| 188 | + |
| 189 | +* [Learn more about DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation) |
0 commit comments