kubernetes
diff --git a/‎assets/scss/_desktop.scss‎
Lines changed: 16 additions & 0 deletions b/‎assets/scss/_desktop.scss‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎assets/scss/_documentation.scss‎
Lines changed: 9 additions & 3 deletions b/‎assets/scss/_documentation.scss‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎assets/scss/_tablet.scss‎
Lines changed: 13 additions & 0 deletions b/‎assets/scss/_tablet.scss‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎assets/scss/_variables_project.scss‎
Lines changed: 8 additions & 1 deletion b/‎assets/scss/_variables_project.scss‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎content/en/docs/concepts/cluster-administration/dra.md‎
Lines changed: 189 additions & 0 deletions b/‎content/en/docs/concepts/cluster-administration/dra.md‎
Lines changed: 189 additions & 0 deletions
diff --git a/‎content/en/docs/concepts/cluster-administration/system-metrics.md‎
Lines changed: 1 addition & 1 deletion b/‎content/en/docs/concepts/cluster-administration/system-metrics.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎content/en/docs/concepts/configuration/liveness-readiness-startup-probes.md‎
Lines changed: 1 addition & 1 deletion b/‎content/en/docs/concepts/configuration/liveness-readiness-startup-probes.md‎
Lines changed: 1 addition & 1 deletion
@@ -3,6 +3,19 @@ $vendor-strip-height: 44px;
 
 @media screen and (min-width: 1024px) {
 
+  body:dir(rtl) {
+    overflow-x: hidden;
+  }
+
+  table:dir(rtl) {
+    text-align: right;
+    direction: rtl;
+  }
+
+  .breadcrumb-item:dir(rtl) + .breadcrumb-item:dir(rtl)::before {
+    float: right;
+  }
+
   .td-home {
     .header-hero {
       #vendorStrip {
@@ -34,6 +47,9 @@ $vendor-strip-height: 44px;
         p {
           text-align: left
         }
+        p:dir(rtl) {
+          text-align: right
+        }
 
         .image-wrapper {
           position: absolute;
 
@@ -90,14 +90,20 @@ div.feature-state-notice {
 
 // All callout headings should be bold
 .alert {
-  border-style: solid;
-  border-width: 0 0 0 4px;
-  border-radius: 0 0 2rem 0;
   h4 {
     font-weight: bold;
     font-style: initial;
   }
 }
+.alert:dir(ltr) {
+  // color the left border (left-to-right locales)
+  border-width: 0 0 0 4px;
+}
+
+.alert:dir(rtl) {
+  // color the right border (right-to-left locales)
+  border-width: 0 4px 0 0;
+}
 
 .glossary-tooltip {
   display: inline-block;
 
@@ -18,6 +18,15 @@ $vendor-strip-font-size: 16px;
     display: none;
   }
 
+  body:dir(rtl) {
+    overflow-x: hidden;
+  }
+
+  table:dir(rtl) {
+    text-align: right;
+    direction: rtl;
+  }
+
   .td-home {
     #viewDocs,
     #tryKubernetes {
@@ -118,6 +127,10 @@ $vendor-strip-font-size: 16px;
       text-align: left;
       margin: 0 5% 15px 0;
     }
+
+    &:dir(rtl) > .light-text {
+      text-align: right;
+    }
   }
 
   #mobileShowVideoButton {
 
@@ -4,7 +4,7 @@ Override variables from the theme here. */
 $primary: #326ce5;
 $secondary: #495057;
 
-$light-grey: #f7f7f7;
+$light-grey: #f9f9f9;
 $medium-light-grey: #6f6f6f;
 $medium-grey: #4c4c4c;
 $dark-grey: #303030;
@@ -14,9 +14,16 @@ $white: #ffffff;
 $feature: #daeaf9;
 $feature-inset: #eeeeee;
 
+// metrics reference colors
 $metric-labels-varying: #efefef;
 $metric-labels-varying-border: #e2e2e2;
 
+$td-sidebar-bg-color: rgba($secondary, 0.05);
+
+$link-color: $primary;
+
+$link-hover-decoration: underline;
+
 // tooltip
 $tooltip-bg: #555;
 $tooltip-arrow-color: $tooltip-bg !default;
 
@@ -0,0 +1,189 @@
+---
+title: Good practices for Dynamic Resource Allocation as a Cluster Admin
+content_type: concept
+weight: 60
+---
+
+<!-- overview -->
+This page describes good practices when configuring a Kubernetes cluster
+utilizing Dynamic Resource Allocation (DRA). These instructions are for cluster
+administrators.
+
+<!-- body -->
+## Separate permissions to DRA related APIs
+
+DRA is orchestrated through a number of different APIs. Use authorization tools
+(like RBAC, or another solution) to control access to the right APIs depending
+on the persona of your user.
+
+In general, DeviceClasses and ResourceSlices should be restricted to admins and
+the DRA drivers. Cluster operators that will be deploying Pods with claims will
+need access to ResourceClaim and ResourceClaimTemplate APIs; both of these APIs
+are namespace scoped.
+
+## DRA driver deployment and maintenance
+
+DRA drivers are third-party applications that run on each node of your cluster
+to interface with the hardware of that node and Kubernetes' native DRA
+components. The installation procedure depends on the driver you choose, but is
+likely deployed as a DaemonSet to all or a selection of the nodes (using node
+selectors or similar mechanisms) in your cluster.
+
+### Use drivers with seamless upgrade if available
+
+DRA drivers implement the [`kubeletplugin` package
+interface](https://pkg.go.dev/k8s.io/dynamic-resource-allocation/kubeletplugin).
+Your driver may support seamless upgrades by implementing a property of this
+interface that allows two versions of the same DRA driver to coexist for a short
+time. This is only available for kubelet versions 1.33 and above and may not be
+supported by your driver for heterogeneous clusters with attached nodes running
+older versions of Kubernetes - check your driver's documentation to be sure.
+
+If seamless upgrades are available for your situation, consider using it to
+minimize scheduling delays when your driver updates.
+
+If you cannot use seamless upgrades, during driver downtime for upgrades you may
+observe that:
+* Pods cannot start unless the claims they depend on were already prepared for
+  use.
+* Cleanup after the last pod which used a claim gets delayed until the driver is
+  available again. The pod is not marked as terminated. This prevents reusing
+  the resources used by the pod for other pods.
+* Running pods will continue to run.
+
+### Confirm your DRA driver exposes a liveness probe and utilize it
+
+Your DRA driver likely implements a grpc socket for healthchecks as part of DRA
+driver good practices. The easiest way to utilize this grpc socket is to
+configure it as a liveness probe for the DaemonSet deploying your DRA driver.
+Your driver's documentation or deployment tooling may already include this, but
+if you are building your configuration separately or not running your DRA driver
+as a Kubernetes pod, be sure that your orchestration tooling restarts the DRA
+driver on failed healthchecks to this grpc socket. Doing so will minimize any
+accidental downtime of the DRA driver and give it more opportunities to self
+heal, reducing scheduling delays or troubleshooting time.
+
+### When draining a node, drain the DRA driver as late as possible
+
+The DRA driver is responsible for unpreparing any devices that were allocated to
+Pods, and if the DRA driver is {{< glossary_tooltip text="drained"
+term_id="drain" >}} before Pods with claims have been deleted, it will not be
+able to finalize its cleanup. If you implement custom drain logic for nodes,
+consider checking that there are no allocated/reserved ResourceClaim or
+ResourceClaimTemplates before terminating the DRA driver itself.
+
+
+## Monitor and tune components for higher load, especially in high scale environments
+
+Control plane component `kube-scheduler` and the internal ResourceClaim
+controller orchestrated by the component `kube-controller-manager` do the heavy
+lifting during scheduling of Pods with claims based on metadata stored in the
+DRA APIs. Compared to non-DRA scheduled Pods, the number of API server calls,
+memory, and CPU utilization needed by these components is increased for Pods
+using DRA claims. In addition, node local components like the DRA driver and
+kubelet utilize DRA APIs to allocated the hardware request at Pod sandbox
+creation time. Especially in high scale environments where clusters have many
+nodes, and/or deploy many workloads that heavily utilize DRA defined resource
+claims, the cluster administrator should configure the relevant components to
+anticipate the increased load. 
+
+The effects of mistuned components can have direct or snowballing affects
+causing different symptoms during the Pod lifecycle. If the `kube-scheduler`
+component's QPS and burst configurations are too low, the scheduler might
+quickly identify a suitable node for a Pod but take longer to bind the Pod to
+that node. With DRA, during Pod scheduling, the QPS and Burst parameters in the
+client-go configuration within `kube-controller-manager` are critical.
+
+The specific values to tune your cluster to depend on a variety of factors like
+number of nodes/pods, rate of pod creation, churn, even in non-DRA environments;
+see the [SIG-Scalability README on Kubernetes scalability
+ thresholds](https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md)
+for more information. In scale tests performed against a DRA enabled cluster
+with 100 nodes, involving 720 long-lived pods (90% saturation) and 80 churn pods
+(10% churn, 10 times), with a job creation QPS of 10, `kube-controller-manager`
+QPS could be set to as low as 75 and Burst to 150 to meet equivalent metric
+targets for non-DRA deployments. At this lower bound, it was observed that the
+client side rate limiter was triggered enough to protect apiserver from
+explosive burst but was is high enough that pod startup SLOs were not impacted.
+While this is a good starting point, you can get a better idea of how to tune
+the different components that have the biggest effect on DRA performance for
+your deployment by monitoring the following metrics.
+
+### `kube-controller-manager` metrics
+
+The following metrics look closely at the internal ResourceClaim controller
+managed by the `kube-controller-manager` component.
+
+* Workqueue Add Rate: Monitor
+  `sum(rate(workqueue_adds_total{name="resource_claim"}[5m]))` to gauge how
+  quickly items are added to the ResourceClaim controller.
+* Workqueue Depth: Track
+  `sum(workqueue_depth{endpoint="kube-controller-manager",
+  name="resource_claim"})` to identify any backlogs in the ResourceClaim
+  controller.
+* Workqueue Work Duration: Observe `histogram_quantile(0.99,
+  sum(rate(workqueue_work_duration_seconds_bucket{name="resource_claim"}[5m]))
+  by (le))` to understand the speed at which the ResourceClaim controller
+  processes work.
+
+If you are experiencing low Workqueue Add Rate, high Workqueue Depth, and/or
+high Workqueue Work Duration, this suggests the controller isn't performing
+optimally. Consider tuning parameters like QPS, burst, and CPU/memory
+configurations.
+
+If you are experiencing high Workequeue Add Rate, high Workqueue Depth, but
+reasonable Workqueue Work Duration, this indicates the controller is processing
+work, but concurrency might be insufficient. Concurrency is hardcoded in the
+controller, so as a cluster administrator, you can tune for this by reducing the
+pod creation QPS, so the add rate to the resource claim workqueue is more
+manageable.
+
+### `kube-scheduler` metrics
+
+The following scheduler metrics are high level metrics aggregating performance
+across all Pods scheduled, not just those using DRA. It is important to note
+that the end-to-end metrics are ultimately influenced by the
+kube-controller-manager's performance in creating ResourceClaims from
+ResourceClainTemplates in deployments that heavily use ResourceClainTemplates.
+
+* Scheduler End-to-End Duration: Monitor `histogram_quantile(0.99,
+  sum(increase(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])) by
+  (le))`.
+* Scheduler Algorithm Latency: Track `histogram_quantile(0.99,
+  sum(increase(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by
+  (le))`.
+
+### `kubelet` metrics
+
+When a Pod bound to a node must have a ResourceClaim satisfied, kubelet calls
+the `NodePrepareResources` and `NodeUnprepareResources` methods of the DRA
+driver. You can observe this behavior from the kubelet's point of view with the
+following metrics.
+
+* Kubelet NodePrepareResources: Monitor `histogram_quantile(0.99,
+  sum(rate(dra_operations_duration_seconds_bucket{operation_name="PrepareResources"}[5m]))
+  by (le))`.
+* Kubelet NodeUnprepareResources: Track `histogram_quantile(0.99,
+  sum(rate(dra_operations_duration_seconds_bucket{operation_name="UnprepareResources"}[5m]))
+  by (le))`.
+
+### DRA kubeletplugin operations
+
+DRA drivers implement the [`kubeletplugin` package
+interface](https://pkg.go.dev/k8s.io/dynamic-resource-allocation/kubeletplugin)
+which surfaces its own metric for the underlying gRPC operation
+`NodePrepareResources` and `NodeUnprepareResources`. You can observe this
+behavior from the point of view of the internal kubeletplugin with the following
+metrics.
+
+* DRA kubeletplugin gRPC NodePrepareResources operation: Observe `histogram_quantile(0.99,
+  sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodePrepareResources"}[5m]))
+  by (le))` 
+* DRA kubeletplugin gRPC NodeUnprepareResources operation: Observe `histogram_quantile(0.99,
+  sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodeUnprepareResources"}[5m]))
+  by (le))`.
+
+
+## {{% heading "whatsnext" %}}
+
+* [Learn more about DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation)
@@ -251,7 +251,7 @@ is encountered that is not allowed with respect to the allow-list constraints.
 
 ## {{% heading "whatsnext" %}}
 
-* Read about the [Prometheus text format](https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#text-based-format)
+* Read about the [Prometheus text format](https://github.com/prometheus/docs/blob/main/docs/instrumenting/exposition_formats.md#text-based-format)
   for metrics
 * See the list of [stable Kubernetes metrics](https://github.com/kubernetes/kubernetes/blob/master/test/instrumentation/testdata/stable-metrics-list.yaml)
 * Read about the [Kubernetes deprecation policy](/docs/reference/using-api/deprecation-policy/#deprecating-a-feature-or-behavior)
@@ -26,7 +26,7 @@ Liveness probes do not wait for readiness probes to succeed. If you want to wait
 
 ## Readiness probe
 
-Readiness probes determine when a container is ready to start accepting traffic. This is useful when waiting for an application to perform time-consuming initial tasks, such as establishing network connections, loading files, and warming caches.
+Readiness probes determine when a container is ready to accept traffic. This is useful when waiting for an application to perform time-consuming initial tasks that depend on its backing services; for example: establishing network connections, loading files, and warming caches. Readiness probes can also be useful later in the container’s lifecycle, for example, when recovering from temporary faults or overloads.
 
 If the readiness probe returns a failed state, Kubernetes removes the pod from all matching service endpoints.