Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,7 @@ creating or modifying ResourceSlices.
Consider the following example ResourceSlice:

```yaml
apiVersion: resource.k8s.io/v1beta1
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: cat-slice
Expand All @@ -233,14 +233,13 @@ spec:
allNodes: true
devices:
- name: "large-black-cat"
basic:
attributes:
color:
string: "black"
size:
string: "large"
cat:
boolean: true
attributes:
color:
string: "black"
size:
string: "large"
cat:
boolean: true
```
This ResourceSlice is managed by the `resource-driver.example.com` driver in the
`black-cat-pool` pool. The `allNodes: true` field indicates that any node in the
Expand Down Expand Up @@ -399,7 +398,7 @@ admin access grants access to in-use devices and may enable additional
permissions when making the device available in a container:

```yaml
apiVersion: resource.k8s.io/v1beta2
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: large-black-cat-claim-template
Expand Down Expand Up @@ -441,7 +440,7 @@ allocated if it is available. But if it is not and two small white devices are a
the pod will still be able to run.

```yaml
apiVersion: resource.k8s.io/v1beta2
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: prioritized-list-claim-template
Expand Down Expand Up @@ -495,7 +494,7 @@ handles this and it is transparent to the consumer as the ResourceClaim API is n

```yaml
kind: ResourceSlice
apiVersion: resource.k8s.io/v1beta2
apiVersion: resource.k8s.io/v1
metadata:
name: resourceslice
spec:
Expand Down Expand Up @@ -632,4 +631,4 @@ spec:
- [Allocate devices to workloads using DRA](/docs/tasks/configure-pod-container/assign-resources/allocate-devices-dra/)
- For more information on the design, see the
[Dynamic Resource Allocation with Structured Parameters](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters)
KEP.
KEP.
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,13 @@ stages:
- stage: beta
defaultValue: false
fromVersion: "1.32"
toVersion: "1.33"
- stage: stable
defaultValue: true
locked: false
fromVersion: "1.34"

# TODO: as soon as this is locked to "true" (= GA), comments about other DRA
# TODO: as soon as this is locked to "true" (= some time after GA, *not* yet in 1.34), comments about other DRA
# feature gate(s) like "unless you also enable the `DynamicResourceAllocation` feature gate"
# can be removed (for example, in dra-admin-access.md).

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Allocate Devices to Workloads with DRA
content_type: task
min-kubernetes-server-version: v1.32
min-kubernetes-server-version: v1.34
weight: 20
---
{{< feature-state feature_gate_name="DynamicResourceAllocation" >}}
Expand Down Expand Up @@ -157,6 +157,20 @@ claims in different containers.
kubectl apply -f https://k8s.io/examples/dra/dra-example-job.yaml
```

Try the following troubleshooting steps:

1. When the workload does not start as expected, drill down from Job
to Pods to ResourceClaims and check the objects
at each level with `kubectl describe` to see whether there are any
status fields or events which might explain why the workload is
not starting.
1. When creating a Pod fails with `must specify one of: resourceClaimName,
resourceClaimTemplateName`, check that all entries in `pod.spec.resourceClaims`
have exactly one of those fields set. If they do, then it is possible
that the cluster has a mutating Pod webhook installed which was built
against APIs from Kubernetes < 1.32. Work with your cluster administrator
to check this.

## Clean up {#clean-up}

To delete the Kubernetes objects that you created in this task, follow these
Expand All @@ -183,4 +197,4 @@ steps:

## {{% heading "whatsnext" %}}

* [Learn more about DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation)
* [Learn more about DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation)
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Set Up DRA in a Cluster"
content_type: task
min-kubernetes-server-version: v1.32
min-kubernetes-server-version: v1.34
weight: 10
---
{{< feature-state feature_gate_name="DynamicResourceAllocation" >}}
Expand Down Expand Up @@ -37,30 +37,20 @@ For details, see

<!-- steps -->

## Enable the DRA API groups {#enable-dra}
## Optional: enable legacy DRA API groups {#enable-dra}

To let Kubernetes allocate resources to your Pods with DRA, complete the
following configuration steps:
DRA graduated to stable in Kubernetes 1.34 and is enabled by default.
Some older DRA drivers or workloads might still need the
v1beta1 API from Kubernetes 1.30 or v1beta2 from Kubernetes 1.32.
If and only if support for those is desired, then enable the following
{{< glossary_tooltip text="API groups" term_id="api-group" >}}:

* `resource.k8s.io/v1beta1`
* `resource.k8s.io/v1beta2`

For more information, see
[Enabling or disabling API groups](/docs/reference/using-api/#enabling-or-disabling).

1. Enable the `DynamicResourceAllocation`
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
on all of the following components:

* `kube-apiserver`
* `kube-controller-manager`
* `kube-scheduler`
* `kubelet`

1. Enable the following
{{< glossary_tooltip text="API groups" term_id="api-group" >}}:

* `resource.k8s.io/v1beta1`: required for DRA to function.
* `resource.k8s.io/v1beta2`: optional, recommended improvements to the user
experience.

For more information, see
[Enabling or disabling API groups](/docs/reference/using-api/#enabling-or-disabling).

## Verify that DRA is enabled {#verify}

To verify that the cluster is configured correctly, try to list DeviceClasses:
Expand All @@ -81,15 +71,15 @@ similar to the following:
```
error: the server doesn't have a resource type "deviceclasses"
```

Try the following troubleshooting steps:

1. Ensure that the `kube-scheduler` component has the `DynamicResourceAllocation`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth preserving troubleshooting steps regarding the feature gate in some form to say "make sure the feature gate is not disabled"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added second bullet point for the feature gate. It does not exactly fit here because the section just checks for the presence of the API group, so I had to mention potential failure modes that could occur later on.

Alternatively this could go into the user-facing allocate-devices-dra.md, but users have no way of checking feature gates.

feature gate enabled *and* uses the
[v1 configuration API](/docs/reference/config-api/kube-scheduler-config.v1/).
If you use a custom configuration, you might need to perform additional steps
to enable the `DynamicResource` plugin.
1. Restart the `kube-apiserver` component and the `kube-controller-manager`
component to propagate the API group changes.
1. Reconfigure and restart the `kube-apiserver` component.

1. If the complete `.spec.resourceClaims` field gets removed from Pods, or if
Pods get scheduled without considering the ResourceClaims, then verify
that the `DynamicResourceAllocation` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is not turned off
for kube-apiserver, kube-controller-manager, kube-scheduler or the kubelet.

## Install device drivers {#install-drivers}

Expand All @@ -112,6 +102,12 @@ cluster-1-device-pool-1-driver.example.com-lqx8x cluster-1-node-1 driver
cluster-1-device-pool-2-driver.example.com-29t7b cluster-1-node-2 driver.example.com cluster-1-device-pool-2-446z 8s
```

Try the following troubleshooting steps:

1. Check the health of the DRA driver and look for error messages about
publishing ResourceSlices in its log output. The vendor of the driver
may have further instructions about installation and troubleshooting.

## Create DeviceClasses {#create-deviceclasses}

You can define categories of devices that your application operators can
Expand All @@ -135,27 +131,25 @@ operators.
The output is similar to the following:

```yaml
apiVersion: resource.k8s.io/v1beta1
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
# lines omitted for clarity
spec:
devices:
- basic:
attributes:
type:
string: gpu
capacity:
memory:
value: 64Gi
name: gpu-0
- basic:
attributes:
type:
string: gpu
capacity:
memory:
value: 64Gi
name: gpu-1
- attributes:
type:
string: gpu
capacity:
memory:
value: 64Gi
name: gpu-0
- attributes:
type:
string: gpu
capacity:
memory:
value: 64Gi
name: gpu-1
driver: driver.example.com
nodeName: cluster-1-node-1
# lines omitted for clarity
Expand Down Expand Up @@ -186,4 +180,4 @@ kubectl delete -f https://k8s.io/examples/dra/deviceclass.yaml
## {{% heading "whatsnext" %}}

* [Learn more about DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation)
* [Allocate Devices to Workloads with DRA](/docs/tasks/configure-pod-container/assign-resources/allocate-devices-dra)
* [Allocate Devices to Workloads with DRA](/docs/tasks/configure-pod-container/assign-resources/allocate-devices-dra)
2 changes: 1 addition & 1 deletion content/en/examples/dra/deviceclass.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
apiVersion: resource.k8s.io/v1beta2
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
name: example-device-class
Expand Down
2 changes: 1 addition & 1 deletion content/en/examples/dra/resourceclaim.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
apiVersion: resource.k8s.io/v1beta2
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: example-resource-claim
Expand Down
2 changes: 1 addition & 1 deletion content/en/examples/dra/resourceclaimtemplate.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
apiVersion: resource.k8s.io/v1beta2
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: example-resource-claim-template
Expand Down