Skip to content

Commit 576dcfe

Browse files
committed
KEP-4815 DRA Partitionable devices support for multi-host
1 parent 4c85919 commit 576dcfe

File tree

3 files changed

+296
-16
lines changed

3 files changed

+296
-16
lines changed

keps/sig-node/4815-dra-partitionable-devices/README.md

Lines changed: 295 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,17 @@
44
- [Release Signoff Checklist](#release-signoff-checklist)
55
- [Summary](#summary)
66
- [Motivation](#motivation)
7+
- [Dynamic allocation of Multi-Instance GPUs (MIG) on NVIDIA hardware](#dynamic-allocation-of-multi-instance-gpus-mig-on-nvidia-hardware)
8+
- [Multi-host Tensor Processing Unit (TPU) scheduling](#multi-host-tensor-processing-unit-tpu-scheduling)
79
- [Goals](#goals)
810
- [Non-Goals](#non-goals)
911
- [Proposal](#proposal)
1012
- [Design Details](#design-details)
1113
- [Extending a device with as set of mixins](#extending-a-device-with-as-set-of-mixins)
1214
- [Defining device partitions in terms of consumed capacity in a composite device](#defining-device-partitions-in-terms-of-consumed-capacity-in-a-composite-device)
15+
- [Defining multi-host devices](#defining-multi-host-devices)
1316
- [Putting it all together for the MIG use-case](#putting-it-all-together-for-the-mig-use-case)
17+
- [Using DRA for the multi-host use-case](#using-dra-for-the-multi-host-use-case)
1418
- [Test Plan](#test-plan)
1519
- [Prerequisite testing updates](#prerequisite-testing-updates)
1620
- [Unit tests](#unit-tests)
@@ -72,6 +76,11 @@ partitions to be created on demand. This leads to increased resource
7276
utilization as the size of each partitioned device can be matched in real-time
7377
to the workload requesting it.
7478

79+
Devices represented in DRA doesn't necessarily have to be a single unit connected
80+
to a single machine, but can also be a logical device comprised of multiple devices
81+
connected to multiple machines. Similar to the single device partitioning, users
82+
might require either the full multi-host device or a subset.
83+
7584
As DRA has evolved from what we now call "classic" DRA to "structured
7685
parameters" this ability to dynamically partition devices has been lost.
7786
This KEP proposes a method to bring this capability back within the framework
@@ -92,9 +101,12 @@ allocated, rather than requiring them to be created before.
92101

93102
## Motivation
94103

95-
One of the primary motivating examples for supporting partitionable devices
96-
with DRA is to enable the dynamic allocation of Multi-Instance GPUs
97-
(MIG) on NVIDIA hardware. MIG devices are represented as fixed-size partitions
104+
We have two primary motivating examples for supporting partitionable devices
105+
with DRA. The first is for partitioning a single GPU into smaller partitions, while
106+
the second is multi-host scheduling of interconnected TPUs.
107+
108+
### Dynamic allocation of Multi-Instance GPUs (MIG) on NVIDIA hardware
109+
MIG devices are represented as fixed-size partitions
98110
of a full GPU that consume a portion of its capacity across multiple
99111
dimensions. These dimensions include things like number of JPEG engines, number
100112
of multiprocessors, and the allocation of a specific set of fixed-size memory
@@ -221,7 +233,60 @@ done *after* the scheduler has allocated these devices, keeping the GPU free to
221233
be partitioned in different ways until the actual user-workload requesting them
222234
has been submitted.
223235

224-
With his motivating example in mind. We define the following goals and
236+
### Multi-host Tensor Processing Unit (TPU) scheduling
237+
238+
TPUs are connected to VMs, usually four TPUs per VM. In order to run large
239+
workloads that require multiple TPUs, groups of TPUs can be connected over
240+
a high-speed inter-chip interconnect, which is important to achieve the best
241+
performance. However, not all TPUs in the group are connected to each other,
242+
so we need to consider the topology when we make decisions about the allocation
243+
of TPUs to workloads.
244+
245+
Due to the topology, only certain specific slices of TPUs can be used.
246+
For example, in a 64 TPU node pool there will be 16 VMs, each with 4
247+
TPUs. This allows for a number of possible multi-VM slices of different
248+
sizes:
249+
* 8x8 slice, which provides 64 TPUs across 16 nodes (shown in black)
250+
* 4x8 slices, which provides 32 TPUs across 8 nodes (shown in purple)
251+
* 4x4 slices, which provides 16 TPUs across 4 nodes (shown in green)
252+
* 2x4 slices, which provides 8 TPUs across 2 nodes (shown in red)
253+
254+
![image](tpu-topology.png)
255+
256+
For example, a user can request a 4x4 slice of TPUs with a `ResourceClaim`
257+
like the following:
258+
259+
```yaml
260+
apiVersion: resource.k8s.io/v1beta1
261+
kind: ResourceClaim
262+
metadata:
263+
name: tpu-device
264+
spec:
265+
spec:
266+
devices:
267+
requests:
268+
- name: 4x4-tpu
269+
deviceClassName: tpu.google.com
270+
selectors:
271+
- cel:
272+
expression: "device.capacity['google-tpu'].tpus == '16'"
273+
```
274+
275+
There are four "good" allocations for this request:
276+
* All TPUs on nodes 1, 2, 5, and 6.
277+
* All TPUs on nodes 3, 4, 7, and 8.
278+
* All TPUs on nodes 9, 10, 13, and 14.
279+
* All TPUs on nodes 11, 12, 15, and 16.
280+
281+
A request like the one above must be allocated one of the four 4x4 slices
282+
or it should not succeed. A request asking for just 16 TPUs will likely
283+
result in allocation of TPUs across many VMs and without the interconnect,
284+
leading to poor performance. So we need to allow users to request a
285+
partition of a device (in this case a 8x8 slice of TPUs) and account for
286+
the fact that this uses some of the capacity required for other slices.
287+
288+
289+
With these motivating examples in mind. We define the following goals and
225290
non-goals of this KEP.
226291
227292
### Goals
@@ -255,9 +320,10 @@ non-goals of this KEP.
255320
The basic idea is the following:
256321

257322
1. Introduce a new device type called `CompositeDevice` which has the same
258-
fields as a `BasicDevice`, plus two more. The first is a field called
259-
`Includes` and the second is a field called `ConsumesCapacityFrom`. Both
260-
full devices and their partitions are represented as instances of this new
323+
fields as a `BasicDevice`, plus four additional fields. The first is a field called
324+
`Includes` and the second is a field called `ConsumesCapacityFrom`. The last
325+
two fields are `NodeName` and `NodeSelector`. Both full devices and their
326+
partitions are represented as instances of this new
261327
`CompositeDevice` type and are listed right next to one another in the
262328
top-level `Devices` list of a `ResourceSlice`.
263329

@@ -274,13 +340,21 @@ The basic idea is the following:
274340
to allocate it. This essentially removes that capacity from any referenced
275341
devices, rendering them unallocatable on their own.
276342

343+
1. The `NodeName` and `NodeSelector` fields describes the node or set of nodes
344+
where the device is available. This is similar to the `NodeName`, `NodeSelector`,
345+
and `AllNodes` properties in the `ResourceSlice` spec, but this allows for
346+
associating individual devices to node(s). That makes it possible to describe
347+
multi-host devices using the ResourceSlice API. The `NodeName` and `NodeSelector`
348+
fields are mutually exclusive and neither can be specified if the `Spec.NodeName` or
349+
`Spec.NodeSelector` fields are specified on the `ResourceSlice`.
350+
277351
With these additions in place, the scheduler has everything it needs to support
278-
the dynamic allocation of both full devices and their (possibly overlapping)
279-
fixed-size partitions. That is to say, the scheduler now has the ability to
280-
"flatten" all devices by applying any mixins from their `Includes` fields as
281-
well as track any capacities consumed from one device by another through its
282-
`ConsumesCapacityFrom` field. More details on the actual algorithm the
283-
scheduler follows to make allocation decisions based on the
352+
the dynamic allocation of both full devices, their (possibly overlapping)
353+
fixed-size partitions, and multi-host devices. That is to say, the scheduler now
354+
has the ability to "flatten" all devices by applying any mixins from their
355+
`Includes` fields as well as track any capacities consumed from one device
356+
by another through its `ConsumesCapacityFrom` field. More details on the
357+
actual algorithm the scheduler follows to make allocation decisions based on the
284358
`ConsumesCapacityFrom` field can be found in the Design Details section below.
285359

286360
## Design Details
@@ -405,6 +479,26 @@ type CompositeDevice struct {
405479
// +optional
406480
ConsumesCapacityFrom []DeviceRef `json:"consumesCapacityFrom,omitempty"`
407481

482+
// NodeName identifies the node where the device is available.
483+
//
484+
// Must only be set if Spec.AllNodes is set.
485+
// Only one or none of NodeName and NodeSelector must be set.
486+
//
487+
// +optional
488+
// +oneOf=DeviceNodeSelection
489+
NodeName string
490+
491+
// NodeSelector defines the nodes where the device is available.
492+
//
493+
// Must use exactly one term.
494+
//
495+
// Must only be set if Spec.AllNodes is set.
496+
// Only one or none of NodeName and NodeSelector must be set.
497+
//
498+
// +optional
499+
// +oneOf=DeviceNodeSelection
500+
NodeSelector *core.NodeSelector
501+
408502
// Attributes defines the set of attributes for this device.
409503
// The name of each attribute must be unique in that set.
410504
//
@@ -454,9 +548,10 @@ type DeviceRef struct {
454548
```
455549

456550
As mentioned previously, the main features being added here are (1) the ability
457-
to include a set of mixins in a device definition, and (2) the ability to
551+
to include a set of mixins in a device definition, (2) the ability to
458552
express that capacity from one device gets consumed by another device if/when
459-
the scheduler decides to allocate it.
553+
the scheduler decides to allocate it, and (3) the ability to define multi-host
554+
devices.
460555

461556
To simplify the conversation, we discuss each new feature separately, starting
462557
with "mixins" and the new `Includes` field, which allows a set of mixins to
@@ -647,7 +742,124 @@ When such a device is allocated, the scheduler will need to track the full
647742
capacity required to satisfy each of the sink devices along the chain. In this
648743
way, all intermediate sink devices will essentially be rendered
649744
"unschedulable", with the last-level sink device pulling its capacity from the
650-
devices it references directly.
745+
device it references directly.
746+
747+
### Defining multi-host devices
748+
749+
An example of a small 4x4 TPU slice with its partitions will look like the
750+
following:
751+
752+
```yaml
753+
kind: ResourceSlice
754+
apiVersion: resource.k8s.io/v1beta1
755+
...
756+
spec:
757+
allNodes: true
758+
pool:
759+
...
760+
driver: tpu.dra.example.com
761+
devices:
762+
# 4x4 slice
763+
- name: tpu-4x4-1
764+
composite:
765+
nodeSelector:
766+
nodeSelectorTerms:
767+
- matchExpressions:
768+
- key: kubernetes.io/hostname
769+
operator: IN
770+
values:
771+
- node-1
772+
- node-2
773+
- node-5
774+
- node-6
775+
capacity:
776+
tpus: "16"
777+
consumesCapacityFrom:
778+
- name: tpu-4x8-1
779+
# 2x4 slices
780+
- name: tpu-2x4-1
781+
composite:
782+
nodeSelector:
783+
nodeSelectorTerms:
784+
- matchExpressions:
785+
- key: kubernetes.io/hostname
786+
operator: IN
787+
values:
788+
- node-1
789+
- node-2
790+
capacity:
791+
tpus: "8"
792+
consumesCapacityFrom:
793+
- name: tpu-4x4-1
794+
- name: tpu-2x4-2
795+
composite:
796+
nodeSelector:
797+
nodeSelectorTerms:
798+
- matchExpressions:
799+
- key: kubernetes.io/hostname
800+
operator: IN
801+
values:
802+
- node-5
803+
- node-6
804+
capacity:
805+
tpus: "8"
806+
consumesCapacityFrom:
807+
- name: tpu-4x4-1
808+
# 2x2 slices
809+
- name: tpu-2x2-1
810+
composite:
811+
nodeName: node-1
812+
capacity:
813+
tpus: "4"
814+
consumesCapacityFrom:
815+
- name: tpu-2x4-1
816+
- name: tpu-2x2-2
817+
composite:
818+
nodeName: node-2
819+
capacity:
820+
tpus: "4"
821+
consumesCapacityFrom:
822+
- name: tpu-2x4-1
823+
- name: tpu-2x2-3
824+
composite:
825+
nodeName: node-5
826+
capacity:
827+
tpus: "4"
828+
consumesCapacityFrom:
829+
- name: tpu-2x4-2
830+
- name: tpu-2x2-4
831+
composite:
832+
nodeName: node-6
833+
capacity:
834+
tpus: "4"
835+
consumesCapacityFrom:
836+
- name: tpu-2x4-2
837+
```
838+
839+
In the example we defined a single 4x4 slice. That means 16 TPUs and with
840+
4 TPUs per node, the device is available across four nodes. The node selector
841+
on the devices selects the 4 nodes used by this device. In the example it
842+
does with by the `IN` operator on the `kubernetes.io/hostname` key, but this
843+
could also be just a regular selector on a single label set on all nodes.
844+
845+
The `ConsumesCapacityFrom` field declares that the smaller slices is a partition
846+
of the larger one, and as described in the previous section, this will
847+
allow the scheduler to understand that allocating a partition of a device has
848+
the consequence of making other partitions unvailable.
849+
850+
When a multi-host device is requested, the workload must have a number of pods
851+
that equals the number of nodes that make up the device. These pods will share
852+
the device, so they must be set up with a shared ResourceClaim. When the scheduler
853+
attempts to schedule the first pod for the workload, it will find a device that
854+
matches the request and allocate it for the ResourceClaim. Once the a device has
855+
been allocated for the claim, this also restricts the nodes where other pods using
856+
the device can be scheduled. To make sure that future pods do get scheduled on an
857+
eligible node, the scheduler will use `nodeName` or `nodeSelector` value from the
858+
device to determine the `nodeSelector` field on the `AllocationResult`
859+
in the `ResourceClaim`, rather than the `nodeName` or `nodeSelector` from the
860+
`ResourceSlice`. This makes sure that all pods sharing the `ResourceClaim` will
861+
be scheduled to the nodes that make up the device.
862+
651863

652864
### Putting it all together for the MIG use-case
653865

@@ -1651,6 +1863,73 @@ devices:
16511863
- name: memory-slices-0-7
16521864
```
16531865

1866+
### Using DRA for the multi-host use-case
1867+
1868+
In order to allocate a 2x4 TPU slice using the ResourceSlice
1869+
[shown above](#defining-multi-host-devices), a ResourceClaim like the
1870+
following can be used:
1871+
1872+
```yaml
1873+
apiVersion: resource.k8s.io/v1beta1
1874+
kind: ResourceClaim
1875+
metadata:
1876+
name: tpu-consumer-resource-claim
1877+
spec:
1878+
devices:
1879+
requests:
1880+
- name: tpu-request
1881+
deviceClassName: tpu.google.com
1882+
selectors:
1883+
- cel:
1884+
expression: "device.capacity['tpu.google.com'].tpus == '8'"
1885+
```
1886+
1887+
This simply requests a device with 8 TPUs. Since there are 4 TPUs per node, this requires
1888+
two pods, one for each node. A Deployment can be used to create the necessary number of
1889+
pods:
1890+
1891+
```yaml
1892+
apiVersion: apps/v1
1893+
kind: Deployment
1894+
metadata:
1895+
name: tpu-consumer
1896+
spec:
1897+
replicas: 2
1898+
selector:
1899+
matchLabels:
1900+
app: tpu-consumer
1901+
template:
1902+
metadata:
1903+
labels:
1904+
app: tpu-consumer
1905+
spec:
1906+
affinity:
1907+
podAntiAffinity:
1908+
requiredDuringSchedulingIgnoredDuringExecution:
1909+
- weight: 100
1910+
podAffinityTerm:
1911+
labelSelector:
1912+
matchLabels:
1913+
app: tpu-consumer
1914+
topologyKey: kubernetes.io/hostname
1915+
resourceClaims:
1916+
- name: "tpu"
1917+
resourceClaimName: tpu-consumer-resource-claim
1918+
containers:
1919+
- name: workload
1920+
image: my-app
1921+
command: ["/bin/program"]
1922+
resources:
1923+
claims:
1924+
- name: "tpu"
1925+
```
1926+
1927+
Since the PodSpec references a ResourceClaim rather than a ResourceClaimTemplate, they will
1928+
share the ResourceClaim. This will then also restrict the pods to run on the nodes that are
1929+
targeted by the node selector on the allocated device. Now, in order to be able to take
1930+
advantage of the TPUs that are connected to the two nodes, the pods need to be scheduled
1931+
on separate nodes. The antiaffinity stanza in the PodSpec makes sure this happens.
1932+
16541933
### Test Plan
16551934

16561935
<!--

keps/sig-node/4815-dra-partitionable-devices/kep.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ title: DRA Partitionable Devices
22
kep-number: 4815
33
authors:
44
- "@klueska"
5+
- "@mortent"
56
owning-sig: sig-node
67
participating-sigs:
78
- sig-scheduling
87.6 KB
Loading

0 commit comments

Comments
 (0)