@@ -76,7 +76,7 @@ partitions to be created on demand. This leads to increased resource
76
76
utilization as the size of each partitioned device can be matched in real-time
77
77
to the workload requesting it.
78
78
79
- Devices represented in DRA doesn 't necessarily have to be a single unit connected
79
+ Devices represented in DRA don 't necessarily have to be a single unit connected
80
80
to a single machine, but can also be a logical device comprised of multiple devices
81
81
connected to multiple machines. Similar to the single device partitioning, users
82
82
might require either the full multi-host device or a subset.
@@ -101,9 +101,11 @@ allocated, rather than requiring them to be created before.
101
101
102
102
## Motivation
103
103
104
- We have two primary motivating examples for supporting partitionable devices
105
- with DRA. The first is for partitioning a single GPU into smaller partitions, while
106
- the second is multi-host scheduling of interconnected TPUs.
104
+ We have several motivating examples for supporting partitionable devices
105
+ with DRA, with the first two described in detail in this document:
106
+ * Partitioning a single GPU into smaller partitions.
107
+ * Multi-host scheduing of interconnected TPUs.
108
+ * RDMA
107
109
108
110
### Dynamic allocation of Multi-Instance GPUs (MIG) on NVIDIA hardware
109
111
MIG devices are represented as fixed-size partitions
@@ -747,7 +749,9 @@ device it references directly.
747
749
# ## Defining multi-host devices
748
750
749
751
An example of a small 4x4 TPU slice with its partitions will look like the
750
- following :
752
+ example below. Since the devices in the slice is connected to multiple nodes,
753
+ it will typically be the responsibility of a central controller to publish the
754
+ ResourceSlice.
751
755
752
756
` ` ` yaml
753
757
kind: ResourceSlice
@@ -847,18 +851,21 @@ of the larger one, and as described in the previous section, this will
847
851
allow the scheduler to understand that allocating a partition of a device has
848
852
the consequence of making other partitions unvailable.
849
853
850
- When a multi-host device is requested, the workload must have a number of pods
851
- that equals the number of nodes that make up the device. These pods will share
852
- the device, so they must be set up with a shared ResourceClaim. When the scheduler
853
- attempts to schedule the first pod for the workload, it will find a device that
854
- matches the request and allocate it for the ResourceClaim. Once the a device has
855
- been allocated for the claim, this also restricts the nodes where other pods using
856
- the device can be scheduled. To make sure that future pods do get scheduled on an
857
- eligible node, the scheduler will use `nodeName` or `nodeSelector` value from the
858
- device to determine the `nodeSelector` field on the `AllocationResult`
859
- in the `ResourceClaim`, rather than the `nodeName` or `nodeSelector` from the
860
- ` ResourceSlice` . This makes sure that all pods sharing the `ResourceClaim` will
861
- be scheduled to the nodes that make up the device.
854
+ In the typical case, when a multi-host device is requested, the workload would
855
+ have a number of pods that equals the number of nodes that make up the device.
856
+ These pods will share the device, so they must be set up with a shared
857
+ ResourceClaim. When the scheduler attempts to schedule the first pod for the
858
+ workload, it will find a device that matches the request and allocate it for the
859
+ ResourceClaim. Once the a device has been allocated for the claim, this also
860
+ restricts the nodes where future pods using the device can be scheduled. To make
861
+ sure that future pods will only be attempted for scheduleding on eligible nodes, the
862
+ scheduler will use `nodeName` or `nodeSelector` value from the device to determine the
863
+ ` nodeSelector` field on the `AllocationResult` in the `ResourceClaim`, rather
864
+ than the `nodeName` or `nodeSelector` from the `ResourceSlice`. This makes sure
865
+ that pods sharing the `ResourceClaim` can not get scheduled on nodes that aren't
866
+ part of the device. However, there is no guarantee that they can get scheduled on
867
+ the nodes that make up the device, since that will be subject to any other
868
+ contstraints (like sufficient CPU and memory) during the scheduling process.
862
869
863
870
864
871
# ## Putting it all together for the MIG use-case
0 commit comments