Skip to content

Commit 0fcca23

Browse files
committed
Update to address comments
1 parent 576dcfe commit 0fcca23

File tree

1 file changed

+24
-17
lines changed
  • keps/sig-node/4815-dra-partitionable-devices

1 file changed

+24
-17
lines changed

keps/sig-node/4815-dra-partitionable-devices/README.md

Lines changed: 24 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ partitions to be created on demand. This leads to increased resource
7676
utilization as the size of each partitioned device can be matched in real-time
7777
to the workload requesting it.
7878

79-
Devices represented in DRA doesn't necessarily have to be a single unit connected
79+
Devices represented in DRA don't necessarily have to be a single unit connected
8080
to a single machine, but can also be a logical device comprised of multiple devices
8181
connected to multiple machines. Similar to the single device partitioning, users
8282
might require either the full multi-host device or a subset.
@@ -101,9 +101,11 @@ allocated, rather than requiring them to be created before.
101101

102102
## Motivation
103103

104-
We have two primary motivating examples for supporting partitionable devices
105-
with DRA. The first is for partitioning a single GPU into smaller partitions, while
106-
the second is multi-host scheduling of interconnected TPUs.
104+
We have several motivating examples for supporting partitionable devices
105+
with DRA, with the first two described in detail in this document:
106+
* Partitioning a single GPU into smaller partitions.
107+
* Multi-host scheduing of interconnected TPUs.
108+
* RDMA
107109

108110
### Dynamic allocation of Multi-Instance GPUs (MIG) on NVIDIA hardware
109111
MIG devices are represented as fixed-size partitions
@@ -747,7 +749,9 @@ device it references directly.
747749
### Defining multi-host devices
748750

749751
An example of a small 4x4 TPU slice with its partitions will look like the
750-
following:
752+
example below. Since the devices in the slice is connected to multiple nodes,
753+
it will typically be the responsibility of a central controller to publish the
754+
ResourceSlice.
751755

752756
```yaml
753757
kind: ResourceSlice
@@ -847,18 +851,21 @@ of the larger one, and as described in the previous section, this will
847851
allow the scheduler to understand that allocating a partition of a device has
848852
the consequence of making other partitions unvailable.
849853

850-
When a multi-host device is requested, the workload must have a number of pods
851-
that equals the number of nodes that make up the device. These pods will share
852-
the device, so they must be set up with a shared ResourceClaim. When the scheduler
853-
attempts to schedule the first pod for the workload, it will find a device that
854-
matches the request and allocate it for the ResourceClaim. Once the a device has
855-
been allocated for the claim, this also restricts the nodes where other pods using
856-
the device can be scheduled. To make sure that future pods do get scheduled on an
857-
eligible node, the scheduler will use `nodeName` or `nodeSelector` value from the
858-
device to determine the `nodeSelector` field on the `AllocationResult`
859-
in the `ResourceClaim`, rather than the `nodeName` or `nodeSelector` from the
860-
`ResourceSlice`. This makes sure that all pods sharing the `ResourceClaim` will
861-
be scheduled to the nodes that make up the device.
854+
In the typical case, when a multi-host device is requested, the workload would
855+
have a number of pods that equals the number of nodes that make up the device.
856+
These pods will share the device, so they must be set up with a shared
857+
ResourceClaim. When the scheduler attempts to schedule the first pod for the
858+
workload, it will find a device that matches the request and allocate it for the
859+
ResourceClaim. Once the a device has been allocated for the claim, this also
860+
restricts the nodes where future pods using the device can be scheduled. To make
861+
sure that future pods will only be attempted for scheduleding on eligible nodes, the
862+
scheduler will use `nodeName` or `nodeSelector` value from the device to determine the
863+
`nodeSelector` field on the `AllocationResult` in the `ResourceClaim`, rather
864+
than the `nodeName` or `nodeSelector` from the `ResourceSlice`. This makes sure
865+
that pods sharing the `ResourceClaim` can not get scheduled on nodes that aren't
866+
part of the device. However, there is no guarantee that they can get scheduled on
867+
the nodes that make up the device, since that will be subject to any other
868+
contstraints (like sufficient CPU and memory) during the scheduling process.
862869

863870

864871
### Putting it all together for the MIG use-case

0 commit comments

Comments
 (0)