Skip to content

Commit 489bbad

Browse files
committed
updates to address feedback
1 parent dd1e790 commit 489bbad

File tree

1 file changed

+33
-4
lines changed
  • keps/sig-scheduling/4815-dra-partitionable-devices

1 file changed

+33
-4
lines changed

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -420,6 +420,19 @@ well as track the capacities consumed by allocated devices. More details
420420
on the actual algorithm the scheduler follows to make allocation decisions
421421
based on the capacity pools can be found in the Design Details section below.
422422

423+
### Risks and Mitigations
424+
425+
#### Partial scheduling of pods for multi-host devices
426+
427+
With multi-host devices, there will typically be multiple pods sharing a single
428+
`ResourceClaim`. DRA guarantees that the pods will not end up on nodes that are not
429+
part of the multi-host device. But it can not guarantee that all pods will be
430+
scheduled, since pods will be subject to any other constraints (like sufficient
431+
CPU and memory) during scheduling.
432+
433+
A better story should be in place for beta, including a plan for alignment and
434+
possible integration with Kueue.
435+
423436
## Design Details
424437

425438
The exact set of proposed API changes can be seen below:
@@ -1245,11 +1258,25 @@ scheduler will use `nodeName` or `nodeSelector` value from the device to determi
12451258
`nodeSelector` field on the `AllocationResult` in the `ResourceClaim`, rather
12461259
than the `nodeName` or `nodeSelector` from the `ResourceSlice`. This makes sure
12471260
that pods sharing the `ResourceClaim` can not get scheduled on nodes that aren't
1248-
part of the device. However, there is no guarantee that they can get scheduled on
1261+
part of the device.
1262+
1263+
#### Multi-host scheduling limitations
1264+
The shared `ResourceClaim` and the device node selectors only guarantee that
1265+
the pods for the workload will not be scheduled on nodes that are not part of
1266+
the multi-host device. However, there is no guarantee that they can get scheduled on
12491267
the nodes that make up the device, since that will be subject to any other
1250-
contstraints (like sufficient CPU and memory) during the scheduling process. This
1251-
design does not attempt to guarantee that all pods can be scheduled, but rather
1252-
make sure pods are only considered for the correct nodes.
1268+
constraints (like sufficient CPU and memory) during the scheduling process.
1269+
1270+
Similarly, it is possible for users to create workloads that references multiple
1271+
`ResourceClaim`s. These might reference different multi-host devices which might
1272+
have node selectors that are only partially overlapping or not overlapping at
1273+
all. In this situation, none or only a subset of the pods might end up being
1274+
scheduled.
1275+
1276+
DRA does not guarantee that all or none of the pods can be scheduled (i.e.
1277+
group scheduling), so handling those situations will be up to the user or
1278+
higher-level frameworks. For beta we aim to improve the story here,
1279+
possibly through integration with Kueue.
12531280

12541281
### Putting it all together for the MIG use-case
12551282

@@ -2858,6 +2885,8 @@ ensure they are handled by the scheduler as described in this KEP.
28582885

28592886
- Gather feedback
28602887
- Additional tests are in Testgrid and linked in KEP
2888+
- Define the alignment and possible integration with Kueue
2889+
- Improve the story for group scheduling
28612890

28622891
#### GA
28632892

0 commit comments

Comments
 (0)