@@ -420,6 +420,19 @@ well as track the capacities consumed by allocated devices. More details
420
420
on the actual algorithm the scheduler follows to make allocation decisions
421
421
based on the capacity pools can be found in the Design Details section below.
422
422
423
+ # ## Risks and Mitigations
424
+
425
+ # ### Partial scheduling of pods for multi-host devices
426
+
427
+ With multi-host devices, there will typically be multiple pods sharing a single
428
+ ` ResourceClaim` . DRA guarantees that the pods will not end up on nodes that are not
429
+ part of the multi-host device. But it can not guarantee that all pods will be
430
+ scheduled, since pods will be subject to any other constraints (like sufficient
431
+ CPU and memory) during scheduling.
432
+
433
+ A better story should be in place for beta, including a plan for alignment and
434
+ possible integration with Kueue.
435
+
423
436
# # Design Details
424
437
425
438
The exact set of proposed API changes can be seen below :
@@ -1245,11 +1258,25 @@ scheduler will use `nodeName` or `nodeSelector` value from the device to determi
1245
1258
` nodeSelector` field on the `AllocationResult` in the `ResourceClaim`, rather
1246
1259
than the `nodeName` or `nodeSelector` from the `ResourceSlice`. This makes sure
1247
1260
that pods sharing the `ResourceClaim` can not get scheduled on nodes that aren't
1248
- part of the device. However, there is no guarantee that they can get scheduled on
1261
+ part of the device.
1262
+
1263
+ # ### Multi-host scheduling limitations
1264
+ The shared `ResourceClaim` and the device node selectors only guarantee that
1265
+ the pods for the workload will not be scheduled on nodes that are not part of
1266
+ the multi-host device. However, there is no guarantee that they can get scheduled on
1249
1267
the nodes that make up the device, since that will be subject to any other
1250
- contstraints (like sufficient CPU and memory) during the scheduling process. This
1251
- design does not attempt to guarantee that all pods can be scheduled, but rather
1252
- make sure pods are only considered for the correct nodes.
1268
+ constraints (like sufficient CPU and memory) during the scheduling process.
1269
+
1270
+ Similarly, it is possible for users to create workloads that references multiple
1271
+ ` ResourceClaim` s. These might reference different multi-host devices which might
1272
+ have node selectors that are only partially overlapping or not overlapping at
1273
+ all. In this situation, none or only a subset of the pods might end up being
1274
+ scheduled.
1275
+
1276
+ DRA does not guarantee that all or none of the pods can be scheduled (i.e.
1277
+ group scheduling), so handling those situations will be up to the user or
1278
+ higher-level frameworks. For beta we aim to improve the story here,
1279
+ possibly through integration with Kueue.
1253
1280
1254
1281
# ## Putting it all together for the MIG use-case
1255
1282
@@ -2858,6 +2885,8 @@ ensure they are handled by the scheduler as described in this KEP.
2858
2885
2859
2886
- Gather feedback
2860
2887
- Additional tests are in Testgrid and linked in KEP
2888
+ - Define the alignment and possible integration with Kueue
2889
+ - Improve the story for group scheduling
2861
2890
2862
2891
# ### GA
2863
2892
0 commit comments