updates to address feedback

mortent · mortent · commit 489bbade04ad · 2025-02-13T17:52:33.000Z
diff --git a/keps/sig-scheduling/4815-dra-partitionable-devices/README.md b/keps/sig-scheduling/4815-dra-partitionable-devices/README.md
@@ -420,6 +420,19 @@ well as track the capacities consumed by allocated devices. More details
 on the actual algorithm the scheduler follows to make allocation decisions
 based on the capacity pools can be found in the Design Details section below.
 
+### Risks and Mitigations
+
+#### Partial scheduling of pods for multi-host devices
+
+With multi-host devices, there will typically be multiple pods sharing a single
+`ResourceClaim`. DRA guarantees that the pods will not end up on nodes that are not
+part of the multi-host device. But it can not guarantee that all pods will be
+scheduled, since pods will be subject to any other constraints (like sufficient
+CPU and memory) during scheduling.
+
+A better story should be in place for beta, including a plan for alignment and
+possible integration with Kueue.
+
 ## Design Details
 
 The exact set of proposed API changes can be seen below:
@@ -1245,11 +1258,25 @@ scheduler will use `nodeName` or `nodeSelector` value from the device to determi
 `nodeSelector` field on the `AllocationResult` in the `ResourceClaim`, rather
 than the `nodeName` or `nodeSelector` from the `ResourceSlice`. This makes sure
 that pods sharing the `ResourceClaim` can not get scheduled on nodes that aren't
-part of the device. However, there is no guarantee that they can get scheduled on
+part of the device.
+
+#### Multi-host scheduling limitations
+The shared `ResourceClaim` and the device node selectors only guarantee that
+the pods for the workload will not be scheduled on nodes that are not part of
+the multi-host device. However, there is no guarantee that they can get scheduled on
 the nodes that make up the device, since that will be subject to any other
-contstraints (like sufficient CPU and memory) during the scheduling process. This
-design does not attempt to guarantee that all pods can be scheduled, but rather
-make sure pods are only considered for the correct nodes.
+constraints (like sufficient CPU and memory) during the scheduling process.
+
+Similarly, it is possible for users to create workloads that references multiple
+`ResourceClaim`s. These might reference different multi-host devices which might
+have node selectors that are only partially overlapping or not overlapping at
+all. In this situation, none or only a subset of the pods might end up being
+scheduled.
+
+DRA does not guarantee that all or none of the pods can be scheduled (i.e.
+group scheduling), so handling those situations will be up to the user or
+higher-level frameworks. For beta we aim to improve the story here,
+possibly through integration with Kueue.
 
 ### Putting it all together for the MIG use-case
 
@@ -2858,6 +2885,8 @@ ensure they are handled by the scheduler as described in this KEP.
 
 - Gather feedback
 - Additional tests are in Testgrid and linked in KEP
+- Define the alignment and possible integration with Kueue
+- Improve the story for group scheduling
 
 #### GA