KEP-4815 DRA Partitionable devices support for multi-host #5069

mortent · 2025-01-22T01:48:53Z

One-line PR description: Updates the design to also be able to support multi-host use-cases

Issue link: DRA: Partitionable Devices #4815

Other comments:

mortent · 2025-01-25T23:02:43Z

/assign @johnbelamaric

keps/sig-node/4815-dra-partitionable-devices/README.md

keps/sig-node/4815-dra-partitionable-devices/kep.yaml

keps/sig-node/4815-dra-partitionable-devices/README.md

mortent · 2025-02-04T17:24:44Z

/wg device-management

dom4ha · 2025-02-10T12:31:33Z

keps/sig-node/4815-dra-partitionable-devices/README.md

 way, all intermediate sink devices will essentially be rendered
 "unschedulable", with the last-level sink device pulling its capacity from the
-devices it references directly.
+device it references directly.


Are the ofered ResourceSlice going to be overlapping? In other words are they going to be a Cartesian product of all possible allocations?

Have you considered also a different approach with just one offering, a ResourceSlice that just provide the biggest possible allocation? If a pod needs (claims) a portion of it, that the original ResourceSlice would be split into remaining offerings to be ready for the next scheduling cycle?

The advantage is that we'd not loose the information that binding to the bigger offering will in fact remove the big one and leave only smaller offering. If scheduler had such information, it could decide to bind to smaller one if it does not need the big one, but without it, scheduler would blindly pick whichever small one matches.

We could say that we could achieve the same using scoring, which is probably true, but another advantage is reducing number of offerings (ResourceSlices) that scheduler needs to process.

I'm aware that creating new ResourceSlices may be quite heavy process which requires api-server turnarounds, but I suspect that DRA plugin could generate such ResourceSlices in memory just for scheduling purposes and perform the split once scheduler decides to bind.

This approach might be especially useful if we want to offer resources of continuous type, like memory. It's not possible to offer all memory allocation possibilities, but it should be possible to offer the maximum available mem on a given node. Once claimed, new ReserourceSlice would appear with the remaining mem.

The approach taken in DRA for now is that the ResourceSlice should list all available devices. There have been early discussions to support relaxing this requirement, either for dynamically creating smaller devices like you propose here or just represent a large number of identical devices in a more compact way. I suspect it is something we will look into at some point, since there are certain devices that support a very large number of partitions (AWS Inferentia devices is an example of this: https://docs.google.com/document/d/1lXGfnrBixRIMW9ESa-mv09Kisb2myVFV_A3nqPJ4FCQ/edit?tab=t.0#bookmark=id.mmv8k6mzxscm). But it is out of scope of the current change.

dom4ha · 2025-02-10T13:01:49Z

I don't have other concerns from the scheduler perspective, except this question regarding the approach to producing available offerings, but I don't think it's blocking.

keps/sig-node/4815-dra-partitionable-devices/README.md

johnbelamaric · 2025-02-12T21:15:42Z

Thanks you @mortent this looks great.

/lgtm

cc @dom4ha @alculquicondor

alculquicondor · 2025-02-12T21:27:43Z

@mimowo PTAL in case there are any major implications for Kueue and group scheduling in general.

thockin

We knew it wasn't going to stay "simple".

/approve on API

mimowo · 2025-02-13T10:21:24Z

@mimowo PTAL in case there are any major implications for Kueue and group scheduling in general.

Thanks for the ping. cc @mwielgus @dom4ha who are also over-watching the DRA -Kueue integration efforts.

dom4ha · 2025-02-13T14:44:52Z

Thanks for the ping. cc Marcin Wielgus Dominik Marciński who are also over-watching the DRA -Kueue integration efforts.

One aspect that is not discussed in this KEP is a high chance the pods become unschedulable. IIUC scheduling of the first pod determines subset of nodes on which remaining pods can schedule. Obviously, when there are no classic resources available on the selected nodes, the remaining pods won't be scheduled anywhere else.

Regarding DRA-Kueue integration, it's a broad topic and we need to discuss it separately. Multi-node introduces in fact an implicit topology and currently scheduler is not capable to guarantee scheduling of a group of pods (thus the issue above). On the other hand Kueue support TAS, so if Kueue was aware of DRA resources availability and this inter-chip topology, it could take a better decision about placement, but it does not have capability to reserve resources (guarantee that scheduler don't schedule other pods in the meantime)

mortent · 2025-02-13T16:32:49Z

Thanks for the ping. cc Marcin Wielgus Dominik Marciński who are also over-watching the DRA -Kueue integration efforts.

One aspect that is not discussed in this KEP is a high chance the pods become unschedulable. IIUC scheduling of the first pod determines subset of nodes on which remaining pods can schedule. Obviously, when there are no classic resources available on the selected nodes, the remaining pods won't be scheduled anywhere else.

This is discussed in the KEP, but I've added another sentence to make it clearer: https://github.com/kubernetes/enhancements/pull/5069/files#diff-8985915ed27658e0b602c60b13c47d5d7e0309f9dff42df6cc9d514da47d652fR1248-R1252. DRA doesn't guarantee that the pods will be scheduled, it just makes sure they never get scheduled on the wrong nodes.

Regarding DRA-Kueue integration, it's a broad topic and we need to discuss it separately. Multi-node introduces in fact an implicit topology and currently scheduler is not capable to guarantee scheduling of a group of pods (thus the issue above). On the other hand Kueue support TAS, so if Kueue was aware of DRA resources availability and this inter-chip topology, it could take a better decision about placement, but it does not have capability to reserve resources (guarantee that scheduler don't schedule other pods in the meantime)

Yeah, we definitely should discuss the DRA and Kueue integration but I don't think that is a blocker for an alpha API.

44past4 · 2025-02-13T16:34:00Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

+      - name: "tpu"
+        resourceClaimName: tpu-consumer-resource-claim


How this should work if a set of pods like in this example will share two or more multi-host resource claims? As I understand for the first pod scheduler can make sure that the node which is selected for a given pod belongs to all selected devices but the following pods will simply get node selectors from the first pod which as I could imagine can lead to situations that those will not be able schedule if the node selectors coming from different devices / resource claims are partially disjoined.

Yeah, referencing multiple claims means that there is a chance that either the claims can't be satisfied for any node, in which case no pods will be scheduled, or it can not be met for all pods, which means that only a subset of the pods will get scheduled.

The DRA scheduler plugin will generate a node selector based on the selectors from either the ResourceSlice or the device (from the field added in this PR) from all claims. it is possible that this ends up selecting no nodes.

I think that is the expected behavior, as this means the claims can't be satisfied. Obviously it would be better if we could guarantee that no pods got scheduled in this situation, but that is out of scope for DRA (at least at this point). Tools like Kueue can probably help here.

dom4ha · 2025-02-13T16:56:48Z

This is discussed in the KEP

I think it deserves a dedicated section, especially since there are no mechanisms that could guarantee scheduling, so in fact we don't know yet how to address it in the current scheduling model.

Have you considered apporach in which inter-connected nodes are only preferred, but not required? I know it would be a different design, but at least it would allow scheduling most of the time.

dom4ha · 2025-02-13T17:04:46Z

/lgtm
Since everyone is aware of the limitations.

pohly · 2025-02-13T17:05:39Z

Not sure whether you were waiting for me, but also LGTM.

alculquicondor · 2025-02-13T17:10:58Z

/approve
/hold

Since everyone is aware of the limitations.

Please document them in the Risks & Mitigations section and the beta criteria.

In particular, about:

A path towards integration with Kueue
A better story for group scheduling
(They might ultimately have the same solution)

k8s-ci-robot · 2025-02-13T17:11:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mortent, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-scheduling/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

johnbelamaric · 2025-02-13T18:11:17Z

/lgtm
/hold cancel

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 22, 2025

k8s-ci-robot requested review from derekwaynecarr and mrunalp January 22, 2025 01:48

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 22, 2025

mortent force-pushed the DRAPartitionableForMultiHost branch from ee1ffbf to 576dcfe Compare January 22, 2025 01:49

k8s-ci-robot assigned johnbelamaric Jan 25, 2025

johnbelamaric reviewed Jan 31, 2025

View reviewed changes

keps/sig-node/4815-dra-partitionable-devices/kep.yaml Show resolved Hide resolved

klueska mentioned this pull request Jan 13, 2025

DRA: Partitionable Devices #4815

Open

4 tasks

klueska reviewed Feb 3, 2025

View reviewed changes

mortent mentioned this pull request Feb 3, 2025

KEP-4815 DRA Partitionable devices design update #5066

Merged

k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 4, 2025

dom4ha mentioned this pull request Feb 7, 2025

add kep 4986 sakkara cluster topology kubernetes-sigs/scheduler-plugins#837

Open

dom4ha reviewed Feb 10, 2025

View reviewed changes

pohly reviewed Feb 10, 2025

View reviewed changes

keps/sig-node/4815-dra-partitionable-devices/README.md Outdated Show resolved Hide resolved

keps/sig-node/4815-dra-partitionable-devices/README.md Outdated Show resolved Hide resolved

keps/sig-node/4815-dra-partitionable-devices/README.md Outdated Show resolved Hide resolved

johnbelamaric reviewed Feb 10, 2025

View reviewed changes

keps/sig-node/4815-dra-partitionable-devices/README.md Outdated Show resolved Hide resolved

keps/sig-node/4815-dra-partitionable-devices/README.md Show resolved Hide resolved

keps/sig-node/4815-dra-partitionable-devices/README.md Outdated Show resolved Hide resolved

mortent force-pushed the DRAPartitionableForMultiHost branch from 7f25e5f to 3db8e71 Compare February 12, 2025 18:08

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 12, 2025

mortent force-pushed the DRAPartitionableForMultiHost branch from 3db8e71 to e9bedb8 Compare February 12, 2025 18:14

mortent requested review from klueska and pohly February 12, 2025 18:59

mortent requested a review from johnbelamaric February 12, 2025 18:59

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025

thockin reviewed Feb 13, 2025

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2025

44past4 reviewed Feb 13, 2025

View reviewed changes

k8s-ci-robot assigned dom4ha Feb 13, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 13, 2025

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 13, 2025

KEP-4815 DRA Partitionable devices support for multi-host

cdbf278

mortent force-pushed the DRAPartitionableForMultiHost branch from 489bbad to cdbf278 Compare February 13, 2025 17:59

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 13, 2025

k8s-ci-robot merged commit 80bba73 into kubernetes:master Feb 13, 2025
4 checks passed

k8s-ci-robot added this to the v1.33 milestone Feb 13, 2025

dom4ha mentioned this pull request Feb 25, 2025

Extend the DRA plugin with ability to reserve resources (without their allocation) before the binding phase kubernetes/kubernetes#130402

Closed

KEP-4815 DRA Partitionable devices support for multi-host #5069

KEP-4815 DRA Partitionable devices support for multi-host #5069

Conversation

mortent commented Jan 22, 2025

Uh oh!

mortent commented Jan 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mortent commented Feb 4, 2025

Uh oh!

dom4ha Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

mortent Feb 12, 2025

Choose a reason for hiding this comment

Uh oh!

dom4ha commented Feb 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnbelamaric commented Feb 12, 2025

Uh oh!

alculquicondor commented Feb 12, 2025

Uh oh!

thockin left a comment

Choose a reason for hiding this comment

Uh oh!

mimowo commented Feb 13, 2025

Uh oh!

dom4ha commented Feb 13, 2025

Uh oh!

mortent commented Feb 13, 2025

Uh oh!

44past4 Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

mortent Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

dom4ha commented Feb 13, 2025

Uh oh!

dom4ha commented Feb 13, 2025

Uh oh!

pohly commented Feb 13, 2025

Uh oh!

alculquicondor commented Feb 13, 2025

Uh oh!

k8s-ci-robot commented Feb 13, 2025

Uh oh!

johnbelamaric commented Feb 13, 2025

Uh oh!

Uh oh!

Uh oh!