Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 52 additions & 15 deletions keps/sig-scheduling/4816-dra-prioritized-list/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ tags, and then generate with `hack/update-toc.sh`.
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Scheduler Implementation](#scheduler-implementation)
- [Scoring](#scoring)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
- [Unit tests](#unit-tests)
Expand Down Expand Up @@ -290,15 +291,13 @@ type `DeviceSubRequest`. The `DeviceSubRequest` type is similar to
available when providing multiple alternatives. The list provided in the
`FirstAvailable` field is considered a priority order, such that the
scheduler will use the first entry in the list that satisfies the
requirements.
requirements.

DRA does not yet implement scoring, which means that
the selected devices might not be optimal. For example, if a prioritized
list is provided in a request, DRA might choose entry number two on node A,
even though entry number one would meet the requirements on node B. This is
consistent with current behavior in DRA where the first match will be chosen.
Scoring is something that will be implemented later, with early discussions
in https://github.com/kubernetes/enhancements/issues/4970
DRA does not yet implement full scoring (tracked in
https://github.com/kubernetes/enhancements/issues/4970), but we will implement
a limited form of scoring for this feature. This is to make sure nodes which
can satisfy a claim with higher ranked subrequests are preferred over others. The
details are described in the [Scoring](#scoring) section.

### User Stories

Expand Down Expand Up @@ -619,13 +618,11 @@ type DeviceRequest struct {
//
// This field may only be set in the entries of DeviceClaim.Requests.
//
// DRA does not yet implement scoring, so the scheduler will
// select the first set of devices that satisfies all the
// requests in the claim. And if the requirements can
// be satisfied on more than one node, other scheduling features
// will determine which node is chosen. This means that the set of
// devices allocated to a claim might not be the optimal set
// available to the cluster. Scoring will be implemented later.
// DRA does not yet implement full scoring, but it implements limited
// scoring so that nodes that can satisfy high ranked subrequests are
// preferred over others. The node ultimately chosen also depends on
// other scheduling features, so it is not guaranteed that the node
// preferred by DRA is chosen.
//
// +optional
// +oneOf=deviceRequestType
Expand Down Expand Up @@ -899,6 +896,46 @@ would need a higher score, which currently is planned for beta of this feature.
For alpha, the scheduler may still pick a node with a less preferred device, if
there are nodes with each type of device available.

#### Scoring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a sentence how scoring would affect ResourceClaims shared by many pods? IIUC the first pod assigned to the ResourceClaim will in fact pick the device, so the scoring for the following pods won't even matter, but it may be good to clarify that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've updated the PR to explain the behavior when multiple pods reference a ResourceClaim.


Full support for scoring in DRA is not in scope for this feature, but we will
implement limited scoring to make sure that nodes which can satisfy claims with
higher ranked subrequests are preferred over others.

We will implement this by letting the dynamicresources scheduler plugin implement
the `Score` and `NormalizeScore` interfaces.

The allocation result for each node will be given a score based on the ranking of
the chosen subrequests across all requests using the `FirstAvailable` field across
all claims referenced by the Pod. Since the number of subrequests for each request
is capped at 8, we will compute a score between 1 and 8 for each request, with 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linear ranking might not match user intent. Would it make sense to use exponential ranking here, giving more priority to the nodes with higher ranked devices?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that we should have something like that the lowest ranked option gets a score of 1, then 2, 4, 8, 16, ...? I did think about other ways to do ranking, but none seemed clearly better than linear ranking. As an example, if I have a claim with two requests, each with three subrequests, would an allocation where the first subrequest gets allocated on the first request and the third on the second request be better than the second on both? I think linear has be benefit that it is pretty easy to understand and reason about.

Copy link
Contributor

@bart0sh bart0sh Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was that users usually prefer most performant hardware, so with your example it might be that user would prefer first subrequest + third over second + second. Does this make sense to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that linear scoring is consistent with how other plugins score. We don't use exponential ranking anywhere. However, if you decide the exponential scoring is better here, I'm okay with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with either option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the argument, I'm just not sure if one is more appropriate than another in the general case. Therefore I'm thinking we should implement the most intuitive solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed!

being the best (i.e. the first option was chosen) and 1 if the 8th subrequest was
chosen. If there are more than one request using the `FirstAvailable` field the score from
all of them will be added up to get the score for the pod on the node.
Since
the score for every node is computed based on the same claims, we end up with a
Copy link
Contributor

@bart0sh bart0sh Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to explain how the total node score is computed if pod requests multiple claims/devices? Is it a sum of scores for each claim, weighted sum, average etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a sentence about this. I think we should just do a sum, since all scores for a single pod will have the same claims. And we will do normalization anyway to make sure the score falls within the allowed boundaries in the scheduling framework.

ranking of the results from all nodes.

We will implement the `NormalizeScore` interface to normalize the results. We will do
this in a way where the score for the worst node will be given a value of zero and the
score for the best node will be given a value of 100. This is easy to compute based
only on the available scores using the formula
`(currentNodeScore - minScore) * 100 / (maxScore - minScore)`. This makes sure that
all options are ranked across the full range of available values. We also considered
ranking the scores in a way where the optimal solution for a pod is given a score of
100 and the worst possible option (where the last available option was chosen for
every use of the `FirstAvailable` field across all claims referenced by the pod) is
given a score of 0. We have chosen to not use this apprach as it isn't obviously better
than the chosen approach, but is more complicated.
Comment on lines +924 to +929
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both approaches look the same, unless I'm missing something. The max normalized score will be 100 and the min normalized score will be 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the difference is that with the proposed one, we will just look at the actual scores from the nodes while the rejected one would look at what are the theoretical set of possible scores.

As an example, assume we have a claim with a single request that contains four subrequests. Then lets assume that for node 1, subrequest number 2 was chosen, while for node 2, subrequest number 3 was chosen. As a result we will get a score of 7 for node 1 and 6 for node 2.

Normalizing with the first alternative, would just look at the scores of 7 and 6 and normalize them between 0 and 100. So node 1 will get a score of a 100, while node 2 will get a score of 0.

With the alternative solution, we would look at the claim and see that the possible scores are 8, 7, 6, and 5 (since we have 4 subrequests). So we would use the same formula, just with different values for maxScore and minScore. The result will be that node gets a score of 67, while node 2 gets a score of 33. So in this case neither node will get a score of 100 since neither of them ended up with the "best" allocation.


We will give the plugin a weight of 2 since it reflects scoring based on user preference.

If multiple pods are referencing a `ResourceClaim`, the allocation of devices are decided
when the first pod is scheduled. Any later pods referencing the claim must be scheduled
on nodes from where the allocated devices are available. But since the devices have
already been allocated, the dynamicresources scheduler plugin will not do any scoring
for later pods.

### Test Plan

<!--
Expand Down
3 changes: 1 addition & 2 deletions keps/sig-scheduling/4816-dra-prioritized-list/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,12 @@ stage: beta
# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.34"
latest-milestone: "v1.35"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.33"
beta: "v1.34"
stable: "v1.35"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
Expand Down