[data] Allocate GPU resources in ResourceManager #54445

raulchen · 2025-07-08T22:14:53Z

Allocate GPU resources in ResourceManager.
Currently we just allocate all available GPUs to all operators that need GPUs. If you have multiple GPU ops, each of them will get all GPUs.
This PR is mainly to make the resource budget reporting correct.

Signed-off-by: Hao Chen <[email protected]>

Copilot

Pull Request Overview

Adds GPU-specific budgeting to the ResourceManager so that operators requesting GPUs receive a slice of the cluster’s GPUs rather than an infinite budget.

Change GPU budget in update_usages to allocate “global GPUs minus current usage” for GPU operators; non-GPU operators get zero.
Update existing tests to expect a zero GPU budget instead of inf in memory-only cases.
Add test_gpu_allocation and test_multiple_gpu_operators to verify GPU allocations.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
python/ray/data/_internal/execution/resource_manager.py	Implement per-op GPU budgeting in `update_usages`
python/ray/data/tests/test_resource_manager.py	Adjust memory-budget assertions and add GPU allocation tests

Comments suppressed due to low confidence (2)

python/ray/data/tests/test_resource_manager.py:672

[nitpick] Add a test case where an operator’s current GPU usage exceeds the global GPU limit to verify that the allocator properly clamps the budget to zero.

    def test_multiple_gpu_operators(self, restore_data_context):

python/ray/data/_internal/execution/resource_manager.py:701

Cache the result of op.min_max_resource_requirements() in a local variable to avoid calling the method twice and improve readability.

            if op.min_max_resource_requirements()[1].gpu > 0:

python/ray/data/_internal/execution/resource_manager.py

Signed-off-by: Hao Chen <[email protected]>

raulchen · 2025-07-10T21:44:04Z

this PR also includes a few small drive-by fixes left from #54376 cc @alexeykudinkin

bveeramani · 2025-07-11T00:43:49Z

python/ray/data/_internal/execution/streaming_executor_state.py

+        max_bytes_to_read = min(
+            (
+                limit
+                for policy in backpressure_policies
+                if (limit := policy.max_task_output_bytes_to_read(op)) is not None
+            ),
+            default=None,
+        )


bveeramani · 2025-07-11T00:46:06Z

python/ray/data/tests/test_resource_manager.py

+        DataContext.get_current().op_resource_reservation_enabled = True
+        DataContext.get_current().op_resource_reservation_ratio = 0.5


Here and for the test below -- aren't these the defaults? Are they necessary for this test?

I'd like to make the test not depend on the defaults.
so it doesn't break if we change the behavior. (We just need to test the logic. defaults don't matter)

bveeramani · 2025-07-11T00:46:17Z

python/ray/data/tests/test_resource_manager.py

+        resource_manager._mem_op_internal = dict.fromkeys([o1, o2, o3], 0)
+        resource_manager._mem_op_outputs = dict.fromkeys([o1, o2, o3], 0)


OOC why do we need to configure these?

actually not needed. removed

Signed-off-by: Hao Chen <[email protected]>

Allocate GPU resources in ResourceManager. Currently we just allocate all available GPUs to all operators that need GPUs. If you have multiple GPU ops, each of them will get all GPUs. This PR is mainly to make the resource budget reporting correct. --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: alimaazamat <[email protected]>

Allocate GPU resources in ResourceManager. Currently we just allocate all available GPUs to all operators that need GPUs. If you have multiple GPU ops, each of them will get all GPUs. This PR is mainly to make the resource budget reporting correct. --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: jugalshah291 <[email protected]>

raulchen added 6 commits July 7, 2025 17:05

[data] allocate GPU resources

ba4ab09

Signed-off-by: Hao Chen <[email protected]>

Merge branch 'master' into allocate-gpu-resource

eefe7e6

Signed-off-by: Hao Chen <[email protected]>

fix

f1ca9bd

Signed-off-by: Hao Chen <[email protected]>

add test

aee6b6d

Signed-off-by: Hao Chen <[email protected]>

simplify test

647e1a5

Signed-off-by: Hao Chen <[email protected]>

Merge branch 'master' into allocate-gpu-resource

1c41ada

Signed-off-by: Hao Chen <[email protected]>

Copilot AI review requested due to automatic review settings July 8, 2025 22:14

raulchen requested a review from a team as a code owner July 8, 2025 22:14

Copilot AI reviewed Jul 8, 2025

View reviewed changes

python/ray/data/_internal/execution/resource_manager.py Show resolved Hide resolved

raulchen added 3 commits July 10, 2025 12:55

simplify loop

2e6c979

Signed-off-by: Hao Chen <[email protected]>

refine test

0ca6058

Signed-off-by: Hao Chen <[email protected]>

remove None

6492f43

Signed-off-by: Hao Chen <[email protected]>

bveeramani reviewed Jul 11, 2025

View reviewed changes

bveeramani approved these changes Jul 11, 2025

View reviewed changes

fix

5d1c4fb

Signed-off-by: Hao Chen <[email protected]>

raulchen added the go add ONLY when ready to merge, run all tests label Jul 11, 2025

raulchen merged commit b081a6c into ray-project:master Jul 14, 2025
6 checks passed

raulchen deleted the allocate-gpu-resource branch July 15, 2025 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] Allocate GPU resources in ResourceManager #54445

[data] Allocate GPU resources in ResourceManager #54445

Uh oh!

raulchen commented Jul 8, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

raulchen commented Jul 10, 2025

Uh oh!

bveeramani Jul 11, 2025

Uh oh!

bveeramani Jul 11, 2025

Uh oh!

raulchen Jul 11, 2025

Uh oh!

bveeramani Jul 11, 2025

Uh oh!

raulchen Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

		DataContext.get_current().op_resource_reservation_enabled = True
		DataContext.get_current().op_resource_reservation_ratio = 0.5

		resource_manager._mem_op_internal = dict.fromkeys([o1, o2, o3], 0)
		resource_manager._mem_op_outputs = dict.fromkeys([o1, o2, o3], 0)

[data] Allocate GPU resources in ResourceManager #54445

[data] Allocate GPU resources in ResourceManager #54445

Uh oh!

Conversation

raulchen commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

raulchen commented Jul 10, 2025

Uh oh!

bveeramani Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

bveeramani Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

raulchen Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

bveeramani Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

raulchen Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

raulchen commented Jul 8, 2025 •

edited

Loading