Skip to content

Conversation

raulchen
Copy link
Contributor

@raulchen raulchen commented Jul 8, 2025

Allocate GPU resources in ResourceManager.
Currently we just allocate all available GPUs to all operators that need GPUs. If you have multiple GPU ops, each of them will get all GPUs.
This PR is mainly to make the resource budget reporting correct.

@Copilot Copilot AI review requested due to automatic review settings July 8, 2025 22:14
@raulchen raulchen requested a review from a team as a code owner July 8, 2025 22:14
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds GPU-specific budgeting to the ResourceManager so that operators requesting GPUs receive a slice of the cluster’s GPUs rather than an infinite budget.

  • Change GPU budget in update_usages to allocate “global GPUs minus current usage” for GPU operators; non-GPU operators get zero.
  • Update existing tests to expect a zero GPU budget instead of inf in memory-only cases.
  • Add test_gpu_allocation and test_multiple_gpu_operators to verify GPU allocations.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
python/ray/data/_internal/execution/resource_manager.py Implement per-op GPU budgeting in update_usages
python/ray/data/tests/test_resource_manager.py Adjust memory-budget assertions and add GPU allocation tests
Comments suppressed due to low confidence (2)

python/ray/data/tests/test_resource_manager.py:672

  • [nitpick] Add a test case where an operator’s current GPU usage exceeds the global GPU limit to verify that the allocator properly clamps the budget to zero.
    def test_multiple_gpu_operators(self, restore_data_context):

python/ray/data/_internal/execution/resource_manager.py:701

  • Cache the result of op.min_max_resource_requirements() in a local variable to avoid calling the method twice and improve readability.
            if op.min_max_resource_requirements()[1].gpu > 0:

raulchen added 3 commits July 10, 2025 12:55
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
@raulchen
Copy link
Contributor Author

this PR also includes a few small drive-by fixes left from #54376 cc @alexeykudinkin

Comment on lines +454 to +461
max_bytes_to_read = min(
(
limit
for policy in backpressure_policies
if (limit := policy.max_task_output_bytes_to_read(op)) is not None
),
default=None,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

Comment on lines +629 to +630
DataContext.get_current().op_resource_reservation_enabled = True
DataContext.get_current().op_resource_reservation_ratio = 0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and for the test below -- aren't these the defaults? Are they necessary for this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to make the test not depend on the defaults.
so it doesn't break if we change the behavior. (We just need to test the logic. defaults don't matter)

Comment on lines +659 to +660
resource_manager._mem_op_internal = dict.fromkeys([o1, o2, o3], 0)
resource_manager._mem_op_outputs = dict.fromkeys([o1, o2, o3], 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC why do we need to configure these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually not needed. removed

Signed-off-by: Hao Chen <[email protected]>
@raulchen raulchen added the go add ONLY when ready to merge, run all tests label Jul 11, 2025
@raulchen raulchen merged commit b081a6c into ray-project:master Jul 14, 2025
6 checks passed
@raulchen raulchen deleted the allocate-gpu-resource branch July 15, 2025 18:01
alimaazamat pushed a commit to alimaazamat/ray that referenced this pull request Jul 24, 2025
Allocate GPU resources in ResourceManager.
Currently we just allocate all available GPUs to all operators that need
GPUs. If you have multiple GPU ops, each of them will get all GPUs.
This PR is mainly to make the resource budget reporting correct.

---------

Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: alimaazamat <[email protected]>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
Allocate GPU resources in ResourceManager.
Currently we just allocate all available GPUs to all operators that need
GPUs. If you have multiple GPU ops, each of them will get all GPUs.
This PR is mainly to make the resource budget reporting correct.

---------

Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: jugalshah291 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants