-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[data] Allocate GPU resources in ResourceManager #54445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds GPU-specific budgeting to the ResourceManager
so that operators requesting GPUs receive a slice of the cluster’s GPUs rather than an infinite budget.
- Change GPU budget in
update_usages
to allocate “global GPUs minus current usage” for GPU operators; non-GPU operators get zero. - Update existing tests to expect a zero GPU budget instead of
inf
in memory-only cases. - Add
test_gpu_allocation
andtest_multiple_gpu_operators
to verify GPU allocations.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
python/ray/data/_internal/execution/resource_manager.py | Implement per-op GPU budgeting in update_usages |
python/ray/data/tests/test_resource_manager.py | Adjust memory-budget assertions and add GPU allocation tests |
Comments suppressed due to low confidence (2)
python/ray/data/tests/test_resource_manager.py:672
- [nitpick] Add a test case where an operator’s current GPU usage exceeds the global GPU limit to verify that the allocator properly clamps the budget to zero.
def test_multiple_gpu_operators(self, restore_data_context):
python/ray/data/_internal/execution/resource_manager.py:701
- Cache the result of
op.min_max_resource_requirements()
in a local variable to avoid calling the method twice and improve readability.
if op.min_max_resource_requirements()[1].gpu > 0:
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
this PR also includes a few small drive-by fixes left from #54376 cc @alexeykudinkin |
max_bytes_to_read = min( | ||
( | ||
limit | ||
for policy in backpressure_policies | ||
if (limit := policy.max_task_output_bytes_to_read(op)) is not None | ||
), | ||
default=None, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
DataContext.get_current().op_resource_reservation_enabled = True | ||
DataContext.get_current().op_resource_reservation_ratio = 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and for the test below -- aren't these the defaults? Are they necessary for this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to make the test not depend on the defaults.
so it doesn't break if we change the behavior. (We just need to test the logic. defaults don't matter)
resource_manager._mem_op_internal = dict.fromkeys([o1, o2, o3], 0) | ||
resource_manager._mem_op_outputs = dict.fromkeys([o1, o2, o3], 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOC why do we need to configure these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually not needed. removed
Signed-off-by: Hao Chen <[email protected]>
Allocate GPU resources in ResourceManager. Currently we just allocate all available GPUs to all operators that need GPUs. If you have multiple GPU ops, each of them will get all GPUs. This PR is mainly to make the resource budget reporting correct. --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: alimaazamat <[email protected]>
Allocate GPU resources in ResourceManager. Currently we just allocate all available GPUs to all operators that need GPUs. If you have multiple GPU ops, each of them will get all GPUs. This PR is mainly to make the resource budget reporting correct. --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: jugalshah291 <[email protected]>
Allocate GPU resources in ResourceManager.
Currently we just allocate all available GPUs to all operators that need GPUs. If you have multiple GPU ops, each of them will get all GPUs.
This PR is mainly to make the resource budget reporting correct.