[data] Extract backpressure-related code from ResourceManager as a policy #54376

raulchen · 2025-07-07T18:37:03Z

Extract backpressure related methods (can_submit_new_tasks and max_task_output_bytes_to_read) from ResourceManager, as a standalone policy ResourceBudgetBackpressurePolicy)
Report task_output_backpressure_time metric.

Signed-off-by: Hao Chen <[email protected]>

Copilot

Pull Request Overview

This PR extracts backpressure logic from ResourceManager into a standalone ResourceBudgetBackpressurePolicy, integrates configurable backpressure policies into the streaming executor for both task submission and task output, and adds a new task_output_backpressure_time metric.

Introduce a BackpressurePolicy interface with methods for input (can_add_input) and output (max_task_output_bytes_to_read) backpressure and extract a resource-based policy.
Refactor process_completed_tasks and get_eligible_operators to use a list of backpressure policies instead of tightly coupling to ResourceManager.
Add task_output_backpressure_time metric in OpRuntimeMetrics and notify operators of output backpressure.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
python/ray/data/tests/test_streaming_executor.py	Updated tests to call `process_completed_tasks` and `get_eligible_operators` with policy lists and introduced a `TestBackpressurePolicy`.
python/ray/data/tests/test_resource_manager.py	Replaced direct `allocator.can_submit_new_task` calls with a helper that uses `get_budget`.
python/ray/data/_internal/execution/streaming_executor_state.py	Refactored `process_completed_tasks` and `get_eligible_operators` to use backpressure policy lists.
python/ray/data/_internal/execution/streaming_executor.py	Updated executor initialization to pass `data_context`, `topology`, and `resource_manager` into policy factory and use a list of policies.
python/ray/data/_internal/execution/resource_manager.py	Exposed `max_task_output_bytes_to_read` and `get_budget` wrappers and removed `can_submit_new_task` from the allocator interface.
python/ray/data/_internal/execution/interfaces/physical_operator.py	Added `notify_in_task_output_backpressure` for output backpressure metrics.
python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py	Added `task_output_backpressure_time` metric and timing logic.
python/ray/data/_internal/execution/backpressure_policy/backpressure_policy.py	Defined `BackpressurePolicy` base class with new constructor and default output method.
python/ray/data/_internal/execution/backpressure_policy/resource_budget_backpressure_policy.py	New `ResourceBudgetBackpressurePolicy` implementation.
python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py	Adapted constructor to accept common parameters via `super()`.
python/ray/data/_internal/execution/backpressure_policy/init.py	Updated `get_backpressure_policies` to pass `data_context`, `topology`, and `resource_manager`.

Comments suppressed due to low confidence (1)

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py:361

Tests for the new task_output_backpressure_time metric are missing. Add unit tests to verify that on_toggle_task_output_backpressure correctly starts, stops, and accumulates the backpressure timer.

    task_output_backpressure_time: float = metric_field(

python/ray/data/tests/test_streaming_executor.py

python/ray/data/_internal/execution/backpressure_policy/resource_budget_backpressure_policy.py

Signed-off-by: Hao Chen <[email protected]>

omatthew98

Small nits but lgtm!

omatthew98 · 2025-07-07T20:46:49Z

python/ray/data/_internal/execution/streaming_executor_state.py

-            if max_bytes_to_read is not None:
-                max_bytes_to_read_per_op[state] = max_bytes_to_read
+    for op, state in topology.items():
+        # Check all backpressure policies for max_task_output_bytes_to_read


Nit: For future debugging purposes, do we want to log the backpressure policy that is being used? Wondering about the case where one might be overly restrictive but it becomes hard to tell which one?

yes, I thought of that. but not sure what's the best way to surface this info (too long to progress bars). I'll leave a TODO here.

Yeah I wonder if just logging at the debug level would be sufficient? If that would cause it to only appear in the ray_data.log file then maybe that would work? Fine to punt on that though.

even logging might be too verbose.
given that we don't have that many policies right now. Logging that doesn't add much value.
So I'll leave it for now

omatthew98 · 2025-07-07T20:48:24Z

python/ray/data/_internal/execution/streaming_executor_state.py

-        in_backpressure = not under_resource_limits or not all(
-            p.can_add_input(op) for p in backpressure_policies
-        )
+        # Operator is considered being in task-submission back-pressure any


Nit: I think missing an if: "Operator is considered being in task-submission back-pressure any" -> "Operator is considered being in task-submission back-pressure if any"

Signed-off-by: Hao Chen <[email protected]>

bveeramani · 2025-07-07T22:08:47Z

python/ray/data/_internal/execution/backpressure_policy/backpressure_policy.py

+        """Initialize the backpressure policy.
+        Args:


Suggested change

"""Initialize the backpressure policy.

Args:

"""Initialize the backpressure policy.

Args:

Signed-off-by: Hao Chen <[email protected]>

alexeykudinkin · 2025-07-08T20:59:33Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

+    task_output_backpressure_time: float = metric_field(
+        default=0,
+        description="Time spent in task output backpressure.",
+        metrics_group=MetricsGroup.TASKS,
+    )


Should we track this as per-task metric? I don't think cumulative one is very useful

It keeps track of the total time when the op is backpressured.
Op-level metric would be more useful as we can show them in the Grafana dashboard.
task-level metric will result in too many lines and make the chart difficult to read

How can be the cumulative time be useful?

task-level metric will result in too many lines

Well, you don't show every line you just show the distribution.

we can handle that later. currently we already report cumulative metric for task submission backpressure time.

alexeykudinkin · 2025-07-08T21:00:17Z

python/ray/data/_internal/execution/resource_manager.py

-    def get_budget(self, op: PhysicalOperator) -> ExecutionResources:
-        return self._op_budgets[op]
+    def get_budget(self, op: PhysicalOperator) -> Optional[ExecutionResources]:
+        return self._op_budgets.get(op, None)


Suggested change

return self._op_budgets.get(op, None)

return self._op_budgets.get(op)

I intentionally changed this to return an optional. Because ineligible ops don't have budgets.

Yeah, i understand that but what's the point of specifying default? It will return none anyways

got it. somehow I thought it raises key error by default.

alexeykudinkin · 2025-07-08T21:02:17Z

python/ray/data/_internal/execution/streaming_executor_state.py

+        max_bytes_to_read = None
+        for policy in backpressure_policies:
+            policy_limit = policy.max_task_output_bytes_to_read(op)
+            if policy_limit is not None:
+                if max_bytes_to_read is None:
+                    max_bytes_to_read = policy_limit
+                else:
+                    max_bytes_to_read = min(max_bytes_to_read, policy_limit)


Suggested change

max_bytes_to_read = None

for policy in backpressure_policies:

policy_limit = policy.max_task_output_bytes_to_read(op)

if policy_limit is not None:

if max_bytes_to_read is None:

max_bytes_to_read = policy_limit

else:

max_bytes_to_read = min(max_bytes_to_read, policy_limit)

max_bytes = min([p.max_task_output_bytes_to_read(op) or float("inf") for p in policies])

alexeykudinkin · 2025-07-08T21:03:42Z

python/ray/data/_internal/execution/streaming_executor_state.py

+        # Operator is considered being in task-submission back-pressure if any
+        # back-pressure policy is violated
+        in_backpressure = any(not p.can_add_input(op) for p in backpressure_policies)


As @omatthew98 pointed out above -- let's make back-pressuring traceable (by logging when the op becomes throttled first time)

The op will be flipping between backpressure and non-backpressure status pretty frequently. I don't think logging the first time would be useful.

It should be every-time it goes from no-throttling to being throttled

alexeykudinkin · 2025-07-08T21:06:01Z

python/ray/data/tests/test_streaming_executor.py

-    ) as _mock:
-        _mock.side_effect = lambda op: False if op is o2 else True
-        assert _get_eligible_ops_to_run(ensure_liveness=False) == [o3]
+    class TestBackpressurePolicy(BackpressurePolicy):


Let's actually test actual back-pressure policies that we have

alexeykudinkin · 2025-07-08T21:07:30Z

python/ray/data/_internal/execution/backpressure_policy/resource_budget_backpressure_policy.py

+    def can_add_input(self, op: "PhysicalOperator") -> bool:
+        budget = self._resource_manager.get_budget(op)
+        if budget is None:
+            return True
+        return op.incremental_resource_usage().satisfies_limit(budget)
+
+    def max_task_output_bytes_to_read(self, op: "PhysicalOperator") -> Optional[int]:
+        """Determine maximum bytes to read based on the resource budgets.
+
+        Args:
+            op: The operator to get the limit for.
+
+        Returns:
+            The maximum bytes that can be read, or None if no limit.
+        """
+        return self._resource_manager.max_task_output_bytes_to_read(op)


Let's make sure we keep existing coverage of this

yeah, it's already tested in test_resource_manager.

…licy (ray-project#54376) * Extract backpressure related methods (`can_submit_new_tasks` and `max_task_output_bytes_to_read`) from ResourceManager, as a standalone policy `ResourceBudgetBackpressurePolicy`) * Report task_output_backpressure_time metric. --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: doyoung <[email protected]>

…licy (ray-project#54376) * Extract backpressure related methods (`can_submit_new_tasks` and `max_task_output_bytes_to_read`) from ResourceManager, as a standalone policy `ResourceBudgetBackpressurePolicy`) * Report task_output_backpressure_time metric. --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: ChanChan Mao <[email protected]>

raulchen added 6 commits July 2, 2025 20:41

refactor ResourceManager and OpResourceAllocator

cbd3215

Signed-off-by: Hao Chen <[email protected]>

refine

aa67471

Signed-off-by: Hao Chen <[email protected]>

add ResourceBudgetBackpressurePolicy

96ca20d

Signed-off-by: Hao Chen <[email protected]>

extract max_task_output_bytes_to_read

1debe47

Signed-off-by: Hao Chen <[email protected]>

fix test_resource_manager.py

e8256e5

Signed-off-by: Hao Chen <[email protected]>

add notify_in_task_output_backpressure

47ee243

Signed-off-by: Hao Chen <[email protected]>

Copilot AI review requested due to automatic review settings July 7, 2025 18:37

raulchen requested a review from a team as a code owner July 7, 2025 18:37

Copilot AI reviewed Jul 7, 2025

View reviewed changes

python/ray/data/tests/test_streaming_executor.py Show resolved Hide resolved

python/ray/data/_internal/execution/backpressure_policy/resource_budget_backpressure_policy.py Outdated Show resolved Hide resolved

raulchen added 2 commits July 7, 2025 11:40

comment

4c84ff4

Signed-off-by: Hao Chen <[email protected]>

fix

846c5fd

Signed-off-by: Hao Chen <[email protected]>

omatthew98 approved these changes Jul 7, 2025

View reviewed changes

fix

d7b4402

Signed-off-by: Hao Chen <[email protected]>

bveeramani approved these changes Jul 7, 2025

View reviewed changes

raulchen added 3 commits July 7, 2025 17:07

lint

d0c5f19

Signed-off-by: Hao Chen <[email protected]>

fix

6621c9d

Signed-off-by: Hao Chen <[email protected]>

lint

7379a08

Signed-off-by: Hao Chen <[email protected]>

raulchen enabled auto-merge (squash) July 8, 2025 00:31

github-actions bot added the go add ONLY when ready to merge, run all tests label Jul 8, 2025

raulchen added 4 commits July 8, 2025 13:09

fix

770212d

Signed-off-by: Hao Chen <[email protected]>

fix

a446936

Signed-off-by: Hao Chen <[email protected]>

lint

cb780ef

Signed-off-by: Hao Chen <[email protected]>

Merge branch 'master' into refactor-resource-manager

f1298db

Signed-off-by: Hao Chen <[email protected]>

github-actions bot disabled auto-merge July 8, 2025 20:26

raulchen enabled auto-merge (squash) July 8, 2025 20:48

alexeykudinkin reviewed Jul 8, 2025

View reviewed changes

alexeykudinkin disabled auto-merge July 8, 2025 21:07

raulchen merged commit bbf024d into ray-project:master Jul 8, 2025
6 checks passed

raulchen deleted the refactor-resource-manager branch July 8, 2025 22:51

raulchen mentioned this pull request Jul 10, 2025

[data] Allocate GPU resources in ResourceManager #54445

Merged

	return self._op_budgets.get(op, None)
	return self._op_budgets.get(op)

[data] Extract backpressure-related code from ResourceManager as a policy #54376

[data] Extract backpressure-related code from ResourceManager as a policy #54376

Uh oh!

Conversation

raulchen commented Jul 7, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

omatthew98 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!