-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Open
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issuesautoscaler related issuesstability
Description
What happened + What you expected to happen
FYI, previously reported this here #51861
It was closed, but the issue is still there, here's an example of a stack trace from ray 2.47.0:
2025-07-09 19:42:52,988 ERROR autoscaler.py:373 -- StandardAutoscaler: Error during autoscaling.
2025-07-09 19:42:53.109
Traceback (most recent call last):
2025-07-09 19:42:53.109
File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 370, in update
2025-07-09 19:42:53.109
self._update()
2025-07-09 19:42:53.109
File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 420, in _update
2025-07-09 19:42:53.109
self.process_completed_updates()
2025-07-09 19:42:53.109
File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 768, in process_completed_updates
2025-07-09 19:42:53.109
self.load_metrics.mark_active(self.provider.internal_ip(node_id))
2025-07-09 19:42:53.109
File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/load_metrics.py", line 131, in mark_active
2025-07-09 19:42:53.109
assert ip is not None, "IP should be known at this time"
2025-07-09 19:42:53.109
^^^^^^^^^^^^^^
PS> Also there appears to be a larger regression in 2.47.0 which causes clusters to get stuck and not able to scale up or down. I don't have a smoking gun, but we had to roll back our upgrade. Currently investigating which version introduced the regression. If you have any other reports for this issue or fixes, please let me know.
Versions / Dependencies
Ray version 2.47.0
Reproduction script
Run a ray job hundres/thousands of tasks and let cluster scale up quickly to hundreds or thousands of nodes. We observe this happening with clusters as small as ~150 nodes.
Issue Severity
High: It blocks me from completing my task.
bhmiller, arkwave and liuquinlin
Metadata
Metadata
Assignees
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issuesautoscaler related issuesstability