Skip to content

[core] EC2 Autoscaler race condition #54920

@akazakov

Description

@akazakov

What happened + What you expected to happen

FYI, previously reported this here #51861
It was closed, but the issue is still there, here's an example of a stack trace from ray 2.47.0:

2025-07-09 19:42:52,988 ERROR autoscaler.py:373 -- StandardAutoscaler: Error during autoscaling.
2025-07-09 19:42:53.109
Traceback (most recent call last):
2025-07-09 19:42:53.109
  File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 370, in update
2025-07-09 19:42:53.109
    self._update()
2025-07-09 19:42:53.109
  File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 420, in _update
2025-07-09 19:42:53.109
    self.process_completed_updates()
2025-07-09 19:42:53.109
  File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/autoscaler.py", line 768, in process_completed_updates
2025-07-09 19:42:53.109
    self.load_metrics.mark_active(self.provider.internal_ip(node_id))
2025-07-09 19:42:53.109
  File "/opt/ml/lib/python3.12/site-packages/ray/autoscaler/_private/load_metrics.py", line 131, in mark_active
2025-07-09 19:42:53.109
    assert ip is not None, "IP should be known at this time"
2025-07-09 19:42:53.109
           ^^^^^^^^^^^^^^ 

PS> Also there appears to be a larger regression in 2.47.0 which causes clusters to get stuck and not able to scale up or down. I don't have a smoking gun, but we had to roll back our upgrade. Currently investigating which version introduced the regression. If you have any other reports for this issue or fixes, please let me know.

Versions / Dependencies

Ray version 2.47.0

Reproduction script

Run a ray job hundres/thousands of tasks and let cluster scale up quickly to hundreds or thousands of nodes. We observe this happening with clusters as small as ~150 nodes.

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

Labels

P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issuesstability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions