-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Open
Labels
@external-author-action-requiredAlternate tag for PRs where the author doesn't have labeling permission.Alternate tag for PRs where the author doesn't have labeling permission.P2Important issue, but not time-criticalImportant issue, but not time-criticalazurecommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corestability
Milestone
Description
I was running a Ray Autoscaler sample script on Azure when I noticed the script was running only on the head node and no worker nodes are launching. The error message was as follows:
File "/anaconda/lib/python3.8/site-packages/ray/autoscaler/_private/node_launcher.py", line 85, in run
self._launch_node(config, count, node_type)
File "/anaconda/lib/python3.8/site-packages/ray/autoscaler/_private/node_launcher.py", line 67, in _launch_node
self.provider.create_node(node_config, node_tags, count)
File "/anaconda/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 220, in create_node
create(
File "/anaconda/lib/python3.8/site-packages/azure/mgmt/resource/resources/v2021_04_01/operations/_deployments_operations.py", line 3008, in create_or_update
raw_result = self._create_or_update_initial(
File "/anaconda/lib/python3.8/site-packages/azure/mgmt/resource/resources/v2021_04_01/operations/_deployments_operations.py", line 2964, in _create_or_update_initial
raise exp
msrestazure.azure_exceptions.CloudError: Azure Error: MultipleErrorsOccurred
Message: Multiple error occurred: BadRequest,BadRequest. Please see details.
Exception Details:
Error Code: InvalidTemplateDeployment
Message: The template deployment failed with error: 'The resource with id: '/subscriptions/42158598-412c-477a-a552-c34f8e8debde/resourceGroups/ray-cluster/providers/Microsoft.Compute/virtualMachines/ray-minimal-worker-c4da9c6a0' failed validation with message: 'The requested size for resource '/subscriptions/42158598-412c-477a-a552-c34f8e8debde/resourceGroups/ray-cluster/providers/Microsoft.Compute/virtualMachines/ray-minimal-worker-c4da9c6a0' is currently not available in location 'westus' zones '' for subscription '42158598-412c-477a-a552-c34f8e8debde'. Please try another size or deploy to a different location or zones. See https://aka.ms/azureskunotavailable for details.'.'.
Error Code: InvalidTemplateDeployment
Message: The template deployment failed with error: 'The resource with id: '/subscriptions/42158598-412c-477a-a552-c34f8e8debde/resourceGroups/ray-cluster/providers/Microsoft.Compute/virtualMachines/ray-minimal-worker-c4da9c6a1' failed validation with message: 'The requested size for resource '/subscriptions/42158598-412c-477a-a552-c34f8e8debde/resourceGroups/ray-cluster/providers/Microsoft.Compute/virtualMachines/ray-minimal-worker-c4da9c6a1' is currently not available in location 'westus' zones '' for subscription '42158598-412c-477a-a552-c34f8e8debde'. Please try another size or deploy to a different location or zones. See https://aka.ms/azureskunotavailable for details.'.'.
^C
Shared connection to 168.62.205.103 closed.
I later found out this was caused by the worker node section:
ray.worker.default:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 0
# The maximum number of worker nodes of this type to launch.
# This takes precedence over min_workers.
max_workers: 2
# The resources provided by this node type.
resources: {"CPU": 2}
# Provider-specific config, e.g. instance type.
node_config:
azure_arm_parameters:
vmSize: Standard_D2s_v3
# List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
imagePublisher: microsoft-dsvm
imageOffer: ubuntu-1804
imageSku: 1804-gen2
imageVersion: 21.07.12
# optionally set priority to use Spot instances
priority: Spot
# set a maximum price for spot instances if desired
# billingProfile:
# maxPrice: -1
If I remove the "priority: Spot" line, Ray would be able to get and launch worker nodes. So looks like if Ray is not getting Spot instances with this line enabled, it's not getting regular instances either.
CC'ing @DmitriGekhtman
Metadata
Metadata
Labels
@external-author-action-requiredAlternate tag for PRs where the author doesn't have labeling permission.Alternate tag for PRs where the author doesn't have labeling permission.P2Important issue, but not time-criticalImportant issue, but not time-criticalazurecommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corestability
Type
Projects
Status
No status