Skip to content

Ray Autoscaler for Azure not getting Spot VMs #17414

@omnific9

Description

@omnific9

I was running a Ray Autoscaler sample script on Azure when I noticed the script was running only on the head node and no worker nodes are launching. The error message was as follows:


  File "/anaconda/lib/python3.8/site-packages/ray/autoscaler/_private/node_launcher.py", line 85, in run

    self._launch_node(config, count, node_type)

  File "/anaconda/lib/python3.8/site-packages/ray/autoscaler/_private/node_launcher.py", line 67, in _launch_node

    self.provider.create_node(node_config, node_tags, count)

  File "/anaconda/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 220, in create_node

    create(

  File "/anaconda/lib/python3.8/site-packages/azure/mgmt/resource/resources/v2021_04_01/operations/_deployments_operations.py", line 3008, in create_or_update

    raw_result = self._create_or_update_initial(

  File "/anaconda/lib/python3.8/site-packages/azure/mgmt/resource/resources/v2021_04_01/operations/_deployments_operations.py", line 2964, in _create_or_update_initial

    raise exp

msrestazure.azure_exceptions.CloudError: Azure Error: MultipleErrorsOccurred

Message: Multiple error occurred: BadRequest,BadRequest. Please see details.

Exception Details:

      Error Code: InvalidTemplateDeployment

      Message: The template deployment failed with error: 'The resource with id: '/subscriptions/42158598-412c-477a-a552-c34f8e8debde/resourceGroups/ray-cluster/providers/Microsoft.Compute/virtualMachines/ray-minimal-worker-c4da9c6a0' failed validation with message: 'The requested size for resource '/subscriptions/42158598-412c-477a-a552-c34f8e8debde/resourceGroups/ray-cluster/providers/Microsoft.Compute/virtualMachines/ray-minimal-worker-c4da9c6a0' is currently not available in location 'westus' zones '' for subscription '42158598-412c-477a-a552-c34f8e8debde'. Please try another size or deploy to a different location or zones. See https://aka.ms/azureskunotavailable for details.'.'.

      Error Code: InvalidTemplateDeployment

      Message: The template deployment failed with error: 'The resource with id: '/subscriptions/42158598-412c-477a-a552-c34f8e8debde/resourceGroups/ray-cluster/providers/Microsoft.Compute/virtualMachines/ray-minimal-worker-c4da9c6a1' failed validation with message: 'The requested size for resource '/subscriptions/42158598-412c-477a-a552-c34f8e8debde/resourceGroups/ray-cluster/providers/Microsoft.Compute/virtualMachines/ray-minimal-worker-c4da9c6a1' is currently not available in location 'westus' zones '' for subscription '42158598-412c-477a-a552-c34f8e8debde'. Please try another size or deploy to a different location or zones. See https://aka.ms/azureskunotavailable for details.'.'.

^C

Shared connection to 168.62.205.103 closed.

I later found out this was caused by the worker node section:

    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 2}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D2s_v3
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: 21.07.12
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

If I remove the "priority: Spot" line, Ray would be able to get and launch worker nodes. So looks like if Ray is not getting Spot instances with this line enabled, it's not getting regular instances either.

CC'ing @DmitriGekhtman

Metadata

Metadata

Labels

@external-author-action-requiredAlternate tag for PRs where the author doesn't have labeling permission.P2Important issue, but not time-criticalazurecommunity-backlogcoreIssues that should be addressed in Ray Corestability

Type

No type

Projects

Status

No status

Relationships

None yet

Development

No branches or pull requests

Issue actions