Deploying Ray on Azure for Scalable Distributed Reinforcement Learning

Microsoft

Apr 22, 2020

Modern distributed applications, such as those used to train deep reinforcement learning models, need to dynamically scale up to meet computational speed and load requirements as well as scaling down to efficiently control costs. In this post, we cover how to simplify development and deployment of such distributed applications by using Ray on Azure.

What is Ray?

Ray is an open-source framework that provides a way to modify existing python code to take advantage of remote, parallel execution. In addition, Ray simplifies the management of distributed compute by setting up a cluster and automatically scaling it based on the observed computational load. A recent contribution to Ray now enables Azure to be used as the underlying compute infrastructure. Packages built on Ray like RLLib and Tune provide the ability to use scalable clusters to speed up training of deep reinforcement learning models or hyperparameter tuning for machine learning models.

Deploying Ray on Azure

There are currently two ways of provisioning Ray on Azure, the first uses a Deploy to Azure Button and the second uses Ray’s built in Azure autoscaler. We’ll show you both approaches so you can decide which is right for your needs and how to quickly set it up so you can begin building distributed applications.

Method 1: Deploy to Azure Button

The easiest way to get started with Ray on Azure is to use the Deploy to Azure Button provided below (as well as in the Ray Autoscaling Documentation).

The Deploy to Azure Button uses an Azure Resource Manager (ARM) template to deploy the following resources on Azure:

An Azure Virtual Machine for the Ray cluster head node
Azure Virtual Machine Scale Sets (VMSS) for the Ray cluster worker nodes
Network infrastructure (VNet, Network Security Group, Public IP Address)

The virtual machines use the Azure Data Science VM (DSVM) as a base image and are initialized with additional startup scripts for Ray. Using the Head Node Size and Worker Node Size option we support multiple VM sizes including GPU versions (see Linux virtual machines in Azure). By default, worker nodes leverage deeply discounted Azure Spot instances to reduce cluster costs.

A DSVM comes pre-installed with common data science packages and Python environments exposed through conda environments. Users can choose the base conda environment used for the Ray environment (see Conda Env option). Furthermore, additional Python packages can be supplied for installation on both the head and worker nodes.

Users can conveniently access the head node using SSH or JupyterLab, given the Public Web UI option was selected. Managing the worker nodes through VMSS enables auto-scaling based on several criteria managed through Azure (e.g. CPU load, memory consumption, remaining credits, etc.). The flexibility provided for scaling the cluster through Azure is one of the primary difference between the two methods in this post.

Method 2: Deploy using Ray Azure Autoscaler

The second method uses the Ray Azure Autoscaler implementation to manage the cluster and is configured using YAML files found in the Ray repository. Using this approach instead of the first method is helpful if you want to modify configuration options not exposed in the Deploy to Azure form.

Start by installing the necessary Ray and Azure python packages and configuring the Azure account to use:

pip install ray azure-cli azure-core

# authenticate with azure
az login

# set the subscription to use or modify the config yaml
az account set -s <YOUR_SUBSCRIPTION_ID>

Once the Azure command line interface is configured to manage resources on your Azure account, you should be ready to run the Autoscaler. Copy the provided ray/python/ray/autoscaler/azure/example-full.yaml cluster configuration to your machine and edit the provider section to set a location and resource group name to use. As with the Deploy to Azure Button approach this will by default deploy a Standard DS2v3 head node and autoscale the cluster up to 2 Standard DS2v3 workers. All VMs will have the DSVM image to provide a standard set of python environments for your applications. These VM options can be adjusted in the head_node and worker_node sections. The max size of the cluster is set using the max_workers property.

Using the VSCode YAML extension supports IntelliSense for Ray autocluster config files, enabling convenient editing. For this to work the config filename needs to follow the filename pattern ray-*-cluster.yaml and the yaml.schemaStore.enable property is set for the VSCode YAML extension.

Once the configuration file has been adjusted and saved you’re ready to deploy your cluster.

ray up ./ray-example-cluster.yaml

Now check that you can connect to your head node and run a remote function.

# this will start a shell on the head node of the cluster
ray attach ./ ray-example-cluster.yaml

# enable conda environment
exec bash -l

# start python
python

>>> import ray
>>>
>>> ray.init(address="auto")
>>> 
>>> @ray.remote
>>> def f(x):
>>>     return x * x
>>> 
>>> futures = [f.remote(i) for i in range(4)]
>>> print(ray.get(futures))

You can also check the ray dashboard to view cluster status

ray dashboard ./ ray-example-cluster.yaml

Then open http://localhost:8265 to see a dashboard similar to the one below

With either approach, you can quickly get started building distributed applications and training your own deep reinforcement learning agents with Ray on Azure. Of course, this is just the beginning of what Ray can do. There are a wide range of packages built on top of Ray to support scalable Reinforcement Learning, Hyperparameter Tuning, Distributed Machine Learning, and more. Get started with instructions and links to tutorials in the Ray Documentation.

Thanks to all the help from marcozo as well as Richard Liaw and others from the Ray community!

We can’t wait to see what you build with Ray on Azure!

Updated Apr 22, 2020

Version 3.0

AzureAI

machine learning

scgraham

Microsoft

Joined March 30, 2020

View Profile

Microsoft Developer Community Blog

Follow this blog board to get notified when there's new activity