Training deep learning models at scale in Azure

Former Employee

May 19, 2020

This post was co-authored by @Chris Lauren , @Ian Finder , @David_Aronchick , Maxim Lukiyanov, Gopi Kumar

Microsoft, like much of the tech industry, has adopted deep learning to power features across our business and accelerate the value we can provide for users. We use deep learning models to improve user productivity and provide innovative experiences in Office, power code autocompletion in Visual Studio, improve search results in Bing, predictively optimize availability in Azure compute among many other scenarios across the company.

Part of what allowed us to move quickly to embrace this world changing technology was investing in the infrastructure and services necessary to increase our data scientists and machine learning engineers’ productivity in a cloud-first, highly scalable manner. The combination of our investments, tooling and experience has allowed us to make big advancements in the efficiency of our teams. When we launched Azure Machine Learning to GA 18 months ago, our goal was to bring our productive and powerful environment to users of all skill levels who want to leverage machine learning and accelerate the practice of data science.

Today, any ML developer can use Azure to experiment with innovative models, run distributed model training and deploy trained models quickly to infuse more intelligence into their business applications. The same machine learning platform, tools and powerful compute are part of Microsoft’s AI at Scale initiative that is enabling the next generation of AI capabilities

Let’s take a deeper look at the Azure AI infrastructure and the Azure Machine Learning service which together comprise the machine learning platform we use to power our AI innovations across Bing, Office, Teams and more.

How Microsoft uses AI at global scale to increase our user's productivity

One example of how Microsoft uses the Azure AI infrastructure and Azure Machine Learning service is the new “suggested replies” feature in Outlook. When you receive an email that can be answered with a quick response, Outlook suggests three responses that you can use to reply quickly, reducing the time and effort involved in replying to an email. This feature is powered by large scale deep learning natural language processing (NLP) model trained in Azure on powerful GPUs using Azure Machine Learning.

The Microsoft Outlook "Suggested Replies" feature uses Azure Machine Learning to train deep learning models at scale

The Outlook team uses Azure Machine Learning pipelines to process their data and train their models on a recurring basis in a repeatable manner. During the model training, the team uses GPU pools available in Azure. Once the model is created, data scientists can compare the model performance with previous models and evaluate which approaches perform better at recommending relevant suggested replies. Additionally, by using accelerated training with ONNX Runtime they were able to get up to an additional 45% improvement in training performance with minimal changes to their existing model training code. This increased the team's productivity even further by enabling more frequent machine learning experiments.

Training models at this scale and frequency would not be possible without the scalable Azure AI infrastructure and Azure Machine Learning service.

Powerful Azure AI infrastructure

Azure offers best-in-class infrastructure for AI workloads of all sizes, supporting GPU acceleration for popular frameworks like TensorFlow, PyTorch, and others from a single GPU, up to the flagship NDv2 VM offering eight 32 GB NVIDIA V100 GPUs with NVLink, as well as cluster-level 100 Gigabit InfiniBand EDR with out-of-box NCCL2 support to allow jobs to transparently harness the power of close to 1,000 GPUs concurrently.

Azure's virtualization technology harnesses features of the underlying hardware to offer nearly identical performance and architectural behavior to bare metal, along with the security and manageability benefits of virtual machines. Even at the driver level, Azure VMs employ standard NVIDIA device drivers, and the same Mellanox OFED InfiniBand RDMA drivers a customer might use on-premise, ensuring a rapid and seamless lift for existing AI workloads to begin leveraging Azure.

Going forward, Azure will continue to invest in the promise of distributed training and interconnects with the latest technologies to deliver higher bandwidth and scale to larger clusters of future GPU products, like NVIDIA’s new A100, with the same commitment to standard communication methods used by workloads such as NCCL2.

Azure Machine Learning service

Azure Machine Learning service runs on top of the Azure AI Infrastructure and provides a complete solution to manage the end to end machine learning lifecycle: preparing data, building/training models, deploying the models to cloud or the edge, and monitoring models performance to determine whether to retrain them on new data to improve over time.

Machine learning lifecycle using Azure Machine Learning service

Training large scale deep learning models on a budget

Using the latest GPU hardware at scale can be expensive. However, Azure Machine Learning makes it easy to minimize infrastructure costs to meet AI development budgets. AML has cost management and budget controls which enable your teams to share AML compute clusters which provision powerful GPU and CPU VMs on demand to train large scale models and turns them off when they are not being used. Additionally, you can further control costs using role based access control and quota management.

Pre-built environments for machine learning frameworks

Data scientists spend a lot of time preparing environments which contain combinations of open source software libraries. These libraries are tested individually, but as data scientists create their own software environment (often in a docker container) they must test the different versions of these libraries on their own to resolve version conflicts. This is a time consuming process which does not directly accrue value to training great models.

Azure ML reduces this complexity by providing pre-built environments for popular machine learning frameworks, and their popular distributed training flavors. Among supported frameworks are standard PyTorch, TensorFlow, their native distributed training backends, popular distributed training framework Horovod, and variety of communications protocols such as MPI, NCCL or Gloo. Azure Machine Learning makes it easy to explore new distributed models or take existing model from research community and run it as is on Azure compute with minimum or no modifications.

Additionally, Azure Machine Learning makes it easy to collaborate with other data scientists on your team using our new preview integrated Jupyter Notebooks. This further increases data scientists productivity and leverages nteract to bring Notebooks right into the AML Studio UI.

Azure Machine Learning's new preview integrated Jupyter notebooks will soon offer collaboration capabilities

Scalable experimentation platform

Building new machine learning models requires iterative experimentation; sometimes it takes hundreds of iterations over the model design, algorithms and hyperparameters to achieve optimal performance. Some experiments that appear promising initially may yield poor results and researchers will have to step back and reassess results from the previous experiments.

As data scientists experiment in this way, it becomes increasingly important to share experiment results easily, reproduce any experiment reliably and collaborate with their team using a platform that is able to scale to handle large models that take days to train and produce GBs of output metrics per run.

Azure ML provides a fully managed, highly scalable machine learning platform. ML developers can organize their model training runs into experiments, track with each run all of the relevant parameters and metrics of the model, versions of training data used, source code, git commit, hyperparameters and more. With all the relevant data tracked by default, Azure ML makes it easy to compare the experiment runs and determine which produced the best model to deploy.

Get started with Azure Machine Learning for free today!

Learn more:

Updated May 20, 2020

Version 4.0

azure machine learning

machine learning

Chris Lauren

Former Employee

Joined September 25, 2018

View Profile

Microsoft Foundry Blog

Follow this blog board to get notified when there's new activity