How to choose the best GPU optimized VM sizes for your project on Azure

Published Aug 04 2022 04:34 AM 656 Views
Microsoft

A common problem that data scientists face when training and deploying their machine learning models is the choice of the right type and size of hardware.

Migrating machine learning tasks on the Cloud significantly simplified the data scientist’s job, who just need to login on Azure portal or Azure ML studio and select from a wide range of available resources - with different sizes, capabilities and costs - the most suitable one for their use case. On the other hand, this long list of options could sometimes intimidate even experienced users. At the same time, GPU (Graphics Processing Unit) acceleration is a new and rapidly evolving field and there is no true one-size fits-all guidance for this product area. The goal of this blog post is to provide some guidelines that could help in this non-trivial task.

 

GPUs are ideal for compute and graphics-intensive workloads, suiting scenarios like high-end remote visualization, deep learning, and predictive analytics. The N-series is a family of Azure Virtual Machines with GPU capabilities, which means specialized virtual machines available with single, multiple, or fractional GPUs.

 

The main factors which influence the choice of compute resource type to use for a job are the following:

 

1. Performance: Certain frameworks and algorithms used to train a model may need less time to complete the training task but come at a higher cost. GPU performance varies per workload but a quick overview can be found on NVIDIA's website.

 

2. Cost: Depending on requirements, a user may prefer a model that either is more cost effective or performs better. If cost saving is a major requirement, reserved instances for virtual machines or low-priority virtual machines can be explored as solutions.

 

3. Location: Virtual machines availability may differ across Azure regions. Also, if data needs to remain within a certain region as a requirement, this can affect the model’s choice. Resources availability per region might be explored using the VMs selector.

 

4. GPU memory size: Deep learning models benefit from the right selection of GPU memory size. The choice of GPU memory size is affected by the memory requirements for the model to train (e. g. size of the dataset and number of parameters).

 

5. Use-case scenario:

 

    • The NC-series (powered by NVIDIA K80 GPUs), NCv2-series (powered by NVIDIA Tesla P100) - that will be retired by August 2023 - and the newest NCv3-series (powered by NVIDIA Tesla V100) are used for machine-learning and high-performance computing workloads (reservoir modeling, DNA sequencing, protein analysis, Monte Carlo simulations). The NC-series is also a popular choice for developers and students learning about, developing for, or experimenting with GPU acceleration.

 

    • The ND-series (powered by NVIDIA Tesla P40) - that will be retired by August 2023 – and the newest NDv2-series (powered by 8 NVIDIA Tesla V100 NVLINK-connected GPUs) and ND A100 v4-series (powered by 8 NVIDIA Ampere A100 Tensor Core GPUs) are designed for training and inferencing scenarios for deep learning.

 

    • The NV-series (powered by NVIDIA Tesla M60) - that will be retired by August 2023 – and the newest NVv3-series (powered by NVIDIA Tesla M60), NVv4-series (powered by AMD Radeon Instinct MI25) and NCasT4_v3-series (powered by NVIDIA Tesla T4) are used for graphics-intensive applications (e. g. streaming, gaming, encoding) and/or remote visualization workloads. In particular, the NCasT4_v3-series is currently the most performant GPU SKUs in Azure for a game development workstation.

A good practice to find the optimal compute configuration is to run the workload, monitor the results and scale up as needed. There are a wide range of command-line tools available to monitor how your GPU compute is performing. One of the most common command-line tools is the NVIDIA System Management Interface (nvidia-smi), which can run at a defined interval.

 

TL;DR

If you aren't looking for a deep dive into how to choose a GPU optimized VM and you just want to grasp a quick overview of the Azure offer, here is a summary of the VM/GPU type and size you can choose from for your workload.

 

NC-series

Preferred scenarios: machine-learning and high-performance computing workloads with a low absolute cost per GPU-hour requirements.

 

VM Type

GPU type

Deprecation date

NC-series

NVIDIA Tesla K80

31/08/2023

NCv2-series

NVIDIA Tesla P100

31/08/2023

NCv3-series

NVIDIA Tesla V100

-

NC A100 v4-series

NVIDIA A100 PCIe

-

NCasT4_v3-series

NVIDIA Tesla T4

-

 

ND-series

Preferred scenarios: training and inferencing scenarios for deep learning.

 

VM Type

GPU type

Deprecation date

ND-series

NVIDIA Tesla P40

31/08/2023

NDv2-series

NVIDIA Tesla V100

-

ND A100 v4-series

NVIDIA Ampere A100

-

 

 

NV-series

Preferred scenarios: remote visualization workloads and other graphics-intensive applications.

 

VM Type

GPU type

Deprecation date

NV-series

NVIDIA Tesla M60

31/08/2023

NVv3-series

NVIDIA Tesla M60

-

NVv4-series

AMD Radeon Instinct MI25

-

NVadsA10 v5-series

NVIDIA A10

-

 

1 Comment
Co-Authors
Version history
Last update:
‎Aug 04 2022 02:36 AM
Updated by: