A common problem that data scientists face when training and deploying their machine learning models is the choice of the right type and size of hardware.
Migrating machine learning tasks on the Cloud significantly simplified the data scientist’s job, who just need to login on Azure portal or Azure ML studio and select from a wide range of available resources - with different sizes, capabilities and costs - the most suitable one for their use case. On the other hand, this long list of options could sometimes intimidate even experienced users. At the same time, GPU (Graphics Processing Unit) acceleration is a new and rapidly evolving field and there is no true one-size fits-all guidance for this product area. The goal of this blog post is to provide some guidelines that could help in this non-trivial task.
GPUs are ideal for compute and graphics-intensive workloads, suiting scenarios like high-end remote visualization, deep learning, and predictive analytics. The N-series is a family of Azure Virtual Machines with GPU capabilities, which means specialized virtual machines available with single, multiple, or fractional GPUs.
The main factors which influence the choice of compute resource type to use for a job are the following:
1. Performance: Certain frameworks and algorithms used to train a model may need less time to complete the training task but come at a higher cost. GPU performance varies per workload but a quick overview can be found on NVIDIA's website.
2. Cost: Depending on requirements, a user may prefer a model that either is more cost effective or performs better. If cost saving is a major requirement, reserved instances for virtual machines or low-priority virtual machines can be explored as solutions.
3. Location: Virtual machines availability may differ across Azure regions. Also, if data needs to remain within a certain region as a requirement, this can affect the model’s choice. Resources availability per region might be explored using the VMs selector.
4. GPU memory size: Deep learning models benefit from the right selection of GPU memory size. The choice of GPU memory size is affected by the memory requirements for the model to train (e. g. size of the dataset and number of parameters).
5. Use-case scenario:
- The NC-series (powered by NVIDIA K80 GPUs), NCv2-series (powered by NVIDIA Tesla P100) - that will be retired by August 2023 - and the newest NCv3-series (powered by NVIDIA Tesla V100) are used for machine-learning and high-performance computing workloads (reservoir modeling, DNA sequencing, protein analysis, Monte Carlo simulations). The NC-series is also a popular choice for developers and students learning about, developing for, or experimenting with GPU acceleration.
- The ND-series (powered by NVIDIA Tesla P40) - that will be retired by August 2023 – and the newest NDv2-series (powered by 8 NVIDIA Tesla V100 NVLINK-connected GPUs) and ND A100 v4-series (powered by 8 NVIDIA Ampere A100 Tensor Core GPUs) are designed for training and inferencing scenarios for deep learning.
- The NV-series (powered by NVIDIA Tesla M60) - that will be retired by August 2023 – and the newest NVv3-series (powered by NVIDIA Tesla M60), NVv4-series (powered by AMD Radeon Instinct MI25) and NCasT4_v3-series (powered by NVIDIA Tesla T4) are used for graphics-intensive applications (e. g. streaming, gaming, encoding) and/or remote visualization workloads. In particular, the NCasT4_v3-series is currently the most performant GPU SKUs in Azure for a game development workstation.
A good practice to find the optimal compute configuration is to run the workload, monitor the results and scale up as needed. There are a wide range of command-line tools available to monitor how your GPU compute is performing. One of the most common command-line tools is the NVIDIA System Management Interface (nvidia-smi), which can run at a defined interval.
TL;DR
If you aren't looking for a deep dive into how to choose a GPU optimized VM and you just want to grasp a quick overview of the Azure offer, here is a summary of the VM/GPU type and size you can choose from for your workload.
NC-series
Preferred scenarios: machine-learning and high-performance computing workloads with a low absolute cost per GPU-hour requirements.
VM Type |
GPU type |
Deprecation date |
NC-series |
NVIDIA Tesla K80 |
31/08/2023 |
NCv2-series |
NVIDIA Tesla P100 |
31/08/2023 |
NCv3-series |
NVIDIA Tesla V100 |
- |
NC A100 v4-series |
NVIDIA A100 PCIe |
- |
NCasT4_v3-series |
NVIDIA Tesla T4 |
- |
ND-series
Preferred scenarios: training and inferencing scenarios for deep learning.
VM Type |
GPU type |
Deprecation date |
ND-series |
NVIDIA Tesla P40 |
31/08/2023 |
NDv2-series |
NVIDIA Tesla V100 |
- |
ND A100 v4-series |
NVIDIA Ampere A100 |
- |
NV-series
Preferred scenarios: remote visualization workloads and other graphics-intensive applications.
VM Type |
GPU type |
Deprecation date |
NV-series |
NVIDIA Tesla M60 |
31/08/2023 |
NVv3-series |
NVIDIA Tesla M60 |
- |
NVv4-series |
AMD Radeon Instinct MI25 |
- |
NVadsA10 v5-series |
NVIDIA A10 |
- |