A common problem that data scientists face when training and deploying their machine learning models is the choice of the right type and size of hardware.
Migrating machine learning tasks on the Cloud significantly simplified the data scientist’s job, who just need to login on Azure portal or Azure ML studio and select from a wide range of available resources - with different sizes, capabilities and costs - the most suitable one for their use case. On the other hand, this long list of options could sometimes intimidate even experienced users. At the same time, GPU (Graphics Processing Unit) acceleration is a new and rapidly evolving field and there is no true one-size fits-all guidance for this product area. The goal of this blog post is to provide some guidelines that could help in this non-trivial task.
GPUs are ideal for compute and graphics-intensive workloads, suiting scenarios like high-end remote visualization, deep learning, and predictive analytics. The N-series is a family of Azure Virtual Machines with GPU capabilities, which means specialized virtual machines available with single, multiple, or fractional GPUs.
The main factors which influence the choice of compute resource type to use for a job are the following:
1. Performance: Certain frameworks and algorithms used to train a model may need less time to complete the training task but come at a higher cost. GPU performance varies per workload but a quick overview can be found on NVIDIA's website.
2. Cost: Depending on requirements, a user may prefer a model that either is more cost effective or performs better. If cost saving is a major requirement, reserved instances for virtual machines or low-priority virtual machines can be explored as solutions.
3. Location: Virtual machines availability may differ across Azure regions. Also, if data needs to remain within a certain region as a requirement, this can affect the model’s choice. Resources availability per region might be explored using the VMs selector.
4. GPU memory size: Deep learning models benefit from the right selection of GPU memory size. The choice of GPU memory size is affected by the memory requirements for the model to train (e. g. size of the dataset and number of parameters).
5. Use-case scenario:
A good practice to find the optimal compute configuration is to run the workload, monitor the results and scale up as needed. There are a wide range of command-line tools available to monitor how your GPU compute is performing. One of the most common command-line tools is the NVIDIA System Management Interface (nvidia-smi), which can run at a defined interval.
TL;DR
If you aren't looking for a deep dive into how to choose a GPU optimized VM and you just want to grasp a quick overview of the Azure offer, here is a summary of the VM/GPU type and size you can choose from for your workload.
NC-series
Preferred scenarios: machine-learning and high-performance computing workloads with a low absolute cost per GPU-hour requirements.
VM Type |
GPU type |
Deprecation date |
NC-series |
NVIDIA Tesla K80 |
31/08/2023 |
NCv2-series |
NVIDIA Tesla P100 |
31/08/2023 |
NCv3-series |
NVIDIA Tesla V100 |
- |
NC A100 v4-series |
NVIDIA A100 PCIe |
- |
NCasT4_v3-series |
NVIDIA Tesla T4 |
- |
ND-series
Preferred scenarios: training and inferencing scenarios for deep learning.
VM Type |
GPU type |
Deprecation date |
ND-series |
NVIDIA Tesla P40 |
31/08/2023 |
NDv2-series |
NVIDIA Tesla V100 |
- |
ND A100 v4-series |
NVIDIA Ampere A100 |
- |
NV-series
Preferred scenarios: remote visualization workloads and other graphics-intensive applications.
VM Type |
GPU type |
Deprecation date |
NV-series |
NVIDIA Tesla M60 |
31/08/2023 |
NVv3-series |
NVIDIA Tesla M60 |
- |
NVv4-series |
AMD Radeon Instinct MI25 |
- |
NVadsA10 v5-series |
NVIDIA A10 |
- |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.