A common problem that data scientists face when training and deploying their machine learning models is the choice of the right type and size of hardware.
Migrating machine learning tasks on the Cloud significantly simplified the data scientist’s job, who just need to login on Azure portal or Azure ML studio and select from a wide range of available resources - with different sizes, capabilities and costs - the most suitable one for their use case. On the other hand, this long list of options could sometimes intimidate even experienced users. At the same time, GPU (Graphics Processing Unit) acceleration is a new and rapidly evolving field and there is no true one-size fits-all guidance for this product area. The goal of this blog post is to provide some guidelines that could help in this non-trivial task.
GPUs are ideal for compute and graphics-intensive workloads, suiting scenarios like high-end remote visualization, deep learning, and predictive analytics. The N-series is a family of Azure Virtual Machines with GPU capabilities, which means specialized virtual machines available with single, multiple, or fractional GPUs.
The main factors which influence the choice of compute resource type to use for a job are the following:
3. Location: Virtual machines availability may differ across Azure regions. Also, if data needs to remain within a certain region as a requirement, this can affect the model’s choice. Resources availability per region might be explored using the VMs selector.
4. GPU memory size: Deep learning models benefit from the right selection of GPU memory size. The choice of GPU memory size is affected by the memory requirements for the model to train (e. g. size of the dataset and number of parameters).
5. Use-case scenario:
The NC-series (powered by NVIDIA K80 GPUs), NCv2-series (powered by NVIDIA Tesla P100) - that will be retired by August 2023 - and the newest NCv3-series (powered by NVIDIA Tesla V100) are used for machine-learning and high-performance computing workloads (reservoir modeling, DNA sequencing, protein analysis, Monte Carlo simulations). The NC-series is also a popular choice for developers and students learning about, developing for, or experimenting with GPU acceleration.
The ND-series (powered by NVIDIA Tesla P40) - that will be retired by August 2023 – and the newest NDv2-series (powered by 8 NVIDIA Tesla V100 NVLINK-connected GPUs) and ND A100 v4-series (powered by 8 NVIDIA Ampere A100 Tensor Core GPUs) are designed for training and inferencing scenarios for deep learning.
The NV-series (powered by NVIDIA Tesla M60) - that will be retired by August 2023 – and the newest NVv3-series (powered by NVIDIA Tesla M60), NVv4-series (powered by AMD Radeon Instinct MI25) and NCasT4_v3-series (powered by NVIDIA Tesla T4) are used for graphics-intensive applications (e. g. streaming, gaming, encoding) and/or remotevisualization workloads. In particular, the NCasT4_v3-series is currently the most performant GPU SKUs in Azure for a game development workstation.
A good practice to find the optimal compute configuration is to run the workload, monitor the results and scale up as needed. There are a wide range of command-line tools available to monitor how your GPU compute is performing. One of the most common command-line tools is the NVIDIA System Management Interface (nvidia-smi), which can run at a defined interval.
If you aren't looking for a deep dive into how to choose a GPU optimized VM and you just want to grasp a quick overview of the Azure offer, here is a summary of the VM/GPU type and size you can choose from for your workload.
Preferred scenarios: machine-learning and high-performance computing workloads with a low absolute cost per GPU-hour requirements.
NVIDIA Tesla K80
NVIDIA Tesla P100
NVIDIA Tesla V100
NC A100 v4-series
NVIDIA A100 PCIe
NVIDIA Tesla T4
Preferred scenarios: training and inferencing scenarios for deep learning.
NVIDIA Tesla P40
NVIDIA Tesla V100
ND A100 v4-series
NVIDIA Ampere A100
Preferred scenarios: remote visualization workloads and other graphics-intensive applications.