Azure AI Supercomputer Delivers Record MLPerf Results

Microsoft

Dec 01, 2021

For customers seeking the most powerful computing for a range of AI workloads from image classification to reinforcement learning, Microsoft Azure AI supercomputers are proving their value via published industry-standard benchmarks. The latest (December 2021) MLPerf 1.1 results show a debut performance by Azure delivering the #2 performance overall and the #1 performance by a cloud provider. Microsoft Azure’s publicly available AI Supercomputing capabilities are led by the new NDm A100 v4 series virtual machines (VMs) powered by NVIDIA A100 Tensor Core GPUs with 80GB of HBM2e memory and NVIDIA Quantum InfiniBand high-performance networking. The results showcase Azure’s commitment to raise the bar in terms of scale and performance for AI training using cloud computing.

Benchmark results

Approximately 25 seconds to train BERT Large natural language processing model on 2,048 GPUs
Processed up to 3.8M images per sec using Resnet50 1.5 image classification with 2,048 GPUs
Completed the Minigo (reinforcement learning) benchmark in under 17.5 minutes with 1,792 GPUs.

The results were generated using Azure CycleCloud to orchestrate the virtual cluster with more than 256 InfiniBand-connected VMs built on the larger underlying AI supercomputer. The Slurm scheduler configured with NVIDIA Pyxis and Enroot was used to schedule the GPU-optimized AI software containers from the NVIDIA NGC catalog. This setup enabled the team to set up the environment quickly and perform the benchmarks with industry-leading performance and scalability. For more information setup see cc-slurm-ngc.

The high-memory NDm A100 v4 series delivers AI Supercomputing power to the broadest audience of users, offering ease-of-access combined with the agility to flex scale, technologies, and costs as needed. Innovative AI users are deploying Azure 40 GB ND A100 v4 VMs and 80 GB NDm A100 v4 VMs to meet their large-scale production AI and machine learning needs, unlocking competitive advantage with leading performance and scalability results. To learn more about Azure’s new NDm A100 v4 super-clusters of virtual machines see our launch blog that was published on November 15^th, 2021.

More about ML Perf

MLPerf is a consortium of AI leaders from academia, research labs, and industry whose mission is to “build fair and useful benchmarks” that provide unbiased evaluations of training and inference performance for hardware, software, and services—all conducted under prescribed conditions. To stay on the cutting edge of industry trends, MLPerf continues to evolve, holding new tests at regular intervals and adding new workloads that represent the state of the art in AI.

MLPerf’s tests are transparent and objective, so users can rely on the results to make informed buying decisions. The industry benchmarking group, formed in May 2018, is backed by dozens of industry leaders. The benchmark tests across training and inferencing, and is increasingly becoming the key tests that hardware and software vendors use to demonstrate performance.

Updated Oct 25, 2022

Version 4.0

hpc

RachelPruitt

Microsoft

Joined September 10, 2020

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity