By Hugo Affaticati (Technical Program Manager), Sonal Doomra (Technical Program Manager 2), and Jon Shelley (Principal TPM Manager).
Introduction
Azure is pleased to showcase results from our MLPerf Training v3.0 submission. For this submission, we benchmarked our ND H100 v5 virtual machine (preview), with innovative technologies like:
- 8x NVIDIA H100 Tensor Core GPUs interconnected via next gen NVSwitch and NVLink 4.0
- 400 Gb/s NVIDIA Quantum-2 CX7 InfiniBand per GPU with 3.2Tb/s per VM in a non-blocking fat-tree network
- NVSwitch and NVLink 4.0 with 3.6TB/s bisectional bandwidth between 8 local GPUs within each VM
- 4th Gen Intel Xeon Scalable processors
- PCIE Gen5 host to GPU interconnect with 64GB/s bandwidth per GPU
- 16 Channels of 4800MHz DDR5 DIMMs
Full results on MLCommons website.
How to replicate the results in Azure
Pre-requisites:
Deploy and set up an ND H100 v5 virtual machine on Azure using Azure Portal or Azure CycleCloud.
Set up the environment
First, one needs to download the container from NVIDIA NGC (account needed). Then, one can clone the code from MLCommon's GitHub repository specific to Azure and publicly available.
cd /share
docker pull nvcr.io/nvdlfwea/mlperfv30/resnet:20230428.mxnet
git clone https://github.com/mlcommons/training_results_v3.0.git
cd /share/training_results_v3.0/Azure/benchmarks/resnet/implementations/ND_H100_v5
Get the dataset for ResNet
ResNet utilizes the ImageNet dataset from 2012. One will need both Training images (Task 1 & 2) and Validation images (all tasks) for MLPerf training v3.0.
For the Training images:
mkdir /share/data && cd /share/data
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar
mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
cd ..
For the Validation images:
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar
mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val
tar -xvf ILSVRC2012_img_val.tar && rm -f ILSVRC2012_img_val.tar
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
Run the ResNet benchmark
The steps to run the benchmark consist of sourcing the configuration file, and starting the benchmark.
cd /share/training_results_v3.0/Azure/benchmarks/resnet/implementations/ND_H100_v5
source config_DGXH100.sh
CONT=nvcr.io/nvdlfwea/mlperfv30/resnet:20230428.mxnet DATADIR=/share/data LOGDIR=results ./run_with_docker.sh
The above steps can be replicated for the other MLPerf Training v3.0 benchmarks. One would have to use the corresponding configuration file and steps to preprocess the data.
#AzureHPCAI #MakeAIYourReality