Azure High Performance Computing (HPC) Blog

4 MIN READ

A quick start guide to benchmarking AI models in Azure: MLPerf Training v2.0

HugoAffaticati

Microsoft

Aug 03, 2022

By: Sonal Doomra, Program Manager 2, Hugo Affaticati, Program Manager, and Daramfon Akpan, Program Manager

Useful resources

Information on the NDm A100 v4-series

MLCommons® provides a distributed AI training benchmark suite: MLPerf^TM Training. Here is how to run MLPerf^TM training v2.0 benchmarks on NDm A100 v4 virtual machines.

1- Select and set up the virtual machine ND96amsr A100 v4 using the information given in this document.

2- Git clone the MLcommons® repo:

cd /mnt/resource_nvme
git clone https://github.com/mlcommons/training_results_v2.0.git

3- Set permissions

sudo chown -R $USER:$USER training_results_v2.0/

4- Navigate into the benchmark directory. BENCHMARK_NAME should be replaced by the benchmark you want to run (e.g. bert, rnnt, dlrm...).

cd training_results_v2.0/Azure/benchmarks/<BENCHMARK_NAME>/implementations/ND96amsr_A100_v4/

5- Make changes for NUMA bindings in azure.sh

vi azure.sh

For NDm A100 v4-series, paste the following lines in the file.

bind_cpu_cores=([0]="24-47" [1]="24-47" [2]="0-23" [3]="0-23" [4]="72-95" [5]="72-95" [6]="48-71" [7]="48-71")
bind_mem=([0]="1" [1]="1" [2]="0" [3]="0" [4]= "3" [5]="3" [6]="2" [7]="2")

6- Make changes to run_and_time.sh to reflect the right path to azure.sh (around line 125)

vi run_and_time.sh

Replace the line with the following.

CMD=( '/bm_utils/bind.sh' '--cpu=/bm_utils/azure.sh' '--mem=/bm_utils/azure.sh' '--ib=single' '--cluster=${cluster}' '--' ${NSYSCMD} 'python' '-u')

7- Make the changes to run_with_docker.sh file to point to correct path in mounted run_and_time.sh (around the middle of the file).

docker exec -it "${_config_env[@]}" "${CONT_NAME}" \
${TORCH_RUN} --nproc_per_node=${DGXNGPU} /bm_utils/run_and_time.sh
) |& tee "${LOG_FILE_BASE}_${_experiment_index}.log"

8- Make changes to config file to account for hyperthreads and number of GPUs, only the values below must be updated:

vi config_DGXA100_1x8x56x1.sh

export DGXNGPU=8
export DGXSOCKETCORES=48
export DGXNSOCKET=2
export DGXHT=1

9- Run the command to source the config file:

source ./config_DGXA100_1x8x56x1.sh

The next steps are different for each benchmark.

10- Follow Readme.txt for the benchmark to download and prepare the data.

Note: While downloading the data, make sure you have enough space. Tip: Use the /mnt/resource_nvme directory to store the data.

11- Run the following command to get the docker image name and tag.

docker images

Note the image name and tag associated with the benchmark you are running. <CONTAINER_NAME> in the next step is <REPOSITORY>:<TAG>

12- The command below runs the benchmark. Note that each benchmark has its own environment variables to set before we run. Please read the explanation of the variables to understand what value to give to each variable.

Run the command below to set the number of experiments to run

export NEXP=10

BERT

CONT=<CONTAINER_NAME> DATADIR=<path/to/4320_shards_varlength/dir> DATADIR_PHASE2=<path/to/4320_shards_varlength/dir> EVALDIR=<path/to/eval_varlength/dir> CHECKPOINTDIR=<path/to/result/checkpointdir> CHECKPOINTDIR_PHASE1=<path/to/pytorch/ckpt/dir> ./run_with_docker.sh

The variables in the above command refer to the directory structure created by the Data download and preprocessing steps.

DATADIR: Point this to the 4320_shards_varlength folder downloaded with the training dataset.
DATADIR_PHASE2: Point this to the 4320_shards_varlength folder downloaded with the training dataset.
EVALDIR: Point this to the eval_varlength folder downloaded with the validation dataset.
CHECKPOINTDIR: Point this to a new results folder under bert data directory.
CHECKPOINTDIR_PHASE1: Point this to the phase1 folder within the bert data directory.

RNNT

CONT=<CONTAINER_NAME> DATADIR= </path/to/rnnt/datasets/dir> METADATA_DIR=</path/to/tokenized/folder/under/data/dir> SENTENCEPIECES_DIR=</path/to/sentencepieces/folder/under/data/dir> LOGDIR=./results ./run_with_docker.sh

DATADIR: Point this to the directory where RNNT data is downloaded.
METADATA_DIR: Point this to the folder called ‘tokenized’ within the downloaded RNNT data.
SENTENCEPIECES_DIR: Point this to the folder called “sentencepieces” within the downloaded RNNT data.

ResNet50

CONT=<CONTAINER_NAME> DATADIR=/path/to/resnet_data/prep_data/ LOGDIR=./results ./run_with_docker.sh

DATADIR: Point this to the folder called “prep_data” inside the downloaded Resnet data.

Minigo

CONT=<CONTAINER_NAME> DATADIR=/path/to/minigo_data/ ./run_with_docker.sh

DLRM

CONT=<CONTAINER_NAME> DATADIR=/path/to/dlrm_data / LOGDIR=./results ./run_with_docker.sh

SSD

CONT=<CONTAINER_NAME> DATADIR=/path/to/ssd_data TORCH_HOME=/torch-home LOGDIR=./results ./run_with_docker.sh

TORCH_HOME: Create a new folder. Mkdir /torch-home.
Point this variable to the newly created /torch-home directory.

Mask R-CNN

CONT=<CONTAINER_NAME> DATADIR=/path/to/maskrcnn_data/ LOGDIR=./results ./run_with_docker.sh

Updated Nov 01, 2022

Version 5.0

hpc