Home
%3CLINGO-SUB%20id%3D%22lingo-sub-1059050%22%20slang%3D%22en-US%22%3EAccelerating%20Distributed%20Training%20in%20Azure%20Machine%20Learning%20service%20using%20SR-IOV%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-1059050%22%20slang%3D%22en-US%22%3E%3CP%3E%3CEM%3EAuthor%3A%20Ravi%20Shankar%20Kolli%3C%2FEM%3E%3C%2FP%3E%0A%3CP%3E%3CEM%3EThis%20post%20is%20co-authored%20by%20Mathew%20Salvaris%2C%20Aashna%20Garg%2C%20Vaibhav%20Jain%2C%20Reyhan%20Patia%2C%20Caghan%20Demirci%2C%20Alex%20Sutton%3C%2FEM%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EToday%E2%80%99s%20state%20of%20the%20art%20deep%20learning%20models%20like%20BERT%20require%20distributed%20multi%20machine%20training%20to%20reduce%20training%20time%20from%20weeks%20to%20days.%20Interconnect%20is%20one%20of%20the%20key%20components%20to%20reduce%20communication%20overhead%20and%20achieve%20good%20scaling%20efficiency%20in%20distributed%20multi%20machine%20training.%3C%2FP%3E%0A%3CP%3EAzure%20Machine%20Learning%20users%20can%20now%20speed%20up%20their%20training%20time%20by%20taking%20advantage%20of%20the%20Azure%20Virtual%20Machines%3CSPAN%3E%26nbsp%3B%3C%2FSPAN%3E%20with%20SR-IOV%20and%20InfiniBand%26nbsp%3B%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fvirtual-machines%2Fworkloads%2Fhpc%2Fenable-infiniband%22%20target%3D%22_self%22%20rel%3D%22noopener%20noreferrer%20noopener%20noreferrer%22%3Esupport%3C%2FA%3E.%20In%20September%202018%2C%20Azure%20introduced%20the%20NC%2C%20ND%2C%20and%20H-series%20of%20VMs%20dedicated%20InfiniBand%20networks.%20All%20RDMA-enabled%20sizes%20are%20capable%20of%20leveraging%20that%20network%20using%20Intel%20MPI.%20SR-IOV%20stands%20for%20%E2%80%9Csingle%20root%20input%2Foutput%20virtualization%E2%80%9D%20which%20optimizes%20sharing%20of%20PCI%20Express%20devices%20in%20a%20system%20with%20virtual%20machines.%20In%20Azure%2C%20SR-IOV%20for%20InfiniBand%20enables%20near%20bare-metal%20performance%20for%20any%20MPI%20library.%3C%2FP%3E%0A%3CP%3E%3CSPAN%20data-contrast%3D%22auto%22%3EMPI%2C%20or%20message-passing%20interface%2C%20is%20a%20communication%20library%26nbsp%3B%3C%2FSPAN%3E%3CSPAN%20data-contrast%3D%22auto%22%3Ecommonly%20used%20for%20distributed%20training%20between%20GPUs%20on%20many%20systems.%20Nvidia%E2%80%99s%20NCCL%20software%20uses%20MPI%20to%20make%20distributed%20training%20easier%20in%20deep%20learning%20frameworks%20like%26nbsp%3BPyTorch%26nbsp%3Band%20TensorFlow.%26nbsp%3B%3C%2FSPAN%3E%3CSPAN%20data-ccp-props%3D%22%7B%26quot%3B201341983%26quot%3B%3A0%2C%26quot%3B335551550%26quot%3B%3A6%2C%26quot%3B335551620%26quot%3B%3A6%2C%26quot%3B335559739%26quot%3B%3A160%2C%26quot%3B335559740%26quot%3B%3A259%7D%22%3E%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%3CSPAN%20data-contrast%3D%22auto%22%3EAzure%20now%20supports%20using%20any%20MPI%20library%20with%20SR-IOV%20enabled%20VM%20families%20such%20as%20NCv3%2C%20NDv2%2C%20and%20HC%20or%20HB%20for%20HPC%20applications.%20Older%20GPU%20hardware%20with%20InfiniBand%20such%20as%20NCv2%20and%20NDv1%20will%20be%20updated%20for%20SR-IOV%20in%202020.%26nbsp%3B%3C%2FSPAN%3E%3CSPAN%20data-ccp-props%3D%22%7B%26quot%3B201341983%26quot%3B%3A0%2C%26quot%3B335551550%26quot%3B%3A6%2C%26quot%3B335551620%26quot%3B%3A6%2C%26quot%3B335559739%26quot%3B%3A160%2C%26quot%3B335559740%26quot%3B%3A259%7D%22%3E%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%3CSPAN%20data-contrast%3D%22auto%22%3EIntel%20MPI%20version%205.x%20will%20continue%20to%20be%20supported%20as%20w%3C%2FSPAN%3E%3CSPAN%20data-contrast%3D%22auto%22%3Ei%3C%2FSPAN%3E%3CSPAN%20data-contrast%3D%22auto%22%3Ell%20all%20subsequent%20Intel%20MPI%20versions.%26nbsp%3B%20In%20addition%2C%20all%20other%20MPIs%20supported%20by%20the%20Open%20Fabric%20Enterprise%20Distribution%20(OFED)%2C%26nbsp%3B%3C%2FSPAN%3E%3CSPAN%20data-contrast%3D%22auto%22%3EOpenMPI%3C%2FSPAN%3E%3CSPAN%20data-contrast%3D%22auto%22%3E%2C%20and%20Nvidia%E2%80%99s%20NCCL2%20library%2C%20providing%20optimized%20performance%20for%20GPUs%3C%2FSPAN%3E%3CSPAN%20data-contrast%3D%22auto%22%3E%26nbsp%3Bare%3C%2FSPAN%3E%3CSPAN%20data-contrast%3D%22auto%22%3E%26nbsp%3Bsupported.%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3EThese%20enhancements%20will%20provide%20customers%20with%20higher%20InfiniBand%20bandwidth%2C%20lower%20latencies%2C%20and%20most%20importantly%2C%20better%20distributed%20application%20performance.%20Infiniband%20connectivity%20provides%20higher%20throughput%20and%20lower%20latencies%20compared%20to%20the%20ethernet%20based%20connection.%20SR-IOV%20enables%20communication%20over%20an%20Infiniband%20network%20using%20any%20flavor%20of%20MPI.%20A%20reference%20implementation%20of%20Bert%20in%20Azure%20Machine%20Learning%20using%20SR-IOV%20and%20Infiniband%20can%20be%20found%20on%20%3CA%20href%3D%22https%3A%2F%2Fgithub.com%2Fmicrosoft%2FAzureML-BERT%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EGithub%3C%2FA%3E.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CH2%20id%3D%22toc-hId--1439548581%22%20id%3D%22toc-hId--1439548581%22%3E%3CFONT%20size%3D%225%22%3EThroughput%20Improvement%20in%20BERT%3C%2FFONT%3E%3C%2FH2%3E%0A%3CP%3ESR-IOV%20and%20Infiniband%20provided%20up%20to%2075%25%20improvement%20in%20the%20throughput%20of%20BERT%20Large%20model.%20When%20SR-IOV%20is%20enabled%2C%20throughput%20improves%20to%20about%2028%20sequences%2Fsecond%2FGPU%20which%20is%2075%25%20better%20than%20the%20baseline.%20Below%20charts%20show%20the%20throughput%20improvement%20of%20BERT%20large%20pretraining%20on%2016%20Azure%20StandardNC24s_v3%20VMs.%20Model%20is%20in%20PyTorch%20and%20used%20Torch.Distributed%20and%20Open%20MPI%20for%20multi-node%20training.%20Note%20that%20the%20below%20charts%20do%20not%20reflect%20the%20best%20throughput%20of%20BERT%20on%20Azure.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Fgxcuf89792.i.lithium.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F160981i4340AF21610DDDC8%2Fimage-size%2Fmedium%3Fv%3D1.0%26amp%3Bpx%3D400%22%20alt%3D%22clipboard_image_27.png%22%20title%3D%22clipboard_image_27.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Fgxcuf89792.i.lithium.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F160980i808699330F9EECDC%2Fimage-size%2Fmedium%3Fv%3D1.0%26amp%3Bpx%3D400%22%20alt%3D%22clipboard_image_28.png%22%20title%3D%22clipboard_image_28.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CH1%20id%3D%22toc-hId--1450051685%22%20id%3D%22toc-hId--1450051685%22%3E%26nbsp%3B%3C%2FH1%3E%0A%3CH2%20id%3D%22toc-hId--759490211%22%20id%3D%22toc-hId--759490211%22%3EThroughput%20Improvement%20in%20ResNet%3C%2FH2%3E%0A%3CP%3EIn%20order%20to%20observe%20the%20improvements%20in%20speed%20for%20PyTorch%20we%20ran%20a%20selection%20of%20%3CSTRONG%3EResNet%20%3C%2FSTRONG%3Emodels%20from%20Torchvision%20on%20synthetic%20data%20at%20%3CSTRONG%3Efull%20precision%3C%2FSTRONG%3E.%20This%20allowed%20us%20to%20estimate%20the%20throughput%20without%20having%20to%20worry%20about%20IO%20overhead.%20Below%20we%20can%20see%20figures%20for%20clusters%20with%20SR-IOV%20enabled%20vs%20those%20that%20didn%E2%80%99t%20have%20SR-IOV.%20We%20were%20using%20%3CSTRONG%3ENC24rs%3C%2FSTRONG%3E%3CSTRONG%3E_v3%3C%2FSTRONG%3E%20vms%20each%20equipped%20with%204%20V100%20GPUs.%20Therefore%2C%20when%20we%20report%208%20GPUs%20it%20is%20across%202%20nodes%20and%2016%20is%20across%204.%20We%20can%20see%20that%20across%20models%20and%20GPU%20configurations%20SR-IOV%20offers%202-3%20times%20improvement%20over%20No%20SR-IOV.%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Fgxcuf89792.i.lithium.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F160982iAB5179FC2820892E%2Fimage-size%2Fmedium%3Fv%3D1.0%26amp%3Bpx%3D400%22%20alt%3D%22clipboard_image_29.png%22%20title%3D%22clipboard_image_29.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Fgxcuf89792.i.lithium.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F160983iD119CB268EE02780%2Fimage-size%2Fmedium%3Fv%3D1.0%26amp%3Bpx%3D400%22%20alt%3D%22clipboard_image_30.png%22%20title%3D%22clipboard_image_30.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Fgxcuf89792.i.lithium.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F160985i9E22C1B81C4D5A10%2Fimage-size%2Fmedium%3Fv%3D1.0%26amp%3Bpx%3D400%22%20alt%3D%22clipboard_image_31.png%22%20title%3D%22clipboard_image_31.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3EIn%20the%20figures%20below%20the%20number%20reported%20in%20the%20center%20of%20the%20bar%20is%20the%20scaling%20efficiency%20on%20RDMA-enabled%20VMs.%20As%20we%20can%20see%20for%20Horovod%20and%20DistributeDataParallel%20both%20using%20NCCL%20the%20scaling%20efficiency%20is%20over%2090%25%20across%20all%20three%20models%20with%20the%20performance%20almost%20doubling%20with%20the%20doubling%20of%20GPUs.%3CSPAN%3E%26nbsp%3B%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Fgxcuf89792.i.lithium.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F160984iE3D4851FD4A87F6D%2Fimage-size%2Fmedium%3Fv%3D1.0%26amp%3Bpx%3D400%22%20alt%3D%22clipboard_image_32.png%22%20title%3D%22clipboard_image_32.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Fgxcuf89792.i.lithium.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F160986iA770BEB47E255349%2Fimage-size%2Fmedium%3Fv%3D1.0%26amp%3Bpx%3D400%22%20alt%3D%22clipboard_image_33.png%22%20title%3D%22clipboard_image_33.png%22%20%2F%3E%3C%2FSPAN%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Fgxcuf89792.i.lithium.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F160987i126C10EEE00DF80C%2Fimage-size%2Fmedium%3Fv%3D1.0%26amp%3Bpx%3D400%22%20alt%3D%22clipboard_image_34.png%22%20title%3D%22clipboard_image_34.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CH1%20id%3D%22toc-hId--769993315%22%20id%3D%22toc-hId--769993315%22%3E%26nbsp%3B%3C%2FH1%3E%0A%3CH2%20id%3D%22toc-hId--79431841%22%20id%3D%22toc-hId--79431841%22%3ESummary%3C%2FH2%3E%0A%3CP%3ESR-IOV%20yielded%20significant%20throughput%20improvements%20to%20distributed%20multi%20machine%20training.%20Bert%20large%20throughput%20increased%20by%2075%25%20with%20SR-IOV%20and%20certain%20Resnet%20models%20were%20faster%20by%20about%202-3x%20with%20SR-IOV.%20Throughput%20also%20scaled%20linearly%20on%20Resnet%20models%20as%20the%20number%20of%20NC24rs_v3%20nodes%20scaled%20from%201%20to%202%2C%204%20and%208%20instances.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EStay%20tuned%20for%20our%20next%20blog%20on%20scaling%20distributed%20deep%20learning%20training%20on%20Azure%20%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fvirtual-machines%2Fwindows%2Fsizes-gpu%23nd-series%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3ENDv2%3C%2FA%3E%20VMs.%20These%20VMs%20feature%208%20NVIDIA%20Tesla%20V100%20NVLINK%20interconnected%20GPUs%2C%2032GB%20HBM2%20memory%20per%20GPU%20and%20100Gbps%20EDR%20Infiniband%20interconnect.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EGet%20started%20with%20Distributed%20Deep%20Learning%20training%20on%20%3CA%20href%3D%22https%3A%2F%2Fazure.microsoft.com%2Fen-us%2Fservices%2Fmachine-learning%2F%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EAzure%20Machine%20Learning%3C%2FA%3E.%20Report%20any%20implementation%20issues%20or%20observed%20throughput%20improvements%20of%20SR-IOV%20on%20Azure%20Machine%20Learning%20at%20%3CA%20href%3D%22https%3A%2F%2Fstackoverflow.com%2Fquestions%2Ftagged%2Fazure-machine-learning-service%22%20target%3D%22_blank%22%20rel%3D%22noopener%20nofollow%20noopener%20noreferrer%20noopener%20noreferrer%22%3EStack%20Overflow%3C%2FA%3E.%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-LABS%20id%3D%22lingo-labs-1059050%22%20slang%3D%22en-US%22%3E%3CLINGO-LABEL%3EAML%3C%2FLINGO-LABEL%3E%3CLINGO-LABEL%3EAzure%20Machine%20Learning%3C%2FLINGO-LABEL%3E%3CLINGO-LABEL%3EMachine%20Learning%3C%2FLINGO-LABEL%3E%3CLINGO-LABEL%3Eml%3C%2FLINGO-LABEL%3E%3C%2FLINGO-LABS%3E
Microsoft

Author: Ravi Shankar Kolli

This post is co-authored by Mathew Salvaris, Aashna Garg, Vaibhav Jain, Reyhan Patia, Caghan Demirci, Alex Sutton

 

Today’s state of the art deep learning models like BERT require distributed multi machine training to reduce training time from weeks to days. Interconnect is one of the key components to reduce communication overhead and achieve good scaling efficiency in distributed multi machine training.

Azure Machine Learning users can now speed up their training time by taking advantage of the Azure Virtual Machines  with SR-IOV and InfiniBand support. In September 2018, Azure introduced the NC, ND, and H-series of VMs dedicated InfiniBand networks. All RDMA-enabled sizes are capable of leveraging that network using Intel MPI. SR-IOV stands for “single root input/output virtualization” which optimizes sharing of PCI Express devices in a system with virtual machines. In Azure, SR-IOV for InfiniBand enables near bare-metal performance for any MPI library.

MPI, or message-passing interface, is a communication library commonly used for distributed training between GPUs on many systems. Nvidia’s NCCL software uses MPI to make distributed training easier in deep learning frameworks like PyTorch and TensorFlow.  

Azure now supports using any MPI library with SR-IOV enabled VM families such as NCv3, NDv2, and HC or HB for HPC applications. Older GPU hardware with InfiniBand such as NCv2 and NDv1 will be updated for SR-IOV in 2020.  

Intel MPI version 5.x will continue to be supported as will all subsequent Intel MPI versions.  In addition, all other MPIs supported by the Open Fabric Enterprise Distribution (OFED), OpenMPI, and Nvidia’s NCCL2 library, providing optimized performance for GPUs are supported.

These enhancements will provide customers with higher InfiniBand bandwidth, lower latencies, and most importantly, better distributed application performance. Infiniband connectivity provides higher throughput and lower latencies compared to the ethernet based connection. SR-IOV enables communication over an Infiniband network using any flavor of MPI. A reference implementation of Bert in Azure Machine Learning using SR-IOV and Infiniband can be found on Github.

 

Throughput Improvement in BERT

SR-IOV and Infiniband provided up to 75% improvement in the throughput of BERT Large model. When SR-IOV is enabled, throughput improves to about 28 sequences/second/GPU which is 75% better than the baseline. Below charts show the throughput improvement of BERT large pretraining on 16 Azure StandardNC24s_v3 VMs. Model is in PyTorch and used Torch.Distributed and Open MPI for multi-node training. Note that the below charts do not reflect the best throughput of BERT on Azure.

 

clipboard_image_27.png

clipboard_image_28.png

 

Throughput Improvement in ResNet

In order to observe the improvements in speed for PyTorch we ran a selection of ResNet models from Torchvision on synthetic data at full precision. This allowed us to estimate the throughput without having to worry about IO overhead. Below we can see figures for clusters with SR-IOV enabled vs those that didn’t have SR-IOV. We were using NC24rs_v3 vms each equipped with 4 V100 GPUs. Therefore, when we report 8 GPUs it is across 2 nodes and 16 is across 4. We can see that across models and GPU configurations SR-IOV offers 2-3 times improvement over No SR-IOV.

clipboard_image_29.png

clipboard_image_30.png

clipboard_image_31.png

In the figures below the number reported in the center of the bar is the scaling efficiency on RDMA-enabled VMs. As we can see for Horovod and DistributeDataParallel both using NCCL the scaling efficiency is over 90% across all three models with the performance almost doubling with the doubling of GPUs. 

clipboard_image_32.png

clipboard_image_33.pngclipboard_image_34.png

 

Summary

SR-IOV yielded significant throughput improvements to distributed multi machine training. Bert large throughput increased by 75% with SR-IOV and certain Resnet models were faster by about 2-3x with SR-IOV. Throughput also scaled linearly on Resnet models as the number of NC24rs_v3 nodes scaled from 1 to 2, 4 and 8 instances.

 

Stay tuned for our next blog on scaling distributed deep learning training on Azure NDv2 VMs. These VMs feature 8 NVIDIA Tesla V100 NVLINK interconnected GPUs, 32GB HBM2 memory per GPU and 100Gbps EDR Infiniband interconnect.

 

Get started with Distributed Deep Learning training on Azure Machine Learning. Report any implementation issues or observed throughput improvements of SR-IOV on Azure Machine Learning at Stack Overflow.