Home
%3CLINGO-SUB%20id%3D%22lingo-sub-333652%22%20slang%3D%22en-US%22%3ENEW%20REFERENCE%20ARCHITECTURE%3A%20Distributed%20training%20of%20deep%20learning%20models%20on%20Azure%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-333652%22%20slang%3D%22en-US%22%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EOur%20sixth%20%3CSPAN%3EAI%20%3CA%20href%3D%22http%3A%2F%2Faka.ms%2FRefArchs%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3Ereference%20architecture%3C%2FA%3E%20(on%20the%20%3C%2FSPAN%3E%3CA%20href%3D%22http%3A%2F%2Faka.ms%2FArchitecture%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EAzure%20Architecture%20Center%3C%2FA%3E%3CSPAN%3E)%20is%20authored%20by%20AzureCAT%26nbsp%3B%3CEM%3E%3CSTRONG%3EMathew%20Salvaris%3C%2FSTRONG%3E%3C%2FEM%3E%2C%20edited%20by%20%3CEM%3E%3CSTRONG%3ENanette%20Ray%3C%2FSTRONG%3E%3C%2FEM%3E%2C%20and%20published%20by%3C%2FSPAN%3E%3CSPAN%3E%20%3CEM%3E%3CSTRONG%3EMike%20Wasson%3C%2FSTRONG%3E%3C%2FEM%3E.%20%3C%2FSPAN%3E%3C%2FP%3E%0A%3CUL%3E%3CBR%20%2F%3E%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Ftraining-deep-learning%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EDistributed%20training%20of%20deep%20learning%20models%20on%20Azure%3C%2FA%3E%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3EReference%20architectures%20provide%20a%20consistent%20approach%20and%20best%20practices%20for%20a%20given%20solution.%20Each%20architecture%20includes%20recommended%20practices%2C%20along%20with%20considerations%20for%20scalability%2C%20availability%2C%20manageability%2C%20security%2C%20and%20more.%20This%20architecture%20includes%20a%20deployable%20solution%20as%20well.%26nbsp%3BThe%20full%20array%20of%20reference%20architectures%20is%20available%20on%20the%20Azure%20Architecture%20Center.%3C%2FP%3E%0A%3CP%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Ftraining-deep-learning%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noreferrer%20noopener%20noreferrer%22%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20style%3D%22width%3A%20795px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Fgxcuf89792.i.lithium.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F72474iB37F13A0A3BF22D7%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20alt%3D%22deep_learning_models_refarch.png%22%20title%3D%22deep_learning_models_refarch.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FA%3E%3C%2FP%3E%0A%3CP%3EThis%20reference%20architecture%20shows%20how%20to%20conduct%20distributed%20training%20of%20deep%20learning%20models%20across%20clusters%20of%20GPU-enabled%20virtual%20machines%20(VMs).%20The%20scenario%20is%20image%20classification%2C%20but%20the%20solution%20can%20be%20generalized%20for%20other%20deep-learning%20scenarios%2C%20such%20as%20segmentation%20and%20object%20detection.%3C%2FP%3E%0A%3CP%3EThis%20architecture%20consists%20of%20the%20following%20components%3A%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fbatch-ai%2Foverview%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3E%3CSTRONG%3EAzure%20Batch%20AI%3C%2FSTRONG%3E%3C%2FA%3E%20plays%20the%20central%20role%20in%20this%20architecture%20by%20scaling%20resources%20up%20and%20down%20according%20to%20need.%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fstorage%2Fblobs%2Fstorage-blobs-introduction%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3E%3CSTRONG%3EBlob%20storage%3C%2FSTRONG%3E%3C%2FA%3E%20is%20used%20to%20stage%20the%20data.%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fstorage%2Ffiles%2Fstorage-files-introduction%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3E%3CSTRONG%3EAzure%20Files%3C%2FSTRONG%3E%3C%2FA%3E%20is%20used%20to%20store%20the%20scripts%2C%20logs%2C%20and%20the%20final%20results%20from%20the%20training.%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fbatch-ai%2Fresource-concepts%23file-server%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3E%3CSTRONG%3EBatch%20AI%20file%20server%3C%2FSTRONG%3E%3C%2FA%3E%20is%20a%20single-node%20NFS%20share%20used%20in%20this%20architecture%20to%20store%20the%20training%20data.%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fhub.docker.com%2F%22%20target%3D%22_blank%22%20rel%3D%22noopener%20nofollow%20noopener%20noreferrer%20noopener%20noreferrer%22%3E%3CSTRONG%3EDocker%20Hub%3C%2FSTRONG%3E%3C%2FA%3E%20is%20used%20to%20store%20the%20Docker%20image%20that%20Batch%20AI%20uses%20to%20run%20the%20training.%20%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fcontainer-registry%2Fcontainer-registry-intro%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EAzure%20Container%20Registry%3C%2FA%3E%20can%20also%20be%20used.%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ETopics%26nbsp%3Bcovered%20include%3A%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Ftraining-deep-learning%23performance-considerations%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EPerformance%20considerations%3C%2FA%3E%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Ftraining-deep-learning%23scalability-considerations%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EScalability%20considerations%3C%2FA%3E%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Ftraining-deep-learning%23storage-considerations%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EStorage%20considerations%3C%2FA%3E%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Ftraining-deep-learning%23security-considerations%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3ESecurity%20considerations%3C%2FA%3E%3CUL%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Ftraining-deep-learning%23restrict-access-to-azure-blob-storage%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3ERestrict%20access%20to%20Azure%20Blob%20Storage%3C%2FA%3E%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Ftraining-deep-learning%23encrypt-data-at-rest-and-in-motion%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EEncrypt%20data%20at%20rest%20and%20in%20motion%3C%2FA%3E%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Ftraining-deep-learning%23secure-data-in-a-virtual-network%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3ESecure%20data%20in%20a%20virtual%20network%3C%2FA%3E%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Ftraining-deep-learning%23monitoring-considerations%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EMonitoring%20considerations%3C%2FA%3E%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Ftraining-deep-learning%23deployment%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EDeployment%3C%2FA%3E%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3E%3CSPAN%3EHead%20over%20to%20the%20Azure%20Architecture%20Center%20to%20learn%20more%20about%20the%26nbsp%3B%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Ftraining-deep-learning%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EDistributed%20training%20of%20deep%20learning%20models%20on%20Azure%3C%2FA%3E%3C%2FSPAN%3E%26nbsp%3Breference%20architecture.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CH1%20id%3D%22toc-hId-1898415547%22%20id%3D%22toc-hId-1927972186%22%3ESee%20Also%3C%2FH1%3E%0A%3CP%3EAdditional%20related%20AI%20reference%20architectures%3A%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Fbatch-scoring-deep-learning%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EBatch%20scoring%20on%20Azure%20for%20deep%20learning%20models%3C%2FA%3E%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Fbatch-scoring-python%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EBatch%20scoring%20of%20Python%20models%20on%20Azure%3C%2FA%3E%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Frealtime-scoring-python%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EReal-time%20scoring%20of%20Python%20Scikit-Learn%20and%20deep%20learning%20models%20on%20Azure%3C%2FA%3E%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Frealtime-scoring-r%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EReal-time%20scoring%20of%20R%20machine%20learning%20models%3C%2FA%3E%3C%2FLI%3E%0A%3CLI%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Farchitecture%2Freference-architectures%2Fai%2Freal-time-recommendation%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3EBuild%20a%20real-time%20recommendation%20API%20on%20Azure%3C%2FA%3E%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3EFind%20all%20our%20reference%20architectures%20%3CA%20href%3D%22http%3A%2F%2Faka.ms%2FRefArchs%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3Ehere%3C%2FA%3E.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CA%20href%3D%22http%3A%2F%2Faka.ms%2FCAT%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noopener%20noreferrer%20noopener%20noreferrer%22%3E%3CIMG%20width%3D%22134%22%20height%3D%22134%22%20class%3D%22alignleft%22%20src%3D%22https%3A%2F%2Fmsdnshared.blob.core.windows.net%2Fmedia%2F2017%2F01%2FAzureCAT_Icon.jpg%22%20border%3D%220%22%20%2F%3E%3C%2FA%3E%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3EAzureCAT%20Guidance%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3E%3CSPAN%3E%3CEM%3E%22Hands-on%20solutions%2C%20with%20our%20heads%20in%20the%20Cloud!%22%3C%2FEM%3E%3C%2FSPAN%3E%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-TEASER%20id%3D%22lingo-teaser-333652%22%20slang%3D%22en-US%22%3E%3CP%3EThis%20reference%20architecture%20shows%20how%20to%20conduct%20distributed%20training%20of%20deep%20learning%20models%20across%20clusters%20of%20GPU-enabled%20virtual%20machines%20(VMs).%20The%20scenario%20is%20image%20classification%2C%20but%20the%20solution%20can%20be%20generalized%20for%20other%20deep-learning%20scenarios%2C%20such%20as%20segmentation%20and%20object%20detection.%3C%2FP%3E%3C%2FLINGO-TEASER%3E%3CLINGO-LABS%20id%3D%22lingo-labs-333652%22%20slang%3D%22en-US%22%3E%3CLINGO-LABEL%3EAI%3C%2FLINGO-LABEL%3E%3C%2FLINGO-LABS%3E
Microsoft

 

Our sixth AI reference architecture (on the Azure Architecture Center) is authored by AzureCAT Mathew Salvaris, edited by Nanette Ray, and published by Mike Wasson.

Reference architectures provide a consistent approach and best practices for a given solution. Each architecture includes recommended practices, along with considerations for scalability, availability, manageability, security, and more. This architecture includes a deployable solution as well. The full array of reference architectures is available on the Azure Architecture Center.

deep_learning_models_refarch.png

This reference architecture shows how to conduct distributed training of deep learning models across clusters of GPU-enabled virtual machines (VMs). The scenario is image classification, but the solution can be generalized for other deep-learning scenarios, such as segmentation and object detection.

This architecture consists of the following components:

  • Azure Batch AI plays the central role in this architecture by scaling resources up and down according to need.
  • Blob storage is used to stage the data.
  • Azure Files is used to store the scripts, logs, and the final results from the training.
  • Batch AI file server is a single-node NFS share used in this architecture to store the training data.
  • Docker Hub is used to store the Docker image that Batch AI uses to run the training. Azure Container Registry can also be used.

 

Topics covered include:

Head over to the Azure Architecture Center to learn more about the Distributed training of deep learning models on Azure reference architecture.

 

See Also

Additional related AI reference architectures:

Find all our reference architectures here.

 

AzureCAT Guidance

"Hands-on solutions, with our heads in the Cloud!"