Azure offers a preferred platform for PyTorch, a popular machine learning framework used for deep learning models. Specifically, the Azure Machine Learning (AzureML) service offers the Azure Container for PyTorch (ACPT), a carefully curated environment that includes the latest PyTorch version and optimization software designed for efficient training and inference, including DeepSpeed and ONNX Runtime. ACPT is thoroughly tested and optimized to ensure seamless integration with the Azure AI infrastructure, making it an excellent choice for running PyTorch-based projects. The ACPT curated environment in AzureML provides a ready-to-use distributed training environment for users to run distributed training experiments on the latest multi-node GPU infrastructure offered in Azure.
We are announcing the general availability for the ACPT curated environment in AzureML. This GA release is a follow-up from our public preview release last October with improvements in quality, performance, and updated versions of the deep learning libraries in ACPT.
The new Azure Container for PyTorch image now also supports the newly released PyTorch 2.0 to enable ease of early experimentation. PyTorch 2.0 includes several performance improvements, including optimizations for GPU usage and memory management with torch.compile, and new technologies like – TorchDynamo, AOTAutograd, PrimTorch and TorchInductor. You can now try these improved technologies easily using ACPT on Azure.
We are also excited to introduce Nebula, a new fast checkpointing capability in ACPT. This enables you to save and manage your checkpoints faster and easier for distributed, large-scale model training jobs with PyTorch on Azure Machine Learning. With Nebula, you can save your checkpoints 1000 times faster with a simple API that works asynchronously with your training process. In a test with medium-sized Hugging Face GPT2-XL checkpoints (20.6 GB), Nebula achieved a 96.9 percent reduction in single checkpointing time. The speed gain of saving checkpoints can still increase with larger models and more GPUs. Our results demonstrate that with Nebula, saving a checkpoint with a size of 97GB in a training job on 128 A100 Nvidia GPUs can be reduced from 20 minutes to just 1 second. With the potential to reduce checkpoint times from hours to seconds (a reduction of 95 to 99.9 percent), Nebula provides a solution for frequent saving and reduced end-to-end training time and cost in large-scale training jobs. Nebula is fully compatible with various distributed PyTorch training strategies, including PyTorch Lightning and DeepSpeed, and you can also use it with different Azure Machine Learning compute targets. For more information about Nebula, please visit aka.ms/NebulaACPT.
Along with the release of the GA for ACPT curated environments, we are releasing a step-by-step guide of best practices for running large scale distributed training on AzureML (https://aka.ms/azureml-largescale). The techniques and optimizations covered in the best practices includes all aspects of the data science steps to manage enterprise grade MLOps lifecycle from resource setup and data loading to training optimizations, evaluation and optimizations for inference.
PyTorch 2.0 brings several new features and improvements that can be utilized on Azure using the Azure Container for PyTorch image. With Nebula you can achieve faster checkpointing time. With these ACPT advancements, you can now easily manage your AI workloads at scale. You can create ML resources, register training datasets, create new or reuse prebuilt environments, efficiently load data, and optimize training easily on Azure for your large training workloads.