Today, Azure is proud to take the next step toward our commitment to enabling customers to harness the power of AI (Artificial Intelligence) at scale. For AI, the bar for innovation has never been higher with hardware requirements for training models far outpacing Moore’s Law. Technology leaders across industries are discovering new ways to apply the power of machine learning, accelerated analytics and AI to make sense of unstructured data. The natural language models of today are exponentially larger than the largest models of four short years ago.
OpenAI’s GPT-3 model, for instance, has three orders of magnitude more parameters than the ResNet-50 image classification model that was at the forefront of AI in the mid-2010s. These kinds of demanding workloads required the development of a new class of system within Microsoft Azure, from the ground-up using the latest hardware innovations.
The Azure team has built on our experience virtualizing the latest GPU technology, and building the public cloud industry’s leading InfiniBand-enabled HPC virtual machines to offer something totally new for AI in the cloud. Each deployment of an ND A100 v4 cluster rivals the largest AI supercomputers in the industry in terms of raw scale and advanced technology. These VMs enjoy the same unprecedented 1.6 Tb/s of total dedicated InfiniBand bandwidth per VM, plus AMD Rome-powered compute cores behind every NVIDIA A100 GPU as used by the most powerful dedicated on-premise HPC systems. Azure adds massive scale, elasticity, and versatility of deployment, as expected by Microsoft’s customers and internal AI engineering teams.
This unparalleled scale and capability of interconnect in a cloud offering, with each GPU directly paired with a high-throughput low-latency InfiniBand interface, offers our customers a unique dimension of scaling on demand without managing their own datacenters.
Today, at SC20, we’re announcing the public preview of the ND A100 v4 VM family, available from one virtual machine to world-class supercomputer scale, with each individual VM featuring:
Like other Azure GPU virtual machines, ND A100 v4 is also available with Azure Machine Learning (AML) service for interactive AI development, distributed training, batch inferencing, and automation with ML Ops. Customers can choose to deploy through AML or traditional VM Scale Sets, and soon many other Azure-native deployment options such as Azure Kubernetes Service. With all of these, optimized configuration of the systems and InfiniBand backend network is taken care of automatically.
Azure Machine Learning provides a tuned virtual machine (pre-installed with the required drivers and libraries) and container-based environments optimized for the ND A100 v4 family. Sample recipes and Jupyter Notebooks help users get started quickly with multiple frameworks including PyTorch, TensorFlow, and training state of the art models like BERT. With Azure Machine Learning, customers have access to the same tools and capabilities in Azure as our AI engineering teams.
Accelerate your innovation and unlock your AI potential with the ND A100 v4.
Preview sign-up is open. Request access now.
Additional Links
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.