Authored by:
Sherry Xu, Partner Lead SoC Architect, Azure Maia
Chandru Ramakrishnan, Partner Software Engineering Manager
As the advancement of artificial intelligence continues to demand new innovations in the cloud, we find ourselves squarely in a moment where the co-optimization of hardware with software is critical to optimizing AI infrastructure for peak performance, scalability, and fungibility.
At Hot Chips 2024, Microsoft shared specifications on Maia 100, Microsoft's first-generation custom AI accelerator designed specifically for large-scale AI workloads deployed in Azure. Vertically integrated to optimize performance and reduce costs, the Maia 100 system includes a platform architecture featuring custom server boards with tailor-made racks and a software stack built to increase performance and cost efficiency for advanced AI capabilities on services like Azure OpenAI Services.
Chip architecture designed to support advanced machine learning needs
The Maia 100 accelerator is purpose-built for a wide range of cloud-based AI workloads. The chip measures out at ~820mm2 and utilizes TSMC’s N5 process with COWOS-S interposer technology. Equipped with large on-die SRAM, Maia 100’s reticle-size SoC die, combined with four HBM2E die, provide a total of 1.8 terabytes per second of bandwidth and 64 gigabytes of capacity to accommodate AI-scale data handling requirements.
Designed to support up to 700W TDP but provisioned at 500W, Maia 100 can deliver high performance while managing power efficiently based on its targeted workloads.
An AI accelerator built for high throughput and diverse data formats
Maia 100’s architecture, tailored to modern machine learning needs, reflects the application of thoughtful research on AI systems for optimal computational speed, performance, and accuracy.
- A high-speed tensor unit offers rapid processing for training and inferencing while supporting a wide range of data types, including low precision data types such as the MX data format, first introduced by Microsoft through the MX Consortium in 2023. This tensor unit is constructed as a 16xRx16 unit.
- The vector processor is a loosely coupled superscalar engine built with custom instruction set architecture (ISA) to support a wide range of data types, including FP32 and BF16.
- A Direct Memory Access (DMA) engine supports different tensor sharding schemes.
- Hardware semaphores enable asynchronous programming on the Maia system.
A software-led approach to data utilization and power efficiency
The Maia accelerator is designed with a lower-precision storage data type and a data compression engine to reduce the bandwidth and capacity ask required for large inferencing jobs often bottlenecked by data movement. To further improve data utilization and power efficiency, large L1 and L2 scratch pads are software-managed.
Ethernet-based interconnects support large-scale AI models
In 2023, Microsoft led the development of the Ultra Ethernet Consortium, helping enable the industry to use Ethernet-based interconnects designed for ultra-high bandwidth compute. Maia 100 supports up to 4800 Gbps all-gather and scatter-reduced bandwidth, and 1200 Gbps all-to-all bandwidth. This ethernet interconnect utilizes a custom RoCE-like protocol, offering enhanced reliability and balance. Maia’s backend network protocol supports AES-GCM encryption, also making It ideal for confidential compute. Maia 100 is also supported by a unified backend network for scale-up and scale-out workloads, providing flexibility to support both direct and switch connectivity.
Enabling quick deployment and model portability on the Maia SDK
With hardware and software architecture designed from the ground up to run large-scale workloads more efficiently, Maia 100 vertically integrates what we have learned across every layer of our cloud architecture – from advanced cooling and networking needs to the software stack that allows quick deployment of models. The Maia software development kit (SDK) allows users to quickly port their models written in PyTorch and Triton to Maia.
The Maia SDK provides a comprehensive set of components for developers to enable quick deployment of models to Azure OpenAI Services:
- Framework integration: a first-class PyTorch backend which supports both eager mode and graph mode;
- Developer tools: Tools for debugging and performance-tuning models such as a debugger, profiler, visualizer, and model quantization and validation tools;
- Compilers: We have 2 programming models and compilers for Maia - Triton programming model offers agility and portability, while the Maia API is suited for the highest performance.
- Kernel and Collective Library: Using the compilers, we’ve developed a set of highly optimized ML compute and communication kernels enabling you to get started quickly on Maia. Authoring of custom kernels is also supported.
- Maia Host/Device Runtime: A host-device runtime layer comes with a hardware abstraction layer that is responsible for memory allocation, kernel launches, scheduling, and device management.
Dual programming models ensure efficient data handling and synchronization
The Maia programming model leverages asynchronous programming with semaphores for synchronization, enabling the overlap of computation with memory and network transfers. It operates with two execution streams: control processors issuing asynchronous commands via queues and hardware threads executing these commands, ensuring efficient data handling through semaphore-based synchronization.
To program Maia, developers can choose from two programming models: Triton, a popular open-source domain-specific language (DSL) for deep neural networks (DNNs) that simplifies coding and runs on both GPUs and Maia, or the Maia API, a Maia-specific custom programming model built for maximum performance with more detailed control. Triton requires fewer lines of code and handles memory and semaphore management automatically, while Maia API demands more code and explicit management by the programmer.
Optimizing data flow with gather-based matrix multiplication
Maia uses a Gather-based approach for large distributed General Matrix Multiplication (GEMMs), as opposed to an All-Reduce based approach. This approach offers several advantages: enhanced processing speed and efficiency by fusing the post-GEMM activation function (like GELU) directly in SRAM; reduction of idle time by overlapping computation with network communications, and reduction of latency by sending quantized data over the network – leading to faster data transmission and overall system performance.
Additionally, we leverage Static Random-Access Memory (SRAM) at the cluster level to buffer activations and intermediate results. Network reads and writes are also served directly from SRAM, enabling direct access to CSRAM. This significantly reduces HBM reads, improving latency.
We further enhance performance by parallelizing computations across clusters and utilizing the Network On Chip (NOC) for on-chip activation gathering.
Optimizing workload performance with portability and flexibility
Key to Maia 100’s fungibility is its ability to execute PyTorch models against Maia with a single line of code. This is supported by a PyTorch backend, which operates both in both eager mode for the optimal developer experience, and graph mode for the best performance. Leveraging PyTorch with Triton, developers can optimize workload performance with complete portability and flexibility between hardware backends without sacrificing efficiency and the ability to target AI workloads.
With its advanced architecture, comprehensive developer tools, and seamless integration with Azure, the Maia 100 is revolutionizing the way Microsoft manages and executes AI workloads. Through the algorithmic co-design of hardware with software, built-in hardware optionality for both model developers and custom kernel authors, and a vertically integrated design to optimize performance and improve power efficiency while reducing costs, Maia 100 offers a new option for running advanced cloud-based AI workloads on Microsoft’s AI infrastructure.