Unlock Performance Gains with NVIDIA Inference Optimizations on Azure AI Foundry

Microsoft

Mar 18, 2025

Microsoft has been working closely with NVIDIA to optimize the most popular models lke Meta Llama models using NVIDIA TensorRT-LLM (TRT-LLM). This ongoing effort ensures that Azure AI Foundry customers benefit from state-of-the-art inference performance improvements, and increased cost efficiency while maintaining response quality.

Optimized Llama Models Now Available

The following Llama models have been optimized delivering significant throughput and latency improvements:

Llama 3.3 70B

Llama 3.1 70B

Llama 3.1 8B

Llama 3.1 405B

These enhancements are automatically applied, so customers using Llama models from the model catalog in Azure AI Foundry will experience improved performance seamlessly—no additional steps or actions required.

Real World Performance Gains

Synopsys has been leveraging the optimized Llama models on Azure AI Foundry, and they have observed significant performance gains.

“At Synopsys, we rely on cutting-edge AI models to drive innovation, and the optimized Meta Llama models on Azure AI Foundry have delivered exceptional performance. We've seen substantial improvements in both throughput and latency, allowing us to accelerate our workloads while optimizing costs. These advancements make Azure AI Foundry an ideal platform for scaling AI applications efficiently.”
— Arun Venkatachar, VP Engineering, Synopsys Central Engineering

Real-world testing confirms these optimizations have led to significant throughput and latency improvements, making Llama models faster and more cost-efficient than ever before.

How NVIDIA TensorRT-LLM Powers These Gains

Microsoft and NVIDIA collaboration has led to deep technical optimizations that enhance both performance and efficiency. Some of the key innovations include:

🔹 GEMM SwiGLU Activation Plugin

Fuses two General Matrix Multiplications (GEMMs) and the SwiGLU activation into a single kernel

Boosts computational efficiency on NVIDIA Hopper GPUs

🔹 Reduce Fusion

Combines ResidualAdd and LayerNorm operations after AllReduce into a single kernel

Optimizes latency, especially for small batch sizes and high token-intensive workloads

🔹 User Buffer

Eliminates unnecessary memory copies, improving inter-GPU communication performance

Particularly effective for FP8 precision in large-scale Llama models

These low-level optimizations improve throughput, enhance GPU utilization and enables inference workloads to run faster, consume fewer resources, and lower total cost of ownership (TCO).

Empowering developers with Flexible, Enterprise-Ready LLM Inference

Azure AI Foundry eliminates infrastructure complexities, enabling developers to

Deploy optimized Llama models with serverless APIs

Scale effortlessly with pay-as-you-go pricing

Ensure enterprise-grade security for AI applications

For developers who prefer to manage their own models, Azure offers flexible NVIDIA-accelerated computing options depending on the level of abstraction they need to develop and deploy their applications. These include deploying models directly on Azure VMs or on Azure Kubernetes Service(AKS) using NVIDIA TensorRT-LLM for optimized performance. DAdditionally, developers can also get enterprise-grade support for their production deployments with NVIDIA TensorRT-LLM with NVIDIA AI Enterprise, which is available through the Azure marketplace.:

In addition at NVIDIA GTC, Microsoft and NVIDIA announced the integration of NVIDIA NIM with Azure AI Foundry, further expanding the choices available to developers.

TensorRT-LLM is for model builders who want to customize, fine-tune, and optimize their own models

NVIDIA NIM provides pre-optimized AI models with enterprise support for AI application developers

With Azure AI Foundry, businesses can scale seamlessly, reduce deployment costs, and maximize performance—whether they choose a fully managed MaaS solution or custom infrastructure deployment.

Try NVIDIA Optimized Llama Models Today

Try out the optimized Llama model APIs on Azure AI Foundry and experience transformational performance improvements firsthand.

Learn more about all the Microsoft Azure and NVIDIA announcements at NVIDIA GTC.

Updated Mar 18, 2025

Version 1.0

Sharmichock

Microsoft

Joined December 14, 2023

View Profile

Microsoft Foundry Blog

Follow this blog board to get notified when there's new activity