Blog Post

AI - Machine Learning Blog
3 MIN READ

Unlock Performance Gains with NVIDIA Inference Optimizations on Azure AI Foundry

Sharmichock's avatar
Sharmichock
Icon for Microsoft rankMicrosoft
Mar 18, 2025

Microsoft has been working closely with NVIDIA to optimize the most popular models lke Meta Llama models using NVIDIA TensorRT-LLM (TRT-LLM). This ongoing effort ensures that Azure AI Foundry customers benefit from state-of-the-art inference performance improvements, and increased cost efficiency while maintaining response quality. 

Optimized Llama Models Now Available 

The following Llama models have been optimized delivering significant throughput and latency improvements: 

  • Llama 3.3 70B  
  • Llama 3.1 70B 
  • Llama 3.1 8B  
  • Llama 3.1 405B  

These enhancements are automatically applied, so customers using Llama models from the model catalog in Azure AI Foundry will experience improved performance seamlessly—no additional steps or actions required. 

 

Real World Performance Gains 

Synopsys has been leveraging the optimized Llama models on Azure AI Foundry, and they have observed significant performance gains.  

“At Synopsys, we rely on cutting-edge AI models to drive innovation, and the optimized Meta Llama models on Azure AI Foundry have delivered exceptional performance. We've seen substantial improvements in both throughput and latency, allowing us to accelerate our workloads while optimizing costs. These advancements make Azure AI Foundry an ideal platform for scaling AI applications efficiently.”  
— Arun Venkatachar, VP Engineering, Synopsys Central Engineering 

Real-world testing confirms these optimizations have led to significant throughput and latency improvements, making Llama models faster and more cost-efficient than ever before. 

How NVIDIA TensorRT-LLM Powers These Gains 

Microsoft and NVIDIA collaboration has led to deep technical optimizations that enhance both performance and efficiency. Some of the key innovations include: 

🔹 GEMM SwiGLU Activation Plugin  

  • Fuses two General Matrix Multiplications (GEMMs) and the SwiGLU activation into a single kernel 
  • Boosts computational efficiency on NVIDIA Hopper GPUs 

🔹 Reduce Fusion  

  • Combines ResidualAdd and LayerNorm operations after AllReduce into a single kernel 
  • Optimizes latency, especially for small batch sizes and high token-intensive workloads 

🔹 User Buffer  

  • Eliminates unnecessary memory copies, improving inter-GPU communication performance 
  • Particularly effective for FP8 precision in large-scale Llama models 

These low-level optimizations improve throughput, enhance GPU utilization and enables inference workloads to  run faster, consume fewer resources, and lower total cost of ownership (TCO). 

 

Empowering developers with Flexible, Enterprise-Ready LLM Inference  

Azure AI Foundry eliminates infrastructure complexities, enabling developers to  

  • Deploy optimized Llama models with serverless APIs 
  •  Scale effortlessly with pay-as-you-go pricing 
  • Ensure enterprise-grade security for AI applications 

For developers who prefer to manage their own models, Azure offers flexible NVIDIA-accelerated computing options depending on the level of abstraction they need to develop and deploy their applications. These include deploying models directly on Azure VMs or on Azure Kubernetes Service(AKS) using NVIDIA TensorRT-LLM for optimized performance. DAdditionally, developers can also get enterprise-grade support for their production deployments with NVIDIA TensorRT-LLM with NVIDIA AI Enterprise, which is available through the Azure marketplace.: 

In addition at NVIDIA GTC, Microsoft and NVIDIA announced the integration of NVIDIA NIM with Azure AI Foundry, further expanding the choices available to developers. 

  • TensorRT-LLM is for model builders who want to customize, fine-tune, and optimize their own models 
  • NVIDIA NIM provides pre-optimized AI models with enterprise support for AI application developers 

With Azure AI Foundry, businesses can scale seamlessly, reduce deployment costs, and maximize performance—whether they choose a fully managed MaaS solution or custom infrastructure deployment. 

 

Try NVIDIA Optimized Llama Models Today 

Try out the optimized Llama model APIs on Azure AI Foundry and experience transformational performance improvements firsthand. 

Learn more about all the Microsoft Azure and NVIDIA announcements at NVIDIA GTC. 

 

Updated Mar 18, 2025
Version 1.0
No CommentsBe the first to comment