Microsoft has been working closely with NVIDIA to optimize the most popular models lke Meta Llama models using NVIDIA TensorRT-LLM (TRT-LLM). This ongoing effort ensures that Azure AI Foundry customers benefit from state-of-the-art inference performance improvements, and increased cost efficiency while maintaining response quality.
Optimized Llama Models Now Available
The following Llama models have been optimized delivering significant throughput and latency improvements:
- Llama 3.3 70B
- Llama 3.1 70B
- Llama 3.1 8B
- Llama 3.1 405B
These enhancements are automatically applied, so customers using Llama models from the model catalog in Azure AI Foundry will experience improved performance seamlessly—no additional steps or actions required.
Real World Performance Gains
Synopsys has been leveraging the optimized Llama models on Azure AI Foundry, and they have observed significant performance gains.
“At Synopsys, we rely on cutting-edge AI models to drive innovation, and the optimized Meta Llama models on Azure AI Foundry have delivered exceptional performance. We've seen substantial improvements in both throughput and latency, allowing us to accelerate our workloads while optimizing costs. These advancements make Azure AI Foundry an ideal platform for scaling AI applications efficiently.”
— Arun Venkatachar, VP Engineering, Synopsys Central Engineering
Real-world testing confirms these optimizations have led to significant throughput and latency improvements, making Llama models faster and more cost-efficient than ever before.
How NVIDIA TensorRT-LLM Powers These Gains
Microsoft and NVIDIA collaboration has led to deep technical optimizations that enhance both performance and efficiency. Some of the key innovations include:
🔹 GEMM SwiGLU Activation Plugin
- Fuses two General Matrix Multiplications (GEMMs) and the SwiGLU activation into a single kernel
- Boosts computational efficiency on NVIDIA Hopper GPUs
🔹 Reduce Fusion
- Combines ResidualAdd and LayerNorm operations after AllReduce into a single kernel
- Optimizes latency, especially for small batch sizes and high token-intensive workloads
🔹 User Buffer
- Eliminates unnecessary memory copies, improving inter-GPU communication performance
- Particularly effective for FP8 precision in large-scale Llama models
These low-level optimizations improve throughput, enhance GPU utilization and enables inference workloads to run faster, consume fewer resources, and lower total cost of ownership (TCO).
Empowering developers with Flexible, Enterprise-Ready LLM Inference
Azure AI Foundry eliminates infrastructure complexities, enabling developers to
- Deploy optimized Llama models with serverless APIs
- Scale effortlessly with pay-as-you-go pricing
- Ensure enterprise-grade security for AI applications
For developers who prefer to manage their own models, Azure offers flexible NVIDIA-accelerated computing options depending on the level of abstraction they need to develop and deploy their applications. These include deploying models directly on Azure VMs or on Azure Kubernetes Service(AKS) using NVIDIA TensorRT-LLM for optimized performance. DAdditionally, developers can also get enterprise-grade support for their production deployments with NVIDIA TensorRT-LLM with NVIDIA AI Enterprise, which is available through the Azure marketplace.:
In addition at NVIDIA GTC, Microsoft and NVIDIA announced the integration of NVIDIA NIM with Azure AI Foundry, further expanding the choices available to developers.
- TensorRT-LLM is for model builders who want to customize, fine-tune, and optimize their own models
- NVIDIA NIM provides pre-optimized AI models with enterprise support for AI application developers
With Azure AI Foundry, businesses can scale seamlessly, reduce deployment costs, and maximize performance—whether they choose a fully managed MaaS solution or custom infrastructure deployment.
Try NVIDIA Optimized Llama Models Today
Try out the optimized Llama model APIs on Azure AI Foundry and experience transformational performance improvements firsthand.
Learn more about all the Microsoft Azure and NVIDIA announcements at NVIDIA GTC.
Updated Mar 18, 2025
Version 1.0Sharmichock
Microsoft
Joined December 14, 2023
AI - Machine Learning Blog
Follow this blog board to get notified when there's new activity