Startups at Microsoft

7 MIN READ

Azure OpenAI best practices: A quick-reference guide to optimize your deployments

rmmartins

Microsoft

Apr 11, 2025

Contributors: Ahmed Chowdhury

As organizations increasingly integrate Azure OpenAI into their applications, it's essential to be aware of the comprehensive best practices that Microsoft has published. However, these valuable resources are often dispersed across various documentation pages, making it challenging to access them efficiently.

This quick-reference guide consolidates the key best practices for deploying and managing Azure OpenAI workloads. By bringing together architectural considerations, security measures, governance strategies, networking configurations, and more, this guide aims to provide a centralized resource to help you optimize your Azure OpenAI deployments effectively.

Architectural considerations

A robust architecture is the foundation of any successful Azure OpenAI deployment. Azure's Well-Architected Framework provides guidance to design and implement solutions that are reliable, secure, and efficient.

Key recommendations:

Design for scalability: Utilize Azure's scalable services to handle varying loads, ensuring consistent performance during peak times.
Optimize cost: Monitor and manage resources to avoid unnecessary expenditures. Implement auto-scaling and choose appropriate pricing tiers based on workload demands.

Example: An e-commerce platform using Azure OpenAI for personalized recommendations can leverage auto-scaling to handle increased traffic during sales events, ensuring users receive timely suggestions without over-provisioning resources.

For detailed architectural guidance, refer to the Architecture Best Practices for Azure OpenAI Service.

Security best practices

Protecting sensitive data and ensuring compliance are paramount when deploying AI solutions. Azure provides a comprehensive security baseline tailored for Azure OpenAI services.

Key recommendations:

Data encryption: Implement encryption for data at rest and in transit to safeguard against unauthorized access.
Access controls: Utilize Azure's Role-Based Access Control (RBAC) to restrict access to AI resources, ensuring only authorized personnel can interact with sensitive data.

Example: A healthcare provider deploying Azure OpenAI for patient diagnostics should encrypt patient data and restrict access based on roles, ensuring compliance with regulations like HIPAA.

For comprehensive security guidelines, consult the Azure Security Baseline for Azure OpenAI.

Governance strategies

Effective governance ensures that AI deployments align with organizational policies and regulatory requirements. Azure's governance recommendations provide a framework for managing AI resources.

Key recommendations:

Resource tagging: Implement consistent tagging for AI resources to facilitate tracking, management, and cost allocation.
Policy enforcement: Use Azure Policy to enforce organizational standards and assess compliance across AI resources.

Example: A company can use resource tagging to allocate AI resource costs to specific departments, ensuring transparency and accountability.

For detailed governance strategies, refer to the Governance Recommendations for AI Workloads on Azure.

Networking considerations

Efficient and secure networking is crucial for AI workloads, especially when dealing with large datasets and real-time processing. Azure offers networking recommendations tailored for AI services.

Key recommendations:

Virtual networks (VNet): Isolate AI resources within VNets to enhance security and control traffic flow.
Private endpoints: Use private endpoints to connect securely to AI services, reducing exposure to the public internet.

VNet Connectivity Patterns:

When you need AI resources in two VNets to talk to each other, there are two primary approaches:

1. Gateway‑to‑Gateway VPN

Encryption: Built‑in IPsec/IKE tunnel, ensuring all traffic is encrypted in transit.
Transit Support: Enables hub‑and‑spoke or multi‑region topologies without mesh peerings —just connect each spoke to a central transit VNet.
When to use: Regulated workloads, cross‑region connectivity, or any scenario demanding IPsec in-flight encryption.

2. VNet Peering

Performance: Lowest latency over Microsoft’s backbone network.
Cost: No gateway data‑processing charges (peering is metered only on data egress).
When to use: VNets in the same region/tenant, where encryption‑tunnel overhead isn’t required and you want simplicity and speed.

Note:

Peering is non‑transitive by default: A↔B and B↔C peerings don’t auto-connect A to C. To achieve transit, you either need gateway transit settings on your peering or use a VPN hub.

If you require both low latency and encrypted traffic, you can combine peering (data path) with Azure Route Server + NVA‑based IPsec—or stick with VPN for simplicity.

Quota management and optimization

Azure imposes quotas to manage resource usage effectively. Understanding and optimizing these quotas ensures uninterrupted AI operations.

Key recommendations:

Monitor usage: Regularly monitor token usage and request rates to stay within allocated quotas.
Request increases proactively: If approaching quota limits, request increases in advance to avoid service disruptions.

Example: A chatbot service experiencing increased user interactions should monitor token usage and anticipate quota adjustments to maintain seamless user experiences.

For detailed quota management, refer to:

Example: A financial institution processing real-time transactions with AI can use VNets and private endpoints to ensure data remains within a secure network boundary, mitigating risks of data breaches.

For comprehensive networking guidelines, consult the Networking Recommendations for AI Workloads on Azure.

Provisioned throughput units (PTUs)

For workloads requiring consistent and predictable performance, Azure offers Provisioned Throughput Units (PTUs).

Key recommendations:

Assess workload needs: Determine if PTUs align with your workload's performance requirements and cost considerations.
Plan for scalability: Allocate PTUs based on anticipated growth, ensuring the AI system can handle increased demand.
Monitor utilization: Regularly monitor PTU utilization to ensure optimal performance and cost-effectiveness.

Example: A streaming service using Azure OpenAI for content recommendations can deploy PTUs to guarantee consistent performance during peak viewing times.

For detailed information on PTUs, refer to the Provisioned Throughput Units (PTUs) in Azure OpenAI Service.

Monitoring and logging

Comprehensive monitoring and logging are vital for maintaining the health and performance of AI systems. Azure provides tools to monitor AI services effectively.

Key recommendations:

Enable diagnostic logs: Capture detailed logs for troubleshooting and performance analysis.
Set up alerts: Configure alerts for anomalies or performance degradation to enable proactive responses.
Utilize Azure monitor: Use Azure Monitor to collect, analyze, and act on telemetry data from your Azure OpenAI resources.

Example: An online retailer using Azure OpenAI for customer support chatbots can set up alerts to detect unusual spikes in response times, allowing for immediate investigation and resolution.

For comprehensive monitoring guidelines, consult the Monitor Azure OpenAI Service documentation.

Multi-region gateway deployment strategy for Azure OpenAI

To enhance reliability, latency, and resilience for geographically distributed Azure OpenAI users, a multi-region API gateway architecture is strongly recommended. This has become a key focus for engineering teams and field specialists, and for good reason: regional outages, high traffic scenarios, or backend limitations can impact availability. A well-architected gateway setup helps mitigate these issues.

Why This Matters

You can route requests intelligently across multiple Azure OpenAI deployments or models.
You minimize latency by serving traffic from the closest region.
You reduce single points of failure and improve your disaster recovery posture.

Implementation Patterns

There are two main patterns for implementing this in production:

Option 1: Azure API Management Premium – Multi-region deployment (recommended for enterprise scale)

This option leverages Azure API Management's built-in multi-region deployment capability, available with the Premium tier.

Benefits:

Replicates the gateway component across multiple Azure regions.
Traffic is automatically routed to the nearest regional gateway based on latency.
Ensures localized access points and high availability in case of regional failures.

Considerations:

Requires Premium tier (higher cost).
Management plane and developer portal remain in the primary region.

Option 2: Standard tier APIM with external load balancer (cost-effective alternative)

If Premium tier is not feasible, you can deploy separate APIM instances (Standard tier or higher) in each region and use a global load balancer like Azure Front Door or Traffic Manager to distribute traffic.

Steps:

Deploy multiple APIM instances independently in different regions.
Use Azure Front Door or Traffic Manager to route traffic based on geo-proximity or latency.
Maintain consistent configuration across all APIM instances.

Trade-offs:

No built-in multi-region replication; manual config sync needed.
More flexible cost-wise and supports gradual scaling.

Additional strategies to strengthen resilience

Multi-backend gateway pattern: Configure your APIM to route requests to different OpenAI deployments/models based on performance, availability, or workload type.
Public backbone consumption: Use gateways that connect via the Microsoft Public Backbone to improve performance and reduce exposure to public internet routing.
Business continuity & disaster recovery (BCDR): Integrate failover rules, caching, and retry policies to ensure seamless experiences during disruptions.

Example: A multinational company deploying Azure OpenAI for internal employee support creates deployments in East US, West Europe, and Southeast Asia. They set up regional APIM gateways using the Premium tier and route traffic intelligently through Azure Front Door. If the East US region is unavailable, users are routed to West Europe automatically — with minimal latency impact — ensuring uptime and productivity.

Resources:

Bonus: Download the full Azure OpenAI review checklist

If you're looking for a structured way to assess your Azure OpenAI implementation, the Azure Review Checklists now provides a comprehensive checklist with 180+ best practice items covering AI Landing Zone for every critical area: Governance, Operations, networking, Identity, Cost Management, and Business Continuity & Disaster Recovery (BCDR):

Download the official Review Checklist Excel Workbook
Select AI Landing Zone and click to Import latest checklist
Load the AI Landing Zone checklist and explore categorized recommendations with direct reference links to Microsoft documentation.

This checklist serves as a powerful tool to validate architecture decisions, uncover gaps, and guide implementation discussions across technical and governance domains.

Conclusion

By adhering to these best practices, organizations can effectively manage and secure their Azure OpenAI workloads, ensuring they are reliable, efficient, and aligned with industry standards.

Updated Apr 25, 2025

Version 14.0

rmmartins

Microsoft

Joined June 01, 2017

View Profile

Startups at Microsoft

Follow this blog board to get notified when there's new activity