Effective AI Governance with Azure

Microsoft

Jul 10, 2025

Why is AI Governance needed?

As organizations increasingly adopt AI in their cloud environments, effective governance is essential to ensure sustainability, security, and operational excellence. Without proper oversight, AI workloads can escalate costs, expose vulnerabilities, and struggle with resiliency under dynamic conditions. AI Governance provides a structured approach to managing AI investments, securing sensitive data, optimizing performance, and ensuring compliance with evolving regulations. By implementing governance best practices, enterprises can balance innovation with control, enabling AI-driven solutions to scale efficiently and responsibly. This blog explores key areas of AI Governance, including cost management, security, resiliency, operational optimization, and model oversight.

Five Pillars of AI Governance

Manage AI Costs

Choose the right billing model: For unpredictable usage, the Pay-as-you-go model works best, while predictable workloads benefit from Provisioned Throughput Units (PTUs). Mixing PTU endpoints with consumption-based endpoints helps save money, as PTUs take care of the main tasks while consumption-based endpoints deal with any extra demand.
Choose the right model: Opting for an AI model should balance performance requirements with cost considerations. Select less expensive models unless the use case demands a higher-cost option. During fine-tuning, ensure maximum utilization of time within each billing period to prevent incurring additional charges.
Reservations: By committing to a reservation for Provisioned Throughput Units (PTUs) over a period of one month or one year, you can realize savings. Most OpenAI models offer reservations, with discounts typically ranging from 30% to 60%.
Track and control token usage: The Generative AI Gateway helps manage costs by tracking and throttling token usage, applying circuit breakers, and routing requests to multiple AI endpoints. Incorporating a semantic cache can further optimize both performance and expenses when using LLMs. Additionally, setting model-based provisioning quotas ensures better cost control by preventing unnecessary usage.
Policies to shut down unused instances: Establish a policy requiring AI resources to enable the automatic shutdown feature on virtual machines and compute instances in Azure AI Foundry and Azure Machine Learning. This requirement applies to nonproduction environments and production workloads that can be taken offline periodically.

Secure AI Workloads

AI threat protection: Defender for Cloud provides real-time monitoring of Gen AI applications to detect security vulnerabilities. AI threat protection works with Azure AI content safety prompt shields and Microsoft’s threat intelligence to identify risks such as data leakage, data poisoning, jailbreak attempts, and credential threats. Integration with Defender XDR enables security teams to centralize alerts for AI workloads within the Defender XDR portal.
Access and identity controls: Grant the minimum necessary user access to centralized AI resources. Leverage managed identities across supported Azure AI services and restrict access to essential AI model endpoints only. Implement just-in-time access to enable temporary elevation of permissions when required. Disable local authentication as needed.
Key management: Azure AI services provide two API keys for each resource to facilitate secret rotation, enhancing security by enabling regular key updates. This feature protects service privacy in case of key leakage. It is recommended to store all keys securely in Azure Key Vault.
Regulatory compliance: AI regulatory compliance involves utilizing industry-specific initiatives in Azure Policy and applying relevant policies for services like Azure AI Foundry and Azure Machine Learning. Compliance checklists designed for specific industries and locations, along with standards like ISO/IEC 23053:2022, assist in reviewing and confirming that AI workloads meet regulatory requirements.
Network security: Azure AI services use a layered security model to restrict access to specific networks. Configuring network rules ensures that only applications from designated networks can access the account. Access can be further filtered by IP addresses, ranges, or Azure Virtual Network subnets. When network rules are in effect, applications must be authorized using Microsoft Entra ID credentials or a valid API key.
Data security: Maintain strict data security boundaries by cataloging data to avoid feeding sensitive information to public-facing AI endpoints. Use legally licensed data for AI model grounding or training, and implement tools like Protected Material Detection to prevent copyright infringement. Establish version control for grounding data to track and revert changes, ensuring consistency and compliance across deployments. Regularly review outputs for intellectual property adherence. Tag sensitive information using Azure Information Protection.

Risk scenario	Risk impact	Resiliency mitigation example
Cyberattacks	Ransomware, distributed denial of service (DDoS), or unauthorized access.	To reduce impact, include robust security measures, including an appropriate backup and recovery process, in your adoption strategy and plan.
System failures	Hardware or software malfunctions.	Design for quick recovery and data integrity restoration. Handle transient faults in your applications, and provide redundancy in your infrastructure, such as multiple replicas with automatic failover.
Configuration issues	Deployment errors or misconfigurations.	Treat configuration changes as code changes by using infrastructure as code (IaC). Use continuous integration/continuous deployment (CI/CD) pipelines, canary deployments, and rollback mechanisms to minimize the impact of faulty updates or deployments.
Demand spikes or overload	Performance degradation during peak usage or spikes in traffic.	Use elastic scalability to ensure that systems automatically scale to handle an increased demand without disruption to service.
Compliance failures	Breaches of regulatory standards.	Adopt compliance tools like Microsoft Purview and use Azure Policy to enforce compliance requirements.
Natural disasters	Datacenter outages caused by earthquakes, floods, or storms.	Plan for failover, high availability, and disaster recovery by using availability zones, multiple regions, or even multicloud approaches.

Resilience for AI Platforms

Deploy AI landing zones: AI landing zones provide pre-designed, scalable environments that provide a structured foundation for deploying AI workloads in Azure. They integrate various Azure services to ensure governance, compliance, security, and operational efficiency. ALZ’s help streamline AI deployments while maintaining best practices for scalability and performance.
Reliable scaling strategy: AI applications require effective scaling strategies, such as auto scaling and automatic scaling mechanisms. While auto-scaling operates based on predefined threshold rules, automatic scaling leverages intelligent algorithms to adaptively scale resources by analyzing learned usage patterns.
Disaster recovery planning: A critical component of business continuity that requires the development of techniques for High Availability (HA) and Disaster Recovery (DR) for your AI endpoints and AI Data. This involves deploying zonal services within a region to ensure HA and provisioning instances in a secondary region to enable effective DR.
Building global resilience: Global deployment optimizes capacity utilization and throughput for generative AI by accessing distributed pools across regions. Intelligent routing prioritizes less busy instances, ensuring processing efficiency and reliability. Azure API Management (APIM) with premium SKU supports resilient global deployments, maintaining a single endpoint for seamless failover and enhanced scalability without burdening applications.

Optimizing AI Operations

Latency: With generative AI, inferencing time far outweighs network latency, making network time negligible in overall operations. A global deployment, leveraging intelligent routing to identify less busy capacity pools worldwide, ensures faster processing by utilizing idle resources effectively. This approach transforms traditional latency considerations, emphasizing the scalability and efficiency of globally distributed models over proximity. Additionally, seasonal differences across regions further enhance the potential for optimized performance.
Capacity and throughput: Global deployments optimize capacity and throughput by accessing larger pools and leveraging intelligent routing to direct requests to less busy instances, ensuring faster processing and quota fulfillment. Data Zones balance broader capacity access with compliance for regions with sovereignty needs, while Provisioned Throughput Units (PTUs) can further improve utilization by dynamically managing token distribution across pools for maximum efficiency. Standard options remain limited and may restrict throughput under heavy demand.
AI observability: GenAI observability encompasses monitoring model performance, capacity utilization, token throughput, and compliance across distributed systems. It tracks token utilization to ensure efficient distribution and optimize throughput, supported by tools like PTU for dynamic management. General observability features include latency tracking, resource allocation insights, error rate monitoring, and proactive alerting, enabling seamless operations and adherence to data sovereignty requirements while maximizing performance and efficiency.

Azure OpenAI observability metrics

Category	Metric	Unit	Dimensions	Aggregation	Description
HTTP Requests	Total Request Count	Count	Endpoint, API Operation, Region	Sum	Tracks the total number of HTTP requests made to the Azure OpenAI endpoints.
	Failed Requests	Count	Status Code, Region, API Operation	Sum	Monitors the count of requests resulting in errors (e.g., 4xx, 5xx response codes).
	Request Rate	Requests/second	Endpoint, Region	Average	Measures the rate of incoming requests to analyze traffic patterns.
Latency	Request Latency	Milliseconds (ms)	Endpoint, Region, API Operation	Average, Percentiles (50th, 90th, 99th)	Captures the average response time of requests, broken down by endpoint or API call.
	Response Time Percentiles	Milliseconds (ms)	Endpoint, Region, API Operation	Percentiles (50th, 90th, 99th)	Identifies outliers or slow responses in terms of latency across different percentiles.
Usage	Token Utilization	Tokens	API Key, Region, Instance Type	Sum, Average	Tracks the number of tokens processed (prompt and completion) to monitor quota usage.
	Throttled Requests	Count	API Key, Region	Sum	Counts requests delayed or rejected due to throttling or quota limits.
Actions	Cache Hits/Misses	Count	Cache Type, Region, Endpoint	Ratio (Hits vs Misses), Sum	Monitors the efficiency of semantic or prompt caching to optimize token usage.
	Request Routing Efficiency	Percentage (%)	Region, Capacity Pool	Average	Tracks the accuracy of routing requests to the least busy capacity pool for better processing.
	Throughput	Tokens/second	Endpoint, Region	Sum, Average	Measures successfully processed tokens or requests per second to ensure capacity optimization.

Govern AI Models

Control the models: Azure Policy can be used to control which models teams are permitted to deploy from the Azure AI Foundry catalog. Organizations are advised to start with audit mode, which monitors model usage without restricting deployments. Transitioning to deny mode should only occur after thoroughly understanding workload teams’ development needs to avoid unnecessary disruption. It’s important to note that deny mode does not automatically remove noncompliant models already deployed, and these must be addressed manually.
Evaluating models: Evaluation is a critical aspect of the generative AI lifecycle, ensuring models meet accuracy, performance, security, and ethical standards while mitigating biases and validating robustness before deployment. It plays a role at every stage, from selecting the base model to pre-production validation and post-production monitoring. Azure provides several tools to support systematic evaluation, including Azure AI Foundry, which offers built-in metrics for assessing AI model performance. The Evaluation API in Azure OpenAI Service enables automated quality checks by integrating evaluations into CI/CD pipelines. Additionally, organizations can leverage Azure DevOps and GitHub Actions to conduct bulk evaluations, ensuring AI models remain compliant, optimized, and trustworthy throughout their lifecycle.
Content filters for models: Organizations are advised to define baseline content filters for generative AI models using Azure AI Content Safety. This system evaluates both prompts and completions through classification models that identify and mitigate harmful content across various categories. Key features include prompt shields, groundedness detection, and protected material text scanning for both images and text. Establishing a process for application teams to communicate governance needs ensures alignment and comprehensive oversight of safety measures.
Ground AI models: To effectively manage generative AI output, utilize system messages and the retrieval augmented generation (RAG) pattern to ensure responses are grounded and reliable. Test grounding techniques using tools like prompt flow for structured workflows or the open-source red teaming framework PyRIT to identify potential vulnerabilities. These strategies help refine model behavior and maintain alignment with governance requirements.

Updated May 14, 2025

Version 1.0

artificial intelligence

Microsoft

Joined December 07, 2022

View Profile

Microsoft Foundry Blog