Updated 5/21/2024 for the latest GenAI features in Azure API Management
Azure OpenAI Service is a cloud-based platform that enables you to access and use powerful AI models such as GPT-4, DALL-E, and more. These models can generate natural language, images, code, and other types of content based on your inputs and parameters. However, to use these models effectively, you need to consider the quotas and limits that apply to your Azure OpenAI resources, such as tokens per minute, requests per minute, and provisioned throughput units. You also need to handle the possible failures and errors that might occur when calling these models, such as network issues, timeouts, or service unavailability.
In this blog post, we will show you how to use Azure API Management to improve the resiliency and capacity of your Azure OpenAI Service. Azure API Management is a service that helps you create, publish, manage, and secure APIs. It provides features such as routing, caching, throttling, authentication, transformation, and more. By using Azure API Management, you can:
- Load balance requests to multiple instances of the Azure OpenAI Service using the round-robin load balancing technique. This can help you distribute the load across different resources and regions and increase the availability and performance of your service.
- Implement the circuit breaker pattern to protect your backend service from being overwhelmed by excessive requests. This can help you prevent cascading failures and improve the stability and resiliency of your service. You can configure the circuit breaker property in the backend resource, and define rules for tripping the circuit breaker, such as the number or percentage of failure conditions within a defined time interval and a range of status codes indicating failures.
You can find the code for this article in this GitHub repository
Circuit breaker pattern
The circuit breaker pattern is a way to prevent an application from performing an operation that is likely to fail, such as calling an external API that is overloaded or unavailable. The circuit breaker monitors the success or failure of the operation and can open or close the circuit accordingly. When the circuit is open, the operation is not performed, and a fallback response is returned instead. When the circuit is closed, the operation is performed as normal. This way, the circuit breaker can avoid wasting resources and improve the user experience by reducing latency and errors.
API Management exposes a circuit breaker property in the backend resource to protect a backend service from being overwhelmed by too many requests. It has the capability to trip the circuit dynamically for a period specified in the retry-after header from the backend by setting the acceptRetryAfter value
Azure OpenAI enforces rate limiting on the invocations and requests will receive a 429 response code when the service is rate limiting. The Azure OpenAI backends configured in API Management can be configured for circuit breaking whenever rate limiting is observed. This will mark the backend as unhealthy whenever the circuit is open.
A sample configuration to break the circuit for a period the backend specifies in the retry-after header, whenever a 429 error occurs on the backend.
resource symbolicname 'Microsoft.ApiManagement/service/backends@2023-03-01-preview' = {
name: 'aoai-backend-1
properties: {
url: 'https://<openai endpoint>'
protocol: 'http'
circuitBreaker: {
rules: [
{
failureCondition: {
count: 1
interval: 'PT10S'
statusCodeRanges: [
{
min: 429
max: 429
}
]
}
name: 'myBreakerRule'
tripDuration: 'PT10S'
acceptRetryAfter: true
}
]
}
}
}
Load balance requests to multiple instances of the Azure OpenAI Service
To load balance requests to multiple instances of the Azure OpenAI Service, you need to create a backend resource for each instance, and then use the load-balancing policy to distribute the requests among them. The load-balancing policy uses the round-robin algorithm to select the backend resource for each request. Alternatively, we can use priority or weighted load balancing or a combination of them,
Create backend resource for each Azure OpenAI endpoint with circuit breaking enabled when the endpoint is throttling. Backends configured with circuit breaking will mark it as unhealthy when the circuit is open. Load balancing will ignore the unhealthy backends and skip the OpenAI endpoints that are throttling.
A sample configuration to load balance multiple backend resources.
resource symbolicname 'Microsoft.ApiManagement/service/backends@2023-05-01-preview' = {
name: 'aoai-lb-pool'
properties: {
description: 'Load balance openai instances'
type: 'Pool'
protocol: 'https'
url: 'https://does-not-matter
pool: {
services: [
{
id: '/backends/aoai-backend-1'
priority: '1',
weight: '3'
}
{
id: '/backends/aoai-backend-2'
priority: '1',
weight: '1'
}
{
id: '/backends/aoai-backend-3'
priority: '2'
}
]
}
}
}
Combining the Circuit breaker and load balancing options will provide a basic and simple way to increase the availability and capacity of Azure OpenAI deployments using the inbuilt features of Azure API Management.
Rate Limiting and Metric Collection
Limiting the tokens consumed by each subscriber is a key accept is controlling the usage and cost for accountability reasons. Azure OpenAI Token Limit policy allows you to manage and enforce limits per API consumer based on the usage of Azure OpenAI tokens.
It is equally important to observe the consumption patterns by collecting the usage metrics which can be easily achieved using the Emit Token Metric Policy. The metrics will be sent to Azure application insights capturing prompt, completion and total token usage across specified dimensions like subscription ID, Client IP Address etc.
The metric collection will work with streaming as well.
Summing Up
With all the features combined, the policies for the API should be as shown below
<policies>
<inbound>
<set-backend-service id="lb-backend" backend-id="aoai-lb-pool" />
<azure-openai-token-limit tokens-per-minute="10000" counter-key="@(context.Subscription.Id)" estimate-prompt-tokens="true" tokens-consumed-header-name="consumed-tokens" remaining-tokens-header-name="remaining-tokens" />
<authentication-managed-identity resource="https://cognitiveservices.azure.com/" />
<azure-openai-emit-token-metric namespace="genaimetrics">
<dimension name="Subscription ID" />
<dimension name="Client IP" value="@(context.Request.IpAddress)" />
</azure-openai-emit-token-metric>
<base />
</inbound>
<backend>
<retry condition="@(context.Response.StatusCode == 429)" count="2" interval="1" first-fast-retry="true">
<forward-request />
</retry>
</backend>
<outbound>
<base />
</outbound>
<on-error>
<base />
</on-error>
</policies>
Sample Implementation
GitHub repo here has a sample implementation to demonstrate the policies discussed in this article.
Updated May 30, 2024
Version 3.0srinipadala
Microsoft
Joined December 29, 2020
FastTrack for Azure
Follow this blog board to get notified when there's new activity