FastTrack for Azure

5 MIN READ

Using Azure API Management Circuit Breaker and Load balancing with Azure OpenAI Service

Microsoft

Jan 29, 2024

Updated 5/21/2024 for the latest GenAI features in Azure API Management

Azure OpenAI Service is a cloud-based platform that enables you to access and use powerful AI models such as GPT-4, DALL-E, and more. These models can generate natural language, images, code, and other types of content based on your inputs and parameters. However, to use these models effectively, you need to consider the quotas and limits that apply to your Azure OpenAI resources, such as tokens per minute, requests per minute, and provisioned throughput units. You also need to handle the possible failures and errors that might occur when calling these models, such as network issues, timeouts, or service unavailability.

In this blog post, we will show you how to use Azure API Management to improve the resiliency and capacity of your Azure OpenAI Service. Azure API Management is a service that helps you create, publish, manage, and secure APIs. It provides features such as routing, caching, throttling, authentication, transformation, and more. By using Azure API Management, you can:

Load balance requests to multiple instances of the Azure OpenAI Service using the round-robin load balancing technique. This can help you distribute the load across different resources and regions and increase the availability and performance of your service.
Implement the circuit breaker pattern to protect your backend service from being overwhelmed by excessive requests. This can help you prevent cascading failures and improve the stability and resiliency of your service. You can configure the circuit breaker property in the backend resource, and define rules for tripping the circuit breaker, such as the number or percentage of failure conditions within a defined time interval and a range of status codes indicating failures.

You can find the code for this article in this GitHub repository

Circuit breaker pattern

The circuit breaker pattern is a way to prevent an application from performing an operation that is likely to fail, such as calling an external API that is overloaded or unavailable. The circuit breaker monitors the success or failure of the operation and can open or close the circuit accordingly. When the circuit is open, the operation is not performed, and a fallback response is returned instead. When the circuit is closed, the operation is performed as normal. This way, the circuit breaker can avoid wasting resources and improve the user experience by reducing latency and errors.

API Management exposes a circuit breaker property in the backend resource to protect a backend service from being overwhelmed by too many requests. It has the capability to trip the circuit dynamically for a period specified in the retry-after header from the backend by setting the acceptRetryAfter value

Azure OpenAI enforces rate limiting on the invocations and requests will receive a 429 response code when the service is rate limiting. The Azure OpenAI backends configured in API Management can be configured for circuit breaking whenever rate limiting is observed. This will mark the backend as unhealthy whenever the circuit is open.

A sample configuration to break the circuit for a period the backend specifies in the retry-after header, whenever a 429 error occurs on the backend.

resource symbolicname 'Microsoft.ApiManagement/service/backends@2023-03-01-preview' = {
  name: 'aoai-backend-1
  properties: {
    url: 'https://<openai endpoint>'
    protocol: 'http'
    circuitBreaker: {
      rules: [
        {
          failureCondition: {
            count: 1
            interval: 'PT10S'
            statusCodeRanges: [
              {
                min: 429
                max: 429
              }
            ]
          }
          name: 'myBreakerRule'
          tripDuration: 'PT10S'
          acceptRetryAfter: true
        }
      ]
    }
   }
 }

Load balance requests to multiple instances of the Azure OpenAI Service

To load balance requests to multiple instances of the Azure OpenAI Service, you need to create a backend resource for each instance, and then use the load-balancing policy to distribute the requests among them. The load-balancing policy uses the round-robin algorithm to select the backend resource for each request. Alternatively, we can use priority or weighted load balancing or a combination of them,

Create backend resource for each Azure OpenAI endpoint with circuit breaking enabled when the endpoint is throttling. Backends configured with circuit breaking will mark it as unhealthy when the circuit is open. Load balancing will ignore the unhealthy backends and skip the OpenAI endpoints that are throttling.

A sample configuration to load balance multiple backend resources.

resource symbolicname 'Microsoft.ApiManagement/service/backends@2023-05-01-preview' = {
  name: 'aoai-lb-pool'
  properties: {
    description: 'Load balance openai instances'
    type: 'Pool'
    protocol: 'https'
    url: 'https://does-not-matter
    pool: {
      services: [
        {
          id: '/backends/aoai-backend-1'
          priority: '1',
          weight: '3'
        }
        {
          id: '/backends/aoai-backend-2'
          priority: '1',
          weight: '1'
        }
        {
          id: '/backends/aoai-backend-3'
          priority: '2'
        }
      ]
    }
  }
}

Combining the Circuit breaker and load balancing options will provide a basic and simple way to increase the availability and capacity of Azure OpenAI deployments using the inbuilt features of Azure API Management.

Rate Limiting and Metric Collection

Limiting the tokens consumed by each subscriber is a key accept is controlling the usage and cost for accountability reasons. Azure OpenAI Token Limit policy allows you to manage and enforce limits per API consumer based on the usage of Azure OpenAI tokens.

It is equally important to observe the consumption patterns by collecting the usage metrics which can be easily achieved using the Emit Token Metric Policy. The metrics will be sent to Azure application insights capturing prompt, completion and total token usage across specified dimensions like subscription ID, Client IP Address etc.

The metric collection will work with streaming as well.

Summing Up

With all the features combined, the policies for the API should be as shown below

<policies>
    <inbound>
        <set-backend-service id="lb-backend" backend-id="aoai-lb-pool" />
        <azure-openai-token-limit tokens-per-minute="10000" counter-key="@(context.Subscription.Id)" estimate-prompt-tokens="true" tokens-consumed-header-name="consumed-tokens" remaining-tokens-header-name="remaining-tokens" />
        <authentication-managed-identity resource="https://cognitiveservices.azure.com/" />
        <azure-openai-emit-token-metric namespace="genaimetrics">
            <dimension name="Subscription ID" />
            <dimension name="Client IP" value="@(context.Request.IpAddress)" />
        </azure-openai-emit-token-metric>
        <base />
    </inbound>
    <backend>
        <retry condition="@(context.Response.StatusCode == 429)" count="2" interval="1" first-fast-retry="true">
            <forward-request />
        </retry>
    </backend>
    <outbound>
        <base />
    </outbound>
    <on-error>
        <base />
    </on-error>
</policies>

Sample Implementation

GitHub repo here has a sample implementation to demonstrate the policies discussed in this article.

Updated May 30, 2024

Version 3.0

Cloud Native Apps

data & ai

srinipadala

Microsoft

Joined December 29, 2020

View Profile

FastTrack for Azure

Follow this blog board to get notified when there's new activity

14 Comments

franciel89
Copper Contributor
Nov 12, 2024
I observed a limitation where configuring both credentials and circuitBreaker for backends causes the circuitBreaker to stop functioning. So the circuitBreaker is only suitable when using managed identity.
dmbuk
Copper Contributor
Oct 29, 2024
srinipadala thanks for the article, nice feature although half baked.
Why?
🟡 Pool deployment available via Bicep / ARM, however when deployed to Azure you see nothing in the portal, just backend pool and the members. The backend recourse is not even visible in the Recourse Manage Explorer or query. It is a totally phantom resource.

🟡 Impossible to set headers (like in outbound policy) for the members of the pool, so we can't find out which backend endpoint the response came from.

That makes it difficult to implement in prod, as troubleshooting will be super hard.
srinipadala
Microsoft
Sep 03, 2024
AYadav35 - You can pass the secrets from Key Vault as parameters to bicep Key Vault secret with Bicep - Azure Resource Manager | Microsoft Learn
AYadav35
Copper Contributor
Sep 03, 2024
srinipadala
One more quick question, is it possible to refer named values in ARM template while creating "Microsoft.ApiManagement/service/backends" I want to add auth headers using named values (this points to key vault)
AYadav35
Copper Contributor
Sep 03, 2024
This makes perfect sense, and I greatly appreciate your quick response, Thank you 🙂
srinipadala
Microsoft
Sep 03, 2024
AYadav35 - The circuit trips only after the failure condition is met and not immediately after. So the pool sees it as healthy until the circuit evaluation is done. For this exact reason, we recommend you use a retry with the pool as many times as there are backends in the pool
AYadav35
Copper Contributor
Sep 03, 2024
hi srinipadala
Thanks for sharing such a great article!!
I would like to bring to your attention to an observation regarding the backend pool configuration. I have a backend pool with two backends, both assigned the same weight, with the circuit breaker enabled.
However, when one of the backends has an open circuit, the pool still routes requests to it. Ideally, the requests should bypass the backend with the open circuit. Is there anything I can do to fix this??
below is my failure condition

{
"failureCondition": {
"count": "1",
"errorReasons": [ "Server errors" ],
"interval": "PT1M",
"statusCodeRanges": [
{
"min": "429",
"max": "429"
},
{
"min": "501",
"max": "503"
}
]
},
"name": "myBreakerRule",
"tripDuration": "PT1M",
"acceptRetryAfter": true
}
junpark1135
Microsoft
May 31, 2024
srinipadala Didn't know this. Thanks a lot.

srinipadala

Microsoft

May 30, 2024

Jun_Jun - Its a collection and should go as below

            statusCodeRanges: [
              {
                max: int
                min: int
              }
              {
                max: int
                min: int
              }
            ]

]

junpark1135
Microsoft
May 30, 2024
Hi srinipadala Thanks for the quick response and answer. i understood your answer for the second question. For the question #1, would you please give me sample bicep code? because when i see it's bicep schema, i don't see array input.

Microsoft.ApiManagement/service/backends - Bicep, ARM template & Terraform AzAPI reference | Microsoft Learn

Blog Post

Using Azure API Management Circuit Breaker and Load balancing with Azure OpenAI Service

Circuit breaker pattern

Load balance requests to multiple instances of the Azure OpenAI Service

Rate Limiting and Metric Collection

Summing Up

Sample Implementation