Azure OpenAI Service is a cloud-based platform that enables you to access and use powerful AI models such as GPT-4, DALL-E, and more. These models can generate natural language, images, code, and other types of content based on your inputs and parameters. However, to use these models effectively, you need to consider the quotas and limits that apply to your Azure OpenAI resources, such as tokens per minute, requests per minute, and provisioned throughput units. You also need to handle the possible failures and errors that might occur when calling these models, such as network issues, timeouts, or service unavailability.
In this blog post, we will show you how to use Azure API Management to improve the resiliency and capacity of your Azure OpenAI Service. Azure API Management is a service that helps you create, publish, manage, and secure APIs. It provides features such as routing, caching, throttling, authentication, transformation, and more. By using Azure API Management, you can:
Load balance requests to multiple instances of the Azure OpenAI Service using the round-robin load balancing technique. This can help you distribute the load across different resources and regions and increase the availability and performance of your service.
Implement the circuit breaker pattern to protect your backend service from being overwhelmed by excessive requests. This can help you prevent cascading failures and improve the stability and resiliency of your service. You can configure the circuit breaker property in the backend resource, and define rules for tripping the circuit breaker, such as the number or percentage of failure conditions within a defined time interval and a range of status codes indicating failures.
Circuit breaker pattern
The circuit breaker pattern is a way to prevent an application from performing an operation that is likely to fail, such as calling an external API that is overloaded or unavailable. The circuit breaker monitors the success or failure of the operation and can open or close the circuit accordingly. When the circuit is open, the operation is not performed, and a fallback response is returned instead. When the circuit is closed, the operation is performed as normal. This way, the circuit breaker can avoid wasting resources and improve the user experience by reducing latency and errors.
API Management exposes a circuit breaker property in the backend resource to protect a backend service from being overwhelmed by too many requests.
Azure OpenAI enforces rate limiting on the invocations and requests will receive a 429 response code when the service is rate limiting. The Azure OpenAI backends configured in API Management can be configured for circuit breaking whenever rate limiting is observed. This will mark the backend as unhealthy whenever the circuit is open.
A sample configuration to break the circuit for a period of 10 seconds whenever a 429 error occurs on the backend.
Load balance requests to multiple instances of the Azure OpenAI Service
To load balance requests to multiple instances of the Azure OpenAI Service, you need to create a backend resource for each instance, and then use the load-balancing policy to distribute the requests among them. The load-balancing policy uses the round-robin algorithm to select the backend resource for each request.
Create backend resource for each Azure OpenAI endpoint with circuit breaking enabled when the endpoint is throttling. Backends configured with circuit breaking will mark it as unhealthy when the circuit is open. Load balancing will ignore the unhealthy backends and skip the OpenAI endpoints that are throttling.
A sample configuration to loadbalance multiple backend resources.
Combining the Circuit breaker and load balancing options will provide a basic and simple way to increase the availability and capacity of Azure OpenAI deployments using the inbuilt features of Azure API Management.