Blog Post

Azure PaaS Blog
3 MIN READ

AI Resilience: Strategies to Keep Your Intelligent App Running at Peak Performance

fabiopadua's avatar
fabiopadua
Icon for Microsoft rankMicrosoft
Apr 24, 2025

In today's fast-paced digital landscape, the demand for intelligent applications is higher than ever. These applications, powered by artificial intelligence (AI), are revolutionizing industries by providing innovative solutions and enhancing user experiences. However, as the complexity and scale of AI systems grow, so do the challenges in maintaining their performance, especially when they reach the PTU (Performance Threshold Utilization) limits.

Stay Online

Reliability. It's one of the 5 pillars of Azure Well-Architect Framework

When starting to implement and go-to-market any new product witch has any integration with Open AI Service you can face spikes of usage in your workload and, even having everything scaling correctly in your side, if you have an Azure Open AI Services deployed using PTU you can reach the PTU threshold and them start to experience some 429 response code.

You also will receive some important information about the when you can retry the request in the header of the response and with this information you can implement in your business logic a solution. Here in this article I will show how to use the API Management Service policy to handle this and also explore the native cache to save some tokens!

Architecture Reference

The Azure Function in the left of the diagram just represent and App request and can be any kind of resource (even in an On-Premisse environment). Our goal in this article is to show one in n possibilities to handle the 429 responses. We are going to use API Management Policy to automatically redirect the backend to another Open AI Services instance in other region in the Standard mode, witch means that the charge is going to be only what you use.

First we need to create an API in our API Management to forward the requests to your main Open AI Services (region 1 in the diagram).

 

Now we are going to create this policy in the API call request:

    <policies>
       <inbound>
         <base />
         <set-backend-service base-url="<your_open_ai_region1_endpoint>" />
       </inbound>
       <backend>
         <base />
       </backend>
       <outbound>
         <base />
       </outbound>
       <on-error>
         <retry condition="@(context.Response.StatusCode == 429)" count="1" interval="5" />
         <set-backend-service base-url="<your_open_ai_region2_endpoint>" />
       </on-error>
     </policies>

The first part of our job is done! Now we have an automatically redirect to our OpenAI Services deployed at region 2 when our PTU threshold is reached.  

Cost consideration

So now you can ask me: and about my cost increment for using API Management?

Even if you don't want to use any other feature on API Management you can leverage of the API Management native cache and, once again using policy and AI, put some questions/answers in the built-in Redis* cache using semantic cache for Open AI services.

Let's change our policy to consider this:

     <policies>
       <inbound>
         <base />
         <azure-openai-semantic-cache-lookup score-threshold="0.05" embeddings-backend-id ="azure-openai-backend" embeddings-backend-auth ="system-assigned" >
           <vary-by>@(context.Subscription.Id)</vary-by>
         </azure-openai-semantic-cache-lookup>
         <set-backend-service base-url="<your_open_ai_region1_endpoint>" />
       </inbound>
       <backend>
         <base />
       </backend>
       <outbound>
         <base />
         <azure-openai-semantic-cache-store duration="60" />
       </outbound>
       <on-error>
         <retry condition="@(context.Response.StatusCode == 429)" count="1" interval="5" />
         <set-backend-service base-url="<your_open_ai_region2_endpoint>" />
       </on-error>
     </policies>

Now, API Management will handle the tokens inputted and use semantic equivalence and decide if its fit with cached information or redirect the request to your OpenAI endpoint. And, sometime, this can help you to avoid reach the PTU threshold as well!

* Check the tier / cache capabilities to validate your business solution needs with the API Management cache feature: Compare API Management features across tiers and cache size across tiers

Conclusion

API Management offers key capabilities for AI that we are exploring in this article and also others that you can leverage for your intelligent applications. Check it out on this awesome AI Gateway HUB repository

At least but not less important, dive in API Management features with experts in the field inside the API Management HUB.

Thanks for reading and Happy Coding!

Updated Mar 14, 2025
Version 1.0

1 Comment

  • Great to know that even with no APIM features, the native cache used allows scalable solutions handling common queries without any backend impact. 😁