Blog Post

Apps on Azure Blog
4 MIN READ

Manage your Generative AI APIs with Azure API Management and Azure Open AI

Chris_Noring's avatar
Chris_Noring
Icon for Microsoft rankMicrosoft
Aug 08, 2024

 

This is for you who have started with Generative AI APIs and you’re looking to take those APIs into production. At high-level, there are things to consider like load balancing error management and cost management. We’ll mention those in this article and guide you to an Azure Sample where you can get started deploying an enterprise-ready solution. 

 

Scenario: you want to take your generative AI to production. 

So, you’re starting to use Azure Open AI, you like what you see, and you see how you can add this AI functionality to several of the apps in your company.  

However, you have a big operation with many customers and many different apps and enterprise grade requirements on security and you know you must tackle all that before you can adopt generative AI to your business.  

 

Problems we must address 

You write up a list of problems that you need to address to fully implement generative AI: 

 

- Load balancing and circuit breaker: With many customers, it’s key that you can distribute the load across multiple instances. Also, error management is super important, to ensure that if one instance fails, the others can take over. In the cloud a common approach to error management is Circuit breaker helps by stopping the requests to the failing instance and redirecting them to the healthy ones.

 

- Monitoring and metrics: You want to monitor the usage of the AI model, how many requests are coming in, how many are failing, and how many are successful. Also, how many tokens are being used and how many are left. What about caching the responses to reduce the load on the AI model and save costs and improve performance. 

 

- Security: You want to secure the AI model; you don't want anyone to access it. You have perhaps started by using API keys but for enterprise scenarios, you want to use managed identity. 

You ask yourself; can a cloud service handle the above problems? It looks like Azure APIM management has an interesting approach to the above problem. In fact, there’s an Azure sample that seems to implement the above, let’s dive in to see how: 

 

Resources 

Here’s some great resource to get you started and also learn more about the features implemented in the Azure Sample. 

Azure sample - APIM  + Generative AI 

- Azure API Management - Overview and key concepts | Microsoft Learn 

- Azure API Management policy reference - azure-openai-token-limit | Microsoft Learn 

- Azure API Management policy reference - azure-openai-emit-token-metric | Microsoft Learn 

- Azure API Management backends | Microsoft Learn 

- Use managed identities in Azure API Management | Microsoft Learn 

 

Introducing: enterprise grade sample using APIM + Generative AI 

In this sample, we get a chat app (frontend and backend) and a set of cloud resources that can be deployed to Azure using Azure Developer CLI, azd. Below is the user interface of the app included in the sample: 

Architecture view of the sample 

Ok, so first we get a chat window, great, that’s a good start, but let’s learn more about the architecture, how the sample is implemented: 

 

The easiest way to describe how the architecture works is considering an incoming web request and what happens to it. In our case, we have a POST request with a prompt.  

  1. Request is hitting the API, and the API considers what to do with it: 
  2. Authentication, first it checks whether you’re allowed by checking the subscriberID you provided in your request 
  3. Routing. Next the API checks the policies to determines whether this request is within token limits (and the request is logged), thereafter it’s sent to the loadBalancer, where the load balancer determines which backend to send it to (each backend has 1:1 association with an Azure Open AI endpoint ) 
    1. There's an alternate scenario here where if a backend responds with error within a certain time interval and a certain type of error the request is routed to a healthy resource
  4. Creating a response, the assigned Azure Open AI endpoint responds, and the user sees the response rendered in the chat app. 

Above is the happy path, if an endpoint throws errors for some reason with a certain frequency and/or error code the circuit breaker logic is triggered, and the request is routed to a healthy endpoint. Another reason for not getting a chat response back is if the token limits have been hit, i.e. rate limiting (you’ve for example made too many requests in a short time span).  

 

Also note how a semantic cache could be made to respond instead if a response and prompt is similar to what's already in the cache.

 

How to get started 

Ensure you have a terminal up and running and that you Azure Developer CLI, azd installed. Then run the following steps:

 

Clone the repo (or start in codespaces) 

 

 

 

git clone https://github.com/Azure-Samples/genai-gateway-apim.git

 

 

 

 

Login to Azure,

 

 

 

azd auth login

 

 

 

Deploy the app

 

azd up

 

 

Run app, at this point, you have your cloud resources deployed. To test these out, run the app locally (you need to have Node.js installed), at repo directory, run the below commands in a terminal:

 

 

cd src
npm install
npm start

 

 

 

 

 

 

This will start the app on http://localhost:3000 and the API is available at http:localhost:1337

 

What next 

Our suggestion is that you go and check out the Azure Sample - APIM + Gen AI Try deploying it and see how it works. 

 

Let us know if you have any questions or feedback. 

Updated Aug 13, 2024
Version 5.0
No CommentsBe the first to comment