This is for you who have started with Generative AI APIs and you’re looking to take those APIs into production. At high-level, there are things to consider like load balancing error management and cost management. We’ll mention those in this article and guide you to an Azure Sample where you can get started deploying an enterprise-ready solution.
Scenario: you want to take your generative AI to production.
So, you’re starting to use Azure Open AI, you like what you see, and you see how you can add this AI functionality to several of the apps in your company.
However, you have a big operation with many customers and many different apps and enterprise grade requirements on security and you know you must tackle all that before you can adopt generative AI to your business.
Problems we must address
You write up a list of problems that you need to address to fully implement generative AI:
- Load balancing and circuit breaker: With many customers, it’s key that you can distribute the load across multiple instances. Also, error management is super important, to ensure that if one instance fails, the others can take over. In the cloud a common approach to error management is Circuit breaker helps by stopping the requests to the failing instance and redirecting them to the healthy ones.
- Monitoring and metrics: You want to monitor the usage of the AI model, how many requests are coming in, how many are failing, and how many are successful. Also, how many tokens are being used and how many are left. What about caching the responses to reduce the load on the AI model and save costs and improve performance.
- Security: You want to secure the AI model; you don't want anyone to access it. You have perhaps started by using API keys but for enterprise scenarios, you want to use managed identity.
You ask yourself; can a cloud service handle the above problems? It looks like Azure APIM management has an interesting approach to the above problem. In fact, there’s an Azure sample that seems to implement the above, let’s dive in to see how:
Resources
Here’s some great resource to get you started and also learn more about the features implemented in the Azure Sample.
- Azure sample - APIM + Generative AI
- Azure API Management - Overview and key concepts | Microsoft Learn
- Azure API Management policy reference - azure-openai-token-limit | Microsoft Learn
- Azure API Management policy reference - azure-openai-emit-token-metric | Microsoft Learn
- Azure API Management backends | Microsoft Learn
- Use managed identities in Azure API Management | Microsoft Learn
Introducing: enterprise grade sample using APIM + Generative AI
In this sample, we get a chat app (frontend and backend) and a set of cloud resources that can be deployed to Azure using Azure Developer CLI, azd. Below is the user interface of the app included in the sample:
Architecture view of the sample
Ok, so first we get a chat window, great, that’s a good start, but let’s learn more about the architecture, how the sample is implemented:
The easiest way to describe how the architecture works is considering an incoming web request and what happens to it. In our case, we have a POST request with a prompt.
- Request is hitting the API, and the API considers what to do with it:
- Authentication, first it checks whether you’re allowed by checking the subscriberID you provided in your request
- Routing. Next the API checks the policies to determines whether this request is within token limits (and the request is logged), thereafter it’s sent to the loadBalancer, where the load balancer determines which backend to send it to (each backend has 1:1 association with an Azure Open AI endpoint )
- There's an alternate scenario here where if a backend responds with error within a certain time interval and a certain type of error the request is routed to a healthy resource
- Creating a response, the assigned Azure Open AI endpoint responds, and the user sees the response rendered in the chat app.
Above is the happy path, if an endpoint throws errors for some reason with a certain frequency and/or error code the circuit breaker logic is triggered, and the request is routed to a healthy endpoint. Another reason for not getting a chat response back is if the token limits have been hit, i.e. rate limiting (you’ve for example made too many requests in a short time span).
Also note how a semantic cache could be made to respond instead if a response and prompt is similar to what's already in the cache.
How to get started
Ensure you have a terminal up and running and that you Azure Developer CLI, azd installed. Then run the following steps:
Clone the repo (or start in codespaces)
git clone https://github.com/Azure-Samples/genai-gateway-apim.git
Login to Azure,
azd auth login
Deploy the app
azd up
Run app, at this point, you have your cloud resources deployed. To test these out, run the app locally (you need to have Node.js installed), at repo directory, run the below commands in a terminal:
cd src
npm install
npm start
This will start the app on http://localhost:3000
and the API is available at http:localhost:1337
.
What next
Our suggestion is that you go and check out the Azure Sample - APIM + Gen AI Try deploying it and see how it works.
Let us know if you have any questions or feedback.