Smarter Azure Open AI Usage

graemefoster · ‎Jan 29 2024

Smart Azure Open AI Endpoints – “AI Central”

Many organizations are building Intelligent Applications built on Azure’s Open AI (AOAI) services. In the path to production the same set of questions are often raised

How many AOAI services should I have?
How do I monitor and log streaming quota usage?
How do I prioritize PTU based AOAI and fallback to PAYG?
How do I round-robin between multiple AOAI servers?
How do I handle Open AI rate-limiting errors?
How do I enforce local rate limiting to a cluster of AI services?
How do I enforce rate limiting to a backend AI service?
How do I present a group of AOAI services as a single endpoint, for a seamless shift to PTU?
How do I reduce risk by leveraging Open AI and Azure Open AI services but present a single endpoint to consumers?
How do I put a circuit breaker over an AI service that I’ve over-used, to fallback to others?

To help with some of these issues we can turn to services like API Management, Application Gateways, and Reverse Proxies. Each can provide a solution to a subset of the problems.

However, there are complexities hidden within these boxes that become difficult to solve

Prioritization and failover of groups of AOIA servers relies on custom code running in a Layer 7 Load Balancer.
Layer 7 load balancers lack real-time retry functionality and instead use asynchronous downstream health monitors.
Server-Side Events support makes it difficult to log quota whilst maintaining a streaming endpoint.
Switching between Azure Open AI, Open AI or other Open Source LLMs requires manipulation of HTTP requests.

Introducing AI Central - https://github.com/microsoft/AICentral

To help with these I have published a Reference Implementation of an intelligent AI Router, “AI Central”. AI Central lets you build configurable, extensible Pipelines allowing you to govern and observe access to your AI service.

AI Central is an extensible smart reverse proxy for Azure Open AI and Open AI services.

Out of the box it provides the following

Consumer local rate limiting
Endpoint local rate limiting and circuit breakers
Randomized endpoint selection from a cluster of AI services
Prioritized endpoint selector from a priority cluster, to a fallback cluster
Bulkhead to hold and throttle load to a cluster of servers
Consumer Entra JWT auth (using Microsoft.Identity) with Role Authorisation
Consumer Entra JWT pass-thru
Client Key auth
Prompt / Token usage logging to Azure Monitor (including Streaming Endpoints)
Open Telemetry metrics

Sample Scenarios

Here’s some scenarios where AI Central might help you:

Scenario 1: PTU failover

Preferred PTU AOAI service, but fallback PAYG AOAI service
A group of applications that need to access AOAI services
A requirement for Prompt logging for audit and governance
Streaming quota logging for chargeback

AI Central can construct a pipeline to manage this for you:

The pipeline listens on a host name expecting Azure Open AI like requests.
The AAD check confirms that the client is permitted access to the pipelines.
The Prioritized endpoint selector is configured to prioritize a PTU server.
- It dispatches the request with a backoff / retry policy and circuit breaker.
- If it fails to receive a request it falls back to the second group of PAYG servers
- If the response from AOAI is detected to be a streaming response, it will stream the results back to the Client, using a Tokenizer to estimate quota usage
Finally, the Azure Monitor Logger asynchronously sends quota usage and prompt information to Azure Monitor.

Scenario 2: Token based rate limiting of streaming consumers, to an AOAI server

Single PTU service with models shared across multiple consumers
Streaming quota logging for chargeback purposes
Fair-use policy by restricting token use by consumer

The pipeline listens on a specific hostname

The AAD check confirms that the client is permitted access to the pipelines
The Token limit checks if the client (AAD identity) has reached their token limit
If not, the request is dispatched to a AOAI server
The AOAI response is re-streamed to the consumer
The return pathway logs the prompt, and updates the tokens consumed by the consumer

NB: Token counting does not use a distributed algorithm. It is local to an AI Central server. Consider this if running multiple AI Central Endpoints behind a load-balancer (for example in a PaaS like Azure Container Apps, Azure App Service, etc)

Try it out

The easiest way to start is to install into your own .NET API from the nuget packages.

#Create new project and bootstrap the AICentral nuget package
dotnet new web -o MyAICentral
cd MyAICentral
dotnet add package AICentral
#optional for logging: dotnet add package AICentral.Logging.AzureMonitor

#Program.cs
//Minimal API to configure AI Central
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddAICentral(
    builder.Configuration,
    additionalComponentAssemblies:
    [
        typeof(AzureMonitorLoggerFactory).Assembly //for Azure Monitor logging
    ]);
);

var app = builder.Build();

app.UseAICentral();

app.Run();

You'll need to add Configuration to define your pipelines.

The Github Repository has some good examples - https://github.com/microsoft/AICentral for a Quick Start, and https://github.com/microsoft/AICentral/blob/main/docs/configuration.md for some more complex examples.

Give it a go and let us know how you find it!

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs