We are thrilled to announce GenAI Gateway capabilities in Azure API Management – a set of features designed specifically for GenAI use cases.
Azure OpenAI service offers a diverse set of tools, providing access to advanced models like GPT3.5-Turbo to GPT-4 and GPT-4 Vision, enabling developers to build intelligent applications that can understand, interpret, and generate human-like text and images.
One of the main resources you have in Azure OpenAI is tokens. Azure OpenAI assigns quota for your model deployments expressed in tokens-per-minute (TPMs) which is then distributed across your model consumers that can be represented by different applications, developer teams, departments within the company, etc.
Starting with a single application integration, Azure makes it easy to connect your app to Azure OpenAI. Your intelligent application connects to Azure OpenAI directly using API Key with a TPM limit configured directly on the model deployment level. However, when you start growing your application portfolio, you are presented with multiple apps calling single or even multiple Azure OpenAI endpoints deployed as Pay-as-you-go or Provisioned Throughput Units (PTUs) instances. That comes with certain challenges:
- How can we track token usage across multiple applications? How can we do cross charges for multiple applications/teams that use Azure OpenAI models?
- How can we make sure that a single app does not consume the whole TPM quota, leaving other apps with no option to use Azure OpenAI models?
- How can we make sure that the API key is securely distributed across multiple applications?
- How can we distribute load across multiple Azure OpenAI endpoints? How can we make sure that PTUs are used first before falling back to Pay-as-you-go instances?
To tackle these operational and scalability challenges, Azure API Management has built a set of GenAI Gateway capabilities:
- Azure OpenAI Token Limit Policy
- Azure OpenAI Emit Token Metric Policy
- Load Balancer and Circuit Breaker
- Import Azure OpenAI as an API
- Azure OpenAI Semantic Caching Policy (in public preview)
Azure OpenAI Token Limit Policy
Azure OpenAI Token Limit policy allows you to manage and enforce limits per API consumer based on the usage of Azure OpenAI tokens. With this policy you can set limits, expressed in tokens-per-minute (TPM).
This policy provides flexibility to assign token-based limits on any counter key, such as Subscription Key, IP Address or any other arbitrary key defined through policy expression. Azure OpenAI Token Limit policy also enables pre-calculation of prompt tokens on the Azure API Management side, minimizing unnecessary request to the Azure OpenAI backend if the prompt already exceeds the limit.
Learn more about this policy here.
Azure OpenAI Emit Token Metric Policy
Azure OpenAI enables you to configure token usage metrics to be sent to Azure Applications Insights, providing overview of the utilization of Azure OpenAI models across multiple applications or API consumers.
This policy captures prompt, completions, and total token usage metrics and sends them to Application Insights namespace of your choice. Moreover, you can configure or select from pre-defined dimensions to split token usage metrics, enabling granular analysis by Subscription ID, IP Address, or any custom dimension of your choice.
Learn more about this policy here.
Load Balancer and Circuit Breaker
Load Balancer and Circuit Breaker features allow you to spread the load across multiple Azure OpenAI endpoints.
With support for round-robin, weighted (new), and priority-based (new) load balancing, you can now define your own load distribution strategy according to your specific requirements.
Define priorities within the load balancer configuration to ensure optimal utilization of specific Azure OpenAI endpoints, particularly those purchased as PTUs. In the event of any disruption, a circuit breaker mechanism kicks in, seamlessly transitioning to lower-priority instances based on predefined rules.
Our updated circuit breaker now features dynamic trip duration, leveraging values from the retry-after header provided by the backend. This ensures precise and timely recovery of the backends, maximizing the utilization of your priority backends to their fullest.
Learn more about load balancer and circuit breaker here.
Import Azure OpenAI as an API
New Import Azure OpenAI as an API in Azure API management provides an easy single click experience to import your existing Azure OpenAI endpoints as APIs.
We streamline the onboarding process by automatically importing the OpenAPI schema for Azure OpenAI and setting up authentication to the Azure OpenAI endpoint using managed identity, removing the need for manual configuration. Additionally, within the same user-friendly experience, you can pre-configure Azure OpenAI policies, such as token limit and emit token metric, enabling swift and convenient setup.
Learn more about Import Azure OpenAI as an API here.
Azure OpenAI Semantic Caching policy
Azure OpenAI Semantic Caching policy empowers you to optimize token usage by leveraging semantic caching, which stores completions for prompts with similar meaning.
Our semantic caching mechanism leverages Azure Redis Enterprise or any other external cache compatible with RediSearch and onboarded to Azure API Management. By leveraging the Azure OpenAI Embeddings model, this policy identifies semantically similar prompts and stores their respective completions in the cache. This approach ensures completions reuse, resulting in reduced token consumption and improved response performance.
Learn more about semantic caching policy here.
Get Started with GenAI Gateway Capabilities in Azure API Management
We’re excited to introduce these GenAI Gateway capabilities in Azure API Management, designed to empower developers to efficiently manage and scale their applications leveraging Azure OpenAI services. Get started today and bring your intelligent application development to the next level with Azure API Management.