Azure Integration Services Blog

2 MIN READ

Expanding GenAI Gateway Capabilities in Azure API Management

Microsoft

Aug 09, 2024

In May 2024, we introduced GenAI Gateway capabilities – a set of features designed specifically for GenAI use cases. Today, we are happy to announce that we are adding new policies to support a wider range of large language models through Azure AI Model Inference API. These new policies work in a similar way to the previously announced capabilities, but now can be used with a wider range of LLMs.

Azure AI Model Inference API enables you to consume the capabilities of models, available in Azure AI model catalog, in a uniform and consistent way. It allows you to talk with different models in Azure AI Studio without changing the underlying code.

Working with large language models presents unique challenges, particularly around managing token resources. Token consumption impacts cost and performance of intelligent apps calling the same model, making it crucial to have robust mechanisms for monitoring and controlling token usage. The new policies aim to address challenges by providing detailed insights and control over token resources, ensuring efficient and cost-effective use of models deployed in Azure AI Studio.

LLM Token Limit Policy

LLM Token Limit policy (preview) provides the flexibility to define and enforce token limits when interacting with large language models available through the Azure AI Model Inference API.

Key Features

Configurable Token Limits: Set token limits for requests to control costs and manage resource usage effectively

Prevents Overuse: Automatically blocks requests that exceed the token limit, ensuring fair use and eliminating the noisy neighbour problem

Seamless Integration: Works seamlessly with existing applications, requiring no changes to your application configuration

Learn more about this policy here.

LLM Emit Token Metric Policy

LLM Emit Token Metric policy (preview) provides detailed metrics on token usage, enabling better cost management and insights into model usage across your application portfolio.

Key Features

Real-Time Monitoring: Emit metrics in real-time to monitor token consumption.

Detailed Insights: Gain insights into token usage patterns to identify and mitigate high-usage scenarios

Cost Management: Split token usage by any custom dimension to attribute cost to different teams, departments, or applications

Learn more about this policy here.

LLM Semantic Caching Policy

LLM Semantic Caching policy (preview) is designed to reduce latency and reduce token consumption by caching responses based on the semantic content of prompts.

Key Features

Reduced Latency: Cache responses to frequently requested queries based to decrease response times.

Improved Efficiency: Optimize resource utilization by reducing redundant model inferences.

Content-Based Caching: Leverages semantic similarity to determine which response to retrieve from cache

Learn more about this policy here.

Get Started with Azure AI Model Inference API and Azure API Management

We are committed to continuously improving our platform and providing the tools you need to leverage the full potential of large language models. Stay tuned as we roll out these new policies across all regions and watch for further updates and enhancements as we continue to expand our capabilities. Get started today and bring your intelligent application development to the next level with Azure API Management.

Updated Aug 08, 2024

Version 1.0

Microsoft

Joined April 05, 2023

View Profile

Azure Integration Services Blog

Follow this blog board to get notified when there's new activity

3 Comments

Steveatco
Copper Contributor
Jan 17, 2025
Hi,
Could you please explain how we can use the Azure AI Model Inference API and Foundry SDK with APIM API please? Not sure how this will work with the OpenAI API without doing rewrites as paths are different, plus the Inference SDK probably needs to support this as well. Have not come across any documentation, blogs or YouTube content demonstrating this working via APIM either, so would be appreciated if this could be clarified please..
Regards Steve
akamenev
Microsoft
Aug 13, 2024
Jayendran thank you for your feedback! We'll investigate how we can address this challenge with token counts.
Jayendran
MCT
Aug 10, 2024
Great updates! I was experimenting with the `llm-semantic-cache-lookup` policy and would like to share my feedback with the PG team

Currently, we are using `log-to-eventhub` to send the output response to calculate the completion_prompt, total_prompt and input_prompt downstream for charge back. Once we start using `llm-semantic-cache-lookup` we may return the data from cache hence we need to skip those calculations. Because it may lead to confusion that it came from the openai instances for charge back. It would be helpful to have an output property (as a variable) to `llm-semantic-cache-lookup`, which can identify whether it's a cache hit or cache miss. With this output variable, we can update the logic downstream appropriately based on cache hit/miss logic

e.g,

<llm-semantic-cache-lookup score-threshold="0.05" embeddings-backend-id = "embeddings-backend" output-variable-cache-hit = "name of a variable that set to true if cache hit and false if cache miss" > <vary-by>"expression to partition caching"</vary-by> </llm-semantic-cache-lookup>

This would be helpful not only for chargeback but also for many different use cases

Thank you for considering this feedback!
Jay

Blog Post

Expanding GenAI Gateway Capabilities in Azure API Management

LLM Token Limit Policy

LLM Emit Token Metric Policy

LLM Semantic Caching Policy

Get Started with Azure AI Model Inference API and Azure API Management