Blog Post

Azure Integration Services Blog
2 MIN READ

Expanding GenAI Gateway Capabilities in Azure API Management

akamenev's avatar
akamenev
Icon for Microsoft rankMicrosoft
Aug 09, 2024

In May 2024, we introduced GenAI Gateway capabilities – a set of features designed specifically for GenAI use cases. Today, we are happy to announce that we are adding new policies to support a wider range of large language models through Azure AI Model Inference API. These new policies work in a similar way to the previously announced capabilities, but now can be used with a wider range of LLMs. 

 

Azure AI Model Inference API enables you to consume the capabilities of models, available in Azure AI model catalog, in a uniform and consistent way. It allows you to talk with different models in Azure AI Studio without changing the underlying code. 

 

Working with large language models presents unique challenges, particularly around managing token resources. Token consumption impacts cost and performance of intelligent apps calling the same model, making it crucial to have robust mechanisms for monitoring and controlling token usage. The new policies aim to address challenges by providing detailed insights and control over token resources, ensuring efficient and cost-effective use of models deployed in Azure AI Studio. 

 

LLM Token Limit Policy 

LLM Token Limit policy (preview) provides the flexibility to define and enforce token limits when interacting with large language models available through the Azure AI Model Inference API. 

Key Features 

  • Configurable Token Limits: Set token limits for requests to control costs and manage resource usage effectively 
  • Prevents Overuse: Automatically blocks requests that exceed the token limit, ensuring fair use and eliminating the noisy neighbour problem 
  • Seamless Integration: Works seamlessly with existing applications, requiring no changes to your application configuration 

Learn more about this policy here. 

LLM Emit Token Metric Policy 

LLM Emit Token Metric policy (preview) provides detailed metrics on token usage, enabling better cost management and insights into model usage across your application portfolio. 

 

Key Features 

  • Real-Time Monitoring: Emit metrics in real-time to monitor token consumption. 
  • Detailed Insights: Gain insights into token usage patterns to identify and mitigate high-usage scenarios 
  • Cost Management: Split token usage by any custom dimension to attribute cost to different teams, departments, or applications 

Learn more about this policy here. 

 

LLM Semantic Caching Policy 

LLM Semantic Caching policy (preview) is designed to reduce latency and reduce token consumption by caching responses based on the semantic content of prompts. 

Key Features 

  • Reduced Latency: Cache responses to frequently requested queries based to decrease response times. 
  • Improved Efficiency: Optimize resource utilization by reducing redundant model inferences. 
  • Content-Based Caching: Leverages semantic similarity to determine which response to retrieve from cache 

Learn more about this policy here. 

 

Get Started with Azure AI Model Inference API and Azure API Management 

We are committed to continuously improving our platform and providing the tools you need to leverage the full potential of large language models. Stay tuned as we roll out these new policies across all regions and watch for further updates and enhancements as we continue to expand our capabilities. Get started today and bring your intelligent application development to the next level with Azure API Management. 

Updated Aug 08, 2024
Version 1.0
  • Steveatco's avatar
    Steveatco
    Copper Contributor

    Hi, 
    Could you please explain how we can use the Azure AI Model Inference API and Foundry SDK with APIM API please? Not sure how this will work with the OpenAI API without doing rewrites as paths are different, plus the Inference SDK probably needs to support this as well. Have not come across any documentation, blogs or YouTube content demonstrating this working via APIM either, so would be appreciated if this could be clarified please.. 
    Regards Steve 

  • Jayendran thank you for your feedback! We'll investigate how we can address this challenge with token counts.

  • Jayendran's avatar
    Jayendran
    Iron Contributor

    Great updates! I was experimenting with the `llm-semantic-cache-lookup`  policy and would like to share my feedback with the PG team

     

    Currently, we are using `log-to-eventhub` to send the output response to calculate the completion_prompt, total_prompt and input_prompt downstream for charge back. Once we start using `llm-semantic-cache-lookup`  we may return the data from cache hence we need to skip those calculations. Because it may lead to confusion that it came from the openai instances for charge back. It would be helpful to have an output property (as a variable) to `llm-semantic-cache-lookup`, which can identify whether it's a cache hit or cache miss. With this output variable, we can update the logic downstream appropriately based on cache hit/miss logic

     

    e.g, 

     

     

    <llm-semantic-cache-lookup
    
    score-threshold="0.05"
    
    embeddings-backend-id = "embeddings-backend"
    
    output-variable-cache-hit = "name of a variable that set to true if cache hit and false if cache miss" >
    
    <vary-by>"expression to partition caching"</vary-by>
    
    </llm-semantic-cache-lookup>

     

     

    This would be helpful not only for chargeback but also for many different use cases

     

    Thank you for considering this feedback!

    Jay