api design
28 TopicsCut Costs and Speed Up AI API Responses with Semantic Caching in Azure API Management
This article is part of a series of articles on API Management and Generative AI. We believe that adding Azure API Management to your AI projects can help you scale your AI models, make them more secure and easier to manage. We previously covered the hidden risks of AI APIs in today's AI-driven technological landscape. In this article, we dive deeper into one of the supported Gen AI policies in API Management, which allows you to minimize Azure OpenAI costs and make your applications more performant by reducing the number of calls sent to your LLM service. How does it currently work without the semantic caching policy? For simplicity, let's look at a scenario where we only have a single client app, a single user, and a single model deployment. This of course does not represent most real-world use-cases, as you often have multiple users talking to different services. Take the following cases into consideration: - A user lands on your application and sends in a query (query 1), They then send the exact same query again, with similar verbiage, in the same session (query 2), The user changes the wording of the query, but it is still relevant and related to the original query (query 3) The last query, (query 4), is completely different and unrelated to the previous queries. In a normal implementation, all these queries will cost you tokens (TPM), resulting in higher cuts in your billing. Your users are also likely to experience some latency as they wait for the LLM to build a response with each call. As the user base grows, you anticipate that the expenses will grow exponentially, making it more expensive to run your system eventually. How does Semantic caching in Azure API Management fix this? Let's look at the same scenario as described above (at a high level first), with a flow diagram representing how you can cut costs and boost your app's performance with the semantic cache policy. When the user sends in the first query, the LLM will be used to generate a response, which will then be stored in the cache. Queries 2 and 3 are somewhat related to query 1, which could be a semantic similarity, or exact match, or could contain a specified keyword, i.e.. price. In all these cases, a lookup will be performed, and the appropriate response will be retrieved from the cache, without waiting on the LLM to regenerate a response. Query 4, which is different from the previous prompts, will require the call to be passed through to the LLM, then grabs the generated response and stores it in the cache for future searches. Okay. Tell me more - How does this work and how do I set it up? Think about this - What would be the likelihood of your users asking related questions or exactly comparable questions in your app? I'd argue that the odds are quite high. Semantic caching for Azure OpenAI API requests To start, you will need to add Azure OpenAI Service APIs to your Azure API Management instance with semantic caching enabled. Luckily, this step has been reduced to just a one-click step. I'll link a tutorial on this in the 'Resources' section. Before you get to configure the policies, you first need to set up a backend for the embeddings API. Oh yes, as part of your deployments, you will need an embedding model to convert your input to the corresponding vector representation, allowing Azure Redis cache to perform the vector similarity search. This step also allows you to set a score_threshold, a parameter used to determine how similar user queries need to be to retrieve responses from the cache. Next, is to add the two policies that you need: azure-openai-semantic-cache-store/ llm-semantic-cache-store and azure-openai-semantic-cache-lookup/ llm-semantic-cache-lookup The azure-openai-semantic-cache-store policy will cache the completions and requests to the configured cache service. You can use the internal Azure Redis enterprise or any another external cache as long as it's a Redis-compatible cache in Azure API Management. The second policy, azure-openai-semantic-cache-lookup, based on the proximity result of the similarity search and the score_threshold, will perform a cache lookup through the compilation of cached requests and completions. In addition to the score_threshold attribute, you will also specify the id of the embeddings backend created in an earlier step and can choose to omit the system messages from the prompt at this step. These two policies enhance your system's efficiency and performance by reusing completions, increasing response speed, and making your API calls much cheaper. Alright, so what should be my next steps? This article just introduced you to one of the many Generative AI supported capabilities in Azure API Management. We have more policies that you can use to better manage your AI APIs, covered in other articles in this series. Do check them out. Do you have any resources I can look at in the meantime to learn more? Absolutely! Check out: - Using external Redis-compatible cache in Azure API Management documentation Use Azure Cache for Redis as a semantic cache tutorial Enable semantic caching for Azure OpenAI APIs in Azure API Management article Improve the performance of an API by adding a caching policy in Azure API Management Learn moduleImprove LLM backend resiliency with load balancer and circuit breaker rules in Azure API Management
This article is part of a series of articles on Azure API Management and Generative AI. We believe that adding Azure API Management to your AI projects can help you scale your AI models, make them more secure and easier to manage. We previously covered the hidden risks of AI APIs in today's AI-driven technological landscape. In this article, we dive deeper into one of the supported Gen AI policies in API Management, which allows your applications to change the effective Gen AI backend based on unexpected and specified events. In Azure API Management, you can set up your different LLMs as backends and define structures to route requests to prioritized backends and add automatic circuit breaker rules to protect backends from too many requests. Under normal conditions, if your Azure OpenAI service fails, users of your application will continue to receive error messages, an experience that will persist until the backend issue is resolved and becomes ready to serve requests again. Similarly, managing multiple Azure OpenAI resources can be cumbersome, as manual URL changes are required in your API settings to switch between backend entities. This approach lacks efficiency and does not account for dynamic user conditions, preventing seamless switching to the optimal backend services for enhanced performance and reliability. How load balancing will work First configure your Azure OpenAI resources as referenceable backends, defining the base-url and assign a backend-id. As an example, let's assume we have three different Azure OpenAI resources as follows: To set up load balancing across the backends, you can either use one of supported approaches/ strategies or a combination of two to ensure optimal use of your Azure OpenAI resources. 1. Round Robin As the name suggests, API Management will evenly distribute requests to the available backends in the pool. 2. Priority-based For this approach, you organize multiple backends into priority groups, and API Management will follow and assign requests to these backends in order of priority. Back to our example, we are going to assign openai1 the top priority (priority 1), assign openai2 to priority 2 and add openai3 with priority 3 This will mean that requests will be forwarded to openai1 (priority 1), but if the service is unreachable, the calls will reroute to hit openai2 defined in the next priority group and so on. 3. Weighted Here, you assign weights to your backends, and requests will be distributed based on these relative weights. For our example above, we want to be even more specific by saying that while all requests default to openai1, in the event of its failure, we now want requests to be equally distributed to our priority 2 backends (specified by the 50/50 weight allocation) Now, configure your circuit breaker rules The next step is to define rules to that listen to the events in your API, and trip when specified conditions are met. Let's look at the example below to learn more about how this works. Inside your CircuitBreaker property configuration, you define an array that can hold multiple rules This section defines the conditions that must be met for the circuit breaker to trip. a. The circuit breaker will trip if there is at least one failure b. The number of failures specified in count will be monitored within 5-minute intervals c. We are looking out for errors that return a status code of 429 (Too Many Requests), and you can define a range of codes here The circuit will remain tripped for 1 minute, after which it will reset and route traffic to the endpoint Alright, so what should be my next steps? This article just introduced you to one of the many Generative AI supported capabilities in Azure API Management. We have more policies that you can use to better manage your AI APIs, covered in other articles in this series. Do check them out. Do you have any resources I can look at in the meantime to learn more? Absolutely! Check out: - https://learn.microsoft.com/en-us/azure/api-management/set-backend-service-policy https://learn.microsoft.com/en-us/azure/api-management/backends?tabs=bicep https://github.com/Azure-Samples/AI-Gateway/tree/main/labs/backend-pool-load-balancingAnnouncing: Azure API Center Hands-on Workshop 🚀
What is the Azure API Center Workshop? The Azure API Center flash workshop is a resource designed to expand your understanding of how organizations can enhance and streamline their API management and governance strategies using Azure API Center. With this practical knowledge and insights, you will be able to streamline secure API integration and enforce security and compliance with tools that evolve to meet your growing business needs. Azure API Center is a centralized inventory designed to track all your APIs, regardless of their type, lifecycle stage, or deployment location. It enhances discoverability, development, and reuse of APIs. While Azure API Management focuses on API deployment and runtime governance, Azure API Center complements it by centralizing APIs and streamlining the registration of new APIs and design-time governance. Who can go through the workshop? The Azure API Center workshop benefits anyone who is interested in improving their API development, at-scale governance, discovery, and consumption workflow experience. Throughout the workshop, we reference 3 key personas heavily involved in the API ecosystem. API Producers - Individual developers or teams who consolidate API specifications and requirements and design API architectures to fit defined goals. They also develop, secure, publish, test APIs to ensure they meet functional and performance requirements and document APIs. API Platform Engineers/ API Admins - Establish and enforce API best practices and design standards across teams and the entire organization. They also enforce monitoring and analysis of definitions for adherence to organizational style rules, generating both individual and summary reports. This ensures timely correction of common errors and inconsistencies in your API definitions. API consumers - Consumers of APIs to build systems/ applications that use services provided by the organization, or direct consumers using APIs to satisfy business needs. To put these personas into perspective, we use a fictitious company, Contoso Airlines, to demonstrate how API Center integrates with existing API development, deployment, governance, and discovery workflows. If you are passionate about teaching and skilling others on API best practices and concepts, visit the workshop's GitHub repository for session delivery resources you can use for a step-by-step presentation on API Center using this workshop. Don't forget to like and star the repo ⭐ What will you learn after going through the workshop? The workshop is divided into three key pillars as follows: API Inventory Under the API Inventory pillar, you will go through guided steps to install the Azure API Center extension on VS Code, connect to your Azure account and create an API Center resource with custom API metadata/ properties defined. You will then use GitHub Copilot to register current APIs into your API Center resource, configure environments and deployments for your APIs and set up automatic synchronization as you import more sample APIs from an API Management service. API Governance You will then go through the API Governance experience first from an API producer's perspective, as you define and apply a custom API style ruleset on VS Code, and from an API Admin perspective to deploy the ruleset to Azure, and view API Analysis reports on Azure to determine the health and safety of all APIs across the organization. API Discovery & Consumption Here, you will learn how you can discover all APIs on VS Code and on the Azure portal, a key step before creating any new APIs to ensure no duplication and promote reusable APIs. You will also quickly load API Documentation and test your APIs. Where do I go to get started with the workshop? To get started with the workshop, you can directly go to https://aka.ms/APICenter/Workshop or go through our GitHub Repository where you can open issues, leave feedback and leave a star 😉Take full control of your AI APIs with Azure API Management Gateway
This article is part of a series of articles on API Management and Generative AI. We believe that adding Azure API Management to your AI projects can help you scale your AI models, make them more secure and easier to manage. In this article, we will shed some light on capabilities in API Management, which are designed to help you govern and manage Generative AI APIs, ensuring that you are building resilient and secure intelligent applications. But why exactly do I need API Management for my AI APIs? Common challenges when implementing Gen AI-Powered solutions include: - Quota, (calculated in tokens-per-minute (TPM)), allocation across multiple client apps, How to control and track token consumption for all users, Mechanisms to attribute costs to specific client apps, activities, or users, Your systems resiliency to backend failures when hitting one or more limits And the list goes on with more challenges and questions. Well, let’s find some answers, shall we? Quota allocation Take a scenario where you have more than one client application, and they are talking to one or more models from Azure OpenAI Service or Azure AI Foundry. With this complexity, you want to have control over the quota distribution for each of the applications. Tracking Token usage & Security I bet you agree with me that it would be unfortunate if one of your applications (most likely that which gets the highest traffic), hogs up all the TPM quota leaving zero tokens remaining for your other applications, right? If this occurs though, there is a high chance that it might be a DDOS Attack, with bad actors trying to bombard your system with purposeless traffic causing service downtime. Yet another reason why you will need more control and tracking mechanisms to ensure this doesn’t happen. Token Metrics As a data-driven company, having additional insights with flexibility to dissect and examine usage data down to dimensions like subscription ID or API ID level is extremely valuable. These metrics go a long way in informing capacity and budget planning decisions. Automatic failovers This is a common one. You want to ensure that your users experience zero service downtime, so if one of your backends is down, does your system architecture allow automatic rerouting and forwarding to healthy services? So, how will API Management help address these challenges? API Management has a set of policies and metrics called Generative AI (Gen AI) gateway capabilities, which empower you to manage and have full control of all these moving pieces and components of your intelligent systems. Minimize cost with Token-based limits and semantic caching How can you minimize operational costs for AI applications as much as possible? By leveraging the `llm-token limit` policy in Azure API Management, you can enforce token-based limits per user on identifiers such as subscription keys and requesting IP addresses. When a caller surpasses their allocated tokens-per-minute quota, they receive a HTTP "Too Many Requests" error along with ‘retry-after’ instructions. This mechanism ensures fair usage and prevents any single user from monopolizing resources. To optimize cost consumption for Large Language Models (LLMs), it is crucial to minimize the number of API calls made to the model. Implementing the `llm-semantic-cache-store` policy and `llm-semantic-cache-lookup` policies allow you to store and retrieve similar completions. This method involves performing a cache lookup for reused completions, thereby reducing the number of calls sent to the LLM backend. Consequently, this strategy helps in significantly lowering operational costs. Ensure reliability with load balancing and circuit breakers Azure API Management allows you to leverage load balancers to distribute the workload across various prioritized LLM backends effectively. Additionally, you can set up circuit breaker rules that redirect requests to a responsive backend if the prioritized one fails, thereby minimizing recovery time and enhancing system reliability. Implementing the semantic-caching policy not only saves costs but also reduces system latency by minimizing the number of calls processed by the backend. Okay. What Next? This article mentions these capabilities at a high level, but in the coming weeks, we will publish articles that go deeper into each of these generative AI capabilities in API Management, with examples of how to set up each policy. Stay tuned! Do you have any resources I can look at in the meantime to learn more? Absolutely! Check out: - Manage your Azure OpenAI APIs with Azure API Management http://aka.ms/apimloveHow To: Retrieve from CosmosDB using Azure API Management
In this How To, I will show a simple mechanism for reading items from CosmosDB using Azure API Management (APIM). There are many scenarios where you might want to do this in order to leverage the capabilities of APIM while having a highly scalable, flexible data store.Essentials for building and modernizing AI apps on Azure
Building and modernizing AI applications is complex—but Azure Essentials simplifies the journey. With a structured, three-stage approach—Readiness and Foundation, Design and Govern, Manage and Optimize—it provides tools, best practices, and expert guidance to tackle key challenges like skilled resource gaps, modernization, and security. Discover how to streamline AI app development, enhance scalability, and achieve cost efficiency while driving business value. Ready to transform your AI journey? Explore the Azure Essentials Hub today.How To: Send requests to Azure Storage from Azure API Management
In this How To, I will show a simple mechanism for writing a payload to Azure Blob Storage from Azure API Management. Some examples where this is useful is implementing a Claim-Check pattern for large messages or to support message logging when Application Insights is not suitable.Build a no-code GraphQL service with Azure API Management
Synthetic GraphQL allows you to build a GraphQL API from your existing REST, SOAP, or other HTTP APIs. It allows your API to coexist while you migrate your client applications to GraphQL. Learn how you can leverage Synthetic GraphQL with Azure API Management.Helm charts managed through Terraform to deploy an Azure SecretProviderClass on AKS
Using Terraform and Helm charts will help you reap the benefits of both worlds: Make full use of your teams’ skills. Pass calculated values from your cloud provider without writing them in your code. Manage planned changes that new git commits plan to do before applying them in production.2.2KViews0likes0Comments