azure cache for redis

17 Topics

AI Resilience: Strategies to Keep Your Intelligent App Running at Peak Performance
Stay Online Reliability. It's one of the 5 pillars of Azure Well-Architect Framework. When starting to implement and go-to-market any new product witch has any integration with Open AI Service you can face spikes of usage in your workload and, even having everything scaling correctly in your side, if you have an Azure Open AI Services deployed using PTU you can reach the PTU threshold and them start to experience some 429 response code. You also will receive some important information about the when you can retry the request in the header of the response and with this information you can implement in your business logic a solution. Here in this article I will show how to use the API Management Service policy to handle this and also explore the native cache to save some tokens! Architecture Reference The Azure Function in the left of the diagram just represent and App request and can be any kind of resource (even in an On-Premisse environment). Our goal in this article is to show one in n possibilities to handle the 429 responses. We are going to use API Management Policy to automatically redirect the backend to another Open AI Services instance in other region in the Standard mode, witch means that the charge is going to be only what you use. First we need to create an API in our API Management to forward the requests to your main Open AI Services (region 1 in the diagram). Now we are going to create this policy in the API call request: <policies> <inbound> <base /> <set-backend-service base-url="<your_open_ai_region1_endpoint>" /> </inbound> <backend> <base /> </backend> <outbound> <base /> </outbound> <on-error> <retry condition="@(context.Response.StatusCode == 429)" count="1" interval="5" /> <set-backend-service base-url="<your_open_ai_region2_endpoint>" /> </on-error> </policies> The first part of our job is done! Now we have an automatically redirect to our OpenAI Services deployed at region 2 when our PTU threshold is reached. Cost consideration So now you can ask me: and about my cost increment for using API Management? Even if you don't want to use any other feature on API Management you can leverage of the API Management native cache and, once again using policy and AI, put some questions/answers in the built-in Redis* cache using semantic cache for Open AI services. Let's change our policy to consider this: <policies> <inbound> <base /> <azure-openai-semantic-cache-lookup score-threshold="0.05" embeddings-backend-id ="azure-openai-backend" embeddings-backend-auth ="system-assigned" > <vary-by>@(context.Subscription.Id)</vary-by> </azure-openai-semantic-cache-lookup> <set-backend-service base-url="<your_open_ai_region1_endpoint>" /> </inbound> <backend> <base /> </backend> <outbound> <base /> <azure-openai-semantic-cache-store duration="60" /> </outbound> <on-error> <retry condition="@(context.Response.StatusCode == 429)" count="1" interval="5" /> <set-backend-service base-url="<your_open_ai_region2_endpoint>" /> </on-error> </policies> Now, API Management will handle the tokens inputted and use semantic equivalence and decide if its fit with cached information or redirect the request to your OpenAI endpoint. And, sometime, this can help you to avoid reach the PTU threshold as well! * Check the tier / cache capabilities to validate your business solution needs with the API Management cache feature: Compare API Management features across tiers and cache size across tiers. Conclusion API Management offers key capabilities for AI that we are exploring in this article and also others that you can leverage for your intelligent applications. Check it out on this awesome AI Gateway HUB repository At least but not less important, dive in API Management features with experts in the field inside the API Management HUB. Thanks for reading and Happy Coding!
fabiopadua
Apr 24, 2025 Place Azure PaaS Blog
496Views
4likes
1Comment
Cut Costs and Speed Up AI API Responses with Semantic Caching in Azure API Management
This article is part of a series of articles on API Management and Generative AI. We believe that adding Azure API Management to your AI projects can help you scale your AI models, make them more secure and easier to manage. We previously covered the hidden risks of AI APIs in today's AI-driven technological landscape. In this article, we dive deeper into one of the supported Gen AI policies in API Management, which allows you to minimize Azure OpenAI costs and make your applications more performant by reducing the number of calls sent to your LLM service. How does it currently work without the semantic caching policy? For simplicity, let's look at a scenario where we only have a single client app, a single user, and a single model deployment. This of course does not represent most real-world use-cases, as you often have multiple users talking to different services. Take the following cases into consideration: - A user lands on your application and sends in a query (query 1), They then send the exact same query again, with similar verbiage, in the same session (query 2), The user changes the wording of the query, but it is still relevant and related to the original query (query 3) The last query, (query 4), is completely different and unrelated to the previous queries. In a normal implementation, all these queries will cost you tokens (TPM), resulting in higher cuts in your billing. Your users are also likely to experience some latency as they wait for the LLM to build a response with each call. As the user base grows, you anticipate that the expenses will grow exponentially, making it more expensive to run your system eventually. How does Semantic caching in Azure API Management fix this? Let's look at the same scenario as described above (at a high level first), with a flow diagram representing how you can cut costs and boost your app's performance with the semantic cache policy. When the user sends in the first query, the LLM will be used to generate a response, which will then be stored in the cache. Queries 2 and 3 are somewhat related to query 1, which could be a semantic similarity, or exact match, or could contain a specified keyword, i.e.. price. In all these cases, a lookup will be performed, and the appropriate response will be retrieved from the cache, without waiting on the LLM to regenerate a response. Query 4, which is different from the previous prompts, will require the call to be passed through to the LLM, then grabs the generated response and stores it in the cache for future searches. Okay. Tell me more - How does this work and how do I set it up? Think about this - What would be the likelihood of your users asking related questions or exactly comparable questions in your app? I'd argue that the odds are quite high. Semantic caching for Azure OpenAI API requests To start, you will need to add Azure OpenAI Service APIs to your Azure API Management instance with semantic caching enabled. Luckily, this step has been reduced to just a one-click step. I'll link a tutorial on this in the 'Resources' section. Before you get to configure the policies, you first need to set up a backend for the embeddings API. Oh yes, as part of your deployments, you will need an embedding model to convert your input to the corresponding vector representation, allowing Azure Redis cache to perform the vector similarity search. This step also allows you to set a score_threshold, a parameter used to determine how similar user queries need to be to retrieve responses from the cache. Next, is to add the two policies that you need: azure-openai-semantic-cache-store/ llm-semantic-cache-store and azure-openai-semantic-cache-lookup/ llm-semantic-cache-lookup The azure-openai-semantic-cache-store policy will cache the completions and requests to the configured cache service. You can use the internal Azure Redis enterprise or any another external cache as long as it's a Redis-compatible cache in Azure API Management. The second policy, azure-openai-semantic-cache-lookup, based on the proximity result of the similarity search and the score_threshold, will perform a cache lookup through the compilation of cached requests and completions. In addition to the score_threshold attribute, you will also specify the id of the embeddings backend created in an earlier step and can choose to omit the system messages from the prompt at this step. These two policies enhance your system's efficiency and performance by reusing completions, increasing response speed, and making your API calls much cheaper. Alright, so what should be my next steps? This article just introduced you to one of the many Generative AI supported capabilities in Azure API Management. We have more policies that you can use to better manage your AI APIs, covered in other articles in this series. Do check them out. Do you have any resources I can look at in the meantime to learn more? Absolutely! Check out: - Using external Redis-compatible cache in Azure API Management documentation Use Azure Cache for Redis as a semantic cache tutorial Enable semantic caching for Azure OpenAI APIs in Azure API Management article Improve the performance of an API by adding a caching policy in Azure API Management Learn module
Julia_Muiruri
Apr 02, 2025 Place Microsoft Developer Community Blog
490Views
1like
0Comments
Take full control of your AI APIs with Azure API Management Gateway
This article is part of a series of articles on API Management and Generative AI. We believe that adding Azure API Management to your AI projects can help you scale your AI models, make them more secure and easier to manage. In this article, we will shed some light on capabilities in API Management, which are designed to help you govern and manage Generative AI APIs, ensuring that you are building resilient and secure intelligent applications. But why exactly do I need API Management for my AI APIs? Common challenges when implementing Gen AI-Powered solutions include: - Quota, (calculated in tokens-per-minute (TPM)), allocation across multiple client apps, How to control and track token consumption for all users, Mechanisms to attribute costs to specific client apps, activities, or users, Your systems resiliency to backend failures when hitting one or more limits And the list goes on with more challenges and questions. Well, let’s find some answers, shall we? Quota allocation Take a scenario where you have more than one client application, and they are talking to one or more models from Azure OpenAI Service or Azure AI Foundry. With this complexity, you want to have control over the quota distribution for each of the applications. Tracking Token usage & Security I bet you agree with me that it would be unfortunate if one of your applications (most likely that which gets the highest traffic), hogs up all the TPM quota leaving zero tokens remaining for your other applications, right? If this occurs though, there is a high chance that it might be a DDOS Attack, with bad actors trying to bombard your system with purposeless traffic causing service downtime. Yet another reason why you will need more control and tracking mechanisms to ensure this doesn’t happen. Token Metrics As a data-driven company, having additional insights with flexibility to dissect and examine usage data down to dimensions like subscription ID or API ID level is extremely valuable. These metrics go a long way in informing capacity and budget planning decisions. Automatic failovers This is a common one. You want to ensure that your users experience zero service downtime, so if one of your backends is down, does your system architecture allow automatic rerouting and forwarding to healthy services? So, how will API Management help address these challenges? API Management has a set of policies and metrics called Generative AI (Gen AI) gateway capabilities, which empower you to manage and have full control of all these moving pieces and components of your intelligent systems. Minimize cost with Token-based limits and semantic caching How can you minimize operational costs for AI applications as much as possible? By leveraging the `llm-token limit` policy in Azure API Management, you can enforce token-based limits per user on identifiers such as subscription keys and requesting IP addresses. When a caller surpasses their allocated tokens-per-minute quota, they receive a HTTP "Too Many Requests" error along with ‘retry-after’ instructions. This mechanism ensures fair usage and prevents any single user from monopolizing resources. To optimize cost consumption for Large Language Models (LLMs), it is crucial to minimize the number of API calls made to the model. Implementing the `llm-semantic-cache-store` policy and `llm-semantic-cache-lookup` policies allow you to store and retrieve similar completions. This method involves performing a cache lookup for reused completions, thereby reducing the number of calls sent to the LLM backend. Consequently, this strategy helps in significantly lowering operational costs. Ensure reliability with load balancing and circuit breakers Azure API Management allows you to leverage load balancers to distribute the workload across various prioritized LLM backends effectively. Additionally, you can set up circuit breaker rules that redirect requests to a responsive backend if the prioritized one fails, thereby minimizing recovery time and enhancing system reliability. Implementing the semantic-caching policy not only saves costs but also reduces system latency by minimizing the number of calls processed by the backend. Okay. What Next? This article mentions these capabilities at a high level, but in the coming weeks, we will publish articles that go deeper into each of these generative AI capabilities in API Management, with examples of how to set up each policy. Stay tuned! Do you have any resources I can look at in the meantime to learn more? Absolutely! Check out: - Manage your Azure OpenAI APIs with Azure API Management http://aka.ms/apimlove
Julia_Muiruri
Mar 24, 2025 Place Microsoft Developer Community Blog
849Views
0likes
0Comments
Unlock New AI and Cloud Potential with .NET 9 & Azure: Faster, Smarter, and Built for the Future
.NET 9, now available to developers, marks a significant milestone in the evolution of the .NET platform, pushing the boundaries of performance, cloud-native development, and AI integration. This release, shaped by contributions from over 9,000 community members worldwide, introduces thousands of improvements that set the stage for the future of application development. With seamless integration with Azure and a focus on cloud-native development and AI capabilities, .NET 9 empowers developers to build scalable, intelligent applications with unprecedented ease. Expanding Azure PaaS Support for .NET 9 With the release of .NET 9, a comprehensive range of Azure Platform as a Service (PaaS) offerings now fully support the platform’s new capabilities, including the latest .NET SDK for any Azure developer. This extensive support allows developers to build, deploy, and scale .NET 9 applications with optimal performance and adaptability on Azure. Additionally, developers can access a wealth of architecture references and sample solutions to guide them in creating high-performance .NET 9 applications on Azure’s powerful cloud services: Azure App Service: Run, manage, and scale .NET 9 web applications efficiently. Check out this blog to learn more about what's new in Azure App Service. Azure Functions: Leverage serverless computing to build event-driven .NET 9 applications with improved runtime capabilities. Azure Container Apps: Deploy microservices and containerized .NET 9 workloads with integrated observability. Azure Kubernetes Service (AKS): Run .NET 9 applications in a managed Kubernetes environment with expanded ARM64 support. Azure AI Services and Azure OpenAI Services: Integrate advanced AI and OpenAI capabilities directly into your .NET 9 applications. Azure API Management, Azure Logic Apps, Azure Cognitive Services, and Azure SignalR Service: Ensure seamless integration and scaling for .NET 9 solutions. These services provide developers with a robust platform to build high-performance, scalable, and cloud-native applications while leveraging Azure’s optimized environment for .NET. Streamlined Cloud-Native Development with .NET Aspire .NET Aspire is a game-changer for cloud-native applications, enabling developers to build distributed, production-ready solutions efficiently. Available in preview with .NET 9, Aspire streamlines app development, with cloud efficiency and observability at its core. The latest updates in Aspire include secure defaults, Azure Functions support, and enhanced container management. Key capabilities include: Optimized Azure Integrations: Aspire works seamlessly with Azure, enabling fast deployments, automated scaling, and consistent management of cloud-native applications. Easier Deployments to Azure Container Apps: Designed for containerized environments, .NET Aspire integrates with Azure Container Apps (ACA) to simplify the deployment process. Using the Azure Developer CLI (azd), developers can quickly provision and deploy .NET Aspire projects to ACA, with built-in support for Redis caching, application logging, and scalability. Built-In Observability: A real-time dashboard provides insights into logs, distributed traces, and metrics, enabling local and production monitoring with Azure Monitor. With these capabilities, .NET Aspire allows developers to deploy microservices and containerized applications effortlessly on ACA, streamlining the path from development to production in a fully managed, serverless environment. Integrating AI into .NET: A Seamless Experience In our ongoing effort to empower developers, we’ve made integrating AI into .NET applications simpler than ever. Our strategic partnerships, including collaborations with OpenAI, LlamaIndex, and Qdrant, have enriched the AI ecosystem and strengthened .NET’s capabilities. This year alone, usage of Azure OpenAI services has surged to nearly a billion API calls per month, illustrating the growing impact of AI-powered .NET applications. Real-World AI Solutions with .NET: .NET has been pivotal in driving AI innovations. From internal teams like Microsoft Copilot creating AI experiences with .NET Aspire to tools like GitHub Copilot, developed with .NET to enhance productivity in Visual Studio and VS Code, the platform showcases AI at its best. KPMG Clara is a prime example, developed to enhance audit quality and efficiency for 95,000 auditors worldwide. By leveraging .NET and scaling securely on Azure, KPMG implemented robust AI features aligned with strict industry standards, underscoring .NET and Azure as the backbone for high-performing, scalable AI solutions. Performance Enhancements in .NET 9: Raising the Bar for Azure Workloads .NET 9 introduces substantial performance upgrades with over 7,500 merged pull requests focused on speed and efficiency, ensuring .NET 9 applications run optimally on Azure. These improvements contribute to reduced cloud costs and provide a high-performance experience across Windows, Linux, and macOS. To see how significant these performance gains can be for cloud services, take a look at what past .NET upgrades achieved for Microsoft’s high-scale internal services: Bing achieved a major reduction in startup times, enhanced efficiency, and decreased latency across its high-performance search workflows. Microsoft Teams improved efficiency by 50%, reduced latency by 30–45%, and achieved up to 100% gains in CPU utilization for key services, resulting in faster user interactions. Microsoft Copilot and other AI-powered applications benefited from optimized runtime performance, enabling scalable, high-quality experiences for users. Upgrading to the latest .NET version offers similar benefits for cloud apps, optimizing both performance and cost-efficiency. For more information on updating your applications, check out the .NET Upgrade Assistant. For additional details on ASP.NET Core, .NET MAUI, NuGet, and more enhancements across the .NET platform, check out the full Announcing .NET 9 blog post. Conclusion: Your Path to the Future with .NET 9 and Azure .NET 9 isn’t just an upgrade—it’s a leap forward, combining cutting-edge AI integration, cloud-native development, and unparalleled performance. Paired with Azure’s scalability, these advancements provide a trusted, high-performance foundation for modern applications. Get started by downloading .NET 9 and exploring its features. Leverage .NET Aspire for streamlined cloud-native development, deploy scalable apps with Azure, and embrace new productivity enhancements to build for the future. For additional insights on ASP.NET, .NET MAUI, NuGet, and more, check out the full Announcing .NET 9 blog post. Explore the future of cloud-native and AI development with .NET 9 and Azure—your toolkit for creating the next generation of intelligent applications.
MehulHarry
Dec 18, 2024 Place Apps on Azure Blog
9.6KViews
2likes
1Comment
Azure Managed Redis (Preview): The Next Generation of Redis on Azure at Microsoft Ignite 2024
Azure Managed Redis (Preview): The Next Generation of Redis on Azure, announced at Microsoft Ignite 2024 We were excited to announce the preview of Azure Managed Redis at Microsoft Ignite 2024, a first party, in-memory database solution designed for developers building the next generation of GenAI applications.
rickydiep
Nov 28, 2024 Place Microsoft Developer Community Blog
846Views
1like
0Comments
Exciting Updates Coming to Conversational Diagnostics (Public Preview)
Last year, at Ignite 2023, we unveiled Conversational Diagnostics (Preview), a revolutionary tool integrated with AI-powered capabilities to enhance problem-solving for Windows Web Apps. This year, we're thrilled to share what’s new and forthcoming for Conversational Diagnostics (Preview). Get ready to experience a broader range of functionalities and expanded support across various Azure Products, making your troubleshooting journey even more seamless and intuitive.
Dalibor_Kovacevic
Nov 22, 2024 Place Apps on Azure Blog
324Views
0likes
0Comments
Introducing Azure Managed Redis, cost-effective caching for your AI apps
Azure Managed Redis, announced at Microsoft's Ignite conference, is a new service that brings the latest Redis innovations to the hyperscale cloud. It features four tiers—Memory Optimized, Balanced, Compute Optimized, and Flash Optimized—designed to enhance performance and scalability for GenAI applications. With up to 99.999% availability SLA, cost-effective total cost of ownership, and seamless interoperability with Azure services, it supports high-performance, scalable AI workloads.
balansubr
Nov 19, 2024 Place Apps on Azure Blog
4.1KViews
2likes
0Comments
Connecting to Azure Cache for Redis with Entra ID in Azure Government
Discovering the nuances of connecting to Azure Cache for Redis with Entra ID in Azure Government
JohnScott
Oct 04, 2024 Place Azure Federal Developer Connect
2.8KViews
0likes
0Comments
Azure API Center: Centralizing API Management for Enhanced Discovery and Governance
Discover how Azure API Center can revolutionize your API management by centralizing control, improving discovery, and enhancing governance. Learn more about its key features, benefits, and access free training on Microsoft Learn today!
Glaucia_Lemos
Jun 26, 2024 Place Microsoft Developer Community Blog
3.3KViews
1like
0Comments
AI+API better together: Benefits & Best Practices using APIs for AI workloads
This blog post will give you an overview of benefits and best practices you will get harnessing APIs and an API Manager solution when integrating AI into your application landscape.
Juliane_Franze
Jun 05, 2024 Place Apps on Azure Blog
7.8KViews
4likes
0Comments