azure api management
118 TopicsYou Can Build a Framework-Agnostic AI Gateway on Azure App Service — Here's How
The agent infrastructure conversation moved this year. In October 2025, AWS shipped Amazon Bedrock AgentCore — a managed agent runtime with per-session microVM isolation, built-in long-term memory, native MCP support, and an opinionated policy engine. A few months earlier, Cloudflare shipped its Agents SDK on top of Durable Objects, betting that edge-native stateful agents are the future. Both bets are real, both are interesting, and both arrive as closed, proprietary runtimes. So: what's Azure's answer? It's a question I've heard a couple times from architects in the last six months. The honest answer is that Azure already has the pieces. They don't ship as one product called AgentRuntime, and that's actually the point. Azure's pitch is composable: App Service + API Management + MCP, three services you already have access to, glued together with open standards. This post walks through a runnable sample of that composition. One App Service hosting both an agent (built with the Microsoft Agent Framework) and the stateless MCP server it calls, fronted by Azure API Management with the AI Gateway policy set — semantic caching, token rate limiting, per-subscription token emission for chargeback. One azd up deploys the lot. Repo: app-service-ai-gateway-mcp-apim-python. The headline claim is in the title. The point I actually want to make is the one underneath it: the framework is replaceable, the gateway is the contribution. Swap the Agent Framework module for Pydantic AI or LangGraph and the rest of the architecture is unchanged. That's what "run anything" means, made literal. The composable stack ┌────────────────────────────────────────────────┐ │ Azure API Management │ MCP / Agent ──┤ AI Gateway policies: │ client │ • llm-token-limit │ │ • llm-semantic-cache-lookup / store │ │ • llm-emit-token-metric │ │ • rate-limit-by-key (MCP API) │ └─────────────┬───────────────────┬──────────────┘ │ │ ┌────────────────▼──┐ ┌────────────▼──────────────┐ │ Azure OpenAI │ │ Azure App Service │ │ • chat model │ │ FastAPI app: │ │ • embedding │ │ • /mcp (stateless) │ │ model │ │ • /agent/chat │ └───────────────────┘ │ Managed identity → │ │ APIM (via subscription) │ └────────────┬──────────────┘ ▼ Application Insights (cloud_RoleInstance, APIM token metrics) Three observations that drive everything else: APIM is the only thing that talks to Azure OpenAI. The App Service agent doesn't have an AOAI key. It has an APIM subscription key. Every LLM call passes through the gateway, picks up the policies, and gets logged with consistent dimensions. That's where the governance part lives. The agent runtime is App Service. Linux, Python, FastAPI. Any language. Any framework. Pick your tool. We use Microsoft Agent Framework because it just GA'd and the API is clean, but the agent module is the easiest thing in the stack to swap. The MCP server is co-located with the agent. Same App Service, different route. The agent calls its own tools either in-process (fast path) or back out through APIM (so MCP traffic gets rate-limited and observed too). That choice is one environment variable. What the sample actually does The FastAPI app exposes three routes that matter: /mcp — a stateless HTTP MCP server (protocol revision 2025-11-25 ), implementing four tools: whoami , echo , lookup_fact , and summarize_app_service_doc . Any MCP client (Claude, VS Code, your own agent runtime) can connect. /agent/chat — a Microsoft Agent Framework agent that uses those same MCP tools as its tool set, and calls AOAI through APIM. /health and / — the boring but essential supporting cast (health check for App Service probes, status page showing the serving instance ID). Here's the agent definition. The key line is the endpoint: from agent_framework.openai import OpenAIChatCompletionClient client = OpenAIChatCompletionClient( azure_endpoint=os.environ["APIM_GATEWAY_URL"], # ← APIM, not AOAI model=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT"], api_version="2024-10-21", api_key=os.environ["APIM_SUBSCRIPTION_KEY"], default_headers={"Ocp-Apim-Subscription-Key": os.environ["APIM_SUBSCRIPTION_KEY"]}, ) agent = client.as_agent( name="AppServiceExpert", instructions=SYSTEM_INSTRUCTIONS, tools=build_tools(), ) That's it. The agent has no idea APIM exists. It thinks it's talking to AOAI. APIM is doing every interesting thing — auth, caching, throttling, metric emission — without the agent code knowing or caring. The policy that does the heavy lifting The AOAI API in APIM has one policy attached at the API scope. The full XML is in infra/apim/policies/aoai-policy.xml; here's the bones of it: <policies> <inbound> <base /> <authentication-managed-identity resource="https://cognitiveservices.azure.com" output-token-variable-name="aoai-token" /> <set-header name="Authorization" exists-action="override"> <value>@("Bearer " + (string)context.Variables["aoai-token"])</value> </set-header> <azure-openai-token-limit counter-key="@(context.Subscription?.Id ?? "anonymous")" tokens-per-minute="50000" estimate-prompt-tokens="true" /> <azure-openai-semantic-cache-lookup score-threshold="0.85" embeddings-backend-id="aoai-embeddings-backend" embeddings-backend-auth="system-assigned"> <vary-by>@(context.Subscription?.Id ?? "anonymous")</vary-by> </azure-openai-semantic-cache-lookup> <set-backend-service backend-id="aoai-backend" /> <azure-openai-emit-token-metric namespace="ai-gateway"> <dimension name="Subscription ID" value="@(context.Subscription?.Id ?? "anonymous")" /> <dimension name="API ID" value="@(context.Api.Id)" /> <dimension name="Operation ID" value="@(context.Operation.Id)" /> <dimension name="Client IP" value="@(context.Request.IpAddress)" /> </azure-openai-emit-token-metric> </inbound> <outbound> <base /> <azure-openai-semantic-cache-store duration="3600" /> </outbound> </policies> Four things are happening here that would otherwise be your problem: Auth to AOAI. APIM's managed identity holds the Cognitive Services OpenAI User role on the AOAI account. No keys. Token rate limiting. Each APIM subscription gets a tokens-per-minute budget. One runaway team can't starve everyone else. Semantic caching. The inbound policy embeds the prompt using the embedding deployment, queries the Redis-backed APIM cache for a vector match above the 0.85 threshold, and short-circuits the AOAI call on a hit. The outbound azure-openai-semantic-cache-store writes successful misses back. Per-call metric emission. Every call pushes PromptTokens , CompletionTokens , and TotalTokens to Application Insights as custom metrics tagged with the APIM subscription, the API, the operation, and the client IP. That's your chargeback dashboard, ready to query. The whole thing is XML. None of it is in your agent code. Deploying it azd auth login azd up azd up provisions a P0v3 App Service Plan with the web app and a staging slot, an AOAI account with gpt-4o-mini + text-embedding-3-small deployments, an APIM Developer SKU service with the two APIs and the policy XML wired up, an Azure Cache for Redis Basic C0 as the semantic-cache store, and a Log Analytics workspace + Application Insights. The postprovision hook fetches the APIM subscription key for the AI Gateway product and writes it into the App Service's APIM_SUBSCRIPTION_KEY setting (and the staging slot's, so slot swaps are clean). Be patient. Developer SKU APIM takes 30–45 minutes the first time. If you want to prototype faster, the sample supports Consumption SKU as a one-line flip: azd env set APIM_SKU Consumption azd provision Consumption provisions in about a minute and is great for sketching. Verify your specific policies are supported there before you ship it. Governing it like a grown-up The toy version of this post stops at "look, semantic cache works." The version your platform engineering lead wants to see goes further. Per-team chargeback. The token-emit policy tags every call with the APIM subscription ID. Issue one subscription per team, hand it over with a quota, and your monthly chargeback report is a KQL query: customMetrics | where timestamp > startofmonth(now()) | where name == "TotalTokens" | summarize Tokens=sum(valueSum) by Team=tostring(customDimensions["Subscription ID"]) | extend USD = Tokens * 0.00015 / 1000 // gpt-4o-mini blended rate | order by USD desc Content safety as a policy plug-in. Add an llm-content-safety block to the inbound policy and point it at an Azure AI Content Safety resource — every prompt and response gets moderated before reaching agents or end users. The sample doesn't deploy Content Safety by default (to keep the demo cost-free), but the README has the one-line bicep + one-block policy delta. Circuit breaker + multi-region failover. Add a second AOAI backend in a different region and an APIM backend pool, give the pool a circuit-breaker rule, and your agents inherit failover with zero code changes. Rate-limit MCP traffic too. The MCP API has its own policy with rate-limit-by-key , so a runaway agent can't pin the MCP server with a hot loop. None of these are gymnastics. They're one policy block each. The pattern is the same every time: write policy at the gateway, leave the agent code alone. Proving it works After azd up finishes, two checks. First, hit the agent endpoint: curl -sS -X POST "$(azd env get-value WEB_URI)/agent/chat" \ -H 'Content-Type: application/json' \ -d '{"message": "How does App Service horizontally scale an MCP server?"}' | jq You should see a reply that cites the instance ID (the agent calls whoami and summarize_app_service_doc to ground its answer) and a tool_calls array showing the agent's reasoning trace. Second, run the k6 load test: export BASE_URL="$(azd env get-value WEB_URI)" export APIM_SUBSCRIPTION_KEY="$(azd env get-value APIM_SUBSCRIPTION_KEY)" k6 run loadtest/k6-gateway.js The script hits /agent/chat with a small pool of semantically-similar prompts. After a 30-second warmup, the steady phase should report a cache-hit ratio above 30%: APIM AI Gateway — k6 summary ───────────────────────────── Cache hits : 412 Cache misses : 88 Hit ratio : 82.4% Cross-check in App Insights: ApiManagementGatewayLogs | where TimeGenerated > ago(15m) | where ApiId == "aoai" | extend cache = tostring(parse_json(ResponseHeaders)["x-llm-cache-status"]) | summarize count() by cache, bin(TimeGenerated, 1m) | render columnchart A solid bar of hits next to a smaller bar of misses is the gateway earning its keep. "Run anything" — the proof Here's the part where I cash the check the title wrote. The agent module is the easiest thing in this stack to replace. Three changes to ship the same demo on Pydantic AI: # requirements.txt - agent-framework-core==1.5.0 - agent-framework-openai==1.5.0 + pydantic-ai==0.4.0 # agent/agent.py from pydantic_ai import Agent from pydantic_ai.models.openai import OpenAIModel def build_agent(): model = OpenAIModel( "gpt-4o-mini", base_url=f"{os.environ['APIM_GATEWAY_URL']}/openai/deployments/gpt-4o-mini", api_key=os.environ["APIM_SUBSCRIPTION_KEY"], ) return Agent(model, system_prompt=SYSTEM_INSTRUCTIONS, tools=build_tools()) That's it. build_tools() returns the same list of async callables (Pydantic AI accepts plain Python functions as tools, same as Agent Framework). LangGraph works the same way — wire build_tools() into a ToolNode , point ChatOpenAI at the APIM gateway URL, done. Every APIM policy still fires. Every token metric still emits. Every cache hit still hits. The gateway is the boundary; the runtime above it is fungible. What AgentCore gets right I want to land this without spin. AgentCore's per-session microVM isolation is genuinely interesting — it's a stronger sandboxing story than running multiple agents in shared App Service workers, and it matters for multi-tenant SaaS where agents execute arbitrary user code or call third-party tools you don't fully trust. The managed long-term memory primitive is also a real convenience; Azure has the building blocks (Cosmos DB, AI Search, Cognitive Search) but they aren't pre-wired into a single "agent memory" API the way AgentCore's are. Where the App Service + APIM + MCP composition genuinely wins: Open standards. MCP is a public protocol with implementations across the industry. AgentCore's tool layer is AWS-native. No new runtime to learn. App Service is the same App Service. Your existing CI/CD, your existing security review, your existing monitoring all apply. Bring your own framework. Pydantic AI, LangGraph, Agent Framework, Semantic Kernel, AutoGen, CrewAI — they all work, because the App Service doesn't care what's running inside the container. Existing enterprise footprint. VNet integration, private endpoints, managed identity, deployment slots, sidecars, Easy Auth. None of it is new for App Service. You inherit a decade of platform work. The right framing isn't "Azure's answer to AgentCore." It's that Azure is making a different bet: that enterprises will value the composability of services they already trust over the convenience of a new proprietary runtime. For some, that bet is probably correct. For a few — multi-tenant agent marketplaces, untrusted code execution — AgentCore's isolation model is a better fit. Pick the one that matches your threat model. What's next If you ship the sample and want to compare notes, the repo is at app-service-ai-gateway-mcp-apim-python.192Views0likes0CommentsTransforming Retry-After Headers in Azure APIM: A Step-by-Step Guide
In this blog post, you'll learn how to customize the Retry-After response header in Azure API Management (APIM) rate-limiting policies, enhancing your API's flexibility and user experience. While it does not delve into the specifics of the rate-limit or rate-limit-by-key policies, it provides a practical guide for altering the Retry-After header. For detailed information on the rate-limit policy, please visit Azure API Management policy reference - rate-limit | Microsoft Learn. Understanding Rate Limiting: Protecting Your API from Overuse Rate limiting is a technique used to control how often requests are made to a resource. It helps prevent excessive or abusive use and ensures the resource is available to all users. Rate limiting is often used to protect against denial-of-service (DoS) attacks, which aim to overwhelm a network or server with too many requests, making it unavailable to legitimate users. It can also limit the number of requests from individual users to prevent a single user or group from monopolizing the resource. Azure API Management Rate Limit Policies In Azure, access to APIs is controlled using the following API Management policies: rate-limit rate-limit-by-key The implementation of these policies is straightforward but somewhat limited and less flexible, in my opinion. The Default Retry-After Header: What You Need to Know In the Azure APIM rate-limit policy documentation, it is mentioned that once the client's requests are throttled, the service starts returning a response header containing the time interval (in seconds) after which the client should retry the request. The default name of the header is Retry-After, and this name can be customized. For example: Retry-After: 60 However, in one use case for a customer, there was a requirement to provide a timestamp instead of a time interval as a header value. For example: Retry-After: 2020-05-04T12:23:41.6181792Z To implement this, the header value needs to change, but this is something that the rate-limit policy does not support. Customizing the Retry-After Header The basis for changing the response header value lies in the on-error scope. You can implement a policy like the following: <inbound> <base> <rate-limit-by-key calls="1000" renewal-period="60" counter-key="@(context.Request.IpAddress)" increment-condition="@(context.Response.StatusCode == 200)" remaining-calls-variable-name="remainingCallsPerIP" retry-after-header-name="Retry-After" remaining-calls-header-name="Requests-Remaining" retry-after-variable-name="retryAfter"> </rate-limit-by-key></inbound> <on-error> <choose> <when condition="@(context.LastError.Reason == " ratelimitexceeded")"=""> <set-header name="Retry-After" exists-action="override"> <value>@(DateTime.UtcNow.AddSeconds(context.Variables.GetValueOrDefault<int>("retryAfter")).ToString("o"))</int></value> </set-header> </when> </choose> <base> </on-error> Please refer to APIM predefined errors for policies here: Error handling in Azure API Management policies | Microsoft Learn Here, the key point is that whenever the APIM rate limit is reached, an error occurs, which is then captured in the on-error scope. To set or override the response header only in rate-limiting scenarios, you need to filter using the RateLimitExceeded error reason. After that, the exact error value is determined by adding the current UTC timestamp with the value of the retryAfter variable in seconds. With this, you have now customized the Retry-After header with a timestamp instead of a time interval (in seconds). Conclusion In conclusion, customizing the Retry-After response header in Azure API Management can significantly enhance the flexibility and user experience of your API services. By leveraging the on-error scope and handling the RateLimitExceeded error, you can provide a more informative and user-friendly response to clients when rate limits are exceeded. This approach not only meets specific customer requirements but also demonstrates the adaptability of Azure APIM in handling various scenarios. With these steps, you can ensure that your API remains robust, efficient, and user centric.483Views1like1CommentExciting Updates Coming to Conversational Diagnostics (Public Preview)
Last year, at Ignite 2023, we unveiled Conversational Diagnostics (Preview), a revolutionary tool integrated with AI-powered capabilities to enhance problem-solving for Windows Web Apps. This year, we're thrilled to share what’s new and forthcoming for Conversational Diagnostics (Preview). Get ready to experience a broader range of functionalities and expanded support across various Azure Products, making your troubleshooting journey even more seamless and intuitive.399Views0likes0CommentsAzure Logic App workflow (Standard) Resubmit and Retry
Hello Experts, A workflow is scheduled to run daily at a specific time and retrieves data from different systems using REST API Calls (8-9). The data is then sent to another system through API calls using multiple child flows. We receive more than 1500 input data, and for each data, an API call needs to be made. During the API invocation process, there is a possibility of failure due to server errors (5xx) and client errors (4xx). To handle this, we have implemented a "Retry" mechanism with a fixed interval. However, there is still a chance of flow failure due to various reasons. Although there is a "Resubmit" feature available at the action level, I cannot apply it in this case because we are using multiple child workflows and the response is sent back from one flow to another. Is it necessary to utilize the "Resubmit" functionality? The Retry Functionality has been developed to handle any Server API errors (5xx) that may occur with Connectors (both Custom and Standard), including client API errors 408 and 429. In this specific scenario, it is reasonable to attempt retrying or resubmitting the API Call from the Azure Logic Apps workflow. Nevertheless, there are other situations where implementing the retry and resubmit logic would result in the same error outcome. Is it acceptable to proceed with the Retry functionality in this particular scenario? It would be highly appreciated if you could provide guidance on the appropriate methodology. Thanks -Sri1.1KViews0likes1CommentReimagining AI Ops with Azure SRE Agent: New Automation, Integration, and Extensibility features
Azure SRE Agent offers intelligent and context aware automation for IT operations. Enhanced by customer feedback from our preview, the SRE Agent has evolved into an extensible platform to automate and manage tasks across Azure and other environments. Built on an Agentic DevOps approach - drawing from proven practices in internal Azure operations - the Azure SRE Agent has already saved over 20,000 engineering hours across Microsoft product teams operations, delivering strong ROI for teams seeking sustainable AIOps. An Operations Agent that adapts to your playbooks Azure SRE Agent is an AI powered operations automation platform that empowers SREs, DevOps, IT operations, and support teams to automate tasks such as incident response, customer support, and developer operations from a single, extensible agent. Its value proposition and capabilities have evolved beyond diagnosis and mitigation of Azure issues, to automating operational workflows and seamless integration with the standards and processes used in your organization. SRE Agent is designed to automate operational work and reduce toil, enabling developers and operators to focus on high-value tasks. By streamlining repetitive and complex processes, SRE Agent accelerates innovation and improves reliability across cloud and hybrid environments. In this article, we will look at what’s new and what has changed since the last update. What’s New: Automation, Integration, and Extensibility Azure SRE Agent just got a major upgrade. From no-code automation to seamless integrations and expanded data connectivity, here’s what’s new in this release: No-code Sub-Agent Builder: Rapidly create custom automations without writing code. Flexible, event-driven triggers: Instantly respond to incidents and operational changes. Expanded data connectivity: Unify diagnostics and troubleshooting across more data sources. Custom actions: Integrate with your existing tools and orchestrate end-to-end workflows via MCP. Prebuilt operational scenarios: Accelerate deployment and improve reliability out of the box. Unlike generic agent platforms, Azure SRE Agent comes with deep integrations, prebuilt tools, and frameworks specifically for IT, DevOps, and SRE workflows. This means you can automate complex operational tasks faster and more reliably, tailored to your organization’s needs. Sub-Agent Builder: Custom Automation, No Code Required Empower teams to automate repetitive operational tasks without coding expertise, dramatically reducing manual workload and development cycles. This feature helps address the need for targeted automation, letting teams solve specific operational pain points without relying on one-size-fits-all solutions. Modular Sub-Agents: Easily create custom sub-agents tailored to your team’s needs. Each sub-agent can have its own instructions, triggers, and toolsets, letting you automate everything from outage response to customer email triage. Prebuilt System Tools: Eliminate the inefficiency of creating basic automation from scratch, and choose from a rich library of hundreds of built-in tools for Azure operations, code analysis, deployment management, diagnostics, and more. Custom Logic: Align automation to your unique business processes by defining your automation logic and prompts, teaching the agent to act exactly as your workflow requires. Flexible Triggers: Automate on Your Terms Invoke the agent to respond automatically to mission-critical events, not wait for manual commands. This feature helps speed up incident response and eliminate missed opportunities for efficiency. Multi-Source Triggers: Go beyond chat-based interactions, and trigger the agent to automatically respond to Incident Management and Ticketing systems like PagerDuty and ServiceNow, Observability Alerting systems like Azure Monitor Alerts, or even on a cron-based schedule for proactive monitoring and best-practices checks. Additional trigger sources such as GitHub issues, Azure DevOps pipelines, email, etc. will be added over time. This means automation can start exactly when and where you need it. Event-Driven Operations: Integrate with your CI/CD, monitoring, or support systems to launch automations in response to real-world events - like deployments, incidents, or customer requests. Vital for reducing downtime, it ensures that business-critical actions happen automatically and promptly. Expanded Data Connectivity: Unified Observability and Troubleshooting Integrate data, enabling comprehensive diagnostics and troubleshooting and faster, more informed decision-making by eliminating silos and speeding up issue resolution. Multiple Data Sources: The agent can now read data from Azure Monitor, Log Analytics, and Application Insights based on its Azure role-based access control (RBAC). Additional observability data sources such as Dynatrace, New Relic, Datadog, and more can be added via the Remote Model Context Protocol (MCP) servers for these tools. This gives you a unified view for diagnostics and automation. Knowledge Integration: Rather than manually detailing every instruction in your prompt, you can upload your Troubleshooting Guide (TSG) or Runbook directly, allowing the agent to automatically create an execution plan from the file. You may also connect the agent to resources like SharePoint, Jira, or documentation repositories through Remote MCP servers, enabling it to retrieve needed files on its own. This approach utilizes your organization’s existing knowledge base, streamlining onboarding and enhancing consistency in managing incidents. Azure SRE Agent is also building multi-agent collaboration by integrating with PagerDuty and Neubird, enabling advanced, cross-platform incident management and reliability across diverse environments. Custom Actions: Automate Anything, Anywhere Extend automation beyond Azure and integrate with any tool or workflow, solving the problem of limited automation scope and enabling end-to-end process orchestration. Out-of-the-Box Actions: Instantly automate common tasks like running azcli, kubectl, creating GitHub issues, or updating Azure resources, reducing setup time and operational overhead. Communication Notifications: The SRE Agent now features built-in connectors for Outlook, enabling automated email notifications, and for Microsoft Teams, allowing it to post messages directly to Teams channels for streamlined communication. Bring Your Own Actions: Drop in your own Remote MCP servers to extend the agent’s capabilities to any custom tool or workflow. Future-proof your agentic DevOps by automating proprietary or emerging processes with confidence. Prebuilt Operations Scenarios Address common operational challenges out of the box, saving teams time and effort while improving reliability and customer satisfaction. Incident Response: Minimize business impact and reduce operational risk by automating detection, diagnosis, and mitigation of your workload stack. The agent has built-in runbooks for common issues related to many Azure resource types including Azure Kubernetes Service (AKS), Azure Container Apps (ACA), Azure App Service, Azure Logic Apps, Azure Database for PostgreSQL, Azure CosmosDB, Azure VMs, etc. Support for additional resource types is being added continually, please see product documentation for the latest information. Root Cause Analysis & IaC Drift Detection: Instantly pinpoint incident causes with AI-driven root cause analysis including automated source code scanning via GitHub and Azure DevOps integration. Proactively detect and resolve infrastructure drift by comparing live cloud environments against source-controlled IaC, ensuring configuration consistency and compliance. Handle Complex Investigations: Enable the deep investigation mode that uses a hypothesis-driven method to analyze possible root causes. It collects logs and metrics, tests hypotheses with iterative checks, and documents findings. The process delivers a clear summary and actionable steps to help teams accurately resolve critical issues. Incident Analysis: The integrated dashboard offers a comprehensive overview of all incidents managed by the SRE Agent. It presents essential metrics, including the number of incidents reviewed, assisted, and mitigated by the agent, as well as those awaiting human intervention. Users can leverage aggregated visualizations and AI-generated root cause analyses to gain insights into incident processing, identify trends, enhance response strategies, and detect areas for improvement in incident management. Inbuilt Agent Memory: The new SRE Agent Memory System transforms incident response by institutionalizing the expertise of top SREs - capturing, indexing, and reusing critical knowledge from past incidents, investigations, and user guidance. Benefit from faster, more accurate troubleshooting, as the agent learns from both successes and mistakes, surfacing relevant insights, runbooks, and mitigation strategies exactly when needed. This system leverages advanced retrieval techniques and a domain-aware schema to ensure every on-call engagement is smarter than the last, reducing mean time to resolution (MTTR) and minimizing repeated toil. Automatically gain a continuously improving agent that remembers what works, avoids past pitfalls, and delivers actionable guidance tailored to the environment. GitHub Copilot and Azure DevOps Integration: Automatically triage, respond to, and resolve issues raised in GitHub or Azure DevOps. Integration with modern development platforms such as GitHub Copilot coding agent increases efficiency and ensures that issues are resolved faster, reducing bottlenecks in the development lifecycle. Ready to get started? Azure SRE Agent home page Product overview Pricing Page Pricing Calculator Pricing Blog Demo recordings Deployment samples What’s Next? Give us feedback: Your feedback is critical - You can Thumbs Up / Thumbs Down each interaction or thread, or go to the “Give Feedback” button in the agent to give us in-product feedback - or you can create issues or just share your thoughts in our GitHub repo at https://github.com/microsoft/sre-agent. We’re just getting started. In the coming months, expect even more prebuilt integrations, expanded data sources, and new automation scenarios. We anticipate continuous growth and improvement throughout our agentic AI platforms and services to effectively address customer needs and preferences. Let us know what Ops toil you want to automate next!4.7KViews1like0CommentsRunning Self-hosted APIM Gateways in Azure Container Apps with VNet Integration
With Azure Container Apps we can run containerized applications, completely serverless. The platform itself handles all the orchestration needed to dynamically scale based on your set triggers (such as KEDA) and even scale-to-zero! I have been working a lot with customers recently on using Azure API Management (APIM) and the topic of how we can leverage Azure APIM to manage our internal APIs without having to expose a public IP and stay within compliance from a security standpoint, which leads to the use of a Self-Hosted Gateway. This offers a managed gateway deployed within their network, allowing a unified approach in managing their APIs while keeping all API communication in-network. The self-hosted gateway is deployed as a container and in this article, we will go through how to provision a self-hosted gateway on Azure Container Apps specifically. I assume there is already an Azure APIM instance provisioned and will dive into creating and configuring the self-hosted gateway on ACA. Prerequisites As mentioned, ensure you have an existing Azure API Management instance. We will be using the Azure CLI to configure the container apps in this walkthrough. To run the commands, you need to have the Azure CLI installed on your local machine and ensure you have the necessary permissions in your Azure subscription. Retrieve Gateway Deployment Settings from APIM First, we need to get the details for our gateway from APIM. Head over to the Azure portal and navigate to your API Management instance. - In the left menu, under Deployment and infrastructure, select Gateways. - Here, you'll find the gateway resource you provisioned. Click on it and go to Deployment. - You'll need to copy the Gateway Token and Configuration endpoint values. (these tell the self-hosted gateway which APIM instance and Gateway to register under) Create a Container Apps Environment Next, we need to create a Container Apps environment. This is where we will create the container app in which our self-hosted gateway will be hosted. Using Azure CLI: Create our VNet and Subnet for our ACA Environment As we want access to our internal APIs, when we create the container apps environment, we need to have the VNet created with a subnet available. Note: If we’re using Workload Profiles (we will in this walkthrough), then we need to delegate the subnet to Microsoft.App/environments. # Create the vnet az network vnet create --resource-group rgContosoDemo \ --name vnet-contoso-demo \ --location centralUS \ --address-prefix 10.0.0.0/16 # Create the subnet az network vnet subnet create --resource-group rgContosoDemo \ --vnet-name vnet-contoso-demo \ --name infrastructure-subnet \ --address-prefixes 10.0.0.0/23 # If you are using a workload profile (we are for this walkthrough) then delegate the subnet az network vnet subnet update --resource-group rgContosoDemo \ --vnet-name vnet-contoso-demo \ --name infrastructure-subnet \ --delegations Microsoft.App/environments Create the Container App Environment in out VNet az containerapp env create --name aca-contoso-env \ --resource-group rgContosoDemo \ --location centralUS \ --enable-workload-profiles Deploy the Self-Hosted Gateway to a Container App Creating the environment takes about 10 minutes and once complete, then comes the fun part—deploying the self-hosted gateway container image to a container app. Using Azure CLI: Create the Container App: az containerapp create --name aca-apim-demo-gateway \ --resource-group rgContosoDemo \ --environment aca-contoso-env \ --workload-profile-name "Consumption" \ --image "mcr.microsoft.com/azure-api-management/gateway:2.5.0" \ --target-port 8080 \ --ingress 'external' \ ---env-vars "config.service.endpoint"="<YOUR_ENDPOINT>" "config.service.auth"="<YOUR_TOKEN>" "net.server.http.forwarded.proto.enabled"="true" Here, you'll replace <YOUR_ENDPOINT> and <YOUR_TOKEN> with the values you copied earlier. Configure Ingress for the Container App: az containerapp ingress enable --name aca-apim-demo-gateway --resource-group rgContosoDemo --type external --target-port 8080 This command ensures that your container app is accessible externally. Verify the Deployment Finally, let's make sure everything is running smoothly. Navigate to the Azure portal and go to your Container Apps environment. Select the container app you created (aca-apim-demo-gateway) and navigate to Replicas to verify that it's running. You can use the status endpoint of the self-hosted gateway to determine if your gateway is running as well: curl -i https://aca-apim-demo-gateway.sillytreats-abcd1234.centralus.azurecontainerapps.io/status-012345678990abcdef Verify Gateway Health in APIM You can navigate in the Azure Portal to APIM and verify the gateway is showing up as healthy. Navigate to Deployment and Infrastructure, select Gateways then choose your Gateway. On the Overview page you’ll see the status of your gateway deployment. And that’s it! You've successfully deployed an Azure APIM self-hosted gateway in Azure Container Apps with VNet integration allowing access to your internal APIs with easy management from the APIM portal in Azure. This setup allows you to manage your APIs efficiently while leveraging the scalability and flexibility of Azure Container Apps. If you have any questions or need further assistance, feel free to ask. How are you feeling about this setup? Does it make sense, or is there anything you'd like to dive deeper into?2.4KViews3likes3CommentsFrom Timeouts to Triumph: Optimizing GPT-4o-mini for Speed, Efficiency, and Reliability
The Challenge Large-scale generative AI deployments can stretch system boundaries — especially when thousands of concurrent requests require both high throughput and low latency. In one such production environment, GPT-4o-mini deployments running under Provisioned Throughput Units (PTUs) began showing sporadic 408 (timeout) and 429 (throttling) errors. Requests that normally completed in seconds were occasionally hitting the 60-second timeout window, causing degraded experiences and unnecessary retries. Initial suspicion pointed toward PTU capacity limitations, but deeper telemetry revealed a different cause. What the Data Revealed Using Azure Data Explorer (Kusto), API Management (APIM) logs, and OpenAI billing telemetry, a detailed investigation uncovered several insights: Latency was not correlated with PTU utilization: PTU resources were healthy and performing within SLA even during spikes. Time-Between-Tokens (TBT) stayed consistently low (~8–10 ms): The model was generating tokens steadily. Excessive token output was the real bottleneck: Requests generating 6K–8K tokens simply required more time than allowed in the 60-second completion window. In short — the model wasn’t slow; the workload was oversized. The Optimization Opportunity The analysis opened a broader optimization opportunity: Balance token length with throughput targets. Introduce architectural patterns to prevent timeout or throttling cascades under load. Enforce automatic token governance instead of relying on client-side discipline. The Solution Three engineering measures delivered immediate impact: token optimization, spillover routing, and policy enforcement. Right-size the Token Budget Empirical throughput for GPT-4o-mini: ~33 tokens/sec → ~2K tokens in 60s. Enforced max_tokens = 2000 for synchronous requests. Enabled streaming responses for longer outputs, allowing incremental delivery without hitting timeout limits. Enable Spillover for Continuity Implemented multi-region spillover using Azure Front Door and APIM Premium gateways. When PTU queues reached capacity or 429s appeared, requests were routed to Standard deployments in secondary regions. The result: graceful degradation and uninterrupted user experience. Govern with APIM Policies Added inbound policies to inspect and adjust max_tokens dynamically. On 408/429 responses, APIM retried and rerouted traffic based on spillover logic. The Results After optimization, improvements were immediate and measurable: Latency Reduction: Significant improvement in end-to-end response times across high-volume workloads Reliability Gains: 408/429 errors fell from >1% to near zero. Cost Efficiency: Average token generation decreased by ~60%, reducing per-request costs. Scalability: Spillover routing ensured consistent performance during regional or capacity surges. Governance: APIM policies established a reusable token-control framework for future AI workloads. Lessons Learned Latency isn’t always about capacity: Investigate workload patterns before scaling hardware. Token budgets define the user experience: Over-generation can quietly break SLA compliance. Design for elasticity: Spillover and multi-region routing maintain continuity during spikes. Measure everything: Combine KQL telemetry, latency and token tracking for faster diagnostics. The Outcome By applying data-driven analysis, architectural tuning, and automated governance, the team turned an operational bottleneck into a model of consistent, scalable performance. The result: Faster responses. Lower costs. Higher trust. A blueprint for building resilient, high-throughput AI systems on Azure.585Views4likes0CommentsFixed ip address for outbound calls from Azure APIM Standard V2
Hi, I recently ran a PoC deployment of Azure APIM Standard V2 Sku instead of our current Premium Classic instance. This worked well! Performance is great and I am able to route calls to an on-prem network ok using vnet-integration. However, one of the features we currently make use of with the Premium Classic instance is a fixed ip address for calls from APIM to 3rd parties. Is there a way to achieve this using Standard V2? We have tried a nat gateway with fixed ip on the same vnet but this does not seem to help.Solved401Views0likes1Comment