azure api management
119 TopicsControlling Tool Access with APIM MCP Gateway
If you've started working with MCP servers in GitHub Copilot, Claude, or any other agent host in an enterprise environment, you've probably hit a similar problem. You want to give your developers access to a useful MCP server, but not every tool it ships with. Maybe one tool is noisy and burns context for no good reason. Maybe one tool calls a paid API. Maybe one tool does something your security team is not comfortable with. The MCP server is all-or-nothing: install it and you get the lot. Most MCP servers don't give you a way to switch individual tools off. The MCP spec doesn't define one either. So if you want fine-grained control, you need to put something in front of the server that can see the JSON-RPC traffic and make decisions about it. That something is an MCP gateway, and Azure API Management (APIM) can do the job. What an MCP gateway actually is An MCP gateway sits between the agent (the client) and the MCP server (the backend) and inspects the MCP protocol traffic flowing between them. Because MCP is just JSON-RPC over HTTP or SSE, anything that can do reverse-proxy plus payload inspection can in theory act as a gateway. The interesting bit is what you do with that position in the network. If you've worked with API gateways before, the mental model is the same: you're putting a managed entry point in front of one or more backends, and using policy to control what gets through. The differences are that the protocol is JSON-RPC over a long-lived connection, and the things you're filtering aren't routes but tools, prompts, and resources. Out of the box, MCP gives you very little operational control. The protocol assumes a trusted one-to-one relationship between a client and a server. There's no built-in authentication story beyond what the transport gives you, no rate limiting, no audit trail beyond the agent's own logs, and no way to share one MCP server safely across multiple users or teams. A gateway is how you fix all of that without modifying the MCP server itself. MCP Gateway Functionality The tool access problem is not the only reason to put a gateway in front of an MCP server. The list looks a lot like the list of reasons you'd put any API behind a gateway, with a few MCP-specific twists: Centralised authentication and authorisation. Add Entra ID, mTLS, or scoped tokens in front of MCP servers that ship with little more than an API key. Rate limiting and quota. Stop a runaway agent loop from hammering a paid upstream API and racking up a bill in minutes. Logging, audit and observability. Capture which user invoked which tool with what arguments and ship it to Log Analytics or your SIEM. Network isolation. Keep developer machines off the public internet by fronting external MCP servers with private endpoints and fixed egress IPs. Sharing one MCP server across many clients. Turn a single-tenant MCP server into a multi-tenant front door with per-user identity and per-team limits. Policy and governance. Control which tools are exposed, redact fields, validate arguments, or transform responses before they reach the agent. Aggregating multiple MCP servers. Present one logical MCP endpoint that fans out to several backends so agent hosts have a single connection point. Failure handling and resilience. Add retries, circuit breakers, and caching of tools/list rather than have every agent host grow its own. Cost management. Put caps, alerts, and per-team chargeback on MCP servers that wrap something paid by the call. APIM as the gateway APIM gives you all of the above, and Microsoft has been adding MCP-specific support over the last few releases. The relevant bits for this post: There's a dedicated MCP servers section in APIM, separate from the standard APIs blade. This is where MCP-aware features live. You can register an existing MCP server as an external MCP server. APIM proxies the protocol traffic and applies policies to it. You can also expose a normal REST API as an MCP server, by adding an MCP layer on top of an existing API. The standard APIM policy engine works on MCP traffic. The shape of the problem The MCP protocol has two methods that matter here: tools/list is what the client calls when it connects. The server returns the catalogue of tools available, with names, descriptions, and input schemas. The agent uses this to decide what it can do. tools/call is what the client sends when the agent actually wants to invoke a tool. If you want to hide a tool, you have to deal with both. Filtering tools/list stops the agent ever knowing the tool exists, which is what you usually want. Blocking tools/call stops a determined client (or a tool the agent guessed at) from calling it directly. You need both to be confident the tool is genuinely off-limits. For the rest of this post I'll use the Microsoft Learn MCP server as the example, because it's public, useful, and ships with three tools: microsoft_docs_search microsoft_docs_fetch microsoft_code_sample_search Say I want to allow the first two and block the third. Here's how to do it. Setting up APIM as an MCP gateway I'm going to skip the bit where you provision APIM. The MCP-specific setup is: In your APIM instance, go to MCP Servers in the left-hand nav (it's its own section, not under APIs). Add a new MCP server. Choose External MCP server as the type. Point it at the Microsoft Learn MCP server endpoint. The transport will be HTTP or SSE depending on the upstream, which APIM handles for you. Save. APIM now proxies the MCP traffic. Configure the agent to point at the APIM-exposed URL instead of the upstream. At this point you have a working passthrough. The next step is the policy. Two ways to control tool access There are two patterns that work, and the right choice depends on how much maintenance you want to take on. Option A: Static allowlist You hardcode the list of tools APIM will expose. APIM intercepts tools/list and returns your fixed list, ignoring whatever the backend says. It also blocks tools/call for anything not on the list. Pros: predictable, doesn't read the response body, doesn't care what the upstream changes. Cons: you have to maintain the schemas yourself. If Microsoft adds a useful new tool to the Learn MCP server, you won't see it until you update the policy. If they change the input schema for an existing tool, you'll need to update that too. This is the option I'd reach for first if I wanted strict control and wasn't expecting the upstream to change much. Option B: Dynamic deny-list You let tools/list flow through, then rewrite the response on its way back out, removing the tools you don't want. You also block tools/call for those tools. Pros: lower maintenance. New tools appear automatically. Schema changes are picked up. Cons: this reads and rewrites context.Response.Body, and Microsoft's own guidance for MCP policies warns that response-body access can interfere with streaming. In practice, tools/list is non-streaming JSON-RPC and this works fine, but you need to test it carefully in your environment, particularly if your agent host is fussy about response handling. This method also requires you to keep on top of when new tools are added and add them to the list if you don't want them used, they will automatically be available to users if you do not. Pick this one when you're confident in the upstream and want to minimise toil. Option A: the allowlist policy Here's the full policy for the static allowlist. It goes into the MCP Server policy editor in APIM. You'll notice there's no <base /> in the sections. Some MCP Server policy editors don't include them, and the policy works fine without them. <policies> <inbound> <choose> <when condition="@{ var body = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); if (body == null) { return false; } var rpcMethod = (string)body["method"]; return string.Equals(rpcMethod, "tools/list", System.StringComparison.OrdinalIgnoreCase); }"> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ var req = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); var id = req != null ? req["id"] : Newtonsoft.Json.Linq.JValue.CreateNull(); var tools = new Newtonsoft.Json.Linq.JArray( new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("name", "microsoft_docs_search"), new Newtonsoft.Json.Linq.JProperty("description", "Search official Microsoft/Azure documentation"), new Newtonsoft.Json.Linq.JProperty("inputSchema", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("type", "object"), new Newtonsoft.Json.Linq.JProperty("properties", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("query", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("type", "string") )) )) )) ), new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("name", "microsoft_docs_fetch"), new Newtonsoft.Json.Linq.JProperty("description", "Fetch a Microsoft documentation page as markdown"), new Newtonsoft.Json.Linq.JProperty("inputSchema", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("type", "object"), new Newtonsoft.Json.Linq.JProperty("properties", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("url", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("type", "string") )) )), new Newtonsoft.Json.Linq.JProperty("required", new Newtonsoft.Json.Linq.JArray("url")) )) ) ); var response = new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("jsonrpc", "2.0"), new Newtonsoft.Json.Linq.JProperty("id", id), new Newtonsoft.Json.Linq.JProperty("result", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("tools", tools) )) ); return response.ToString(Newtonsoft.Json.Formatting.None); }</set-body> </return-response> </when> <when condition="@{ var body = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); if (body == null) { return false; } var rpcMethod = (string)body["method"]; var toolName = (string)body["params"]?["name"]; return string.Equals(rpcMethod, "tools/call", System.StringComparison.OrdinalIgnoreCase) && string.Equals(toolName, "microsoft_code_sample_search", System.StringComparison.OrdinalIgnoreCase); }"> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ var req = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); var id = req != null ? req["id"] : Newtonsoft.Json.Linq.JValue.CreateNull(); var error = new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("jsonrpc", "2.0"), new Newtonsoft.Json.Linq.JProperty("id", id), new Newtonsoft.Json.Linq.JProperty("error", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("code", -32601), new Newtonsoft.Json.Linq.JProperty("message", "Tool disabled by API Management policy: microsoft_code_sample_search") )) ); return error.ToString(Newtonsoft.Json.Formatting.None); }</set-body> </return-response> </when> </choose> </inbound> <backend /> <outbound /> <on-error /> </policies> It's a chunk of XML, but it's doing only three things. Here's what each part is for. The outer <choose> and the two <when> branches. This is just an if/else if. APIM peeks at the inbound JSON-RPC request and decides which branch to fire. The first branch handles tools/list. The second handles tools/call for the specific tool I want to block. If neither matches, nothing happens in the policy and the request flows through to the backend as normal. So tools/call for the allowed tools, plus initialize, ping, and anything else MCP throws at the gateway, all pass through untouched. The tools/list branch: synthesise the catalogue. When the agent asks for the tool list, APIM never forwards the call. Instead, the policy reads the id off the incoming request (so the response correlates correctly), builds a JArray containing only the tools I want to expose, with their names, descriptions, and inputSchema blocks, wraps that in a JSON-RPC result envelope, and returns it directly with <return-response>. The backend never sees the request. That's the bit that makes Option A bulletproof: there is no upstream behaviour that can leak through, because the upstream isn't involved. The tools/call branch: hard-block the disallowed tool. When the agent tries to call microsoft_code_sample_search directly (whether it guessed at it, cached it from a previous run, or someone added it to a config file), the policy short-circuits with a JSON-RPC error. The error code is -32601, which the spec defines as "Method not found", and the message says plainly that the tool was disabled by APIM. Well-behaved agents will surface that to the user and move on. Without this branch, hiding the tool from tools/list would only stop honest clients. A couple of incidental things worth knowing. The preserveContent: true argument on context.Request.Body.As<...>() is important: without it, reading the body consumes it and the rest of the pipeline gets nothing. And the OrdinalIgnoreCase comparisons are belt-and-braces: MCP method names are case-sensitive in the spec, but agents in the wild are inconsistent. Option B: the deny-list policy Same scenario, different approach. Let tools/list flow through to the backend and rewrite the response on the way out. Still block tools/call for the disallowed tool. ```xml <policies> <inbound> <choose> <when condition="@{ var body = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); if (body == null) { return false; } var rpcMethod = (string)body["method"]; var toolName = (string)body["params"]?["name"]; return string.Equals(rpcMethod, "tools/call", System.StringComparison.OrdinalIgnoreCase) && string.Equals(toolName, "microsoft_code_sample_search", System.StringComparison.OrdinalIgnoreCase); }"> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ var req = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); var id = req != null ? req["id"] : Newtonsoft.Json.Linq.JValue.CreateNull(); var error = new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("jsonrpc", "2.0"), new Newtonsoft.Json.Linq.JProperty("id", id), new Newtonsoft.Json.Linq.JProperty("error", new Newtonsoft.Json.Linq.JObject( new Newtonsoft.Json.Linq.JProperty("code", -32601), new Newtonsoft.Json.Linq.JProperty("message", "Tool disabled by API Management policy: microsoft_code_sample_search") )) ); return error.ToString(Newtonsoft.Json.Formatting.None); }</set-body> </return-response> </when> </choose> </inbound> <backend /> <outbound> <choose> <when condition="@{ var req = context.Request.Body.As<Newtonsoft.Json.Linq.JObject>(preserveContent: true); if (req == null) { return false; } var rpcMethod = (string)req["method"]; return string.Equals(rpcMethod, "tools/list", System.StringComparison.OrdinalIgnoreCase); }"> <set-body>@{ var bodyText = context.Response.Body.As<string>(preserveContent: true); if (string.IsNullOrEmpty(bodyText)) { return bodyText; } var resp = Newtonsoft.Json.Linq.JObject.Parse(bodyText); var tools = resp["result"]?["tools"] as Newtonsoft.Json.Linq.JArray; if (tools == null) { return bodyText; } var filtered = new Newtonsoft.Json.Linq.JArray(); foreach (var t in tools) { var name = (string)t?["name"]; if (!string.Equals(name, "microsoft_code_sample_search", System.StringComparison.OrdinalIgnoreCase)) { filtered.Add(t); } } ((Newtonsoft.Json.Linq.JObject)resp["result"])["tools"] = filtered; return resp.ToString(Newtonsoft.Json.Formatting.None); }</set-body> </when> </choose> </outbound> <on-error /> </policies> This policy splits the work across inbound and outbound. The shape is different from Option A, but again it's only doing a few things. Inbound: block direct calls to the disallowed tool. This block is identical to the one in Option A. If the agent tries to invoke microsoft_code_sample_search directly, APIM returns the JSON-RPC -32601 error and the request never reaches the backend. Everything else (including tools/list and calls to the allowed tools) carries on through to the upstream MCP server. Backend: nothing custom. The empty <backend /> is deliberate. We want the upstream server to handle every request that wasn't blocked in inbound, including tools/list. APIM forwards it, the server returns its full catalogue, and we get a chance to rewrite the response on the way back. Outbound: filter the catalogue on the way out. This is the bit that does the dynamic work. The <when> condition checks the original request (not the response) to see if this was a tools/list call, because we only want to rewrite responses to that specific method. If it was, the policy reads the response body as a string, parses it as JSON, walks the result.tools array, builds a new array containing everything except the deny-listed tool, swaps it back into the response object, and writes the modified JSON back with <set-body>. For any other method, the response flows through untouched. The reason this option is more fragile than Option A is right there in that last paragraph: it reads and rewrites context.Response.Body. Microsoft's own guidance for MCP policies flags that response-body access can interfere with streaming. For non-streaming JSON-RPC like tools/list, this works fine in practice, but if you ever extend this pattern to filter streamed responses (resource updates, long-running tool calls), you'll need to think much harder about it. For the narrow case of trimming tools/list, it's a reasonable trade for the lower maintenance cost. Validating it works Once the policy is saved, hit the APIM-exposed MCP endpoint with the agent of your choice and check three things: tools/list returns only microsoft_docs_search and microsoft_docs_fetch. The third tool should not appear. tools/call for microsoft_code_sample_search returns a JSON-RPC error with code -32601. The two allowed tools still actually work end-to-end. Which one should you use Default to Option A (the static allowlist) unless you have a specific reason not to. It's the most predictable and it doesn't touch the response body, which keeps you well clear of the streaming caveat. You will have to update it when the upstream tool catalogue changes, but this does give you greater control of when you make those changes available and creates a conscious decision to allow them, or not. Reach for Option B when the upstream changes frequently, or when you'd rather block specific known-bad tools and let everything else through. Test it carefully and watch the agent's behaviour for any sign of streaming or framing issues. If you've got a fully internal MCP server you control, the cleanest answer is to fix the tool list at the source and skip the gateway changes altogether. But for third-party MCP servers, this is an approach you can apply without changing the underlying MCP server.296Views1like0CommentsYou Can Build a Framework-Agnostic AI Gateway on Azure App Service — Here's How
The agent infrastructure conversation moved this year. In October 2025, AWS shipped Amazon Bedrock AgentCore — a managed agent runtime with per-session microVM isolation, built-in long-term memory, native MCP support, and an opinionated policy engine. A few months earlier, Cloudflare shipped its Agents SDK on top of Durable Objects, betting that edge-native stateful agents are the future. Both bets are real, both are interesting, and both arrive as closed, proprietary runtimes. So: what's Azure's answer? It's a question I've heard a couple times from architects in the last six months. The honest answer is that Azure already has the pieces. They don't ship as one product called AgentRuntime, and that's actually the point. Azure's pitch is composable: App Service + API Management + MCP, three services you already have access to, glued together with open standards. This post walks through a runnable sample of that composition. One App Service hosting both an agent (built with the Microsoft Agent Framework) and the stateless MCP server it calls, fronted by Azure API Management with the AI Gateway policy set — semantic caching, token rate limiting, per-subscription token emission for chargeback. One azd up deploys the lot. Repo: app-service-ai-gateway-mcp-apim-python. The headline claim is in the title. The point I actually want to make is the one underneath it: the framework is replaceable, the gateway is the contribution. Swap the Agent Framework module for Pydantic AI or LangGraph and the rest of the architecture is unchanged. That's what "run anything" means, made literal. The composable stack ┌────────────────────────────────────────────────┐ │ Azure API Management │ MCP / Agent ──┤ AI Gateway policies: │ client │ • llm-token-limit │ │ • llm-semantic-cache-lookup / store │ │ • llm-emit-token-metric │ │ • rate-limit-by-key (MCP API) │ └─────────────┬───────────────────┬──────────────┘ │ │ ┌────────────────▼──┐ ┌────────────▼──────────────┐ │ Azure OpenAI │ │ Azure App Service │ │ • chat model │ │ FastAPI app: │ │ • embedding │ │ • /mcp (stateless) │ │ model │ │ • /agent/chat │ └───────────────────┘ │ Managed identity → │ │ APIM (via subscription) │ └────────────┬──────────────┘ ▼ Application Insights (cloud_RoleInstance, APIM token metrics) Three observations that drive everything else: APIM is the only thing that talks to Azure OpenAI. The App Service agent doesn't have an AOAI key. It has an APIM subscription key. Every LLM call passes through the gateway, picks up the policies, and gets logged with consistent dimensions. That's where the governance part lives. The agent runtime is App Service. Linux, Python, FastAPI. Any language. Any framework. Pick your tool. We use Microsoft Agent Framework because it just GA'd and the API is clean, but the agent module is the easiest thing in the stack to swap. The MCP server is co-located with the agent. Same App Service, different route. The agent calls its own tools either in-process (fast path) or back out through APIM (so MCP traffic gets rate-limited and observed too). That choice is one environment variable. What the sample actually does The FastAPI app exposes three routes that matter: /mcp — a stateless HTTP MCP server (protocol revision 2025-11-25 ), implementing four tools: whoami , echo , lookup_fact , and summarize_app_service_doc . Any MCP client (Claude, VS Code, your own agent runtime) can connect. /agent/chat — a Microsoft Agent Framework agent that uses those same MCP tools as its tool set, and calls AOAI through APIM. /health and / — the boring but essential supporting cast (health check for App Service probes, status page showing the serving instance ID). Here's the agent definition. The key line is the endpoint: from agent_framework.openai import OpenAIChatCompletionClient client = OpenAIChatCompletionClient( azure_endpoint=os.environ["APIM_GATEWAY_URL"], # ← APIM, not AOAI model=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT"], api_version="2024-10-21", api_key=os.environ["APIM_SUBSCRIPTION_KEY"], default_headers={"Ocp-Apim-Subscription-Key": os.environ["APIM_SUBSCRIPTION_KEY"]}, ) agent = client.as_agent( name="AppServiceExpert", instructions=SYSTEM_INSTRUCTIONS, tools=build_tools(), ) That's it. The agent has no idea APIM exists. It thinks it's talking to AOAI. APIM is doing every interesting thing — auth, caching, throttling, metric emission — without the agent code knowing or caring. The policy that does the heavy lifting The AOAI API in APIM has one policy attached at the API scope. The full XML is in infra/apim/policies/aoai-policy.xml; here's the bones of it: <policies> <inbound> <base /> <authentication-managed-identity resource="https://cognitiveservices.azure.com" output-token-variable-name="aoai-token" /> <set-header name="Authorization" exists-action="override"> <value>@("Bearer " + (string)context.Variables["aoai-token"])</value> </set-header> <azure-openai-token-limit counter-key="@(context.Subscription?.Id ?? "anonymous")" tokens-per-minute="50000" estimate-prompt-tokens="true" /> <azure-openai-semantic-cache-lookup score-threshold="0.85" embeddings-backend-id="aoai-embeddings-backend" embeddings-backend-auth="system-assigned"> <vary-by>@(context.Subscription?.Id ?? "anonymous")</vary-by> </azure-openai-semantic-cache-lookup> <set-backend-service backend-id="aoai-backend" /> <azure-openai-emit-token-metric namespace="ai-gateway"> <dimension name="Subscription ID" value="@(context.Subscription?.Id ?? "anonymous")" /> <dimension name="API ID" value="@(context.Api.Id)" /> <dimension name="Operation ID" value="@(context.Operation.Id)" /> <dimension name="Client IP" value="@(context.Request.IpAddress)" /> </azure-openai-emit-token-metric> </inbound> <outbound> <base /> <azure-openai-semantic-cache-store duration="3600" /> </outbound> </policies> Four things are happening here that would otherwise be your problem: Auth to AOAI. APIM's managed identity holds the Cognitive Services OpenAI User role on the AOAI account. No keys. Token rate limiting. Each APIM subscription gets a tokens-per-minute budget. One runaway team can't starve everyone else. Semantic caching. The inbound policy embeds the prompt using the embedding deployment, queries the Redis-backed APIM cache for a vector match above the 0.85 threshold, and short-circuits the AOAI call on a hit. The outbound azure-openai-semantic-cache-store writes successful misses back. Per-call metric emission. Every call pushes PromptTokens , CompletionTokens , and TotalTokens to Application Insights as custom metrics tagged with the APIM subscription, the API, the operation, and the client IP. That's your chargeback dashboard, ready to query. The whole thing is XML. None of it is in your agent code. Deploying it azd auth login azd up azd up provisions a P0v3 App Service Plan with the web app and a staging slot, an AOAI account with gpt-4o-mini + text-embedding-3-small deployments, an APIM Developer SKU service with the two APIs and the policy XML wired up, an Azure Cache for Redis Basic C0 as the semantic-cache store, and a Log Analytics workspace + Application Insights. The postprovision hook fetches the APIM subscription key for the AI Gateway product and writes it into the App Service's APIM_SUBSCRIPTION_KEY setting (and the staging slot's, so slot swaps are clean). Be patient. Developer SKU APIM takes 30–45 minutes the first time. If you want to prototype faster, the sample supports Consumption SKU as a one-line flip: azd env set APIM_SKU Consumption azd provision Consumption provisions in about a minute and is great for sketching. Verify your specific policies are supported there before you ship it. Governing it like a grown-up The toy version of this post stops at "look, semantic cache works." The version your platform engineering lead wants to see goes further. Per-team chargeback. The token-emit policy tags every call with the APIM subscription ID. Issue one subscription per team, hand it over with a quota, and your monthly chargeback report is a KQL query: customMetrics | where timestamp > startofmonth(now()) | where name == "TotalTokens" | summarize Tokens=sum(valueSum) by Team=tostring(customDimensions["Subscription ID"]) | extend USD = Tokens * 0.00015 / 1000 // gpt-4o-mini blended rate | order by USD desc Content safety as a policy plug-in. Add an llm-content-safety block to the inbound policy and point it at an Azure AI Content Safety resource — every prompt and response gets moderated before reaching agents or end users. The sample doesn't deploy Content Safety by default (to keep the demo cost-free), but the README has the one-line bicep + one-block policy delta. Circuit breaker + multi-region failover. Add a second AOAI backend in a different region and an APIM backend pool, give the pool a circuit-breaker rule, and your agents inherit failover with zero code changes. Rate-limit MCP traffic too. The MCP API has its own policy with rate-limit-by-key , so a runaway agent can't pin the MCP server with a hot loop. None of these are gymnastics. They're one policy block each. The pattern is the same every time: write policy at the gateway, leave the agent code alone. Proving it works After azd up finishes, two checks. First, hit the agent endpoint: curl -sS -X POST "$(azd env get-value WEB_URI)/agent/chat" \ -H 'Content-Type: application/json' \ -d '{"message": "How does App Service horizontally scale an MCP server?"}' | jq You should see a reply that cites the instance ID (the agent calls whoami and summarize_app_service_doc to ground its answer) and a tool_calls array showing the agent's reasoning trace. Second, run the k6 load test: export BASE_URL="$(azd env get-value WEB_URI)" export APIM_SUBSCRIPTION_KEY="$(azd env get-value APIM_SUBSCRIPTION_KEY)" k6 run loadtest/k6-gateway.js The script hits /agent/chat with a small pool of semantically-similar prompts. After a 30-second warmup, the steady phase should report a cache-hit ratio above 30%: APIM AI Gateway — k6 summary ───────────────────────────── Cache hits : 412 Cache misses : 88 Hit ratio : 82.4% Cross-check in App Insights: ApiManagementGatewayLogs | where TimeGenerated > ago(15m) | where ApiId == "aoai" | extend cache = tostring(parse_json(ResponseHeaders)["x-llm-cache-status"]) | summarize count() by cache, bin(TimeGenerated, 1m) | render columnchart A solid bar of hits next to a smaller bar of misses is the gateway earning its keep. "Run anything" — the proof Here's the part where I cash the check the title wrote. The agent module is the easiest thing in this stack to replace. Three changes to ship the same demo on Pydantic AI: # requirements.txt - agent-framework-core==1.5.0 - agent-framework-openai==1.5.0 + pydantic-ai==0.4.0 # agent/agent.py from pydantic_ai import Agent from pydantic_ai.models.openai import OpenAIModel def build_agent(): model = OpenAIModel( "gpt-4o-mini", base_url=f"{os.environ['APIM_GATEWAY_URL']}/openai/deployments/gpt-4o-mini", api_key=os.environ["APIM_SUBSCRIPTION_KEY"], ) return Agent(model, system_prompt=SYSTEM_INSTRUCTIONS, tools=build_tools()) That's it. build_tools() returns the same list of async callables (Pydantic AI accepts plain Python functions as tools, same as Agent Framework). LangGraph works the same way — wire build_tools() into a ToolNode , point ChatOpenAI at the APIM gateway URL, done. Every APIM policy still fires. Every token metric still emits. Every cache hit still hits. The gateway is the boundary; the runtime above it is fungible. What AgentCore gets right I want to land this without spin. AgentCore's per-session microVM isolation is genuinely interesting — it's a stronger sandboxing story than running multiple agents in shared App Service workers, and it matters for multi-tenant SaaS where agents execute arbitrary user code or call third-party tools you don't fully trust. The managed long-term memory primitive is also a real convenience; Azure has the building blocks (Cosmos DB, AI Search, Cognitive Search) but they aren't pre-wired into a single "agent memory" API the way AgentCore's are. Where the App Service + APIM + MCP composition genuinely wins: Open standards. MCP is a public protocol with implementations across the industry. AgentCore's tool layer is AWS-native. No new runtime to learn. App Service is the same App Service. Your existing CI/CD, your existing security review, your existing monitoring all apply. Bring your own framework. Pydantic AI, LangGraph, Agent Framework, Semantic Kernel, AutoGen, CrewAI — they all work, because the App Service doesn't care what's running inside the container. Existing enterprise footprint. VNet integration, private endpoints, managed identity, deployment slots, sidecars, Easy Auth. None of it is new for App Service. You inherit a decade of platform work. The right framing isn't "Azure's answer to AgentCore." It's that Azure is making a different bet: that enterprises will value the composability of services they already trust over the convenience of a new proprietary runtime. For some, that bet is probably correct. For a few — multi-tenant agent marketplaces, untrusted code execution — AgentCore's isolation model is a better fit. Pick the one that matches your threat model. What's next If you ship the sample and want to compare notes, the repo is at app-service-ai-gateway-mcp-apim-python.329Views0likes0CommentsTransforming Retry-After Headers in Azure APIM: A Step-by-Step Guide
In this blog post, you'll learn how to customize the Retry-After response header in Azure API Management (APIM) rate-limiting policies, enhancing your API's flexibility and user experience. While it does not delve into the specifics of the rate-limit or rate-limit-by-key policies, it provides a practical guide for altering the Retry-After header. For detailed information on the rate-limit policy, please visit Azure API Management policy reference - rate-limit | Microsoft Learn. Understanding Rate Limiting: Protecting Your API from Overuse Rate limiting is a technique used to control how often requests are made to a resource. It helps prevent excessive or abusive use and ensures the resource is available to all users. Rate limiting is often used to protect against denial-of-service (DoS) attacks, which aim to overwhelm a network or server with too many requests, making it unavailable to legitimate users. It can also limit the number of requests from individual users to prevent a single user or group from monopolizing the resource. Azure API Management Rate Limit Policies In Azure, access to APIs is controlled using the following API Management policies: rate-limit rate-limit-by-key The implementation of these policies is straightforward but somewhat limited and less flexible, in my opinion. The Default Retry-After Header: What You Need to Know In the Azure APIM rate-limit policy documentation, it is mentioned that once the client's requests are throttled, the service starts returning a response header containing the time interval (in seconds) after which the client should retry the request. The default name of the header is Retry-After, and this name can be customized. For example: Retry-After: 60 However, in one use case for a customer, there was a requirement to provide a timestamp instead of a time interval as a header value. For example: Retry-After: 2020-05-04T12:23:41.6181792Z To implement this, the header value needs to change, but this is something that the rate-limit policy does not support. Customizing the Retry-After Header The basis for changing the response header value lies in the on-error scope. You can implement a policy like the following: <inbound> <base> <rate-limit-by-key calls="1000" renewal-period="60" counter-key="@(context.Request.IpAddress)" increment-condition="@(context.Response.StatusCode == 200)" remaining-calls-variable-name="remainingCallsPerIP" retry-after-header-name="Retry-After" remaining-calls-header-name="Requests-Remaining" retry-after-variable-name="retryAfter"> </rate-limit-by-key></inbound> <on-error> <choose> <when condition="@(context.LastError.Reason == " ratelimitexceeded")"=""> <set-header name="Retry-After" exists-action="override"> <value>@(DateTime.UtcNow.AddSeconds(context.Variables.GetValueOrDefault<int>("retryAfter")).ToString("o"))</int></value> </set-header> </when> </choose> <base> </on-error> Please refer to APIM predefined errors for policies here: Error handling in Azure API Management policies | Microsoft Learn Here, the key point is that whenever the APIM rate limit is reached, an error occurs, which is then captured in the on-error scope. To set or override the response header only in rate-limiting scenarios, you need to filter using the RateLimitExceeded error reason. After that, the exact error value is determined by adding the current UTC timestamp with the value of the retryAfter variable in seconds. With this, you have now customized the Retry-After header with a timestamp instead of a time interval (in seconds). Conclusion In conclusion, customizing the Retry-After response header in Azure API Management can significantly enhance the flexibility and user experience of your API services. By leveraging the on-error scope and handling the RateLimitExceeded error, you can provide a more informative and user-friendly response to clients when rate limits are exceeded. This approach not only meets specific customer requirements but also demonstrates the adaptability of Azure APIM in handling various scenarios. With these steps, you can ensure that your API remains robust, efficient, and user centric.521Views1like1CommentExciting Updates Coming to Conversational Diagnostics (Public Preview)
Last year, at Ignite 2023, we unveiled Conversational Diagnostics (Preview), a revolutionary tool integrated with AI-powered capabilities to enhance problem-solving for Windows Web Apps. This year, we're thrilled to share what’s new and forthcoming for Conversational Diagnostics (Preview). Get ready to experience a broader range of functionalities and expanded support across various Azure Products, making your troubleshooting journey even more seamless and intuitive.400Views0likes0CommentsAzure Logic App workflow (Standard) Resubmit and Retry
Hello Experts, A workflow is scheduled to run daily at a specific time and retrieves data from different systems using REST API Calls (8-9). The data is then sent to another system through API calls using multiple child flows. We receive more than 1500 input data, and for each data, an API call needs to be made. During the API invocation process, there is a possibility of failure due to server errors (5xx) and client errors (4xx). To handle this, we have implemented a "Retry" mechanism with a fixed interval. However, there is still a chance of flow failure due to various reasons. Although there is a "Resubmit" feature available at the action level, I cannot apply it in this case because we are using multiple child workflows and the response is sent back from one flow to another. Is it necessary to utilize the "Resubmit" functionality? The Retry Functionality has been developed to handle any Server API errors (5xx) that may occur with Connectors (both Custom and Standard), including client API errors 408 and 429. In this specific scenario, it is reasonable to attempt retrying or resubmitting the API Call from the Azure Logic Apps workflow. Nevertheless, there are other situations where implementing the retry and resubmit logic would result in the same error outcome. Is it acceptable to proceed with the Retry functionality in this particular scenario? It would be highly appreciated if you could provide guidance on the appropriate methodology. Thanks -Sri1.1KViews0likes1CommentReimagining AI Ops with Azure SRE Agent: New Automation, Integration, and Extensibility features
Azure SRE Agent offers intelligent and context aware automation for IT operations. Enhanced by customer feedback from our preview, the SRE Agent has evolved into an extensible platform to automate and manage tasks across Azure and other environments. Built on an Agentic DevOps approach - drawing from proven practices in internal Azure operations - the Azure SRE Agent has already saved over 20,000 engineering hours across Microsoft product teams operations, delivering strong ROI for teams seeking sustainable AIOps. An Operations Agent that adapts to your playbooks Azure SRE Agent is an AI powered operations automation platform that empowers SREs, DevOps, IT operations, and support teams to automate tasks such as incident response, customer support, and developer operations from a single, extensible agent. Its value proposition and capabilities have evolved beyond diagnosis and mitigation of Azure issues, to automating operational workflows and seamless integration with the standards and processes used in your organization. SRE Agent is designed to automate operational work and reduce toil, enabling developers and operators to focus on high-value tasks. By streamlining repetitive and complex processes, SRE Agent accelerates innovation and improves reliability across cloud and hybrid environments. In this article, we will look at what’s new and what has changed since the last update. What’s New: Automation, Integration, and Extensibility Azure SRE Agent just got a major upgrade. From no-code automation to seamless integrations and expanded data connectivity, here’s what’s new in this release: No-code Sub-Agent Builder: Rapidly create custom automations without writing code. Flexible, event-driven triggers: Instantly respond to incidents and operational changes. Expanded data connectivity: Unify diagnostics and troubleshooting across more data sources. Custom actions: Integrate with your existing tools and orchestrate end-to-end workflows via MCP. Prebuilt operational scenarios: Accelerate deployment and improve reliability out of the box. Unlike generic agent platforms, Azure SRE Agent comes with deep integrations, prebuilt tools, and frameworks specifically for IT, DevOps, and SRE workflows. This means you can automate complex operational tasks faster and more reliably, tailored to your organization’s needs. Sub-Agent Builder: Custom Automation, No Code Required Empower teams to automate repetitive operational tasks without coding expertise, dramatically reducing manual workload and development cycles. This feature helps address the need for targeted automation, letting teams solve specific operational pain points without relying on one-size-fits-all solutions. Modular Sub-Agents: Easily create custom sub-agents tailored to your team’s needs. Each sub-agent can have its own instructions, triggers, and toolsets, letting you automate everything from outage response to customer email triage. Prebuilt System Tools: Eliminate the inefficiency of creating basic automation from scratch, and choose from a rich library of hundreds of built-in tools for Azure operations, code analysis, deployment management, diagnostics, and more. Custom Logic: Align automation to your unique business processes by defining your automation logic and prompts, teaching the agent to act exactly as your workflow requires. Flexible Triggers: Automate on Your Terms Invoke the agent to respond automatically to mission-critical events, not wait for manual commands. This feature helps speed up incident response and eliminate missed opportunities for efficiency. Multi-Source Triggers: Go beyond chat-based interactions, and trigger the agent to automatically respond to Incident Management and Ticketing systems like PagerDuty and ServiceNow, Observability Alerting systems like Azure Monitor Alerts, or even on a cron-based schedule for proactive monitoring and best-practices checks. Additional trigger sources such as GitHub issues, Azure DevOps pipelines, email, etc. will be added over time. This means automation can start exactly when and where you need it. Event-Driven Operations: Integrate with your CI/CD, monitoring, or support systems to launch automations in response to real-world events - like deployments, incidents, or customer requests. Vital for reducing downtime, it ensures that business-critical actions happen automatically and promptly. Expanded Data Connectivity: Unified Observability and Troubleshooting Integrate data, enabling comprehensive diagnostics and troubleshooting and faster, more informed decision-making by eliminating silos and speeding up issue resolution. Multiple Data Sources: The agent can now read data from Azure Monitor, Log Analytics, and Application Insights based on its Azure role-based access control (RBAC). Additional observability data sources such as Dynatrace, New Relic, Datadog, and more can be added via the Remote Model Context Protocol (MCP) servers for these tools. This gives you a unified view for diagnostics and automation. Knowledge Integration: Rather than manually detailing every instruction in your prompt, you can upload your Troubleshooting Guide (TSG) or Runbook directly, allowing the agent to automatically create an execution plan from the file. You may also connect the agent to resources like SharePoint, Jira, or documentation repositories through Remote MCP servers, enabling it to retrieve needed files on its own. This approach utilizes your organization’s existing knowledge base, streamlining onboarding and enhancing consistency in managing incidents. Azure SRE Agent is also building multi-agent collaboration by integrating with PagerDuty and Neubird, enabling advanced, cross-platform incident management and reliability across diverse environments. Custom Actions: Automate Anything, Anywhere Extend automation beyond Azure and integrate with any tool or workflow, solving the problem of limited automation scope and enabling end-to-end process orchestration. Out-of-the-Box Actions: Instantly automate common tasks like running azcli, kubectl, creating GitHub issues, or updating Azure resources, reducing setup time and operational overhead. Communication Notifications: The SRE Agent now features built-in connectors for Outlook, enabling automated email notifications, and for Microsoft Teams, allowing it to post messages directly to Teams channels for streamlined communication. Bring Your Own Actions: Drop in your own Remote MCP servers to extend the agent’s capabilities to any custom tool or workflow. Future-proof your agentic DevOps by automating proprietary or emerging processes with confidence. Prebuilt Operations Scenarios Address common operational challenges out of the box, saving teams time and effort while improving reliability and customer satisfaction. Incident Response: Minimize business impact and reduce operational risk by automating detection, diagnosis, and mitigation of your workload stack. The agent has built-in runbooks for common issues related to many Azure resource types including Azure Kubernetes Service (AKS), Azure Container Apps (ACA), Azure App Service, Azure Logic Apps, Azure Database for PostgreSQL, Azure CosmosDB, Azure VMs, etc. Support for additional resource types is being added continually, please see product documentation for the latest information. Root Cause Analysis & IaC Drift Detection: Instantly pinpoint incident causes with AI-driven root cause analysis including automated source code scanning via GitHub and Azure DevOps integration. Proactively detect and resolve infrastructure drift by comparing live cloud environments against source-controlled IaC, ensuring configuration consistency and compliance. Handle Complex Investigations: Enable the deep investigation mode that uses a hypothesis-driven method to analyze possible root causes. It collects logs and metrics, tests hypotheses with iterative checks, and documents findings. The process delivers a clear summary and actionable steps to help teams accurately resolve critical issues. Incident Analysis: The integrated dashboard offers a comprehensive overview of all incidents managed by the SRE Agent. It presents essential metrics, including the number of incidents reviewed, assisted, and mitigated by the agent, as well as those awaiting human intervention. Users can leverage aggregated visualizations and AI-generated root cause analyses to gain insights into incident processing, identify trends, enhance response strategies, and detect areas for improvement in incident management. Inbuilt Agent Memory: The new SRE Agent Memory System transforms incident response by institutionalizing the expertise of top SREs - capturing, indexing, and reusing critical knowledge from past incidents, investigations, and user guidance. Benefit from faster, more accurate troubleshooting, as the agent learns from both successes and mistakes, surfacing relevant insights, runbooks, and mitigation strategies exactly when needed. This system leverages advanced retrieval techniques and a domain-aware schema to ensure every on-call engagement is smarter than the last, reducing mean time to resolution (MTTR) and minimizing repeated toil. Automatically gain a continuously improving agent that remembers what works, avoids past pitfalls, and delivers actionable guidance tailored to the environment. GitHub Copilot and Azure DevOps Integration: Automatically triage, respond to, and resolve issues raised in GitHub or Azure DevOps. Integration with modern development platforms such as GitHub Copilot coding agent increases efficiency and ensures that issues are resolved faster, reducing bottlenecks in the development lifecycle. Ready to get started? Azure SRE Agent home page Product overview Pricing Page Pricing Calculator Pricing Blog Demo recordings Deployment samples What’s Next? Give us feedback: Your feedback is critical - You can Thumbs Up / Thumbs Down each interaction or thread, or go to the “Give Feedback” button in the agent to give us in-product feedback - or you can create issues or just share your thoughts in our GitHub repo at https://github.com/microsoft/sre-agent. We’re just getting started. In the coming months, expect even more prebuilt integrations, expanded data sources, and new automation scenarios. We anticipate continuous growth and improvement throughout our agentic AI platforms and services to effectively address customer needs and preferences. Let us know what Ops toil you want to automate next!4.7KViews1like0CommentsRunning Self-hosted APIM Gateways in Azure Container Apps with VNet Integration
With Azure Container Apps we can run containerized applications, completely serverless. The platform itself handles all the orchestration needed to dynamically scale based on your set triggers (such as KEDA) and even scale-to-zero! I have been working a lot with customers recently on using Azure API Management (APIM) and the topic of how we can leverage Azure APIM to manage our internal APIs without having to expose a public IP and stay within compliance from a security standpoint, which leads to the use of a Self-Hosted Gateway. This offers a managed gateway deployed within their network, allowing a unified approach in managing their APIs while keeping all API communication in-network. The self-hosted gateway is deployed as a container and in this article, we will go through how to provision a self-hosted gateway on Azure Container Apps specifically. I assume there is already an Azure APIM instance provisioned and will dive into creating and configuring the self-hosted gateway on ACA. Prerequisites As mentioned, ensure you have an existing Azure API Management instance. We will be using the Azure CLI to configure the container apps in this walkthrough. To run the commands, you need to have the Azure CLI installed on your local machine and ensure you have the necessary permissions in your Azure subscription. Retrieve Gateway Deployment Settings from APIM First, we need to get the details for our gateway from APIM. Head over to the Azure portal and navigate to your API Management instance. - In the left menu, under Deployment and infrastructure, select Gateways. - Here, you'll find the gateway resource you provisioned. Click on it and go to Deployment. - You'll need to copy the Gateway Token and Configuration endpoint values. (these tell the self-hosted gateway which APIM instance and Gateway to register under) Create a Container Apps Environment Next, we need to create a Container Apps environment. This is where we will create the container app in which our self-hosted gateway will be hosted. Using Azure CLI: Create our VNet and Subnet for our ACA Environment As we want access to our internal APIs, when we create the container apps environment, we need to have the VNet created with a subnet available. Note: If we’re using Workload Profiles (we will in this walkthrough), then we need to delegate the subnet to Microsoft.App/environments. # Create the vnet az network vnet create --resource-group rgContosoDemo \ --name vnet-contoso-demo \ --location centralUS \ --address-prefix 10.0.0.0/16 # Create the subnet az network vnet subnet create --resource-group rgContosoDemo \ --vnet-name vnet-contoso-demo \ --name infrastructure-subnet \ --address-prefixes 10.0.0.0/23 # If you are using a workload profile (we are for this walkthrough) then delegate the subnet az network vnet subnet update --resource-group rgContosoDemo \ --vnet-name vnet-contoso-demo \ --name infrastructure-subnet \ --delegations Microsoft.App/environments Create the Container App Environment in out VNet az containerapp env create --name aca-contoso-env \ --resource-group rgContosoDemo \ --location centralUS \ --enable-workload-profiles Deploy the Self-Hosted Gateway to a Container App Creating the environment takes about 10 minutes and once complete, then comes the fun part—deploying the self-hosted gateway container image to a container app. Using Azure CLI: Create the Container App: az containerapp create --name aca-apim-demo-gateway \ --resource-group rgContosoDemo \ --environment aca-contoso-env \ --workload-profile-name "Consumption" \ --image "mcr.microsoft.com/azure-api-management/gateway:2.5.0" \ --target-port 8080 \ --ingress 'external' \ ---env-vars "config.service.endpoint"="<YOUR_ENDPOINT>" "config.service.auth"="<YOUR_TOKEN>" "net.server.http.forwarded.proto.enabled"="true" Here, you'll replace <YOUR_ENDPOINT> and <YOUR_TOKEN> with the values you copied earlier. Configure Ingress for the Container App: az containerapp ingress enable --name aca-apim-demo-gateway --resource-group rgContosoDemo --type external --target-port 8080 This command ensures that your container app is accessible externally. Verify the Deployment Finally, let's make sure everything is running smoothly. Navigate to the Azure portal and go to your Container Apps environment. Select the container app you created (aca-apim-demo-gateway) and navigate to Replicas to verify that it's running. You can use the status endpoint of the self-hosted gateway to determine if your gateway is running as well: curl -i https://aca-apim-demo-gateway.sillytreats-abcd1234.centralus.azurecontainerapps.io/status-012345678990abcdef Verify Gateway Health in APIM You can navigate in the Azure Portal to APIM and verify the gateway is showing up as healthy. Navigate to Deployment and Infrastructure, select Gateways then choose your Gateway. On the Overview page you’ll see the status of your gateway deployment. And that’s it! You've successfully deployed an Azure APIM self-hosted gateway in Azure Container Apps with VNet integration allowing access to your internal APIs with easy management from the APIM portal in Azure. This setup allows you to manage your APIs efficiently while leveraging the scalability and flexibility of Azure Container Apps. If you have any questions or need further assistance, feel free to ask. How are you feeling about this setup? Does it make sense, or is there anything you'd like to dive deeper into?2.5KViews3likes3CommentsFrom Timeouts to Triumph: Optimizing GPT-4o-mini for Speed, Efficiency, and Reliability
The Challenge Large-scale generative AI deployments can stretch system boundaries — especially when thousands of concurrent requests require both high throughput and low latency. In one such production environment, GPT-4o-mini deployments running under Provisioned Throughput Units (PTUs) began showing sporadic 408 (timeout) and 429 (throttling) errors. Requests that normally completed in seconds were occasionally hitting the 60-second timeout window, causing degraded experiences and unnecessary retries. Initial suspicion pointed toward PTU capacity limitations, but deeper telemetry revealed a different cause. What the Data Revealed Using Azure Data Explorer (Kusto), API Management (APIM) logs, and OpenAI billing telemetry, a detailed investigation uncovered several insights: Latency was not correlated with PTU utilization: PTU resources were healthy and performing within SLA even during spikes. Time-Between-Tokens (TBT) stayed consistently low (~8–10 ms): The model was generating tokens steadily. Excessive token output was the real bottleneck: Requests generating 6K–8K tokens simply required more time than allowed in the 60-second completion window. In short — the model wasn’t slow; the workload was oversized. The Optimization Opportunity The analysis opened a broader optimization opportunity: Balance token length with throughput targets. Introduce architectural patterns to prevent timeout or throttling cascades under load. Enforce automatic token governance instead of relying on client-side discipline. The Solution Three engineering measures delivered immediate impact: token optimization, spillover routing, and policy enforcement. Right-size the Token Budget Empirical throughput for GPT-4o-mini: ~33 tokens/sec → ~2K tokens in 60s. Enforced max_tokens = 2000 for synchronous requests. Enabled streaming responses for longer outputs, allowing incremental delivery without hitting timeout limits. Enable Spillover for Continuity Implemented multi-region spillover using Azure Front Door and APIM Premium gateways. When PTU queues reached capacity or 429s appeared, requests were routed to Standard deployments in secondary regions. The result: graceful degradation and uninterrupted user experience. Govern with APIM Policies Added inbound policies to inspect and adjust max_tokens dynamically. On 408/429 responses, APIM retried and rerouted traffic based on spillover logic. The Results After optimization, improvements were immediate and measurable: Latency Reduction: Significant improvement in end-to-end response times across high-volume workloads Reliability Gains: 408/429 errors fell from >1% to near zero. Cost Efficiency: Average token generation decreased by ~60%, reducing per-request costs. Scalability: Spillover routing ensured consistent performance during regional or capacity surges. Governance: APIM policies established a reusable token-control framework for future AI workloads. Lessons Learned Latency isn’t always about capacity: Investigate workload patterns before scaling hardware. Token budgets define the user experience: Over-generation can quietly break SLA compliance. Design for elasticity: Spillover and multi-region routing maintain continuity during spikes. Measure everything: Combine KQL telemetry, latency and token tracking for faster diagnostics. The Outcome By applying data-driven analysis, architectural tuning, and automated governance, the team turned an operational bottleneck into a model of consistent, scalable performance. The result: Faster responses. Lower costs. Higher trust. A blueprint for building resilient, high-throughput AI systems on Azure.595Views4likes0Comments