python

109 Topics

MCP Just Went Stateless — What the 2026 Spec Changes About Scaling on App Service
A couple of months ago I wrote about scaling MCP servers behind App Service's built-in load balancer. The trick back then was to lean on stateless HTTP transport so any instance could serve any request — and to make sure you turned off ARR affinity so the load balancer was actually free to spread traffic around. That post still works. But the MCP spec just caught up to it in a big way. The 2026-07-28 release candidate is the largest revision of the Model Context Protocol since it launched, and the headline change is exactly the thing we were working around: MCP is now stateless at the protocol layer. The handshake is gone, the session header is gone, and the sticky-routing-and-shared-session-store dance that horizontal deployments used to need is no longer part of the protocol at all. If you're hosting an MCP server on App Service, this is good news — and it means a few of the steps from my last post are now things the protocol does for you. Here's what changed, and what (if anything) you need to do about it. Here's the before and after, straight from the spec. In 2025-11-25 , the client POST s an initialize call to /mcp first and gets a session ID back: {"jsonrpc":"2.0","id":1,"method":"initialize", "params":{"protocolVersion":"2025-11-25","capabilities":{}, "clientInfo":{"name":"my-app","version":"1.0"}}} Heads up on timing: 2026-07-28 is a release candidate as I write this; the final spec ships July 28, 2026. It contains breaking changes, so treat this as "get ready" guidance rather than "rip everything out today." Quick recap: how we scaled MCP before In the original post, the recipe looked like this: Run the MCP server in stateless HTTP mode (the 2025-11-25 transport). Scale App Service out to N instances (the sample used three). Set clientAffinityEnabled: false so there's no ARR affinity cookie pinning a client to one instance. If you genuinely needed cross-request state, externalize it — typically into Azure Cache for Redis — so every instance saw the same data. Watch traffic spread across instances in Application Insights via cloud_RoleInstance . The catch: even in "stateless HTTP" mode, the 2025-11-25 protocol still started every connection with an initialize handshake and handed back an Mcp-Session-Id that the client had to send on every follow-up request. That session ID pinned a client to whichever instance issued it — so to scale cleanly you either kept affinity on (and gave up even load balancing) or did real work to share session state across instances. That's the part the 2026 spec deletes. What the 2026 spec actually changes The handshake and the session are gone Two proposals do the heavy lifting: SEP-2575 removes the initialize / initialized handshake. The protocol version, client info, and client capabilities that used to be exchanged once at connect time now ride along in _meta on every request. A new server/discover method lets a client ask for server capabilities when it actually wants them. SEP-2567 removes the Mcp-Session-Id header and the protocol-level session that came with it. With both gone, any MCP request can land on any instance. The sticky routing and shared session stores that horizontal deployments needed before just aren't required at the protocol layer anymore. Here's the before and after, straight from the spec. In 2025-11-25 , the client POST s an initialize call to /mcp first and gets a session ID back: {"jsonrpc":"2.0","id":1,"method":"initialize", "params":{"protocolVersion":"2025-11-25","capabilities":{}, "clientInfo":{"name":"my-app","version":"1.0"}}} …then every later call has to carry the Mcp-Session-Id header the server handed back, which pins it to that instance: {"jsonrpc":"2.0","id":2,"method":"tools/call", "params":{"name":"search","arguments":{"q":"otters"}}} In 2026-07-28 , the same tool call is one self-contained request that any instance can answer. The routing info rides in headers — MCP-Protocol-Version , Mcp-Method , and Mcp-Name — and the body carries everything else: {"jsonrpc":"2.0","id":1,"method":"tools/call", "params":{"name":"search","arguments":{"q":"otters"}, "_meta":{"io.modelcontextprotocol/clientInfo":{"name":"my-app","version":"1.0"}}}} No handshake, no session ID, nothing to pin. Traffic you can route and cache at the edge A few smaller changes make this traffic much friendlier to the infrastructure App Service already gives you: Routable headers (SEP-2243): Streamable HTTP now requires Mcp-Method and Mcp-Name headers, so load balancers, gateways, and rate-limiters can route or throttle on the operation without cracking open the request body. (Servers reject requests where the headers and body disagree.) Cacheable lists (SEP-2549): tools/list and resource-read results now carry ttlMs and cacheScope , modeled on HTTP Cache-Control . Clients know exactly how long a tool list is fresh and whether it's safe to share across users — no more holding an SSE stream open just to learn the list changed. Traceable calls (SEP-414): W3C Trace Context ( traceparent , tracestate , baggage ) propagation in _meta is now documented with fixed key names. A trace that starts in the host app can follow a tool call through the client SDK, your MCP server, and whatever it calls downstream — and show up as one span tree in any OpenTelemetry backend, including Application Insights. That last one pairs really nicely with the App Insights setup from the original sample, which already tags spans with cloud_RoleInstance . Why this is easier on App Service now App Service's built-in load balancer has always wanted to round-robin your requests. The thing stopping it from doing that cleanly with MCP was the protocol's own session affinity. Now that the protocol is stateless: No affinity tuning to reason about. You still want clientAffinityEnabled: false , but there's no longer a protocol session fighting it. Any instance serves any request, for real. Scale from 3 to 10 instances and the load balancer just spreads the work — no shared session store required for protocol state. Less Redis glue. In the old model, Redis was often there to share protocol session state. That reason is gone (see the next section for what Redis is still great for). "Stateless protocol" doesn't mean "stateless app" This is the part I want to be really clear about, because it's easy to over-read the headline. Removing the protocol session does not mean your application can't have state. It means the protocol stops carrying state for you. If your server needs to remember something across calls, you do what HTTP APIs have always done: mint an explicit handle and let the model pass it back as an argument. The spec calls this the explicit-handle pattern. A tool returns a basket_id (or browser_id , or whatever), and later calls include that ID as a normal parameter: // 1) create returns a handle {"name": "create_basket", "arguments": {}} // -> { "basket_id": "b_12345" } // 2) later calls pass it back as an ordinary argument {"name": "add_item", "arguments": {"basket_id": "b_12345", "sku": "ABC"}} The nice side effect: the model can see the handle, compose it across tools, and hand it off between steps — in ways that session state hidden in transport metadata never really allowed. So where does Redis fit now? Exactly where it always belonged — your application's data, not the protocol's plumbing: Backing store for those explicit handles (what's actually in basket b_12345 ). Caching expensive lookups or model responses across instances. App-level conversation memory or rate-limit counters. Stateless protocol, stateful application. You externalize state because your app needs it shared, not because the transport forces you to. Migrating an existing MCP server on App Service If you deployed the original sample (or something like it), here's the punch list to get to the 2026 model. The good news: the App Service / infra side barely changes — most of the work is in the protocol layer your SDK handles for you. App Service config — mostly already done: Keep clientAffinityEnabled: false . (Still the right call.) Keep scaling out to N instances. Nothing here changes. Keep Application Insights + OpenTelemetry — and lean into the new Trace Context key names for cleaner end-to-end traces. Protocol layer — the real work: Update to an SDK build that speaks 2026-07-28 . The handshake and session handling go away; your server reads protocol version and client info from _meta per request instead of from an initialize exchange. Emit ttlMs / cacheScope on tools/list and resource reads so clients (and your gateway) can cache them. Make sure your server honors / validates the Mcp-Method and Mcp-Name headers. If you were storing anything keyed off Mcp-Session-Id , move it to the explicit-handle pattern (handle in, handle out, state in Redis/Cosmos/etc.). Audit for the breaking bits: tasks/list is removed, Roots/Sampling/Logging are deprecated, and the "resource not found" error code moves from -32002 to the standard -32602 . I built a standalone companion sample for exactly this — the 2026-07-28 version of the original, with the handshake gone, everything read from _meta , server/discover implemented, and the explicit-handle pattern shown in a real tool. Link below. Try it yourself I built a companion sample for this post: a FastAPI MCP server that speaks 2026-07-28 natively — no handshake, no session — running on three App Service instances behind the built-in load balancer, with a staging slot, App Insights, a spec-compliant client, and a k6 load test: 👉 seligj95/app-service-mcp-stateless-scale-2026-python azd auth login azd up That provisions a Premium v3 plan with capacity: 3 , the web app with clientAffinityEnabled: false , a staging slot, and Log Analytics + Application Insights. No initialize , no Mcp-Session-Id anywhere — discovery is a single server/discover call, and every request carries its own protocol version and client info in _meta . The part I like best is the tally tool. It keeps a running total across calls using an explicit, signed handle instead of a session — so you can watch the total stay correct even as the load balancer routes each call to a different instance: +10 -> total=10 served_by=2103650c... +5 -> total=15 served_by=08fc7022... (different instance, total still right) +100 -> total=115 served_by=08fc7022... That's the stateless handle pattern from earlier, made concrete: state travels with the request, not the connection. Then watch the load spread in Application Insights: requests | where timestamp > ago(15m) | where name contains "/mcp" | summarize count() by cloud_RoleInstance Want the 2025-11-25 version for comparison? That's the original Part 1 sample: seligj95/app-service-mcp-stateless-scale-python. Diff the two main.py files and you can see the handshake and session handling simply disappear. The takeaway When I wrote the first post, "make MCP stateless so App Service can load-balance it" was a pattern you had to apply. With the 2026 spec, it's just how MCP works. The protocol deleted the exact friction we were routing around — which means hosting a horizontally scaled MCP server on App Service is now closer to "deploy a normal web app and scale it out" than ever. If you're already running MCP on App Service: you did the hard part early. The spec just made it official. Got an MCP server running on App Service? I'd love to hear how the migration goes — drop a comment.
jordanselig
Jun 23, 2026 Place Apps on Azure Blog
772Views
0likes
0Comments
What's new in Azure App Service at #MSBuild 2026
At Microsoft Build 2026, Azure App Service introduced a powerful set of updates designed to help organizations accelerate their journey into AI, without increasing complexity or cost. These innovations focus on one clear business outcome: enabling teams to build, deploy, and scale AI-powered applications and agents faster, more securely, and with greater operational efficiency. A key highlight is the new Easy AI experience, which allows existing web apps to become AI-ready with no rearchitecting required. With capabilities like built-in Model Context Protocol (MCP), developers can instantly expose app functionality as agent-ready endpoints, enabling AI agents to interact with business logic securely and seamlessly. This dramatically reduces development time, allowing teams to move from idea to intelligent application in a fraction of the usual effort. Security and compliance are also strengthened with the general availability of Isolated v4 for Azure App Service Environments, delivering improved performance for customers that need single-tenant isolation and strong data residency guarantees. For enterprises operating in regulated industries, this ensures AI applications meet strict governance requirements without sacrificing scalability or speed. For modernization scenarios, Managed Instance on Azure App Service simplifies the migration of legacy applications, including those with OS-level dependencies. Faster restarts, enhanced diagnostics, and AI-assisted migration workflows help organizations modernize existing systems cost-effectively—avoiding expensive rewrites while unlocking AI capabilities. Recent updates include an AI-assisted approach to migrating legacy IIS applications using a multi-agent workflow powered by MCP. Managed Instance is supported on both Premium v4 and Isolated v4, laying the foundation for a modern compute infrastructure across the board. Operational efficiency is further enhanced through platform and CLI improvements designed for the “agent era.” From structured deployment diagnostics to optimized Python pipelines delivering faster deployments, these updates reduce friction and infrastructure overhead, lowering total cost of ownership. Together, these innovations position Azure App Service as a future-ready platform where businesses can rapidly build intelligent, agent-driven applications securely, efficiently, and at scale. 👉 Learn more in the full announcement: Deep dive into Azure App Service Build 2026 updates
Mayunk_Jain
Jun 08, 2026 Place Apps on Azure Blog
1.4KViews
0likes
0Comments
Anyscale on Azure: Powering Enterprise AI at Massive Scale on Azure Kubernetes Service
Somewhere on your AI platform team, an engineer is on call this weekend — not for the model, not for the training run, but for the integration code holding five separate AI processing systems together. Data preparation on one. Training on a second. Evaluation on a third. Serving on a fourth. Observability bolted on across all of it. The glue between them has quietly evolved into a production system of its own, complete with its own failure modes and its own pager. This is what running AI at scale looks like for most enterprises in 2026. To process the full breadth of AI workloads, teams don’t have one platform, but a stack of multiple compute engines — stitched together and monitored around the clock. Training failures become increasingly costly as multi-node GPU clusters remain underutilized and difficult to operate. Inference costs climb in a straight line when they should be bending the other way. And the accelerators underneath, at six figures a year per node, sit at 30–40% utilization. None of this is a model problem. It is a systems problem, and it exposes a divide that is widening across the industry. The AI shift: Moving from API inference calls only to end-to-end AI Most enterprises start an AI journey by calling hosted model APIs. It’s the fastest way to experiment and ship. But as adoption grows, inference costs scale in a straight line while differentiation remains limited. The organizations pulling ahead are doing more than consuming models. They are customizing them with proprietary data, operating them at scale, and owning the infrastructure between their data and their models. Their unit economics improve as they scale. The dividing line isn’t budget. It isn’t ambition. It is a single architectural decision: whether the layer between your data and your models is something you rent in pieces or run as a single system. That unified system for end-to-end AI, almost without exception, is built on one runtime: Ray, an open-source framework widely adopted by AI-natives such as Cursor, Mistral and xAI to act as the engine that powers many of their workloads from multimodal data processing to reinforcement learning. Anyscale on Azure: Build and run end-to-end AI on your Azure subscription Anyscale on Azure brings the distributed compute runtime the AI industry has converged on — Ray— into your Azure tenant as an Azure Native service, that includes purpose-built developer tooling and unified pane for cluster management, built through deep engineering collaboration between Anyscale and Microsoft. Unlike other processing engines which either only support one hardware type (e.g. CPUs) or focus on a single workload (e.g. inference), Ray turns a heterogeneous cluster of CPUs and GPUs into a single Python runtime composing data preparation, distributed training, fine-tuning, reinforcement learning, high-throughput inference, and agentic execution as one program, not five interlocking systems. Anyscale created Ray and stewards the open-source Ray project, now governed by the PyTorch Foundation; the Anyscale Runtime is the production-grade layer that enterprises can utilize on critical paths from day one, bringing managed cluster operations, enterprise-grade support, and the operational reliability needed to run AI and data workloads at scale. On Azure, that runtime executes on your Azure Kubernetes Service (AKS) clusters, inside your subscription, and under Microsoft Entra ID workload identity. Your data, models, and weights never leave your cloud, and consumption is billed through Azure with drawdown against your existing Azure commitment (MACC). Sovereignty isn't a label bolted on after the fact. It is the architectural starting point: customer-owned data and models in the customer-owned tenant and governance boundary. The variable per-token economics of hosted APIs are replaced with compute you govern directly. Your proprietary data becomes a compounding advantage rather than a payload shipped to a third-party endpoint. A single runtime for the full AI lifecycle The cost profile of enterprise AI is largely architectural. Fragmented stacks — separate systems for prep, training, evaluation, and serving — produce a predictable set of failure modes such as Idle GPU time, Integration code and cross-system data movement. The result: production GPU utilization only in the 30–40% range, against accelerators that cost six figures per node per year. On the same fleet, Anyscale customers run those accelerators at 80%+ sustained utilization and report 40–60% lower GPU spend versus static, single-tenant clusters — driven by fractional GPU allocation (down to 0.2 of a device), bin-packing across complementary memory and compute profiles, gang scheduling for distributed training, priority-aware preemption that lets production inference take precedence over ad-hoc training, and spot integration with checkpoint-aware preemption so long-running jobs survive reclamation without lost work. Anyscale on Azure replaces this with a single Ray-powered runtime that spans the lifecycle as one distributed computation graph: Ray Data (distributed preparation) → Ray Train (fault-tolerant training) → Ray Tune (hyperparameter search) → Ray Serve (inference) — under one managed control plane. On top of open-source Ray, the Anyscale Runtime adds fault-tolerant training with checkpoint/restart, optimized scheduling, faster cluster bring-up, inference-aware autoscaling, and per-stage observability. Ray is the unifying layer that, rather than replacing, streamlines distributed processing of the framework stack the AI industry already uses: PyTorch, Hugging Face Transformers, FSDP, DeepSpeed, and Megatron for training, vLLM and SGLang for high-throughput inference with continuous batching, paged attention, and speculative decoding. Ray Train orchestrates the three parallelism patterns modern training requires — data parallel, model parallel, and hybrid 3D parallel (data + tensor + pipeline) — for trillion-parameter models, without requiring teams to write custom distributed code. The architectural payoff is direct: a single Python program defines a graph spanning CPU-heavy preparation and GPU-heavy training. The model produced by Ray Train is served by Ray Serve in the same cluster, against the same storage. The operational, identity, and observability surface is unified instead of fragmented. What enterprises deploy with Anyscale on Azure There are five workloads that power the development of modern AI systems, spanning data processing, training, inference, and simulation. But in most environments, each depends on separate engines, frameworks, and orchestration layers. The resulting fragmentation drives up infrastructure spend, latency, and engineering complexity. This makes a single Ray-based runtime under Anyscale’s managed control plane the operationally rational choice. Anyscale on Azure provides a complete platform to build and deploy AI applications using the same APIs as open-source Ray. While the data plane runs inside the customer’s AKS cluster, the managed control plane provides a unified interface for development, debugging, and cluster operations. AI in your trust boundary by design: the architecture Anyscale on Azure is an Azure Native product — discoverable via the Azure portal and provisioned through Azure Resource Manager with every resource tagged, scoped, and policy‑bound like any other in your subscription. Anyscale on Azure is a split-plane deployment: Control plane (managed by Anyscale) — scheduling, jobs, services, workspaces, and observability. Data plane (your Azure subscription) — Ray clusters run on your AKS, in your VNet, on your storage (Azure Blob / ADLS Gen2 via BlobFuse2). The trust boundary is what matters — more than any individual data plane feature — for regulated workloads (financial services, healthcare, public sector) and any enterprise where proprietary data is the differentiation. The execution model: Workloads run inside your AKS cluster — your subscription, your VNet. Model weights, training data, KV caches, checkpoints, and inference traffic never leave the boundary. Provisioning is ARM-native — resources tag, scope, and inherit Azure Policy like anything else in the subscription. Identity is Microsoft Entra ID end to end — workload identity issues pod credentials; RBAC governs access. No long-lived keys, no parallel secret store. Network controls are yours — Private Link, NSGs, Cilium-based Azure CNI policies, and customer-managed encryption keys via Key Vault. Audit is the Azure Activity Log — the same surface your compliance team already monitors. The Anyscale Operator is the only Anyscale-controlled component in your environment — it runs inside your AKS, communicates with the control plane via egress only, and accepts no inbound access from Anyscale. The result: code and data stay in your Azure subscription. Your existing compliance posture, audit surface, and data residency certifications carry forward — nothing new to attest. Billing rolls through the same Azure invoice with MACC drawdown — no second invoice, no parallel procurement. Production evidence Xoople planetary‑scale satellite imagery on Anyscale on Azure; multimodal AI turns spectral data into operational intelligence. "Anyscale lets our teams focus on models and outcomes rather than infrastructure, dramatically accelerating the path from experimentation to deployment," — Milos Colic, VP of Engineering, Xoople. Wayve trains the next generation of autonomous‑driving foundation models on Anyscale on Azure, running distributed ML and data pipelines across large CPU and GPU fleets. The operational driver is GPU‑capacity aggregation at a scale that no single region or cluster can deliver. Beyond Anyscale on Azure, the same Ray runtime is used in production at Cursor, Physical Intelligence, xAI, Coinbase, Bedrock Robotics, and Runway. Bedrock Robotics scaled compute 85x on Anyscale without linearly increasing costs. Currently with 12M+ weekly downloads (+400% YoY) and 42K+ GitHub stars and now openly governed under the PyTorch Foundation (Linux Foundation), Ray is becoming the de-factor open-source standard and is not a single-vendor runtime. Pricing Pricing is usage‑based and consolidates onto the same Azure invoice as the rest of the customer's subscription, including drawdown against existing Azure commitment (MACC): Azure infrastructure — standard Azure compute and GPU charges for the AKS substrate the workload runs on, scaling directly with actual usage. Anyscale service layer — pay‑as‑you‑go through Azure service meters with no upfront commitment, priced by CPU, memory, and GPU type. Where Anyscale on Azure fits Base-model intelligence is converging. Enterprises can buy access to the same frontier models, so the model itself is no longer the moat. What separates the enterprises pulling ahead is the layer underneath: how efficiently they run the full AI lifecycle at scale, how much compounding leverage they extract from their proprietary data, and whether they own the runtime that ties it all together. Anyscale on Azure is the Azure Native runtime layer for that posture — bringing the open-source distributed compute standard the AI industry has converged on into the same Azure governance, identity, and procurement model as the rest of the tenant. The shape of enterprise AI is settling. The teams pulling ahead are not the ones renting the most intelligence through APIs — they are the ones building and operating AI systems inside their own cloud, on their own data, under their own governance, and scaling those systems on the open distributed runtime the industry has already converged on. Anyscale on Azure is that runtime, delivered as an Azure Native product: Ray, productionized — the open‑source distributed compute standard for AI, hardened with the Anyscale Runtime, a managed control plane, and observability designed for foundation‑model‑scale workloads. One runtime, the full AI lifecycle — data preparation, training, fine‑tuning, reinforcement learning, inference, and agentic workloads in a single Python program, on a single substrate, with no cross‑system glue. Inside your Azure tenant, on the AKS you already run — customer‑owned data, customer‑owned models, customer‑owned governance. Entra identity, Azure RBAC, Private Link, Activity Log audit, and customer‑managed keys end to end. One Azure invoice — usage‑based pricing through the Marketplace with MACC drawdown; no parallel procurement, no second vendor contract. If your team is wrestling with GPU utilization, fragmented data‑to‑serving stacks, training jobs that exceed any single region's capacity, or hosted‑API costs that scale faster than your usage — this is the runtime built for that problem. Try it now Provision your first Anyscale Cloud by navigating to the Azure portal. Click on "Create" to begin creating the Anyscale cloud resource and link the necessary Azure resources. your Anyscale Cloud directly from Azure Portal. e. Explore the quickstart guides and documentation on Microsoft Learn to get started. For architectural deep‑dives, capacity planning, or a hands‑on workshop with the Anyscale on Azure solution architects, reach out through your Microsoft account team. Deepen your expertise and deep dive on best practices in the upcoming virtual webinar. Register here. The infrastructure for the next decade of enterprise AI is here. Build on it. Links and Resources Press Release: Anyscale Launches on Microsoft Azure as a Native Integration for Enterprises Announcing Anyscale on Azure public preview: Powered by Ray on AKS Youtube Video: Anyscale on Azure: Scale Python AI workloads with managed Ray on AKS Azure on Anyscale overview Architecture Create an Anyscale Cloud in Azure Portal Pricing Support model Terms and Conditions Frequently asked questions
bobmital
Jun 02, 2026 Place Apps on Azure Blog
285Views
0likes
0Comments
Azure Functions MCP Extension: What's New at Build 2026
The Azure Functions MCP extension has had a breakout year! Since its initial preview, the extension has grown from a single trigger type into a full-featured platform for building remote MCP servers: with tool, resource, and prompt triggers across multiple languages, MCP Apps for interactive UIs, built-in MCP authentication, and feature enhancements. Here's what's new and what it means for developers building MCP servers on Azure Functions. The full MCP primitive set: Tools, resources, and prompts When the MCP extension first shipped, it supported tool triggers. Declare a function as an MCP tool, and any MCP client can discover and call it. That was the starting point. Since then, we've shipped the remaining MCP primitives: Resource triggers: expose a function as an MCP resource. Prompt triggers: expose a function as an MCP prompt, letting clients request structured prompt templates from your server. Like tool triggers, resource and prompt triggers are supported in multiple languages including .NET, Java, Python, TypeScript, and JavaScript. MCP Apps: interactive UI from your MCP server MCP Apps let your tools return interactive user interfaces instead of plain text. Combine tool triggers with resource triggers, and your MCP server can serve rich, rendered experiences to MCP-aware clients. The Azure Functions MCP extension supports MCP Apps natively, meaning the same function app that exposes tools and resources can also serve UI components. The launch blog post on the Azure Apps Blog walked through the pattern in detail. For .NET developers, the new fluent builder API (available in the latest NuGet release) makes it easier to compose MCP Apps by chaining tool and resource definitions in a declarative style. MCP authentication The extension supports built-in MCP authentication, implementing the requirements of the MCP auth spec. All samples in the aka.ms/remote-mcp repo enable built-in MCP auth by default with Microsoft Entra ID as the identity provider. Samples have also been updated to demonstrate how to exchange tokens in the On-Behalf-Of (OBO) flow, so your MCP tools can access downstream APIs using the invoking user's identity. Auth configuration in the Azure portal: Preview at Build is a one-click experience in the Azure portal for configuring built-in MCP auth. No more manual app registration creating, configuration and wiring to the server. Just open your server app on the portal and click to enable MCP auth. Try it out! Feature enhancements Beyond the headline primitives and auth, the extension has shipped a steady stream of capabilities the past few months. The following are the notable additions. Structured content Structured content lets you return machine-readable JSON metadata alongside your tool's response via the `structuredContent` field. Clients that support it can programmatically consume the data (e.g. parse fields, render tables, drive downstream logic) rather than just displaying text. Clients that don't support it still get the regular content blocks as a fallback. Rich content types Tools aren't limited to returning plain text. The extension supports the full set of MCP content block types, e.g. `TextContent`, `ImageContent`, `AudioContent`, `ResourceLink`, and `EmbeddedResource`, so your tools can return images, audio clips, references to resources, and inline file content alongside text. Input and output schemas `WithInputSchema` and `WithOutputSchema` give you explicit control over the JSON schemas advertised for your tools. This is especially useful when the auto-generated schema from function parameters doesn't capture the full contract. For example, when your tool accepts a complex nested object or returns a specific shape that clients depend on. Input and output schemas are currently supported in .NET, with support for other languages coming soon. builder.ConfigureMcpTool("SearchDocs") .WithOutputSchema(""" { "type": "object", "properties": { "results": { "type": "array", "items": { "type": "string" } }, "query": { "type": "string" } }, "required": ["results", "query"] } """); Fluent configuration APIs in .NET A set of fluent builder APIs that let you configure MCP primitives declaratively in `Program.cs`: ConfigureMcpTool: add properties, metadata, input/output schemas, or promote a tool to an MCP App ConfigureMcpResource: attach metadata to resources ConfigureMcpPrompt: define prompt arguments and metadata builder.ConfigureMcpTool("sayhello") .WithProperty("name", McpToolPropertyType.String, "Name of the user", required: true) .WithMetadata("ui", new { resourceUri = "ui://index.html" }); What's next Usage of the MCP extension has grown steadily since its preview launch. Tool execution volume has increased 15x over the past several months as more customers move from experimentation to production. As adoption grows, so do the expectations. Developers building production MCP servers are hitting real friction around auth complexity, client configuration, and observability. We're continuing to invest in the extension to address these gaps and help customers be more successful building and hosting MCP servers on Azure Functions. Here's where we're focusing next. Continued auth simplification Auth remains the biggest barrier to getting an MCP server into production. We'll work on: Smoother client setup: making it easier to connect any MCP client to an authenticated Azure Functions MCP server, not just VS Code. Simplified OBO flow: streamlining the experience of On-Behalf-Of authentication so developers can delegate user identity to downstream services with less configuration. Our goal: the secure path should be the easy path. Deeper integration with Microsoft Foundry We'll build tighter integration between Azure Functions MCP servers and Microsoft Foundry. This includes surfacing MCP servers in Foundry Toolbox, a new feature introduced to help Foundry agents discover and consume tools from a single endpoint. Developers will be able to publish an MCP server from Functions and have it available to Foundry agents through Toolbox without manual endpoint configuration. Continued feature enhancement We prioritize based on feedback from the community raised in our GitHub repo. For example, support for streaming output and pagination are top items in our backlog today based on user demand. We also track the MCP spec's evolution closely and will continue shipping support for strategic features as they land. Examples of proposals we're following: MCP Tasks: the Tasks extension (SEP-2663) defines a standard pattern for async, long-running tool calls with durable task handles. This replaces hand-rolled polling patterns and aligns well with Functions' execute-and-return model. Stateless MCP: SEP-2575 proposes removing the mandatory initialization handshake, which is a natural fit for serverless platforms like Azure Functions where fresh instances can handle any request. Have something you'd like us to prioritize? Let us know by filing a request on GitHub. Get started Samples: Samples showcasing most up-to-date features: aka.ms/remote-mcp Documentation: Model Context Protocol for Azure Functions MCP Extension GitHub repo: Azure Functions MCP Extension
lily-ma
Jun 02, 2026 Place Apps on Azure Blog
506Views
1like
0Comments
Announcing Anyscale on Azure public preview: Powered by Ray on AKS
Today, I’m excited to announce the public preview of Anyscale on Azure, bringing Anyscale’s managed Ray platform and the Anyscale Runtime natively to Azure, all running on Azure Kubernetes Service (AKS). It is the fastest path I have seen from a single notebook to a multi-region distributed AI job, running on the AKS clusters your platform team already operates.
Brendan Burns
Jun 02, 2026 Place Apps on Azure Blog
608Views
1like
0Comments
Introducing Azure Container Apps Sandboxes: Secure Infrastructure for Agentic Workloads
Today we are announcing the public preview of Azure Container Apps Sandboxes - a new first-class resource type that gives you fast, secure, ephemeral compute environments with built-in suspend and resume. This is the underlying infrastructure on which products like Cloud sandboxes in GitHub Copilot, Foundry Hosted Agents, and Azure Container Apps Express are built, you now have the opportunity to build your solutions leveraging this infrastructure. Azure Container Apps Sandboxes unlocks two massive opportunities. For platform developers and ISVs, sandboxes give you the same isolated compute fabric that powers many Microsoft products. You get the building blocks to create your own multi-tenant platform on proven, enterprise-scale infrastructure. For AI agents, sandboxes become a self-configurable tool that lets agents extend their own capabilities on the fly. An agent can spin up a fresh sandbox in milliseconds and use it to execute untrusted code, compile source, test HTTP requests against a live app, launch a browser session, or tackle whatever needs a quick and scalable infrastructure. On one side it empowers humans to build platforms, on the other it empowers agents to build their own capabilities. Both get enterprise-grade isolation, instant startup, and snapshot-based persistence out of the box. We'll walk through the resource model, sandbox lifecycle, the features that set Sandboxes apart - like snapshots, lifecycle policies, network egress controls, volumes, and managed identities - and show you how to get started with the portal and CLI. What Are Container Apps Sandboxes? Container Apps Sandboxes are secure, isolated compute environments that start in sub-second time, scale to thousands, and cost nothing when idle. Each sandbox runs in its own hardware-isolated microVM boundary - fully separated from the host, the platform, and every other sandbox. You bring your own Open Container Initiative (OCI) image, and Sandboxes handle the rest: provisioning from prewarmed pools, strong multi-tenant isolation, and snapshot-based suspend/resume that preserves full memory and disk state across sessions. There are many ways Sandboxes can help you build your next project - here are a few: Your own build & test systems - wire a Sandbox into your CI/CD flow to run builds while your laptop stays cool. Agents that can run anything safely - an agent spawns a sandbox, executes work inside it, and returns the output with no agent host privileges required. Agent swarms - decompose a research question, spawn N sandbox workers in parallel (each pinned to its own image and egress policy), and synthesize the result. Early access customers are already unlocking significant benefits by leveraging Azure Container Apps Sandboxes. "With Azure Container Apps sandboxes, SitecoreAI can safely enable agents to take real action. The combination of multi-tenant isolation, rapid scale-out, and full automation allows Sitecore to run long-lived, autonomous agents that securely execute code, manage workflows, and interact with enterprise systems within secure, governed environments. With this foundation, we can build agents that do real work: assembling content, personalizing experiences, and optimizing campaigns in production. Agents that operate continuously, learn from results, and improve over time, so our customers get better outcomes without giving up control." - Mo Cherif, VP of AI and Innovation, Sitecore "We got early access to Azure Container Apps Sandboxes, and got the first prototype integrated with Atlas AI in hours, and it's already shaping a new Atlas AI capability that we plan to launch in preview in Q3. It gives every Atlas AI agent a safe, sandboxed workspace (file system, terminal, code execution) on a customer's live data in Cognite Data Fusion. The value: Industrial process, reliability, and production engineers spend days and weeks on questions like "which wells are underperforming and why?" These questions are tractable but expensive, so they are asked rarely and decisions are made on gut feel. With this, an agent pulls the data, runs the analysis, cross-references maintenance and inspection records, and returns a cited draft in minutes. Sandboxes make it practical: Aligned feature set, per-customer isolation, pause/resume across multi-day investigations, scale-to-zero economics." - Kelvin Sundli, Product manager, Atlas AI, Cognite Resource Model: Sandbox Groups and Sandboxes The top-level ARM resource is Microsoft.App/SandboxGroups. A Sandbox Group is the management boundary for a collection of sandboxes that share configuration - think of it like a Container Apps Environment, but purpose-built for sandboxes. When you create a Sandbox Group, you specify: Subscription, Resource Group, and Region Sandbox defaults (optional): default CPU, memory, disk, max sandbox count, and default idle timeout Networking: optionally deploy into a custom VNet with a dedicated subnet for private networking Identity: System or user assigned Entra identity. Individual sandboxes are created within a Sandbox Group. Each sandbox has its own source (disk image or snapshot), resource tier, lifecycle policy, network egress policy, environment variables, ports, volumes, and connections. Sandbox Lifecycle Sandboxes have a well-defined lifecycle with the following states: State Description Creating Provisioning the sandbox from a disk image or snapshot Running Actively executing - backed by a live microVM Idle System-suspended after inactivity; can auto-resume on the next request Suspended Full state (memory + disk) preserved as a snapshot; no compute costs Resuming Restoring from a suspended or idle state - sub-second for most workloads Stopped User-initiated stop; can be resumed Stopping Graceful shutdown in progress Deleting Teardown in progress The key insight here is the distinction between Idle and Suspended. When a sandbox goes idle (e.g., no traffic for a configured timeout), the system can automatically suspend it and capture a snapshot. When a new request arrives, the sandbox resumes transparently. This gives you scale-to-zero economics with stateful compute - something that wasn't possible before without significant custom engineering. Disk Images: Bring Your Own Container Sandboxes boot from Disk Images - Open Container Initiative (OCI) images converted into an optimized root filesystem format. You point to any OCI image (public or private registry), and the platform builds a bootable disk image from it. You can start with public, pre-built images maintained by the platform (for example, Ubuntu base images), or bring your own private images. For private registries, you can authenticate with username/token or use a user-assigned managed identity for Azure Container Registry (ACR) – integrated with Azure as you expect. Snapshots: Full-State Persistence Snapshots capture the complete state of a running sandbox - memory, disk, and all running processes. When you resume a sandbox from a snapshot, every process, open file handle, and in-memory data structure is restored exactly as it was. A snapshot captures the full state of a running sandbox: memory pages, disk, processes. Two ways to make one - automatically on suspend, or manually on demand. Three things they're great for: Checkpointing mid-task so a long-running agent can resume exactly where it left off Cloning an environment that's already warm - dependencies installed, caches populated, services running Shipping a "ready-to-go" state that resumes in sub-second instead of cold-booting Snapshots are free during the preview, after which they will be stored as Azure Blob Storage at standard rates. Each snapshot records the source sandbox, resource allocation (CPU, memory, disk), and container metadata - so what you get back is exactly what you snapshotted. Resource Tiers Every sandbox is assigned to a resource tier that determines its CPU, memory, and disk allocation: Tier CPU Memory Disk XS 0.25 vCPU 0.5 GB 5 GB S 0.5 vCPU 1 GB 10 GB M (default) 1vCPU 2 GB 20 GB L 2 vCPU 4 GB 40 GB XL 4 vCPU 8 GB 80 GB When creating a sandbox from a snapshot, the resource tier is inherited from the snapshot and cannot be changed - this ensures the restored environment has the exact resources it was running with when the snapshot was taken. Lifecycle Policies: Auto-Suspend and Auto-Delete Every sandbox can be configured with lifecycle policies that automate state transitions and cleanup: Auto-Suspend Idle timeout: How long a sandbox can sit idle before being suspended (configurable: 1m, 2m, 5m, 10m, 30m, 60m) Suspend mode: Disk + Memory (default): Full snapshot including memory state - resume picks up exactly where you left off, with all processes and in-memory data intact. Disk: Only the disk is preserved; the VM restarts fresh on resume. Useful when you only need file persistence, not process continuity. Auto-Delete Automatically delete sandboxes after a configurable number of days of inactivity Prevents accumulation of abandoned sandboxes that consume snapshot storage These lifecycle policies are what make Sandboxes economically viable at scale. A platform serving thousands of tenants can configure aggressive idle timeouts (say, 60 seconds) with Memory suspend mode, and each tenant's sandbox disappears from the billing meter almost immediately - but resumes in sub-second time the moment they return. Network Egress Policy For scenarios involving untrusted code - AI agents executing LLM-generated scripts, multi-tenant SaaS with user-submitted workloads - controlling outbound network access is critical. Sandboxes provide a per-sandbox Network Egress Policy: Default action: Allow or Deny all outbound traffic Host rules: Domain-pattern rules (e.g., *.github.com → Allow) to permit specific destinations Custom CIDR rules: Network-level rules for IP ranges (e.g., 10.0.0.0/8 → Deny) Skip egress proxy: Option to bypass the egress proxy entirely when custom VNet routing handles policy enforcement This means you can run a sandbox in a deny-by-default posture and allowlist only the specific endpoints it needs (your API server, a package registry, etc.) - without setting up NSGs or firewall appliances. Managed Volumes: Persistent and Shared Storage Sandboxes support two types of mountable volumes, both managed by Microsoft: Volume Type Backed By Best For Managed Azure Blob Azure Blob Storage Shared data across sandboxes, file uploads/downloads, persistent artifacts Managed Data Disk Azure Disk Storage High-performance storage for databases, build caches, large working sets - only available to one sandbox at a time Blob volumes come with a built-in file explorer in the portal - you can browse, upload, download, create folders, and drag-and-drop files directly. Data Disk volumes provide dedicated block storage with configurable sizes. Secrets and Identity Secrets Sandbox Groups support key-value secrets scoped to the group. Secrets can be created, edited, and referenced by sandboxes within the group. These secrets can be used in egress policies to modify requests with transform or header-injection rules, without exposing the secrets to code running inside the sandbox. Managed Identity Sandbox Groups support both system-assigned and user-assigned managed identities, with full RBAC role assignment management. This means your sandboxes can authenticate to Azure services (Key Vault, Storage, Cosmos DB, etc.) without managing credentials - the same identity model you use everywhere else in Azure. MCP Connectors and Triggers ACA Sandboxes now supports managed connectors through the Model Context Protocol (MCP), giving sandboxes access to external APIs - including Microsoft 365, Salesforce, ServiceNow, GitHub, and 1,400+ other systems - without managing credentials directly. Attach a Connector Gateway to your sandbox group, and every sandbox in the group can call external APIs through a standardized MCP interface at runtime. Pair connectors with triggers to build event-driven automation: route an Outlook email to a sandbox that triages it with an AI agent, or react to a SharePoint file upload by extracting and processing the document all without writing glue code. Triggers can fire a shell command inside a sandbox or invoke an HTTP endpoint the sandbox exposes, so your automation shapes fit naturally around your workload. The integration is built on the new Connector Namespace service (az connector-namespace), the same runtime behind Logic Apps and Power Platform connectors, now available as a programmable layer for sandboxes. See the end-to-end samples for runnable azd up-deployable examples covering email triage and document automation scenarios. The Portal Experience Azure Container Apps Sandboxes are only available in the new Azure Container Apps portal that provides a rich, IDE-like experience for working with sandboxes. Creating a Sandbox The portal offers multiple creation paths: Standard Sandbox - full configuration control over source, resources, lifecycle, networking, and volumes GitHub Copilot Sandbox - preset, Copilot CLI ready to go, GitHub credentials can be wired through the Access Token before the sandbox is created Claude Sandbox - Claude CLI pre-installed, ready for agentic coding inside the sandbox Using Coding Agents (Copilot CLI / Claude Code) If you live inside Copilot CLI or Claude Code, you don't need to learn a new CLI. Install the azure-sandbox skill once and your agent picks up the right skills: # GitHub Copilot CLI # Add as a plugin marketplace /plugin marketplace add microsoft/azure-container-apps # Install all skills /plugin install sandboxes@Azure-Container-Apps # Claude Code claude plugin add microsoft/azure-container-apps The skill runs prerequisite checks silently (az --version, az account show, node --version, aca --version), prompts only if something's missing, and maps natural-language asks to the right aca commands. Bundled runbooks cover Copilot CLI BYOK (bring your own Azure OpenAI key), the deploy-a-web-app walkthrough, and shell setup. Sandbox Detail Page Once your sandbox is running, the detail page gives you immediate access to the sandbox terminal and additional details, such as - Network Audit - real-time egress traffic log showing allowed and denied requests Monitor - live CPU, memory, disk, and network utilization charts Connectors - attached connections with an "Add" action Volumes - mounted volumes with an "Add" action Log Stream - streaming container logs Processes - running process list inside the sandbox Files - file explorer to browse the sandbox filesystem The toolbar actions let you manage the state of the sandbox - Resume or Stop. In the Ellipsis menu (⁝) you can find additional settings to manage network Egress Policy and ingress (Add port), take a Snapshot of the sandbox, Commit (save disk state as a new disk image), set Lifecycle Policy or permanently Delete the sandbox. Finally, you can see additional Details in a side panel. Getting Started with the CLI and Python SDK All sandbox and sandbox-group operations go through the  aca  CLI. There are no az containerapp sandbox commands, - az is only used for az login, az account show, and resource-group management. Install (CLI) # Mac, Linux curl -fsSL https://aka.ms/aca-cli-install | sh # Windows irm https://aka.ms/aca-cli-install-ps | iex Run aca --help to get started. Install (Python SDK) pip install azure-containerapps-sandbox For more details, quick start and examples on ACA CLI and Python SDK, please go to https://sandboxes.azure.com Evolution from Dynamic Sessions If you've used Azure Container Apps Dynamic Sessions, Sandboxes are the next evolution of that capability. Everything Sessions can do, Sandboxes can do - and significantly more: Capability Dynamic Sessions Sandboxes Sub-second startup ✓ ✓ Strong isolation ✓ ✓ Custom container images ✓ ✓ Custom VNet integration ✓ (Partial) ✓ Suspend/resume with Memory and Disk snapshots - ✓ Lifecycle policies (auto-suspend, auto-delete) - ✓ Network egress policy (per-sandbox) - ✓ Persistent managed volumes (Blob, Data Disk) - ✓ Managed identity (system + user-assigned) - ✓ Secrets management - ✓ Configurable resource tiers - ✓ Direct access to sandbox in Portal experience - ✓ We will continue to support Dynamic Sessions, but all new investment goes into Sandboxes. If you're building new workloads on isolated ephemeral compute, start with Sandboxes. How It All Fits Together ACA Sandboxes is a platform primitive. It's the foundation on which multiple Microsoft products are already built - including ACA Express, Cloud sandboxes in GitHub Copilot, and Foundry Hosted Agents. When you build on Sandboxes, you're building on the same infrastructure that powers Microsoft's own portfolio. This is the evolution of what we shared with Project Legion in 2024. Legion described the internal infrastructure; Sandboxes exposes it as a customer-facing primitive that you can use directly. What's Next • Deeper Azure integrations - first-class connectivity with Azure networking, identity, storage, and AI services • Enhanced SDK and CLI - richer programmatic experiences for managing sandboxes at scale • More Microsoft services built on Sandboxes - this is just the beginning Get Started Today • Portal: https://sandboxes.azure.com/ • Documentation: Azure Container Apps Sandboxes • Pricing: Azure Container Apps Pricing (per-second vCPU/memory billing, scale-to-zero, snapshots at Blob Storage rates) We'd love to hear your feedback. You can ask questions, or file issues on the Azure Container Apps GitHub (prefix with [Sandbox] for Sandboxes-specific issues).
vyomnagrani
Jun 02, 2026 Place Apps on Azure Blog
4.6KViews
3likes
1Comment
Azure Functions MCP extension now supports MCP Prompts
We are thrilled to announce that the MCP prompt trigger is now available in public preview in the Azure Functions MCP extension! With this release, the extension now supports all three core MCP server primitives - tools, resources, and prompts, giving you a complete platform for building rich MCP servers on Azure Functions. In case you missed it, the MCP resource trigger is generally available for serving resources and building interactive UIs in MCP Apps. What are MCP Prompts In the Model Context Protocol (MCP), prompts are reusable templates that allow server authors to provide parameterized prompts for a domain, or showcase how to best use the MCP server. Prompts are user-controlled in that they require explicit invocation rather than automatic triggering, and can be context-aware, referencing available resources and tools to create comprehensive workflows. Unlike tools (which are model-controlled) and resources (which are application-controlled), prompts are exposed from servers to clients so users can explicitly select them. Applications typically expose prompts through slash commands, command palettes, dedicated UI buttons, or context menus. How It Works In Python, defining a prompt is as simple as decorating a function. Here's a prompt that returns a code review checklist: app.mcp_prompt_trigger( arg_name="context", prompt_name="code_review_checklist", description="Returns a structured code review checklist prompt for evaluating code changes." ) def code_review_checklist(context: func.PromptInvocationContext) -> str: logging.info("Code review checklist prompt invoked.") return """You are a senior software engineer performing a code review. Use the following checklist to evaluate the code: 1. **Correctness** — Does the code do what it's supposed to? 2. **Error Handling** — Are edge cases and failures handled? 3. **Security** — Are there any vulnerabilities (injection, auth, secrets)? 4. **Performance** — Are there obvious inefficiencies? 5. **Readability** — Is the code clear and well-named? 6. **Tests** — Are there adequate tests for the changes? Provide your feedback in a structured format with a severity level (critical, warning, suggestion) for each finding.""" Prompts can accept arguments, allowing clients to customize the generated message. Here's a prompt that generates documentation with configurable parameters: app.mcp_prompt_trigger( arg_name="context", prompt_name="generate_documentation", prompt_arguments=[ func.PromptArgument("function_name", "The name of the function to document.", required=False), func.PromptArgument("style", "Documentation style: 'concise', 'detailed', or 'tutorial'.", required=False) ], description="Generates API documentation for a function. Arguments are configured in Program.cs." ) def generate_documentation(context: func.PromptInvocationContext) -> str: function_name = context.arguments.get("function_name", "(unknown)") style = context.arguments.get("style", "concise") logging.info(f"Generate docs prompt invoked for function: {function_name}") return f"""Generate API documentation for the function named **{function_name}**. Documentation style: **{style}** Include the following sections: - **Description** — What the function does. - **Parameters** — List each parameter with its type and purpose. - **Return Value** — What the function returns. - **Example Usage** — A short code example showing how to call it.""" Checkout the Get Started section for the complete sample and samples in different languages. Why Azure Functions Azure Functions is the ideal platform for hosting remote MCP servers because of its built-in MCP authentication, event-driven scaling from 0 to N, and serverless billing. This ensures your agentic tools are secure, cost-effective, and ready to handle any load. With the MCP extension, you focus on implementing the primitives you want to expose, tools, resources, and prompts, instead of worrying about MCP protocol details and server logistics. Get Started You can start building today using our quickstarts and samples: Python TypeScript .NET Java Documentation Azure Functions MCP extension overview Prompt trigger We'd Love to Hear from You! Let us know your thoughts about the new prompt trigger. What kinds of prompts are you building for your MCP servers? What would you like us to prioritize next? Share your feedback in our GitHub repo.
lily-ma
May 29, 2026 Place Apps on Azure Blog
483Views
0likes
0Comments
Debugging Python apps on App Service with the new SSH helper aliases
You shipped a Python app to App Service. It worked in the demo. It works locally. In production, /chat is returning 502s — but /health is green, the deployment succeeded, the logs are quiet, and your laptop can't reproduce it. What you actually need is a shell on the running container so you can poke at DNS, env vars, installed packages, the listening port, and the AI endpoint your app is calling. The platform has had SSH for a while, but the playbook of "open SSH, then remember which 14 commands to run" was tribal knowledge. We just shipped a set of SSH helper aliases that turn that tribal knowledge into one-word commands. apphelp shows you everything; appconfig , showpkgs , and appcurl cover the app side; ai-test , ai-diagnose , ai-curl , ai-latency , ai-dns , and ai-access-check cover the Azure AI Foundry side. This post is a hands-on tour. We built a deliberately fragile FastAPI sample with six different fault modes, deployed it, broke it, and SSH'd in to watch the aliases drive each one to root cause. Every transcript below is real output from the deployed sample. 📦 Sample repo: seligj95/app-service-ssh-diagnostics-python — azd up and you have a fault-injectable Python + Foundry app live in your subscription in about 4 minutes. The sample, in one breath FastAPI app, Python 3.14, App Service Linux on P0v3 — uses the new Oryx FastAPI auto-detection so no custom startup command is needed Calls Azure OpenAI (gpt-4o-mini) via managed identity — no keys POST /admin/fault toggles one of seven modes: off , bad-creds , wrong-endpoint , dns-fail , port-mismatch , dep-import-error , latency-spike GET / is a landing page with a built-in cheat sheet of the SSH aliases The endpoints are intentionally boring. The point is to give the aliases something realistic to chew on. A quick note on Azure OpenAI vs. AI Foundry. This sample provisions an Azure OpenAI account ( kind: OpenAI ). The new ai-* aliases speak the OpenAI chat-completions API ( /openai/deployments/<model>/chat/completions ), which is identical on Azure OpenAI and on Azure AI Foundry projects — both expose *.openai.azure.com endpoints, both accept managed-identity bearer tokens, both speak the same schema. The aliases work against either; the env-var name AZURE_AI_FOUNDRY_ENDPOINT is just the alias contract. Drop a Foundry endpoint into it and the same walkthrough applies. Shout-out to the new FastAPI auto-detect on Python 3.14. This sample also benefits from another recent App Service change: on Python 3.14+, App Service automatically detects FastAPI apps and starts them with gunicorn -k uvicorn_worker.UvicornWorker — no custom startup command needed. Our Bicep ships an empty appCommandLine and lets Oryx do the right thing. The whole sample is a nice tour of recent App Service Python improvements landing together. Step zero: apphelp After azd up finishes, the first thing to do over SSH is: az webapp ssh -g rg-ssh-diag-demo -n app-web-<token> Then inside the container: $ apphelp apphelp prints every alias the image ships with, grouped by category. You don't need to memorize anything — when you forget what checkport does, you run apphelp and it's right there. We'll lean on most of these: App info: showpkgs , appconfig , appenv Logs: applogs , deploylogs , logfiles Reachability: appcurl , checkport , gohome , gosrc AI/Foundry: ai-test , ai-dns , ai-access-check , ai-curl , ai-latency , ai-diagnose Network tools: install-nettools The healthy baseline Before breaking anything, run ai-diagnose . This is the one-shot "is my AI path healthy?" check, and it's the alias we reach for most: $ ai-diagnose ──────────────────────────────────────────────────────────────── AI Foundry Diagnostics ──────────────────────────────────────────────────────────────── [✓] Managed identity token [✓] DNS resolution (d8f9grasb7ewc7h8.ai-gateway.eastus2-01.azure-api.net. - public) [✓] Foundry connectivity (761ms) ──────────────────────────────────────────────────────────────── Three green checks tell you three different things: the managed identity is issuing tokens, the Foundry hostname resolves, and the endpoint responded in a reasonable time. If any of these are red, you already know which layer the fault is in. For more detail, the individual aliases are worth knowing: $ ai-test ✓ Connected | 1009ms | Model: gpt-4o-mini | Auth: Managed Identity $ ai-access-check ✓ Foundry endpoint: https://cog-ftirxupt2yjoe.openai.azure.com/ ✓ Model: gpt-4o-mini ✓ Using auth mode: Managed Identity ✓ Access check passed: authorized to call Foundry $ ai-latency Running 5 requests to gpt-4o-mini... Request 1: 679ms ✓ Request 2: 826ms ✓ Request 3: 758ms ✓ Request 4: 641ms ✓ Request 5: 664ms ✓ Results (5/5 successful): Avg: 713ms | Min: 641ms | Max: 826ms And the app side: $ checkport ✓ App is listening on port 8000 $ appcurl /health HTTP Status: 200 Time: 0.002417s Size: 5423 bytes That's our "everything is fine" reference. Now let's break things. One trick: applying a fault inside the SSH shell A subtle thing trips people up the first time. POST /admin/fault mutates the app process's environment — but your SSH shell is a separate process. It inherited the container's env when you opened the session, so ai-test will still see the healthy values. The sample handles this by also writing a small file to the persistent share: # app/faults.py def _write_env_file() -> None: """Write fault env to /home/site/diagnostics/fault.env so SSH can `source` it.""" diag = Path("/home/site/diagnostics") diag.mkdir(parents=True, exist_ok=True) snap = _snapshot_unlocked() lines = [f"# Active fault: {snap['mode']}", ""] for k, v in snap["env"].items(): lines.append(f"export {k}={shlex.quote(v) if v else "''"}") (diag / "fault.env").write_text("\n".join(lines) + "\n") After toggling a fault, run this once in your SSH session: source /home/site/diagnostics/fault.env Now the aliases see the same env the broken app sees. This pattern — flip a flag from outside, source the change inside — is worth stealing for your own debugging workflows. Group A: faults the AI aliases catch directly Some faults are in the path between App Service and Foundry — wrong endpoint, broken DNS, network. The ai-* aliases reproduce the failure end-to-end, and they tell you exactly which layer. Fault 1: wrong-endpoint — a typo in the AOAI endpoint The most common AI-side incident: someone fat-fingers an app setting. The endpoint resolves to something (it's still *.openai.azure.com ) but it's not your resource. curl -X POST $URL/admin/fault -H 'content-type: application/json' \ -d '{"mode":"wrong-endpoint"}' curl $URL/chat -H 'content-type: application/json' \ -d '{"prompt":"hi"}' # HTTP 502 # {"detail":"APIConnectionError: Connection error."} SSH in, source the fault env, run the AI aliases: $ source /home/site/diagnostics/fault.env $ ai-dns Resolving: this-resource-does-not-exist.openai.azure.com ✗ DNS resolution failed for this-resource-does-not-exist.openai.azure.com $ ai-curl Request: POST https://this-resource-does-not-exist.openai.azure.com//openai/deployments/gpt-4o-mini/chat/completions?api-version=2024-02-01 Authorization: Bearer [hidden] Content-Type: application/json curl: (6) Could not resolve host: this-resource-does-not-exist.openai.azure.com $ ai-diagnose [✓] Managed identity token [✗] DNS resolution failed for this-resource-does-not-exist.openai.azure.com [✗] Foundry connectivity (HTTP 000) ai-diagnose collapses the whole story into three lines: token works, DNS fails, connectivity fails. The fault is unambiguously a bad endpoint — check appconfig and your Bicep parameters. Fault 2: dns-fail — NXDOMAIN A subtler variant of the same failure mode is when the endpoint is structurally wrong (private endpoint misconfigured, hosts file mishap, custom domain expired). ai-dns calls it out the same way: $ ai-dns Resolving: no-such-host.invalid.example ✗ DNS resolution failed for no-such-host.invalid.example If you need deeper diagnostics — say, you suspect a flaky resolver rather than the hostname itself — install-nettools gives you dig , nslookup , and friends without rebuilding the container. $ install-nettools $ dig openai.azure.com $ nslookup cog-ftirxupt2yjoe.openai.azure.com Group B: faults that pass ai-test but break your app Here's the most useful thing we learned building this sample: ai-test can be green while your app is on fire, and that's a signal, not a bug. The ai-* aliases call Foundry directly. If they're green and your app is red, the platform-to-Foundry path is fine — the divergence is in your app. Time to pivot to appenv , applogs , showpkgs . Fault 3: bad-creds — wrong AZURE_CLIENT_ID This one is the classic user-assigned managed identity mishap: you scoped your code to a user-assigned managed identity, but the GUID in AZURE_CLIENT_ID doesn't actually exist (or wasn't granted RBAC). curl -X POST $URL/admin/fault -d '{"mode":"bad-creds"}' curl $URL/chat -d '{"prompt":"hi"}' # HTTP 502 # {"detail":"ClientAuthenticationError: DefaultAzureCredential failed to retrieve a token..."} Now SSH in and try the AI aliases: $ source /home/site/diagnostics/fault.env $ ai-test ✓ Connected | 734ms | Model: gpt-4o-mini | Auth: Managed Identity $ ai-access-check ✓ Foundry endpoint: https://cog-ftirxupt2yjoe.openai.azure.com/ ✓ Using auth mode: Managed Identity ✓ Access check passed: authorized to call Foundry Both green. That looks like a contradiction, but it's not. The aliases authenticate using the system-assigned managed identity directly (via IMDS), and they pass. Your Python app uses DefaultAzureCredential , which honors AZURE_CLIENT_ID to pick a user-assigned identity — and that one is broken. The takeaway: when ai-test is green but /chat is red, the platform's identity is fine. Pivot to appenv to see exactly what env your app process sees, and check AZURE_CLIENT_ID : $ appenv | grep AZURE_CLIENT_ID AZURE_CLIENT_ID=00000000-0000-0000-0000-000000000000 There's the bug. The aliases didn't fail — they told you the fault isn't in the platform. That's diagnosis by elimination, and it's faster than guessing. Fault 4: dep-import-error — your code throws Same pattern. The app raises an ImportError on /chat , the AI aliases are green: curl -X POST $URL/admin/fault -d '{"mode":"dep-import-error"}' curl $URL/chat -d '{"prompt":"hi"}' # HTTP 500 # {"detail":"ImportError: No module named 'tiktoken'..."} This is where the app-side aliases earn their keep: $ showpkgs | head -20 ────────────────────────────────────────────────────── Virtual environment packages (antenv) ────────────────────────────────────────────────────── Package Version -------------------------------------- --------- annotated-types 0.7.0 anyio 4.13.0 azure-core 1.41.0 azure-identity 1.19.0 azure-monitor-opentelemetry 1.8.8 ... No tiktoken in that list. Confirmation in one command — no need to remember pip list or where the virtualenv lives. deploylogs then tells you what the last deployment actually built: $ deploylogs 10 Latest deployment: b8a64ed4-b6b7-4419-91eb-6d8e4e7ef323 Log file: /home/site/deployments/b8a64ed4-b6b7-4419-91eb-6d8e4e7ef323/log.log 2026-05-18T19:10:52.3844297Z,Parsing the build logs,abc3cf97-... 2026-05-18T19:10:52.5414396Z,Found 0 issue(s),7d11d013-... 2026-05-18T19:10:52.7913394Z,Build Summary :,... 2026-05-18T19:10:53.5643089Z,Deployment successful. deployer = Push-Deployer ... Build was clean. The package just isn't in requirements.txt . Two aliases, one minute, root cause. Fault 5: port-mismatch — uvicorn binds the wrong port A real-world bug: someone sets WEBSITES_PORT=9999 in app settings to expose a different port, but the app still binds to 8000. curl -X POST $URL/admin/fault -d '{"mode":"port-mismatch"}' The aliases tell you exactly which port everything sees: $ checkport Checking if app is listening on port 8000... ✓ App is listening on port 8000 $ appcurl /health Testing app at localhost:8000 ... HTTP Status: 200 Time: 0.002417s $ appconfig PORT Value: 8000 Note: The port your Python app should listen on. Default is 8000. The app is healthy from inside the container. The mismatch is between what the platform tries to forward to and what uvicorn is bound to. This is the kind of fault where curling the public URL fails but appcurl /health succeeds — and the contrast is itself the diagnosis. Fault 6: latency-spike — the alias bench is fast, your app is slow The app injects 4 seconds of asyncio.sleep before each Foundry call. /chat is now ~4.5 seconds. ai-latency : $ ai-latency Running 5 requests to gpt-4o-mini... Request 1: 715ms ✓ Request 2: 588ms ✓ Request 3: 578ms ✓ Request 4: 669ms ✓ Request 5: 643ms ✓ Results (5/5 successful): Avg: 638ms | Min: 578ms | Max: 715ms Foundry, from this instance, averages 638ms. If your app is taking 5 seconds end-to-end and ai-latency says the model is sub-second, the slowness is in your code — not in Foundry, not in the network. Time to look at App Insights end-to-end transactions, or at any pre-call work (retrieval, vector lookup, your own sleep). What this changes about the debugging workflow Before these aliases, the SSH playbook for a Python AI app went something like: open SSH, dig around /home/site/wwwroot/antenv , grep applicationHost.config for ports, write a curl by hand against the AOAI endpoint with a manually-fetched managed identity token, hope you got the API version right. Now it's ai-diagnose . If that's red, you know exactly which layer. If it's green, you know the fault is in your code or your settings, and appenv , appconfig , showpkgs , applogs walk you the rest of the way. Three patterns we'd lean on going forward: Start with apphelp and ai-diagnose every time. Don't try to remember the right command — let the aliases tell you. Treat ai-test being green as a signal, not a finish line. If /chat is red and ai-test is green, the platform path is fine; pivot to app-side aliases. Use source /home/site/diagnostics/fault.env as a pattern. Any time you want your SSH shell to see what the app process sees, write env to a file and source it. It's a small thing that removes a huge class of "but it worked when I tested it" confusions. We want feedback The aliases are GA today on Python images and we have ideas for where they go next — Node, .NET, more ai-* checks (Foundry agents, vector indexes), tighter integration with azd diagnose . If you have a Python app on App Service and you want a specific alias added, tell us by dropping a comment on this post. Try the sample git clone https://github.com/seligj95/app-service-ssh-diagnostics-python cd app-service-ssh-diagnostics-python azd auth login azd up Four minutes later you'll have the whole thing live. Then curl -X POST $URL/admin/fault -d '{"mode":"<pick one>"}' , SSH in, and walk through any of the six faults above. The README has the full alias-to-fault map.
jordanselig
May 19, 2026 Place Apps on Azure Blog
151Views
0likes
0Comments
Turn Your App Service Web App Into a Self-Healing Agent: LLMOps Best Practices for Production
A user submits a prompt. The agent burns through 50,000 tokens looping on a malformed tool response. Another user trips a model rate limit and the agent silently fails. A bad prompt update goes out at 4 PM Friday and degrades success rate to 60%. Your APM dashboard shows green the entire time because none of that is a 500. This post walks through the LLMOps stack we built into a working reference sample on Azure App Service: the SLIs that matter for agents, a budget circuit breaker, prompt-repair retries, and a fully automated slot-swap rollback when things go sideways. Every code snippet is from the deployable sample at the end of the post. 📦 Sample repo: seligj95/app-service-self-healing-agent-python — azd up and you've got the whole stack live in your subscription in under 10 minutes. Why agent ops ≠ web-app SRE Your web app's reliability model assumes a request maps to bounded work — a SQL query, a cache hit, a templated response. You alert on Http5xx, p95 latency, and dependency failures. Done. An agent breaks that model in four ways: Cost is unbounded per request. An agent that loops on a flaky tool can spend $5 on one user prompt. The HTTP response is still 200. Failure can be silent. A model can hallucinate confident JSON, a tool can return malformed args, and the agent dutifully returns a wrong answer to the user. Zero exceptions logged. Latency is non-deterministic. A "simple" prompt that normally finishes in 2 seconds can blow out to 30s when the model picks an expensive plan. p95 latency tells you nothing. Quality regresses on prompt changes, not code changes. A prompt tweak that ships in seconds can crater tool-call accuracy by 30%. Your CI/CD pipeline didn't catch it because there were no failing tests. Web-app SLOs (uptime, latency, error rate) are necessary but not sufficient. Agents need agent-shaped SLOs. Define your agent SLOs first Before instrumenting anything, write down what "healthy" means. Here are the four SLIs we chose for the sample. None of them are Http5xx. SLI What it measures Why it matters Task success rate % of /chat requests that the agent self-classifies as completed Catches silent failures the HTTP layer misses Cost per task $ spent (input + output tokens × model rate) per /chat The unbounded-loop problem in one number Tool success rate % of tool invocations that didn't raise Tool layer is where most agent failures live Repair retries Times we re-prompted the model after a schema-validation failure Leading indicator of prompt drift In our reference middleware these come out as agent.task.success , agent.cost.usd , agent.tool.success , and agent.repair.retry — eleven custom metrics in total. We emit them via OpenTelemetry so they land in App Insights customMetrics and the included KQL workbook visualizes them as SLO tiles. Observability stack on App Service App Service makes the observability story unusually easy because you get App Insights wired up automatically by azd — no agent install, no DaemonSet, no sidecar. The only thing you bring is the SDK init for your custom metrics: # llmops_middleware/sli.py from azure.monitor.opentelemetry import configure_azure_monitor from opentelemetry import metrics def configure_azure_monitor_if_available() -> bool: if not os.getenv("APPLICATIONINSIGHTS_CONNECTION_STRING"): return False configure_azure_monitor() return True meter = metrics.get_meter("agent") tokens_in = meter.create_counter("agent.tokens.in") cost_usd = meter.create_counter("agent.cost.usd") task_latency = meter.create_histogram("agent.task.latency") tool_success = meter.create_counter("agent.tool.success") # ... We compute cost from a per-model rate card so the metric is in real dollars, not abstract tokens: COST_PER_1K_TOKENS = { "gpt-4o": {"in": 0.0025, "out": 0.01}, "gpt-4o-mini": {"in": 0.00015, "out": 0.0006}, } def record_cost(model: str, tokens_in_count: int, tokens_out_count: int, tenant: str) -> float: rate = COST_PER_1K_TOKENS[model] cost = (tokens_in_count * rate["in"] + tokens_out_count * rate["out"]) / 1000 cost_usd.add(cost, {"model": model, "tenant": tenant}) return cost Once those flow, the KQL queries write themselves: // Top cost-burning tenants in the last hour customMetrics | where timestamp > ago(1h) | where name == "agent.cost.usd" | extend tenant = tostring(customDimensions["tenant"]) | summarize spend_usd = sum(valueSum) by tenant | top 10 by spend_usd desc The sample ships a 6-tile workbook ( observability/workbook.json ) deployed via Bicep. It renders SLO compliance, cost burn-down, tool failure breakdown, latency percentiles, budget breaches, and healing signals out of the box. The deployed workbook in App Insights. The SLO panel dips during a chaos run and recovers as the agent self-heals — exactly the signal you want on a glass-pane dashboard. Cost guardrails with a budget circuit breaker Custom metrics tell you about cost after you spent it. To prevent runaways, you need a circuit breaker that bites before the model call happens. The middleware in llmops_middleware/budget.py keeps a per-tenant counter in memory (per month) and returns a decision: class BudgetDecision(Enum): ALLOW = "allow" # under budget DOWNSHIFT = "downshift" # ≥80% — switch to cheaper model BLOCK = "block" # ≥100% — refuse the request def evaluate(tenant: str) -> BudgetDecision: spent = _spend.get((tenant, _current_period()), 0.0) if spent >= BUDGET_USD_PER_TENANT: return BudgetDecision.BLOCK if spent >= BUDGET_USD_PER_TENANT * 0.80: return BudgetDecision.DOWNSHIFT return BudgetDecision.ALLOW The agent loop reads that decision and downshifts from gpt-4o to gpt-4o-mini — a 16× cost reduction ($0.0025 / 1K input tokens vs $0.00015) — when a tenant crosses 80% of their monthly budget. The user keeps getting answers; the bill stops climbing. def _pick_model(tenant: str) -> str: decision = budget.evaluate(tenant) if decision == BudgetDecision.DOWNSHIFT: sli.model_downshift.add(1, {"tenant": tenant}) return DOWNSHIFT_MODEL return PRIMARY_MODEL For the demo we keep state in memory; production should swap the dict for Redis (atomic INCRBY ) or Cosmos with optimistic concurrency. The interface in budget.py is intentionally tiny so this is a 10-line change. Self-healing patterns There are three patterns in the sample, each addressing a different failure class. 1. Retry with prompt-repair The most common agent failure isn't a tool exception — it's the model returning malformed JSON that fails schema validation on tool args. The fix is to feed the validation error back into the model and ask it to repair the call: # llmops_middleware/repair.py async def retry_with_repair(call_fn, args, *, max_attempts=2): for attempt in range(max_attempts): try: return await call_fn(args) except (ValidationError, RepairableError) as exc: sli.repair_retry.add(1, {"attempt": str(attempt)}) args = await _ask_model_to_repair(args, str(exc)) raise This single pattern recovers 50–70% of "the agent returned garbage" cases without escalating. 2. Tool fallback chains When a primary tool times out or fails open, try a cheaper or simpler one: async def tool_fallback_chain(primary, *fallbacks, args): for fn in (primary, *fallbacks): try: return await fn(args) except ToolUnavailable: sli.tool_success.add(1, {"tool": fn.__name__, "status": "fallback"}) raise NoToolAvailable() Lookup-style tools especially benefit: web search → cached snapshot → static knowledge base. 3. Slot-swap auto-rollback Here's the killer feature App Service brings that's a slog on K8s: deployment slots. You always have a known-good previous version warmed up and one ARM API call away from production traffic. We wire that up to fire automatically when our SLI breaches. The chain is: Metric alert on Http5xx > 5 in 5 minutes (the platform metric, free) Action Group that POSTs to a Logic App webhook (SAS-signed callback URL) Logic App that calls POST /sites/{name}/slots/staging/slotsswap via its managed identity (granted Website Contributor on the target web app) The whole healer is one trigger + two actions: receive the alert webhook, call ARM slotsswap, return a status payload to the caller. The two actions in Bicep: SwapSlots: { type: 'Http' inputs: { method: 'POST' uri: '${environment().resourceManager}@{parameters(\'targetSiteId\')}/slots/staging/slotsswap?api-version=2024-04-01' body: { targetSlot: 'production' } authentication: { type: 'ManagedServiceIdentity' audience: environment().resourceManager } } } No code to deploy, no secrets to manage, no second runtime to babysit. From alert-fire to swapped-slot is about 4 minutes in our tests — under the SLA most agent products have for "user-visible degraded mode." Why not a Function App? We started there. The Logic App is 60 lines of Bicep and zero application code. For a one-action workflow like "swap a slot," the Function adds packaging, deployment, and a runtime to monitor for no benefit. Chaos testing for agents You can't trust a self-healing system you haven't broken. The sample ships a chaos CLI and an in-process injection point so you can practice failures on demand. In-process: llmops_middleware/chaos.py exposes four modes ( off , throttle , malformed , outage ) togglable via POST /admin/chaos . When set, tool calls roll a die and raise the matching exception with the configured probability: class ChaosController: def maybe_inject(self) -> None: if random.random() > self.probability: return if self.mode == "outage": raise ChaosOutage("simulated tool outage") if self.mode == "throttle": raise ChaosThrottled("simulated 429") if self.mode == "malformed": raise ChaosMalformed("simulated bad tool output") External: chaos/inject.py is a small async load driver that sets /admin/chaos then drives /chat at a target RPS, tallying response codes: python chaos/inject.py \ --base-url https://my-agent.azurewebsites.net \ --mode outage --probability 1.0 --rps 10 --duration 300 Running that for 5 minutes against the deployed sample reliably: Drives customMetrics(name="agent.task.failure") over 50/min Trips the Http5xx > 5 metric alert (~90 seconds after threshold breach) Fires the Logic App run (succeeded in 1.2 seconds in our test) Flips the slot — /health instance ID changes The repo's observability/queries.kql has the canonical KQL for each of these signals, and observability/workbook.json is the deployable workbook that visualizes them. The reference middleware Everything in this post is in seligj95/app-service-self-healing-agent-python. The Python package llmops_middleware/ is the part you'd vendor into your own agent — sli.py , budget.py , repair.py , chaos.py . The agent loop and the Bicep are demo-quality but production-shaped. Run it yourself: git clone https://github.com/seligj95/app-service-self-healing-agent-python cd app-service-self-healing-agent-python azd auth login azd up You'll have an agent + AOAI + workbook + healer running in about 8 minutes. Then run the chaos script and watch the slot flip. The KQL workbook Deployable workbook JSON, dropped into the resource group by Bicep. Six panels: SLO tile — % of tasks where agent.task.success was emitted (grouped by tenant) Cost burn-down — running spend per tenant against the monthly budget Top failing tools — failure count by tool, broken down by error class Latency p50/p95/p99 — agent.task.latency histogram Budget breaches — count and tenant list Healing signals — agent.repair.retry + agent.model.downshift + agent.chaos.injected over time It's observability/workbook.json — loadTextContent -ed into infra/shared/monitoring.bicep so you get it deployed automatically. Why App Service for LLMOps After building this, the appeal of App Service for agents is clearer than I expected going in: Slots are an unfair advantage. A pre-warmed previous version, one ARM call from production. K8s blue/green needs you to build it. Managed identity to Azure OpenAI removes the entire key-rotation problem. The sample sets disableLocalAuth: true on the AOAI account — there literally is no key. App Insights is auto-wired so your custom metrics land in customMetrics and your KQL queries work day one. Bicep + azd lets you ship a full LLMOps stack in one repo: app, infra, healing, observability, chaos. If you're standing up a new agent and you don't already have a Kubernetes platform you love, App Service is a strong default. Wrap-up If you take three things from this post: Define agent SLOs in your own terms — task success, cost per task, tool reliability — not just web-app SLOs. Put a circuit breaker between the user and the model. A budget breaker that downshifts to a cheaper model is the highest-ROI middleware you can ship. Make rollback boring. Slot swap + a one-action Logic App + a metric alert is a self-healing system you can build in an afternoon and trust at 3 AM. The sample has all of it wired up. We're considering baking these into App Service — tell us what you'd want The middleware in this sample (SLIs + telemetry, cost guardrails, policy/audit hooks) is exactly the kind of thing we're evaluating as first-class App Service platform features — opt-in sidecars or built-in capabilities so you don't have to vendor a middleware package into every agent you ship. Concretely, we're tracking ideas like: Agent Observatory — a sidecar that intercepts SDK calls (Semantic Kernel, LangChain, Crew AI, AutoGen) and captures full reasoning traces with zero code changes AI Cost Guardian — platform-level quotas and spend caps across Azure OpenAI, Anthropic, and other model providers, with real-time enforcement Policy Guard — governance, PII masking, model-approval lists, and an immutable audit log for regulated workloads If any of those would land for your team — or if you're solving these problems differently and want to push back on the shape — we want to hear it. Drop a comment on this post: the roadmap is genuinely shaped by feedback at this stage.
jordanselig
May 18, 2026 Place Apps on Azure Blog
228Views
0likes
0Comments
You Can Scale MCP Servers Behind a Load Balancer on App Service — Here's How
Most MCP servers in the wild are single-instance processes. That's fine when they're driving a local Claude or VS Code session — but it's the wrong shape for a production agent fleet that has to absorb traffic spikes, ride through deploys, and survive instance failures. The good news: the MCP spec already grew up. The 2025-06-18 revision formalizes stateless HTTP transport (and the current 2025-11-25 revision keeps it), which means a single request carries everything the server needs to answer. No long-lived connection, no in-process session table, no sticky-session hacks to keep a client glued to one box. That tiny protocol change unlocks something big: you can stick an MCP server behind App Service's built-in load balancer and scale it like any other web API. This post walks through how, with a runnable sample. Sample: seligj95/app-service-mcp-stateless-scale-python. One azd up and you have a stateless FastAPI MCP server running on three App Service instances behind the platform load balancer, with a staging slot, Application Insights, and a k6 script that visualizes load distribution from the client side. Why "stateless" is the whole story Earlier MCP transports leaned on persistent connections — SSE channels and WebSocket-style sessions where the server held per-client state in memory (open tools, subscriptions, partial streams). That model is great for a local IDE talking to a local process. It's hostile to load balancing, because routing a follow-up request to a different instance breaks the session. The stateless HTTP transport flips that. Each request is a complete JSON-RPC envelope ( initialize , tools/list , tools/call ), every response is self-contained, and the server is allowed to forget the client between requests. Any instance can serve any call. That is the property a load balancer needs. In the sample, every tool is a pure function of its arguments — whoami reports the serving instance, lookup_fact reads a static dictionary, compute_primes runs a sieve. None of them touches per-client memory. That's not a constraint of the protocol; it's a discipline you adopt to keep statelessness intact. Why App Service, and not Functions or AKS Functions and AKS are a couple of the many great options for MCP server hosting depending on what the MCP server is used for. The use case we are discussing here is a scaled MCP server, i.e. an MCP server that must reach a large and broad audience. Here are a few defaults that make App Service a solid option for this scenario: Always On. Reasoning tools call into LLMs and external APIs; latencies routinely sit in the multi-second range. Functions caps a single execution at ten minutes by default (and aggressively scales workers to zero between bursts, which kills warm caches). App Service keeps the process resident. Horizontal scale is one parameter. Pick a Premium SKU, set the plan's capacity to N, and you have N instances behind a managed load balancer. No VMSS to declare, no ingress controller to wire up, no Service to reconcile. Deployment slots. Swap a warmed-up staging slot into production for zero-downtime deploys. Critical when your "API" is an LLM tool surface that an agent is actively driving. Easy Auth. OAuth 2.1 in front of the MCP endpoint without writing the flow yourself — turn on the App Service authentication blade and point it at Entra ID. The sample leaves this off so the deploy is one command, but the wiring is a checkbox away. The TL;DR: it's PaaS that already knows how to run a stateful long-lived process at horizontal scale, which is exactly the shape of a scaled MCP server. The FastAPI MCP server, end-to-end stateless The whole transport is one POST handler. The full source is in main.py , but here are the load-bearing pieces: @app.post("/mcp") async def mcp_endpoint(request: Request): body = await request.json() method = body.get("method", "") msg_id = body.get("id") if method == "initialize": return {"jsonrpc": "2.0", "id": msg_id, "result": _server_info()} if method == "tools/list": return {"jsonrpc": "2.0", "id": msg_id, "result": {"tools": [...]}} if method == "tools/call": params = body.get("params", {}) result = await MCP_TOOLS[params["name"]]["function"](**params.get("arguments", {})) return { "jsonrpc": "2.0", "id": msg_id, "result": {"content": [{"type": "text", "text": json.dumps(result)}]}, } There is no session table. There is no client_id cookie. There is no AsyncIterator held open between requests. initialize , tools/list , and tools/call all return in a single round trip, which is the shape App Service's load balancer expects. The most useful debugging tool in the sample is whoami : async def tool_whoami() -> Dict[str, Any]: return { "instance_id": os.environ.get("WEBSITE_INSTANCE_ID", "local"), "hostname": socket.gethostname(), ... } WEBSITE_INSTANCE_ID is unique per App Service worker. Call whoami a few times from your MCP client and the value rotates — that's the load balancer working. If it doesn't rotate, something is pinning your traffic (almost always the ARR Affinity cookie; we'll get there). The Bicep that actually makes it scale The infra is a P0v3 plan with capacity: 3 , a web app with affinity disabled, and a staging slot on the same plan: resource appServicePlan 'Microsoft.Web/serverfarms@2024-04-01' = { name: name sku: { name: 'P0v3' capacity: instanceCount // 3 by default } properties: { reserved: true } } resource web 'Microsoft.Web/sites@2024-04-01' = { name: name properties: { serverFarmId: appServicePlanId httpsOnly: true clientAffinityEnabled: false // ← the one line that matters siteConfig: { linuxFxVersion: 'PYTHON|3.11' alwaysOn: true healthCheckPath: '/health' appCommandLine: 'python -m uvicorn main:app --host 0.0.0.0 --port 8000' } } } resource staging 'Microsoft.Web/sites/slots@2024-04-01' = { parent: web name: 'staging' properties: { /* same shape — separate hostname, same plan */ } } The single most important line in that template is clientAffinityEnabled: false . App Service defaults to on, which sets the ARRAffinity cookie and pins every subsequent request from a given client to the instance that handled the first one. That default exists because legacy ASP.NET apps used in-process session state. Stateless MCP does not. Leaving affinity on silently undoes everything we just built. Premium v3 (P0v3) is the floor for two reasons: it gives Always On and unlocks deployment slots. Below that tier you don't get either. Application Insights without writing telemetry code The sample drops one line of bootstrap into main.py : from azure.monitor.opentelemetry import configure_azure_monitor if os.environ.get("APPLICATIONINSIGHTS_CONNECTION_STRING"): configure_azure_monitor(logger_name="mcp") The Azure Monitor OpenTelemetry distro auto-instruments FastAPI and outbound HTTP. Every request span App Service emits is tagged with cloud_RoleInstance , which Application Insights populates from WEBSITE_INSTANCE_ID . That makes the question "is traffic actually spreading across my instances?" a one-liner in Logs: requests | where timestamp > ago(15m) | where name contains "/mcp" | summarize count() by cloud_RoleInstance | order by count_ desc If you see three roughly-equal rows, you're done. If you see one row, your client is sending ARRAffinity cookies — turn affinity off and redeploy. Deploy azd auth login azd up That provisions the resource group, plan, web app, staging slot, Log Analytics workspace, and Application Insights resource, then deploys the Python app via Oryx. The output prints both WEB_URI and WEB_STAGING_URI . Open the production URI — the home page renders the instance ID that served it. Refresh. The ID changes. To swap the staging slot into production with no downtime: az webapp deployment slot swap \ --resource-group <rg> --name <app> \ --slot staging --target-slot production App Service warms the staging instances, redirects traffic, and the old production becomes the new staging — the classic blue-green pattern, but free. Prove it scales The sample ships a k6 script that hammers /mcp with tools/call requests and tags every response with the instance_id the server returned: BASE_URL=https://<your-app>.azurewebsites.net \ k6 run --summary-export=summary.json loadtest/k6-mcp.js jq '.metrics.mcp_instance_hits.values' summary.json The output groups hits per instance tag. On a three-instance plan with a 60-second steady load you should see something close to: { "count": 1842, "instance0d3e2f...": 614, "instance7a91bc...": 612, "instance19f0c4...": 616 } Roughly 33% on each box — the App Service load balancer round-robining new connections, with no help from the application. What I'd do next The sample is intentionally a starting point. Two extensions are the obvious next moves: Add Easy Auth. Turn on App Service authentication, pick Entra ID, require auth on /mcp . The token surfaces as headers; your tool handlers can use it to identify the calling agent without you owning any of the OAuth machinery. Autoscale on CPU. instanceCount: 3 is a starting point. Wire up Microsoft.Insights/autoscalesettings against the plan and let it scale 3 → 10 on the prime-counting tool. The architecture already supports it — that's the whole point of stateless. Try it Sample repo: github.com/seligj95/app-service-mcp-stateless-scale-python MCP spec: modelcontextprotocol.io/specification/2025-11-25 App Service docs: learn.microsoft.com/azure/app-service/overview If you ship something with it, I'd love to hear how it held up.
jordanselig
May 18, 2026 Place Apps on Azure Blog
258Views
0likes
0Comments