azure app service
492 TopicsYou Can Build a Framework-Agnostic AI Gateway on Azure App Service — Here's How
The agent infrastructure conversation moved this year. In October 2025, AWS shipped Amazon Bedrock AgentCore — a managed agent runtime with per-session microVM isolation, built-in long-term memory, native MCP support, and an opinionated policy engine. A few months earlier, Cloudflare shipped its Agents SDK on top of Durable Objects, betting that edge-native stateful agents are the future. Both bets are real, both are interesting, and both arrive as closed, proprietary runtimes. So: what's Azure's answer? It's a question I've heard a couple times from architects in the last six months. The honest answer is that Azure already has the pieces. They don't ship as one product called AgentRuntime, and that's actually the point. Azure's pitch is composable: App Service + API Management + MCP, three services you already have access to, glued together with open standards. This post walks through a runnable sample of that composition. One App Service hosting both an agent (built with the Microsoft Agent Framework) and the stateless MCP server it calls, fronted by Azure API Management with the AI Gateway policy set — semantic caching, token rate limiting, per-subscription token emission for chargeback. One azd up deploys the lot. Repo: app-service-ai-gateway-mcp-apim-python. The headline claim is in the title. The point I actually want to make is the one underneath it: the framework is replaceable, the gateway is the contribution. Swap the Agent Framework module for Pydantic AI or LangGraph and the rest of the architecture is unchanged. That's what "run anything" means, made literal. The composable stack ┌────────────────────────────────────────────────┐ │ Azure API Management │ MCP / Agent ──┤ AI Gateway policies: │ client │ • llm-token-limit │ │ • llm-semantic-cache-lookup / store │ │ • llm-emit-token-metric │ │ • rate-limit-by-key (MCP API) │ └─────────────┬───────────────────┬──────────────┘ │ │ ┌────────────────▼──┐ ┌────────────▼──────────────┐ │ Azure OpenAI │ │ Azure App Service │ │ • chat model │ │ FastAPI app: │ │ • embedding │ │ • /mcp (stateless) │ │ model │ │ • /agent/chat │ └───────────────────┘ │ Managed identity → │ │ APIM (via subscription) │ └────────────┬──────────────┘ ▼ Application Insights (cloud_RoleInstance, APIM token metrics) Three observations that drive everything else: APIM is the only thing that talks to Azure OpenAI. The App Service agent doesn't have an AOAI key. It has an APIM subscription key. Every LLM call passes through the gateway, picks up the policies, and gets logged with consistent dimensions. That's where the governance part lives. The agent runtime is App Service. Linux, Python, FastAPI. Any language. Any framework. Pick your tool. We use Microsoft Agent Framework because it just GA'd and the API is clean, but the agent module is the easiest thing in the stack to swap. The MCP server is co-located with the agent. Same App Service, different route. The agent calls its own tools either in-process (fast path) or back out through APIM (so MCP traffic gets rate-limited and observed too). That choice is one environment variable. What the sample actually does The FastAPI app exposes three routes that matter: /mcp — a stateless HTTP MCP server (protocol revision 2025-11-25 ), implementing four tools: whoami , echo , lookup_fact , and summarize_app_service_doc . Any MCP client (Claude, VS Code, your own agent runtime) can connect. /agent/chat — a Microsoft Agent Framework agent that uses those same MCP tools as its tool set, and calls AOAI through APIM. /health and / — the boring but essential supporting cast (health check for App Service probes, status page showing the serving instance ID). Here's the agent definition. The key line is the endpoint: from agent_framework.openai import OpenAIChatCompletionClient client = OpenAIChatCompletionClient( azure_endpoint=os.environ["APIM_GATEWAY_URL"], # ← APIM, not AOAI model=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT"], api_version="2024-10-21", api_key=os.environ["APIM_SUBSCRIPTION_KEY"], default_headers={"Ocp-Apim-Subscription-Key": os.environ["APIM_SUBSCRIPTION_KEY"]}, ) agent = client.as_agent( name="AppServiceExpert", instructions=SYSTEM_INSTRUCTIONS, tools=build_tools(), ) That's it. The agent has no idea APIM exists. It thinks it's talking to AOAI. APIM is doing every interesting thing — auth, caching, throttling, metric emission — without the agent code knowing or caring. The policy that does the heavy lifting The AOAI API in APIM has one policy attached at the API scope. The full XML is in infra/apim/policies/aoai-policy.xml; here's the bones of it: <policies> <inbound> <base /> <authentication-managed-identity resource="https://cognitiveservices.azure.com" output-token-variable-name="aoai-token" /> <set-header name="Authorization" exists-action="override"> <value>@("Bearer " + (string)context.Variables["aoai-token"])</value> </set-header> <azure-openai-token-limit counter-key="@(context.Subscription?.Id ?? "anonymous")" tokens-per-minute="50000" estimate-prompt-tokens="true" /> <azure-openai-semantic-cache-lookup score-threshold="0.85" embeddings-backend-id="aoai-embeddings-backend" embeddings-backend-auth="system-assigned"> <vary-by>@(context.Subscription?.Id ?? "anonymous")</vary-by> </azure-openai-semantic-cache-lookup> <set-backend-service backend-id="aoai-backend" /> <azure-openai-emit-token-metric namespace="ai-gateway"> <dimension name="Subscription ID" value="@(context.Subscription?.Id ?? "anonymous")" /> <dimension name="API ID" value="@(context.Api.Id)" /> <dimension name="Operation ID" value="@(context.Operation.Id)" /> <dimension name="Client IP" value="@(context.Request.IpAddress)" /> </azure-openai-emit-token-metric> </inbound> <outbound> <base /> <azure-openai-semantic-cache-store duration="3600" /> </outbound> </policies> Four things are happening here that would otherwise be your problem: Auth to AOAI. APIM's managed identity holds the Cognitive Services OpenAI User role on the AOAI account. No keys. Token rate limiting. Each APIM subscription gets a tokens-per-minute budget. One runaway team can't starve everyone else. Semantic caching. The inbound policy embeds the prompt using the embedding deployment, queries the Redis-backed APIM cache for a vector match above the 0.85 threshold, and short-circuits the AOAI call on a hit. The outbound azure-openai-semantic-cache-store writes successful misses back. Per-call metric emission. Every call pushes PromptTokens , CompletionTokens , and TotalTokens to Application Insights as custom metrics tagged with the APIM subscription, the API, the operation, and the client IP. That's your chargeback dashboard, ready to query. The whole thing is XML. None of it is in your agent code. Deploying it azd auth login azd up azd up provisions a P0v3 App Service Plan with the web app and a staging slot, an AOAI account with gpt-4o-mini + text-embedding-3-small deployments, an APIM Developer SKU service with the two APIs and the policy XML wired up, an Azure Cache for Redis Basic C0 as the semantic-cache store, and a Log Analytics workspace + Application Insights. The postprovision hook fetches the APIM subscription key for the AI Gateway product and writes it into the App Service's APIM_SUBSCRIPTION_KEY setting (and the staging slot's, so slot swaps are clean). Be patient. Developer SKU APIM takes 30–45 minutes the first time. If you want to prototype faster, the sample supports Consumption SKU as a one-line flip: azd env set APIM_SKU Consumption azd provision Consumption provisions in about a minute and is great for sketching. Verify your specific policies are supported there before you ship it. Governing it like a grown-up The toy version of this post stops at "look, semantic cache works." The version your platform engineering lead wants to see goes further. Per-team chargeback. The token-emit policy tags every call with the APIM subscription ID. Issue one subscription per team, hand it over with a quota, and your monthly chargeback report is a KQL query: customMetrics | where timestamp > startofmonth(now()) | where name == "TotalTokens" | summarize Tokens=sum(valueSum) by Team=tostring(customDimensions["Subscription ID"]) | extend USD = Tokens * 0.00015 / 1000 // gpt-4o-mini blended rate | order by USD desc Content safety as a policy plug-in. Add an llm-content-safety block to the inbound policy and point it at an Azure AI Content Safety resource — every prompt and response gets moderated before reaching agents or end users. The sample doesn't deploy Content Safety by default (to keep the demo cost-free), but the README has the one-line bicep + one-block policy delta. Circuit breaker + multi-region failover. Add a second AOAI backend in a different region and an APIM backend pool, give the pool a circuit-breaker rule, and your agents inherit failover with zero code changes. Rate-limit MCP traffic too. The MCP API has its own policy with rate-limit-by-key , so a runaway agent can't pin the MCP server with a hot loop. None of these are gymnastics. They're one policy block each. The pattern is the same every time: write policy at the gateway, leave the agent code alone. Proving it works After azd up finishes, two checks. First, hit the agent endpoint: curl -sS -X POST "$(azd env get-value WEB_URI)/agent/chat" \ -H 'Content-Type: application/json' \ -d '{"message": "How does App Service horizontally scale an MCP server?"}' | jq You should see a reply that cites the instance ID (the agent calls whoami and summarize_app_service_doc to ground its answer) and a tool_calls array showing the agent's reasoning trace. Second, run the k6 load test: export BASE_URL="$(azd env get-value WEB_URI)" export APIM_SUBSCRIPTION_KEY="$(azd env get-value APIM_SUBSCRIPTION_KEY)" k6 run loadtest/k6-gateway.js The script hits /agent/chat with a small pool of semantically-similar prompts. After a 30-second warmup, the steady phase should report a cache-hit ratio above 30%: APIM AI Gateway — k6 summary ───────────────────────────── Cache hits : 412 Cache misses : 88 Hit ratio : 82.4% Cross-check in App Insights: ApiManagementGatewayLogs | where TimeGenerated > ago(15m) | where ApiId == "aoai" | extend cache = tostring(parse_json(ResponseHeaders)["x-llm-cache-status"]) | summarize count() by cache, bin(TimeGenerated, 1m) | render columnchart A solid bar of hits next to a smaller bar of misses is the gateway earning its keep. "Run anything" — the proof Here's the part where I cash the check the title wrote. The agent module is the easiest thing in this stack to replace. Three changes to ship the same demo on Pydantic AI: # requirements.txt - agent-framework-core==1.5.0 - agent-framework-openai==1.5.0 + pydantic-ai==0.4.0 # agent/agent.py from pydantic_ai import Agent from pydantic_ai.models.openai import OpenAIModel def build_agent(): model = OpenAIModel( "gpt-4o-mini", base_url=f"{os.environ['APIM_GATEWAY_URL']}/openai/deployments/gpt-4o-mini", api_key=os.environ["APIM_SUBSCRIPTION_KEY"], ) return Agent(model, system_prompt=SYSTEM_INSTRUCTIONS, tools=build_tools()) That's it. build_tools() returns the same list of async callables (Pydantic AI accepts plain Python functions as tools, same as Agent Framework). LangGraph works the same way — wire build_tools() into a ToolNode , point ChatOpenAI at the APIM gateway URL, done. Every APIM policy still fires. Every token metric still emits. Every cache hit still hits. The gateway is the boundary; the runtime above it is fungible. What AgentCore gets right I want to land this without spin. AgentCore's per-session microVM isolation is genuinely interesting — it's a stronger sandboxing story than running multiple agents in shared App Service workers, and it matters for multi-tenant SaaS where agents execute arbitrary user code or call third-party tools you don't fully trust. The managed long-term memory primitive is also a real convenience; Azure has the building blocks (Cosmos DB, AI Search, Cognitive Search) but they aren't pre-wired into a single "agent memory" API the way AgentCore's are. Where the App Service + APIM + MCP composition genuinely wins: Open standards. MCP is a public protocol with implementations across the industry. AgentCore's tool layer is AWS-native. No new runtime to learn. App Service is the same App Service. Your existing CI/CD, your existing security review, your existing monitoring all apply. Bring your own framework. Pydantic AI, LangGraph, Agent Framework, Semantic Kernel, AutoGen, CrewAI — they all work, because the App Service doesn't care what's running inside the container. Existing enterprise footprint. VNet integration, private endpoints, managed identity, deployment slots, sidecars, Easy Auth. None of it is new for App Service. You inherit a decade of platform work. The right framing isn't "Azure's answer to AgentCore." It's that Azure is making a different bet: that enterprises will value the composability of services they already trust over the convenience of a new proprietary runtime. For some, that bet is probably correct. For a few — multi-tenant agent marketplaces, untrusted code execution — AgentCore's isolation model is a better fit. Pick the one that matches your threat model. What's next If you ship the sample and want to compare notes, the repo is at app-service-ai-gateway-mcp-apim-python.110Views0likes0CommentsYou Can Scale MCP Servers Behind a Load Balancer on App Service — Here's How
Most MCP servers in the wild are single-instance processes. That's fine when they're driving a local Claude or VS Code session — but it's the wrong shape for a production agent fleet that has to absorb traffic spikes, ride through deploys, and survive instance failures. The good news: the MCP spec already grew up. The 2025-06-18 revision formalizes stateless HTTP transport (and the current 2025-11-25 revision keeps it), which means a single request carries everything the server needs to answer. No long-lived connection, no in-process session table, no sticky-session hacks to keep a client glued to one box. That tiny protocol change unlocks something big: you can stick an MCP server behind App Service's built-in load balancer and scale it like any other web API. This post walks through how, with a runnable sample. Sample: seligj95/app-service-mcp-stateless-scale-python. One azd up and you have a stateless FastAPI MCP server running on three App Service instances behind the platform load balancer, with a staging slot, Application Insights, and a k6 script that visualizes load distribution from the client side. Why "stateless" is the whole story Earlier MCP transports leaned on persistent connections — SSE channels and WebSocket-style sessions where the server held per-client state in memory (open tools, subscriptions, partial streams). That model is great for a local IDE talking to a local process. It's hostile to load balancing, because routing a follow-up request to a different instance breaks the session. The stateless HTTP transport flips that. Each request is a complete JSON-RPC envelope ( initialize , tools/list , tools/call ), every response is self-contained, and the server is allowed to forget the client between requests. Any instance can serve any call. That is the property a load balancer needs. In the sample, every tool is a pure function of its arguments — whoami reports the serving instance, lookup_fact reads a static dictionary, compute_primes runs a sieve. None of them touches per-client memory. That's not a constraint of the protocol; it's a discipline you adopt to keep statelessness intact. Why App Service, and not Functions or AKS Functions and AKS are a couple of the many great options for MCP server hosting depending on what the MCP server is used for. The use case we are discussing here is a scaled MCP server, i.e. an MCP server that must reach a large and broad audience. Here are a few defaults that make App Service a solid option for this scenario: Always On. Reasoning tools call into LLMs and external APIs; latencies routinely sit in the multi-second range. Functions caps a single execution at ten minutes by default (and aggressively scales workers to zero between bursts, which kills warm caches). App Service keeps the process resident. Horizontal scale is one parameter. Pick a Premium SKU, set the plan's capacity to N, and you have N instances behind a managed load balancer. No VMSS to declare, no ingress controller to wire up, no Service to reconcile. Deployment slots. Swap a warmed-up staging slot into production for zero-downtime deploys. Critical when your "API" is an LLM tool surface that an agent is actively driving. Easy Auth. OAuth 2.1 in front of the MCP endpoint without writing the flow yourself — turn on the App Service authentication blade and point it at Entra ID. The sample leaves this off so the deploy is one command, but the wiring is a checkbox away. The TL;DR: it's PaaS that already knows how to run a stateful long-lived process at horizontal scale, which is exactly the shape of a scaled MCP server. The FastAPI MCP server, end-to-end stateless The whole transport is one POST handler. The full source is in main.py , but here are the load-bearing pieces: @app.post("/mcp") async def mcp_endpoint(request: Request): body = await request.json() method = body.get("method", "") msg_id = body.get("id") if method == "initialize": return {"jsonrpc": "2.0", "id": msg_id, "result": _server_info()} if method == "tools/list": return {"jsonrpc": "2.0", "id": msg_id, "result": {"tools": [...]}} if method == "tools/call": params = body.get("params", {}) result = await MCP_TOOLS[params["name"]]["function"](**params.get("arguments", {})) return { "jsonrpc": "2.0", "id": msg_id, "result": {"content": [{"type": "text", "text": json.dumps(result)}]}, } There is no session table. There is no client_id cookie. There is no AsyncIterator held open between requests. initialize , tools/list , and tools/call all return in a single round trip, which is the shape App Service's load balancer expects. The most useful debugging tool in the sample is whoami : async def tool_whoami() -> Dict[str, Any]: return { "instance_id": os.environ.get("WEBSITE_INSTANCE_ID", "local"), "hostname": socket.gethostname(), ... } WEBSITE_INSTANCE_ID is unique per App Service worker. Call whoami a few times from your MCP client and the value rotates — that's the load balancer working. If it doesn't rotate, something is pinning your traffic (almost always the ARR Affinity cookie; we'll get there). The Bicep that actually makes it scale The infra is a P0v3 plan with capacity: 3 , a web app with affinity disabled, and a staging slot on the same plan: resource appServicePlan 'Microsoft.Web/serverfarms@2024-04-01' = { name: name sku: { name: 'P0v3' capacity: instanceCount // 3 by default } properties: { reserved: true } } resource web 'Microsoft.Web/sites@2024-04-01' = { name: name properties: { serverFarmId: appServicePlanId httpsOnly: true clientAffinityEnabled: false // ← the one line that matters siteConfig: { linuxFxVersion: 'PYTHON|3.11' alwaysOn: true healthCheckPath: '/health' appCommandLine: 'python -m uvicorn main:app --host 0.0.0.0 --port 8000' } } } resource staging 'Microsoft.Web/sites/slots@2024-04-01' = { parent: web name: 'staging' properties: { /* same shape — separate hostname, same plan */ } } The single most important line in that template is clientAffinityEnabled: false . App Service defaults to on, which sets the ARRAffinity cookie and pins every subsequent request from a given client to the instance that handled the first one. That default exists because legacy ASP.NET apps used in-process session state. Stateless MCP does not. Leaving affinity on silently undoes everything we just built. Premium v3 (P0v3) is the floor for two reasons: it gives Always On and unlocks deployment slots. Below that tier you don't get either. Application Insights without writing telemetry code The sample drops one line of bootstrap into main.py : from azure.monitor.opentelemetry import configure_azure_monitor if os.environ.get("APPLICATIONINSIGHTS_CONNECTION_STRING"): configure_azure_monitor(logger_name="mcp") The Azure Monitor OpenTelemetry distro auto-instruments FastAPI and outbound HTTP. Every request span App Service emits is tagged with cloud_RoleInstance , which Application Insights populates from WEBSITE_INSTANCE_ID . That makes the question "is traffic actually spreading across my instances?" a one-liner in Logs: requests | where timestamp > ago(15m) | where name contains "/mcp" | summarize count() by cloud_RoleInstance | order by count_ desc If you see three roughly-equal rows, you're done. If you see one row, your client is sending ARRAffinity cookies — turn affinity off and redeploy. Deploy azd auth login azd up That provisions the resource group, plan, web app, staging slot, Log Analytics workspace, and Application Insights resource, then deploys the Python app via Oryx. The output prints both WEB_URI and WEB_STAGING_URI . Open the production URI — the home page renders the instance ID that served it. Refresh. The ID changes. To swap the staging slot into production with no downtime: az webapp deployment slot swap \ --resource-group <rg> --name <app> \ --slot staging --target-slot production App Service warms the staging instances, redirects traffic, and the old production becomes the new staging — the classic blue-green pattern, but free. Prove it scales The sample ships a k6 script that hammers /mcp with tools/call requests and tags every response with the instance_id the server returned: BASE_URL=https://<your-app>.azurewebsites.net \ k6 run --summary-export=summary.json loadtest/k6-mcp.js jq '.metrics.mcp_instance_hits.values' summary.json The output groups hits per instance tag. On a three-instance plan with a 60-second steady load you should see something close to: { "count": 1842, "instance0d3e2f...": 614, "instance7a91bc...": 612, "instance19f0c4...": 616 } Roughly 33% on each box — the App Service load balancer round-robining new connections, with no help from the application. What I'd do next The sample is intentionally a starting point. Two extensions are the obvious next moves: Add Easy Auth. Turn on App Service authentication, pick Entra ID, require auth on /mcp . The token surfaces as headers; your tool handlers can use it to identify the calling agent without you owning any of the OAuth machinery. Autoscale on CPU. instanceCount: 3 is a starting point. Wire up Microsoft.Insights/autoscalesettings against the plan and let it scale 3 → 10 on the prime-counting tool. The architecture already supports it — that's the whole point of stateless. Try it Sample repo: github.com/seligj95/app-service-mcp-stateless-scale-python MCP spec: modelcontextprotocol.io/specification/2025-11-25 App Service docs: learn.microsoft.com/azure/app-service/overview If you ship something with it, I'd love to hear how it held up.121Views0likes0CommentsAnnouncing Public Preview of Argo CD extension in AKS Azure Portal Experience
We are excited to announce the public preview of Argo CD in the Azure Portal for Azure Kubernetes Service. As GitOps becomes the standard for deploying and operating applications at scale, customers need a way to adopt GitOps with simpler onboarding, secure defaults, and integrated workflows. With Argo CD now available directly in the Portal, teams can enable and manage GitOps without the complexity of manual setup. Bringing GitOps into the AKS experience Argo CD is widely used across Kubernetes environments, but setup often requires manual configuration across identity, networking, and registry integrations. With the Azure Portal experience, customers can: Enable Argo CD directly from the AKS cluster Configure identity, access, ingress, and registry integration in a guided flow Manage and monitor GitOps workflows through Argo CD UI This reduces onboarding friction and helps you reach your first successful GitOps deployment faster. Trusted identity and secure access The Argo CD experience integrates with Microsoft Entra ID to provide a secure, enterprise-ready foundation: Secure authentication using Workload Identity federation to Azure Container Registry (ACR) and Azure DevOps, removing long-lived credentials and hard-coded secrets Single Sign-On (SSO) using existing Azure identities Enterprise-grade hardening and security This preview includes built-in improvements to strengthen security posture: Images built on Azure Linux for reduced CVEs and improved baseline security Optional automatic patch updates to stay current while maintaining control over change management Parity with upstream Argo CD Argo CD in AKS remains aligned with the upstream open-source project, supporting: High availability (HA) configurations for production workloads Hub-and-spoke architectures for multi-cluster GitOps Application and ApplicationSet for scalable deployment across fleets Getting Started We invite you to explore the Argo CD experience in the Azure Portal and share feedback. To get started, go to your AKS cluster in the Azure Portal, navigate to the GitOps experience, and select Enable Argo CD. Follow the guided setup to configure identity, access, ingress, and registry integration with secure defaults. Once enabled, you can monitor your deployment and view application health and sync status from the Argo CD UI linked in the GitOps blade. For customers who prefer automation and scripting, the Argo CD extension is also available via Azure CLI public preview. NOTE: You can choose between Flux and Argo CD as your GitOps solution based on your needs. The Argo CD option is available during the initial GitOps setup experience, while existing Flux users will continue to see their current configuration.227Views0likes0Comments[Retired] Strapi on App Service: Quick start
In this quick start guide, you will learn how to create and deploy your first Strapi site on Azure App Service Linux, using Azure Database for MySQL or PostgreSQL, along with other necessary Azure resources. This guide utilizes an ARM template to install the required resources for hosting your Strapi application.3.7KViews1like4Comments[Retired] Strapi on App Service: Overview
Looking to self-host Strapi? Deploy Strapi on Azure App Service to gain greater customization control, global region availability, and seamless integration with other Azure services. Hosting Strapi on Azure App Service simplifies infrastructure management while ensuring high availability, security, and performance. Whether you're searching for Strapi hosting, Strapi deployment, or where to host Strapi, Azure App Service provides the ideal solution for deploying Strapi efficiently and securely.1.3KViews2likes0Comments[Retired] Strapi on App Service: FAQ
This comprehensive guide is designed to answer all your questions about hosting and deploying Strapi on Azure App Service. Whether you're looking to understand the best platforms for Strapi integration, how to run Strapi in different modes, or how to deploy Strapi on various hosting services, this FAQ has you covered.1KViews0likes0CommentsDebugging Python apps on App Service with the new SSH helper aliases
You shipped a Python app to App Service. It worked in the demo. It works locally. In production, /chat is returning 502s — but /health is green, the deployment succeeded, the logs are quiet, and your laptop can't reproduce it. What you actually need is a shell on the running container so you can poke at DNS, env vars, installed packages, the listening port, and the AI endpoint your app is calling. The platform has had SSH for a while, but the playbook of "open SSH, then remember which 14 commands to run" was tribal knowledge. We just shipped a set of SSH helper aliases that turn that tribal knowledge into one-word commands. apphelp shows you everything; appconfig , showpkgs , and appcurl cover the app side; ai-test , ai-diagnose , ai-curl , ai-latency , ai-dns , and ai-access-check cover the Azure AI Foundry side. This post is a hands-on tour. We built a deliberately fragile FastAPI sample with six different fault modes, deployed it, broke it, and SSH'd in to watch the aliases drive each one to root cause. Every transcript below is real output from the deployed sample. 📦 Sample repo: seligj95/app-service-ssh-diagnostics-python — azd up and you have a fault-injectable Python + Foundry app live in your subscription in about 4 minutes. The sample, in one breath FastAPI app, Python 3.14, App Service Linux on P0v3 — uses the new Oryx FastAPI auto-detection so no custom startup command is needed Calls Azure OpenAI (gpt-4o-mini) via managed identity — no keys POST /admin/fault toggles one of seven modes: off , bad-creds , wrong-endpoint , dns-fail , port-mismatch , dep-import-error , latency-spike GET / is a landing page with a built-in cheat sheet of the SSH aliases The endpoints are intentionally boring. The point is to give the aliases something realistic to chew on. A quick note on Azure OpenAI vs. AI Foundry. This sample provisions an Azure OpenAI account ( kind: OpenAI ). The new ai-* aliases speak the OpenAI chat-completions API ( /openai/deployments/<model>/chat/completions ), which is identical on Azure OpenAI and on Azure AI Foundry projects — both expose *.openai.azure.com endpoints, both accept managed-identity bearer tokens, both speak the same schema. The aliases work against either; the env-var name AZURE_AI_FOUNDRY_ENDPOINT is just the alias contract. Drop a Foundry endpoint into it and the same walkthrough applies. Shout-out to the new FastAPI auto-detect on Python 3.14. This sample also benefits from another recent App Service change: on Python 3.14+, App Service automatically detects FastAPI apps and starts them with gunicorn -k uvicorn_worker.UvicornWorker — no custom startup command needed. Our Bicep ships an empty appCommandLine and lets Oryx do the right thing. The whole sample is a nice tour of recent App Service Python improvements landing together. Step zero: apphelp After azd up finishes, the first thing to do over SSH is: az webapp ssh -g rg-ssh-diag-demo -n app-web-<token> Then inside the container: $ apphelp apphelp prints every alias the image ships with, grouped by category. You don't need to memorize anything — when you forget what checkport does, you run apphelp and it's right there. We'll lean on most of these: App info: showpkgs , appconfig , appenv Logs: applogs , deploylogs , logfiles Reachability: appcurl , checkport , gohome , gosrc AI/Foundry: ai-test , ai-dns , ai-access-check , ai-curl , ai-latency , ai-diagnose Network tools: install-nettools The healthy baseline Before breaking anything, run ai-diagnose . This is the one-shot "is my AI path healthy?" check, and it's the alias we reach for most: $ ai-diagnose ──────────────────────────────────────────────────────────────── AI Foundry Diagnostics ──────────────────────────────────────────────────────────────── [✓] Managed identity token [✓] DNS resolution (d8f9grasb7ewc7h8.ai-gateway.eastus2-01.azure-api.net. - public) [✓] Foundry connectivity (761ms) ──────────────────────────────────────────────────────────────── Three green checks tell you three different things: the managed identity is issuing tokens, the Foundry hostname resolves, and the endpoint responded in a reasonable time. If any of these are red, you already know which layer the fault is in. For more detail, the individual aliases are worth knowing: $ ai-test ✓ Connected | 1009ms | Model: gpt-4o-mini | Auth: Managed Identity $ ai-access-check ✓ Foundry endpoint: https://cog-ftirxupt2yjoe.openai.azure.com/ ✓ Model: gpt-4o-mini ✓ Using auth mode: Managed Identity ✓ Access check passed: authorized to call Foundry $ ai-latency Running 5 requests to gpt-4o-mini... Request 1: 679ms ✓ Request 2: 826ms ✓ Request 3: 758ms ✓ Request 4: 641ms ✓ Request 5: 664ms ✓ Results (5/5 successful): Avg: 713ms | Min: 641ms | Max: 826ms And the app side: $ checkport ✓ App is listening on port 8000 $ appcurl /health HTTP Status: 200 Time: 0.002417s Size: 5423 bytes That's our "everything is fine" reference. Now let's break things. One trick: applying a fault inside the SSH shell A subtle thing trips people up the first time. POST /admin/fault mutates the app process's environment — but your SSH shell is a separate process. It inherited the container's env when you opened the session, so ai-test will still see the healthy values. The sample handles this by also writing a small file to the persistent share: # app/faults.py def _write_env_file() -> None: """Write fault env to /home/site/diagnostics/fault.env so SSH can `source` it.""" diag = Path("/home/site/diagnostics") diag.mkdir(parents=True, exist_ok=True) snap = _snapshot_unlocked() lines = [f"# Active fault: {snap['mode']}", ""] for k, v in snap["env"].items(): lines.append(f"export {k}={shlex.quote(v) if v else "''"}") (diag / "fault.env").write_text("\n".join(lines) + "\n") After toggling a fault, run this once in your SSH session: source /home/site/diagnostics/fault.env Now the aliases see the same env the broken app sees. This pattern — flip a flag from outside, source the change inside — is worth stealing for your own debugging workflows. Group A: faults the AI aliases catch directly Some faults are in the path between App Service and Foundry — wrong endpoint, broken DNS, network. The ai-* aliases reproduce the failure end-to-end, and they tell you exactly which layer. Fault 1: wrong-endpoint — a typo in the AOAI endpoint The most common AI-side incident: someone fat-fingers an app setting. The endpoint resolves to something (it's still *.openai.azure.com ) but it's not your resource. curl -X POST $URL/admin/fault -H 'content-type: application/json' \ -d '{"mode":"wrong-endpoint"}' curl $URL/chat -H 'content-type: application/json' \ -d '{"prompt":"hi"}' # HTTP 502 # {"detail":"APIConnectionError: Connection error."} SSH in, source the fault env, run the AI aliases: $ source /home/site/diagnostics/fault.env $ ai-dns Resolving: this-resource-does-not-exist.openai.azure.com ✗ DNS resolution failed for this-resource-does-not-exist.openai.azure.com $ ai-curl Request: POST https://this-resource-does-not-exist.openai.azure.com//openai/deployments/gpt-4o-mini/chat/completions?api-version=2024-02-01 Authorization: Bearer [hidden] Content-Type: application/json curl: (6) Could not resolve host: this-resource-does-not-exist.openai.azure.com $ ai-diagnose [✓] Managed identity token [✗] DNS resolution failed for this-resource-does-not-exist.openai.azure.com [✗] Foundry connectivity (HTTP 000) ai-diagnose collapses the whole story into three lines: token works, DNS fails, connectivity fails. The fault is unambiguously a bad endpoint — check appconfig and your Bicep parameters. Fault 2: dns-fail — NXDOMAIN A subtler variant of the same failure mode is when the endpoint is structurally wrong (private endpoint misconfigured, hosts file mishap, custom domain expired). ai-dns calls it out the same way: $ ai-dns Resolving: no-such-host.invalid.example ✗ DNS resolution failed for no-such-host.invalid.example If you need deeper diagnostics — say, you suspect a flaky resolver rather than the hostname itself — install-nettools gives you dig , nslookup , and friends without rebuilding the container. $ install-nettools $ dig openai.azure.com $ nslookup cog-ftirxupt2yjoe.openai.azure.com Group B: faults that pass ai-test but break your app Here's the most useful thing we learned building this sample: ai-test can be green while your app is on fire, and that's a signal, not a bug. The ai-* aliases call Foundry directly. If they're green and your app is red, the platform-to-Foundry path is fine — the divergence is in your app. Time to pivot to appenv , applogs , showpkgs . Fault 3: bad-creds — wrong AZURE_CLIENT_ID This one is the classic user-assigned managed identity mishap: you scoped your code to a user-assigned managed identity, but the GUID in AZURE_CLIENT_ID doesn't actually exist (or wasn't granted RBAC). curl -X POST $URL/admin/fault -d '{"mode":"bad-creds"}' curl $URL/chat -d '{"prompt":"hi"}' # HTTP 502 # {"detail":"ClientAuthenticationError: DefaultAzureCredential failed to retrieve a token..."} Now SSH in and try the AI aliases: $ source /home/site/diagnostics/fault.env $ ai-test ✓ Connected | 734ms | Model: gpt-4o-mini | Auth: Managed Identity $ ai-access-check ✓ Foundry endpoint: https://cog-ftirxupt2yjoe.openai.azure.com/ ✓ Using auth mode: Managed Identity ✓ Access check passed: authorized to call Foundry Both green. That looks like a contradiction, but it's not. The aliases authenticate using the system-assigned managed identity directly (via IMDS), and they pass. Your Python app uses DefaultAzureCredential , which honors AZURE_CLIENT_ID to pick a user-assigned identity — and that one is broken. The takeaway: when ai-test is green but /chat is red, the platform's identity is fine. Pivot to appenv to see exactly what env your app process sees, and check AZURE_CLIENT_ID : $ appenv | grep AZURE_CLIENT_ID AZURE_CLIENT_ID=00000000-0000-0000-0000-000000000000 There's the bug. The aliases didn't fail — they told you the fault isn't in the platform. That's diagnosis by elimination, and it's faster than guessing. Fault 4: dep-import-error — your code throws Same pattern. The app raises an ImportError on /chat , the AI aliases are green: curl -X POST $URL/admin/fault -d '{"mode":"dep-import-error"}' curl $URL/chat -d '{"prompt":"hi"}' # HTTP 500 # {"detail":"ImportError: No module named 'tiktoken'..."} This is where the app-side aliases earn their keep: $ showpkgs | head -20 ────────────────────────────────────────────────────── Virtual environment packages (antenv) ────────────────────────────────────────────────────── Package Version -------------------------------------- --------- annotated-types 0.7.0 anyio 4.13.0 azure-core 1.41.0 azure-identity 1.19.0 azure-monitor-opentelemetry 1.8.8 ... No tiktoken in that list. Confirmation in one command — no need to remember pip list or where the virtualenv lives. deploylogs then tells you what the last deployment actually built: $ deploylogs 10 Latest deployment: b8a64ed4-b6b7-4419-91eb-6d8e4e7ef323 Log file: /home/site/deployments/b8a64ed4-b6b7-4419-91eb-6d8e4e7ef323/log.log 2026-05-18T19:10:52.3844297Z,Parsing the build logs,abc3cf97-... 2026-05-18T19:10:52.5414396Z,Found 0 issue(s),7d11d013-... 2026-05-18T19:10:52.7913394Z,Build Summary :,... 2026-05-18T19:10:53.5643089Z,Deployment successful. deployer = Push-Deployer ... Build was clean. The package just isn't in requirements.txt . Two aliases, one minute, root cause. Fault 5: port-mismatch — uvicorn binds the wrong port A real-world bug: someone sets WEBSITES_PORT=9999 in app settings to expose a different port, but the app still binds to 8000. curl -X POST $URL/admin/fault -d '{"mode":"port-mismatch"}' The aliases tell you exactly which port everything sees: $ checkport Checking if app is listening on port 8000... ✓ App is listening on port 8000 $ appcurl /health Testing app at localhost:8000 ... HTTP Status: 200 Time: 0.002417s $ appconfig PORT Value: 8000 Note: The port your Python app should listen on. Default is 8000. The app is healthy from inside the container. The mismatch is between what the platform tries to forward to and what uvicorn is bound to. This is the kind of fault where curling the public URL fails but appcurl /health succeeds — and the contrast is itself the diagnosis. Fault 6: latency-spike — the alias bench is fast, your app is slow The app injects 4 seconds of asyncio.sleep before each Foundry call. /chat is now ~4.5 seconds. ai-latency : $ ai-latency Running 5 requests to gpt-4o-mini... Request 1: 715ms ✓ Request 2: 588ms ✓ Request 3: 578ms ✓ Request 4: 669ms ✓ Request 5: 643ms ✓ Results (5/5 successful): Avg: 638ms | Min: 578ms | Max: 715ms Foundry, from this instance, averages 638ms. If your app is taking 5 seconds end-to-end and ai-latency says the model is sub-second, the slowness is in your code — not in Foundry, not in the network. Time to look at App Insights end-to-end transactions, or at any pre-call work (retrieval, vector lookup, your own sleep). What this changes about the debugging workflow Before these aliases, the SSH playbook for a Python AI app went something like: open SSH, dig around /home/site/wwwroot/antenv , grep applicationHost.config for ports, write a curl by hand against the AOAI endpoint with a manually-fetched managed identity token, hope you got the API version right. Now it's ai-diagnose . If that's red, you know exactly which layer. If it's green, you know the fault is in your code or your settings, and appenv , appconfig , showpkgs , applogs walk you the rest of the way. Three patterns we'd lean on going forward: Start with apphelp and ai-diagnose every time. Don't try to remember the right command — let the aliases tell you. Treat ai-test being green as a signal, not a finish line. If /chat is red and ai-test is green, the platform path is fine; pivot to app-side aliases. Use source /home/site/diagnostics/fault.env as a pattern. Any time you want your SSH shell to see what the app process sees, write env to a file and source it. It's a small thing that removes a huge class of "but it worked when I tested it" confusions. We want feedback The aliases are GA today on Python images and we have ideas for where they go next — Node, .NET, more ai-* checks (Foundry agents, vector indexes), tighter integration with azd diagnose . If you have a Python app on App Service and you want a specific alias added, tell us by dropping a comment on this post. Try the sample git clone https://github.com/seligj95/app-service-ssh-diagnostics-python cd app-service-ssh-diagnostics-python azd auth login azd up Four minutes later you'll have the whole thing live. Then curl -X POST $URL/admin/fault -d '{"mode":"<pick one>"}' , SSH in, and walk through any of the six faults above. The README has the full alias-to-fault map.112Views0likes0CommentsTurn Your App Service Web App Into a Self-Healing Agent: LLMOps Best Practices for Production
A user submits a prompt. The agent burns through 50,000 tokens looping on a malformed tool response. Another user trips a model rate limit and the agent silently fails. A bad prompt update goes out at 4 PM Friday and degrades success rate to 60%. Your APM dashboard shows green the entire time because none of that is a 500. This post walks through the LLMOps stack we built into a working reference sample on Azure App Service: the SLIs that matter for agents, a budget circuit breaker, prompt-repair retries, and a fully automated slot-swap rollback when things go sideways. Every code snippet is from the deployable sample at the end of the post. 📦 Sample repo: seligj95/app-service-self-healing-agent-python — azd up and you've got the whole stack live in your subscription in under 10 minutes. Why agent ops ≠ web-app SRE Your web app's reliability model assumes a request maps to bounded work — a SQL query, a cache hit, a templated response. You alert on Http5xx, p95 latency, and dependency failures. Done. An agent breaks that model in four ways: Cost is unbounded per request. An agent that loops on a flaky tool can spend $5 on one user prompt. The HTTP response is still 200. Failure can be silent. A model can hallucinate confident JSON, a tool can return malformed args, and the agent dutifully returns a wrong answer to the user. Zero exceptions logged. Latency is non-deterministic. A "simple" prompt that normally finishes in 2 seconds can blow out to 30s when the model picks an expensive plan. p95 latency tells you nothing. Quality regresses on prompt changes, not code changes. A prompt tweak that ships in seconds can crater tool-call accuracy by 30%. Your CI/CD pipeline didn't catch it because there were no failing tests. Web-app SLOs (uptime, latency, error rate) are necessary but not sufficient. Agents need agent-shaped SLOs. Define your agent SLOs first Before instrumenting anything, write down what "healthy" means. Here are the four SLIs we chose for the sample. None of them are Http5xx. SLI What it measures Why it matters Task success rate % of /chat requests that the agent self-classifies as completed Catches silent failures the HTTP layer misses Cost per task $ spent (input + output tokens × model rate) per /chat The unbounded-loop problem in one number Tool success rate % of tool invocations that didn't raise Tool layer is where most agent failures live Repair retries Times we re-prompted the model after a schema-validation failure Leading indicator of prompt drift In our reference middleware these come out as agent.task.success , agent.cost.usd , agent.tool.success , and agent.repair.retry — eleven custom metrics in total. We emit them via OpenTelemetry so they land in App Insights customMetrics and the included KQL workbook visualizes them as SLO tiles. Observability stack on App Service App Service makes the observability story unusually easy because you get App Insights wired up automatically by azd — no agent install, no DaemonSet, no sidecar. The only thing you bring is the SDK init for your custom metrics: # llmops_middleware/sli.py from azure.monitor.opentelemetry import configure_azure_monitor from opentelemetry import metrics def configure_azure_monitor_if_available() -> bool: if not os.getenv("APPLICATIONINSIGHTS_CONNECTION_STRING"): return False configure_azure_monitor() return True meter = metrics.get_meter("agent") tokens_in = meter.create_counter("agent.tokens.in") cost_usd = meter.create_counter("agent.cost.usd") task_latency = meter.create_histogram("agent.task.latency") tool_success = meter.create_counter("agent.tool.success") # ... We compute cost from a per-model rate card so the metric is in real dollars, not abstract tokens: COST_PER_1K_TOKENS = { "gpt-4o": {"in": 0.0025, "out": 0.01}, "gpt-4o-mini": {"in": 0.00015, "out": 0.0006}, } def record_cost(model: str, tokens_in_count: int, tokens_out_count: int, tenant: str) -> float: rate = COST_PER_1K_TOKENS[model] cost = (tokens_in_count * rate["in"] + tokens_out_count * rate["out"]) / 1000 cost_usd.add(cost, {"model": model, "tenant": tenant}) return cost Once those flow, the KQL queries write themselves: // Top cost-burning tenants in the last hour customMetrics | where timestamp > ago(1h) | where name == "agent.cost.usd" | extend tenant = tostring(customDimensions["tenant"]) | summarize spend_usd = sum(valueSum) by tenant | top 10 by spend_usd desc The sample ships a 6-tile workbook ( observability/workbook.json ) deployed via Bicep. It renders SLO compliance, cost burn-down, tool failure breakdown, latency percentiles, budget breaches, and healing signals out of the box. The deployed workbook in App Insights. The SLO panel dips during a chaos run and recovers as the agent self-heals — exactly the signal you want on a glass-pane dashboard. Cost guardrails with a budget circuit breaker Custom metrics tell you about cost after you spent it. To prevent runaways, you need a circuit breaker that bites before the model call happens. The middleware in llmops_middleware/budget.py keeps a per-tenant counter in memory (per month) and returns a decision: class BudgetDecision(Enum): ALLOW = "allow" # under budget DOWNSHIFT = "downshift" # ≥80% — switch to cheaper model BLOCK = "block" # ≥100% — refuse the request def evaluate(tenant: str) -> BudgetDecision: spent = _spend.get((tenant, _current_period()), 0.0) if spent >= BUDGET_USD_PER_TENANT: return BudgetDecision.BLOCK if spent >= BUDGET_USD_PER_TENANT * 0.80: return BudgetDecision.DOWNSHIFT return BudgetDecision.ALLOW The agent loop reads that decision and downshifts from gpt-4o to gpt-4o-mini — a 16× cost reduction ($0.0025 / 1K input tokens vs $0.00015) — when a tenant crosses 80% of their monthly budget. The user keeps getting answers; the bill stops climbing. def _pick_model(tenant: str) -> str: decision = budget.evaluate(tenant) if decision == BudgetDecision.DOWNSHIFT: sli.model_downshift.add(1, {"tenant": tenant}) return DOWNSHIFT_MODEL return PRIMARY_MODEL For the demo we keep state in memory; production should swap the dict for Redis (atomic INCRBY ) or Cosmos with optimistic concurrency. The interface in budget.py is intentionally tiny so this is a 10-line change. Self-healing patterns There are three patterns in the sample, each addressing a different failure class. 1. Retry with prompt-repair The most common agent failure isn't a tool exception — it's the model returning malformed JSON that fails schema validation on tool args. The fix is to feed the validation error back into the model and ask it to repair the call: # llmops_middleware/repair.py async def retry_with_repair(call_fn, args, *, max_attempts=2): for attempt in range(max_attempts): try: return await call_fn(args) except (ValidationError, RepairableError) as exc: sli.repair_retry.add(1, {"attempt": str(attempt)}) args = await _ask_model_to_repair(args, str(exc)) raise This single pattern recovers 50–70% of "the agent returned garbage" cases without escalating. 2. Tool fallback chains When a primary tool times out or fails open, try a cheaper or simpler one: async def tool_fallback_chain(primary, *fallbacks, args): for fn in (primary, *fallbacks): try: return await fn(args) except ToolUnavailable: sli.tool_success.add(1, {"tool": fn.__name__, "status": "fallback"}) raise NoToolAvailable() Lookup-style tools especially benefit: web search → cached snapshot → static knowledge base. 3. Slot-swap auto-rollback Here's the killer feature App Service brings that's a slog on K8s: deployment slots. You always have a known-good previous version warmed up and one ARM API call away from production traffic. We wire that up to fire automatically when our SLI breaches. The chain is: Metric alert on Http5xx > 5 in 5 minutes (the platform metric, free) Action Group that POSTs to a Logic App webhook (SAS-signed callback URL) Logic App that calls POST /sites/{name}/slots/staging/slotsswap via its managed identity (granted Website Contributor on the target web app) The whole healer is one trigger + two actions: receive the alert webhook, call ARM slotsswap, return a status payload to the caller. The two actions in Bicep: SwapSlots: { type: 'Http' inputs: { method: 'POST' uri: '${environment().resourceManager}@{parameters(\'targetSiteId\')}/slots/staging/slotsswap?api-version=2024-04-01' body: { targetSlot: 'production' } authentication: { type: 'ManagedServiceIdentity' audience: environment().resourceManager } } } No code to deploy, no secrets to manage, no second runtime to babysit. From alert-fire to swapped-slot is about 4 minutes in our tests — under the SLA most agent products have for "user-visible degraded mode." Why not a Function App? We started there. The Logic App is 60 lines of Bicep and zero application code. For a one-action workflow like "swap a slot," the Function adds packaging, deployment, and a runtime to monitor for no benefit. Chaos testing for agents You can't trust a self-healing system you haven't broken. The sample ships a chaos CLI and an in-process injection point so you can practice failures on demand. In-process: llmops_middleware/chaos.py exposes four modes ( off , throttle , malformed , outage ) togglable via POST /admin/chaos . When set, tool calls roll a die and raise the matching exception with the configured probability: class ChaosController: def maybe_inject(self) -> None: if random.random() > self.probability: return if self.mode == "outage": raise ChaosOutage("simulated tool outage") if self.mode == "throttle": raise ChaosThrottled("simulated 429") if self.mode == "malformed": raise ChaosMalformed("simulated bad tool output") External: chaos/inject.py is a small async load driver that sets /admin/chaos then drives /chat at a target RPS, tallying response codes: python chaos/inject.py \ --base-url https://my-agent.azurewebsites.net \ --mode outage --probability 1.0 --rps 10 --duration 300 Running that for 5 minutes against the deployed sample reliably: Drives customMetrics(name="agent.task.failure") over 50/min Trips the Http5xx > 5 metric alert (~90 seconds after threshold breach) Fires the Logic App run (succeeded in 1.2 seconds in our test) Flips the slot — /health instance ID changes The repo's observability/queries.kql has the canonical KQL for each of these signals, and observability/workbook.json is the deployable workbook that visualizes them. The reference middleware Everything in this post is in seligj95/app-service-self-healing-agent-python. The Python package llmops_middleware/ is the part you'd vendor into your own agent — sli.py , budget.py , repair.py , chaos.py . The agent loop and the Bicep are demo-quality but production-shaped. Run it yourself: git clone https://github.com/seligj95/app-service-self-healing-agent-python cd app-service-self-healing-agent-python azd auth login azd up You'll have an agent + AOAI + workbook + healer running in about 8 minutes. Then run the chaos script and watch the slot flip. The KQL workbook Deployable workbook JSON, dropped into the resource group by Bicep. Six panels: SLO tile — % of tasks where agent.task.success was emitted (grouped by tenant) Cost burn-down — running spend per tenant against the monthly budget Top failing tools — failure count by tool, broken down by error class Latency p50/p95/p99 — agent.task.latency histogram Budget breaches — count and tenant list Healing signals — agent.repair.retry + agent.model.downshift + agent.chaos.injected over time It's observability/workbook.json — loadTextContent -ed into infra/shared/monitoring.bicep so you get it deployed automatically. Why App Service for LLMOps After building this, the appeal of App Service for agents is clearer than I expected going in: Slots are an unfair advantage. A pre-warmed previous version, one ARM call from production. K8s blue/green needs you to build it. Managed identity to Azure OpenAI removes the entire key-rotation problem. The sample sets disableLocalAuth: true on the AOAI account — there literally is no key. App Insights is auto-wired so your custom metrics land in customMetrics and your KQL queries work day one. Bicep + azd lets you ship a full LLMOps stack in one repo: app, infra, healing, observability, chaos. If you're standing up a new agent and you don't already have a Kubernetes platform you love, App Service is a strong default. Wrap-up If you take three things from this post: Define agent SLOs in your own terms — task success, cost per task, tool reliability — not just web-app SLOs. Put a circuit breaker between the user and the model. A budget breaker that downshifts to a cheaper model is the highest-ROI middleware you can ship. Make rollback boring. Slot swap + a one-action Logic App + a metric alert is a self-healing system you can build in an afternoon and trust at 3 AM. The sample has all of it wired up. We're considering baking these into App Service — tell us what you'd want The middleware in this sample (SLIs + telemetry, cost guardrails, policy/audit hooks) is exactly the kind of thing we're evaluating as first-class App Service platform features — opt-in sidecars or built-in capabilities so you don't have to vendor a middleware package into every agent you ship. Concretely, we're tracking ideas like: Agent Observatory — a sidecar that intercepts SDK calls (Semantic Kernel, LangChain, Crew AI, AutoGen) and captures full reasoning traces with zero code changes AI Cost Guardian — platform-level quotas and spend caps across Azure OpenAI, Anthropic, and other model providers, with real-time enforcement Policy Guard — governance, PII masking, model-approval lists, and an immutable audit log for regulated workloads If any of those would land for your team — or if you're solving these problems differently and want to push back on the shape — we want to hear it. Drop a comment on this post: the roadmap is genuinely shaped by feedback at this stage.150Views0likes0CommentsNew SSH helper aliases for Python apps on Azure App Service for Linux
Troubleshooting a running application often starts with SSH. To make that experience simpler for Python apps on Azure App Service for Linux, we have added new SSH helper aliases for common app, log, networking, and Azure AI Foundry diagnostics. When you SSH into your application, you will now see two helper commands: View available SSH helpers with apphelp Run apphelp to see the full list of available aliases. These helpers are grouped by common tasks, including app information, logs, diagnostics, testing, and Azure AI Foundry connectivity. For example: applogs Tails your application logs directly from the SSH session. appcurl Tests your application locally using localhost:$PORT, which is useful when checking whether the app is listening correctly inside the container. Other useful helpers include: showpkgs # List installed Python packages appconfig # Show common App Service settings deploylogs # Show recent deployment logs checkport # Verify the app is listening on the configured port gohome # Go to /home/site/wwwroot gosrc # Go to the app source directory Azure AI Foundry diagnostics from SSH We have also added helpers for Azure AI Foundry scenarios. These are useful when your app calls Azure AI services and you need to quickly validate identity, DNS, connectivity, or response latency from inside the App Service environment. For a quick end-to-end connectivity test, run: ai-test Example output: ✓ Connected | 3706ms | Model: gpt-4.1-mini | Auth: Managed Identity For a broader diagnostic check, run: ai-diagnose Example output: AI Foundry Diagnostics [✓] Managed identity token [✓] DNS resolution [✓] Foundry connectivity Additional AI helpers include: ai-dns # Check DNS resolution for the Foundry endpoint ai-access-check # Check RBAC for Foundry calls ai-curl # Verbose HTTP debug for Foundry ai-latency # Benchmark Foundry response times Install networking tools with install-nettools For deeper connectivity troubleshooting, you can run: install-nettools This installs commonly used networking utilities that can help diagnose DNS resolution, TCP connectivity, routing, packet capture, listening ports, and HTTP endpoint access. Why this helps These aliases are intended to reduce the number of manual steps needed during troubleshooting. Instead of remembering log paths, port checks, curl commands, or AI connectivity validation steps, you can run a focused helper command directly from the SSH session. If there are other common SSH commands or troubleshooting workflows you would like us to add as aliases, please share your feedback with us.143Views1like0CommentsSimplifying FastAPI Deployments on Azure App Service for Linux
Deploying FastAPI apps to Azure App Service for Linux is now simpler. Previously, when you deployed a FastAPI application, you needed to configure a custom startup command so the app could run correctly with an ASGI server. This added an extra step to the deployment flow and could create friction, especially when getting started with FastAPI on App Service. We have now added built-in detection logic that identifies FastAPI applications automatically and configures the appropriate startup behavior for you. What changed When you deploy a Python app, App Service now scans common entry point files such as: main.py, app.py, application.py, server.py, asgi.py, api.py, index.py, and run.py If one of these files imports FastAPI using from fastapi or import fastapi, App Service detects the app as a FastAPI application. To avoid false positives, files that also import Flask are skipped during FastAPI detection. How your app is started When a FastAPI app is detected, App Service automatically starts the application using Gunicorn with the Uvicorn worker class: gunicorn -k uvicorn_worker.UvicornWorker This provides the ASGI support required by FastAPI applications. Framework detection priority If multiple framework indicators are present, App Service uses the following priority order: Django > FastAPI > Flask Django continues to take precedence when a wsgi.py file is found in a subdirectory. What this means for you When you deploy a FastAPI app to Azure App Service for Linux, you no longer need to configure a custom startup command in the supported runtime flow. Oryx detects the FastAPI app and configures the startup behavior automatically. This removes an extra setup step and makes it easier to get your FastAPI app running on App Service. Availability This improvement is currently enabled for Python 3.14 and later. Support for additional Python versions will be enabled in an upcoming rollout. For more details on deploying Python apps to App Service, see the Azure App Service Python quickstart documentation.166Views0likes0Comments