best practices
252 TopicsHow ACR Runs Multi-Tenancy at Scale: Stamp Rebalancing and Why You Never See It Happen
By Johnson Shi, Richard Yuan, Yi Zha, Susan Shi, Jeanine Burke, Bin Du, Clark Porter, Bernie Harris, Eric Du Introduction Two of the most common questions we hear from teams running container workloads at scale on Azure Container Registry (ACR) are: "How does ACR keep my registry's performance predictable when I'm sharing infrastructure with thousands of other tenants?" — Cloud services are inherently multi-tenant. What does ACR actually do to keep my workload from competing with my neighbors? "What happens when one tenant's workload grows large enough to affect the shared infrastructure?" — Is there an active intervention, or does the system just absorb the noise? In this post, we clarify how ACR runs its multi-tenant fleet: the stamp architecture that underpins ACR's infrastructure in every Azure region, the practice of proactively rebalancing registries between stamps when one stamp gets hot, and the additional stamp isolation options available for exceptional workloads. Running multi-tenancy well at scale isn't passive — it's an active operational practice, and customers benefit from it every day without seeing it happen. Key Takeaways An ACR registry can be geo-replicated: a registry can have geo-replicas (which are both read and write-enabled) in multiple Azure regions. Each geo-replica is served by an ACR stamp — independent deployment units that underpin ACR regional infrastructure, each made up of VMSS-backed compute pools and a pool of storage accounts, that together serve many registries belonging to many tenants. Stamps are simultaneously a capacity pool, a fault domain, and an update domain. When a stamp gets hot, ACR proactively rebalances by moving registries to a less-utilized stamp in the same region. The registry endpoint does not change; the move is transparent to the customer. For exceptional workloads where rebalancing alone would just transfer the problem, ACR can provide additional stamp isolation — placing registries on stamps with fewer co-tenants, providing better traffic isolation, fault domain separation, and update domain independence. This also structurally improves the stamps the tenant used to share with everyone else. ACR engineering uses a mix of reactive signals (outages, sustained errors, throttling, low throughput) and proactive signals (operational telemetry) to decide when to rebalance stamps. Hot-node P95 CPU, discussed in this post, is one of the proactive signals we use — for each 1-minute bin, take the hottest node's average CPU, then percentile across bins. Pool-average hides per-node hot-spotting; single-sample Max is too noisy. All of this is currently manual. Rebalancing decisions, migrations, and isolation provisioning are operator-driven today. We are actively investing in standardizing and automating the practice — automated stamp rebalancing and lifecycle management are on the roadmap. Background What is a stamp? A stamp is ACR's unit of deployment within a region. At a high level, ACR has the following components within a region to serve registry data plane operations: VMSS-backed compute pools. Virtual Machine Scale Sets are Azure's primitive for running a managed group of identical VMs that autoscale together. Each stamp has a pool of VMs that handle authentication, manifest operations, tag resolution, and registry-side metadata — the coordination layer of a container pull — plus a separate pool of VMs running the dataproxy component, which sits between clients and storage. For private endpoint pulls, when a client pulls a layer, the dataproxy fetches from storage (or its local cache) and streams the bytes back; it is effectively a private endpoint and streaming cache layered together. A pool of storage accounts. Each ACR region has its own set of Azure Storage accounts that hold the actual blob (layer) data and manifest content for the geo-replicas on residing them. Storage accounts are multi-tenant within a stamp and region — multiple registries' blobs may land in the same group of accounts, with strict multi-tenant isolation controls and authorization enforcement. Each ACR region typically contains multiple stamps serving many tenants' registries. For geo-replicated registries, a geo-replica in a region is bound to exactly one underlying ACR stamp. A geo-replicated registry's global endpoint (<registry>.azurecr.io), geo-replica regional endpoints, and geo-replica dedicated data endpoints are resolved via DNS — backed by ACR's own Traffic Manager profile — to a specific stamp serving that region's geo-replica. The key conceptual point: a stamp is simultaneously a capacity pool (autoscale operates on it), a fault domain (incidents on the stamp affect all its tenants), and an update domain (rollouts progress through update domains within the stamp). When we move a registry between stamps in the same region, we are moving it between all three at once — and the customer's endpoint URLs do not change. From the customer's perspective, the migration is fully seamless: there are no endpoint changes, no DNS updates to make, and no action required on their part. The registry continues to work exactly as before, and the customer does not need to know or care that the underlying stamp has changed. Why multi-tenancy at scale is an active practice The naive picture is: provision enough capacity, autoscale handles the rest. This works in steady state. It does not work when one tenant's workload grows enough to systematically influence stamp behavior, when traffic shape is bursty enough that averages understate peaks, or when a single large tenant's blast radius becomes uncomfortably concentrated on a shared stamp. None of these is something a passive autoscaler will fix. They require an operator decision: this registry would be better served on that stamp. ACR engineering does this continuously — from routine rebalancing to providing additional isolation for exceptional workloads. How We Do It: Stamp Rebalancing Stamp rebalancing — a recurring practice Several signals can trigger a stamp rebalancing decision — reactive signals such as sustained errors, outages, throttling that customers observe or that we observe in our own telemetry, low throughput on a stamp, or proactive signals like hot-node P95 CPU (described in this post below) breaching a threshold. The most recent rebalancing work used hot-node P95 as the proactive trigger; other rebalancing decisions have been driven by the reactive signals just listed. When any of these fires, ACR engineering identifies the registries contributing most to the problem and picks one or more to move to a less-utilized stamp in the same region. The mechanism is straightforward: we initiate elevated operator actions, the control plane re-binds the registry's home_stamp field, DNS routing follows, in-flight requests on the source stamp drain in 30–60 seconds, and new traffic lands on the destination stamp. The cutover takes minutes. The customer's registry endpoint does not change. Most customers never know it happened; the ones whose registry moved typically see better latency afterward. Rebalancing to an existing cooler stamp is a recurring practice that resolves most multi-tenant pressure. For exceptional workloads where rebalancing to another shared stamp would just transfer the problem, ACR may provide additional stamp isolation — placing registries on stamps with fewer co-tenants, giving the tenant better traffic isolation, fault domain separation, and update domain independence while also structurally improving the stamps that tenant used to share with everyone else. Rebalancing at different scales ACR applies rebalancing across a spectrum of scenarios, from moving a handful of registries to a cooler stamp to providing additional stamp isolation for exceptional workloads. The decision criterion is workload size relative to the shared fleet — if moving a tenant to a different shared stamp would just transfer the hot-stamp problem to the destination, additional stamp isolation is the right answer. For everyone else, rebalancing to an existing stamp is sufficient. Both are manual today; both stamp provisioning and rebalancing mechanisms described are on ACR's roadmap to be automated with less operator involvement. Hot-node P95: one of the signals we use proactively Rebalancing decisions are driven by a mix of reactive and proactive signals. Reactive signals — outages, sustained error rates, frequent throttling, low throughput that customers report or that we see in our own telemetry — are the obvious triggers. But waiting for these means waiting for a customer-visible problem. Proactive signals let us intervene before that happens. Hot-node P95 CPU, showcased in this post, is one of the proactive signals we use, and it was the primary signal for the most recent rebalancing work described in the example below. The choice of CPU metric matters. Three candidates: Pool-average CPU. Averages every node in the pool. Hides per-node hot-spotting — a pool with 6% average CPU can still have one node at 99%. Single-sample Max CPU. The highest 1-minute sample. Captures spikes, but is dominated by single-bin noise that doesn't represent sustained load. Hot-node P95 CPU. For each 1-minute bin, take the hottest node's average CPU. Then percentile across bins over a representative 12-hour peak window. This is "how hot is the worst node, most of the time." Hot-node P95 captures sustained per-node load without being noisy, and it tracks customer-visible behavior more closely than either alternative. A concrete illustration from a recent regional resize: on one shared stamp's dataproxy pool, Max CPU touched 96% — alarming if read alone. But hot-node P95 was 43%, meaning most of the time even the hottest node was comfortably loaded; the 96% was a single 1-minute spike. Using Max as the operating signal would have triggered an unnecessary intervention. Using pool-average would have missed real hot-spotting elsewhere. Hot-node P95 is the right operating point for this particular signal — and it is one input among several that feed the broader rebalancing decision. A Recent Example: Rebalancing Large AI Workloads for Additional Isolation We recently completed the rebalancing of registries belonging to one of the largest AI workloads in the region, providing additional isolation to address the scale of their traffic. The customer's workload had grown to the point where its presence on the shared stamps was systematically influencing stamp behavior — variability that affected their own pull latency, and variability that affected every other tenant on the same shared stamps. The customer had 40 registries homed across two shared stamps in the region, with a severely long-tailed traffic distribution: the top four registries carried 96.7% of the customer's traffic. When that much load is concentrated in four registries, the migration cannot proceed as one batch. We moved them in phases, smallest to largest, with observation windows between phases: Idle and small-traffic tail first — about thirty low-traffic registries, used to validate the cutover tooling against the destination stamp. Medium-traffic registries next — in sub-batches with 24 hours of observation between them. The top four, one at a time — each individually with 48 hours of observation between cutovers. Order: smallest to largest, so each cutover was a sanity check at increasing load. The cumulative effect on the shared stamps the customer had previously occupied: Shared stamp + pool Hot-Node P95 CPU change Max CPU change Stamp A — registry pool -7% flat Stamp A — dataproxy pool -34% 96% → 64% Stamp B — registry pool -33% -3 percentage points Stamp B — dataproxy pool -44% -5 percentage points Stamp A dataproxy is the headline. The hottest node went from briefly touching 96% to maxing out at 64%, with sustained hot-node P95 dropping from 43% to 28.5%. Every other tenant homed on Stamp A — most with no idea this rebalancing happened — now runs on a structurally healthier pool, with more headroom, lower tail latency under load, and lower risk of CPU-driven incidents during traffic spikes. Stamp B saw similar relief. After the rebalancing, we right-sized the shared stamps downward — lowering the VMSS minimum instance count on each to match the new traffic level. Hot-node P95 was the primary signal driving this resize work, the same proactive signal that motivated the rebalancing in the first place: when hot traffic leaves a shared stamp, capacity right-sizing follows. Findings ACR runs this recurring stamp rebalancing practice for one reason: to give customers more guaranteed performance — higher and more predictable pull throughput, lower tail latency, better fault and update isolation — whether through routine rebalancing or additional isolation for exceptional workloads. Every tenant on the rebalanced stamps gets more headroom, more predictable behavior under load, and a smaller blast radius for any single incident or rollout. Three things happen continuously in any ACR region to make this real: registries get rebalanced between stamps as load patterns shift, exceptional workloads get additional stamp isolation when no shared stamp can absorb them sustainably, and stamps get continuously right-sized when load enters or leaves. All three are operator-driven today, all three are being invested in for automation, and all three are guided by a combination of reactive signals (outages, errors, throttling) and proactive signals (hot-node P95 CPU is one of them). The thesis is straightforward: cloud multi-tenancy at scale is not a passive property of the architecture. It is an active operational practice that exists to give customers guaranteed performance and predictable behavior. The customers who benefit most from it are usually the customers who never notice it's happening. Summary Question Answer How does ACR keep multi-tenant performance predictable at scale? By actively moving registries between stamps as load shifts — rebalancing in the common case, providing additional isolation for exceptional workloads. What is a stamp? An ACR deployment unit within a region's geo-replica: VMSS-backed registry and dataproxy compute pools plus a pool of storage accounts. Simultaneously a capacity pool, fault domain, and update domain. A region typically contains multiple stamps. Do customers see when their registry moves between stamps? No. Stamps are within a region; the global endpoint and any regional endpoint URLs do not change. The cutover takes minutes; in-flight requests drain in 30–60 seconds. Does providing additional isolation only help the isolated tenant? No — every other tenant who was sharing a stamp with that workload also benefits, because the largest source of variability has been removed from the shared fleet. What signals drive these decisions? A mix of reactive signals (outages, sustained errors, throttling, low throughput) and proactive signals from our own telemetry. Hot-node P95 CPU — the 95th percentile, across a 12-hour peak window, of the hottest node's CPU in each 1-minute bin — is one of the proactive signals, and it was the primary signal for the most recent rebalancing work. Is all of this automated? Not yet. Rebalancing, isolation provisioning, and migrations are operator-driven today. Standardizing and automating these practices is an active investment.113Views0likes0CommentsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.13KViews6likes0CommentsBuilding AI Agents with Microsoft Foundry: A Progressive Lab from Hello World to Self-Hosted
AI agent development has a steep on-ramp. The combination of new SDKs, tool-calling patterns, model selection decisions, retrieval-augmented generation, and deployment concerns means most developers spend more time wiring things together than actually building anything useful. The Microsoft Foundry Agent Lab is a structured, open-source demo series designed to change that — nine self-contained demos, each adding exactly one new concept, all built on the same Microsoft Foundry SDK and a single model deployment. This post walks through what the lab contains, how each demo works under the hood, and the architectural decisions that make it a useful reference for AI engineers building production agents. Why a Progressive Lab? Agent frameworks can be overwhelming. A developer who opens a rich example with RAG, tool-calling, streaming, and a custom UI all at once has no clear line of sight to which parts are essential and which are embellishments. The Foundry Agent Lab takes the opposite approach: start with the absolute minimum and introduce one new primitive per demo. By the time you reach Demo 8, you have seen every major capability — not in one monolithic sample, but in a layered sequence where each addition is visible and understandable. # Demo New Concept Tool Used UX 0 hello-demo Agent creation, Responses API, conversations None Terminal 1 tools-demo Function calling, tool-calling loop, live API FunctionTool Terminal 2 desktop-demo UI decoupling — same agent, different surface None Desktop (Tkinter) 3 websearch-demo Server-side built-in tools, no client loop WebSearchTool Terminal 4 code-demo Code execution in sandbox, Gradio web UI CodeInterpreterTool Web (Gradio) 5 rag-demo Document upload, vector stores, RAG grounding FileSearchTool Terminal 6 mcp-demo MCP servers, human-in-the-loop approval MCPTool Terminal 7 toolbox-demo Centralized tool governance, Toolbox versioning Toolbox Terminal 8 hosted-demo Self-hosted agent with Responses protocol Custom server Terminal + Agent Inspector The Model Router: One Deployment to Rule Them All Before diving into the demos, it is worth understanding the one architectural decision that ties the entire lab together: every agent uses model-router as its model deployment. MODEL_DEPLOYMENT=model-router Model Router is a Microsoft Foundry capability that inspects each request at inference time and routes it to the optimal available model — weighing task complexity, cost, and latency. A simple factual question goes to a fast, cheap model. A complex tool-calling chain with code generation gets routed to a frontier model. You write zero routing logic. The lab's MODEL-ROUTER.md file contains empirical observations from running all nine demos. A sample of what the router selected: Demo Query Task Type Model Selected hello "What's the capital of WA state?" Factual recall grok-4-1-fast-reasoning hello "Summarize our conversation" Summarization gpt-5.2-chat-2025-12-11 tools "What's the weather in Seattle?" Tool-using gpt-5.4-mini-2026-03-17 code Data analysis with code generation Code generation + execution gpt-5.4-2026-03-05 rag HR policy document question Retrieval + synthesis gpt-5.3-chat-2026-03-03 This is the strongest signal in the lab: you do not need to reason about model selection. You declare what your agent needs to do; the router handles the rest, and it chooses correctly. Demo 0: The Minimum Viable Agent The hello-demo establishes the baseline pattern used by every subsequent demo. Two files: one to register the agent, one to chat with it. Registering the agent from azure.identity import DefaultAzureCredential from azure.ai.projects import AIProjectClient from azure.ai.projects.models import PromptAgentDefinition credential = DefaultAzureCredential() project = AIProjectClient(endpoint=PROJECT_ENDPOINT, credential=credential) agent = project.agents.create_version( agent_name=AGENT_NAME, definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, instructions="You are a helpful, friendly assistant.", ), ) Authentication uses DefaultAzureCredential , which works with az login locally and with managed identity in production — no API keys anywhere in the code. Chatting with the agent # Create a server-side conversation (persists history across turns) conversation = openai.conversations.create() # Each turn sends the user message; the agent sees full history response = openai.responses.create( input=user_input, conversation=conversation.id, extra_body={"agent_reference": {"name": AGENT_NAME, "type": "agent_reference"}}, ) print(response.output_text) The conversation object is server-side. You pass its ID on every turn; the history lives in Foundry, not in a local list. This is the Responses API pattern — distinct from the older Completions or Chat Completions APIs. Demo 1: Function Tools and the Tool-Calling Loop Demo 1 adds function calling against a real weather API. The key insight here is that the model does not execute the function — it requests the execution, and your code executes it locally, then feeds the result back. Declaring a function tool from azure.ai.projects.models import FunctionTool, PromptAgentDefinition func_tool = FunctionTool( name="get_weather", description="Get the current weather for a given city.", parameters={ "type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"], }, strict=True, ) agent = project.agents.create_version( agent_name=AGENT_NAME, definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[func_tool], instructions="You are a weather assistant...", ), ) The tool-calling loop response = openai.responses.create(input=user_input, conversation=conversation.id, ...) # Loop while the model is requesting tool calls while any(item.type == "function_call" for item in response.output): input_list = [] for item in response.output: if item.type == "function_call": args = json.loads(item.arguments) result = get_weather(args["city"]) # execute locally input_list.append(FunctionCallOutput(call_id=item.call_id, output=result)) # Send results back to the agent response = openai.responses.create(input=input_list, conversation=conversation.id, ...) print(response.output_text) The strict=True parameter on FunctionTool enforces structured outputs — the model must return arguments that match the declared JSON schema exactly. This eliminates argument parsing errors in production. Demo 2: UI Is Not Your Agent Demo 2 runs the exact same agent as Demo 1 but surfaces it in a Tkinter desktop window. The point is pedagogical: your agent definition, conversation management, and tool-calling logic are entirely independent of your UI layer. Swapping from terminal to desktop requires changing only the presentation code — nothing in the agent or conversation path changes. This is a principle worth internalising early: agent logic and UI logic should never be entangled. The lab enforces this separation structurally. Demo 3: Server-Side Built-In Tools The web search demo introduces a sharp contrast with Demo 1. With WebSearchTool , the tool-calling loop disappears entirely from client code: from azure.ai.projects.models import WebSearchTool agent = project.agents.create_version( agent_name="Search-Agent", definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[WebSearchTool()], instructions="You are a research assistant...", ), ) The agent decides when to search, executes the search server-side, and returns a grounded response with citations. Your client code looks identical to Demo 0 — a simple responses.create() call with no tool loop. The distinction matters architecturally: Function tools (Demo 1) — tool execution happens on your client; you control the code, the API call, the error handling. Built-in tools (Demo 3+) — tool execution happens inside Foundry; you get results without managing execution. Demo 4: Code Interpreter and the Gradio Web UI Demo 4 attaches CodeInterpreterTool , which gives the agent a sandboxed Python execution environment inside Foundry. The agent can write code, run it, observe output, and iterate — all server-side. Combined with a Gradio web interface, this demo shows an agent that can perform data analysis, generate charts, and explain results through a browser UI. Model Router is particularly interesting here: the empirical data shows it selects a more capable frontier model ( gpt-5.4-2026-03-05 ) for code-generation tasks, while simpler conversational turns stay on lighter models. Demo 5: Retrieval-Augmented Generation with FileSearchTool Demo 5 introduces RAG. The setup phase uploads a document, creates a vector store, and attaches it to the agent: # Upload document and create a vector store vector_store = openai.vector_stores.create(name="employee-handbook-store") with open("data/employee-handbook.md", "rb") as f: openai.vector_stores.files.upload_and_poll( vector_store_id=vector_store.id, file=f ) # Attach the vector store to the agent agent = project.agents.create_version( agent_name="RAG-Agent", definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[FileSearchTool(vector_store_ids=[vector_store.id])], instructions="Answer questions using only the provided documents...", ), ) At query time, the agent embeds the question, searches the vector store semantically, retrieves matching chunks, and generates an answer grounded in the retrieved content — entirely server-side. The client code remains a plain responses.create() call. An important detail: the .vector_store_id file is written to disk during setup and read back during the chat session, so the demo survives process restarts without re-uploading the document. The .gitignore excludes this file from source control. Demo 6: Model Context Protocol Demo 6 connects the agent to a GitHub MCP server, giving it access to repository and issue data via the open Model Context Protocol standard. MCP servers expose tools over a standardised wire protocol; the agent discovers and calls them without any client-side function declarations. The demo also demonstrates human-in-the-loop approval: before executing any MCP tool call, the agent surfaces the proposed action and waits for the user to confirm. This is an important safety pattern for agents that can trigger side effects on external systems. Demo 7: Toolbox — Centralised Tool Governance Where Demo 6 connects to a single MCP server directly, Demo 7 uses a Toolbox — a managed Microsoft Foundry resource that bundles multiple tools into a single, versioned, MCP-compatible endpoint. The Toolbox in this demo exposes both GitHub Issues and GitHub Repos tools, curated into an immutable versioned snapshot. This pattern is significant for production multi-agent systems: Centralised governance — one team owns the tool definitions; all agents consume them via a single endpoint. Versioned snapshots — promoting a new Toolbox version is explicit; agents pin to a version and upgrade intentionally. MCP compatibility — any MCP-capable agent or framework can connect, not just Foundry SDK agents. from azure.ai.projects.models import McpTool toolbox_tool = McpTool( server_label="toolbox", server_url=TOOLBOX_ENDPOINT, allowed_tools=[], # empty = all tools in the Toolbox version headers={"Authorization": f"Bearer {token}"}, ) Demo 8: Self-Hosted Agent with the Responses Protocol The final demo departs from the prompt-agent pattern. Instead of registering a declarative agent in Foundry, Demo 8 implements a custom agent server using the Responses protocol. The server exposes a streaming HTTP endpoint; Foundry's Agent Inspector can connect to it and route user turns to it just as it would to a hosted prompt agent. This demo includes a Dockerfile and an agent.yaml , enabling deployment to Foundry's container hosting service. It uses gpt-4.1-mini directly rather than the model router, because the custom server owns the entire inference path. When to consider this pattern: Your agent requires custom pre- or post-processing logic that cannot be expressed in a system prompt. You need to integrate with infrastructure that is not reachable through MCP or built-in tools. You want to own the inference call for cost control, A/B testing, or compliance reasons. You are building a multi-agent orchestrator that needs to expose itself as an agent to other orchestrators. Getting Started The lab requires Python 3.10 or higher, an Azure subscription with a Microsoft Foundry project, and the Azure CLI. 1. Clone and set up the virtual environment git clone https://github.com/microsoft-foundry/Foundry-Agent-Lab.git cd Foundry-Agent-Lab # Create and activate the virtual environment python -m venv .venv # Windows Command Prompt .venv\Scripts\activate.bat # Windows PowerShell .venv\Scripts\Activate.ps1 # macOS / Linux source .venv/bin/activate pip install -r requirements.txt 2. Configure a demo copy hello-demo\.env.sample hello-demo\.env # Edit hello-demo\.env and set PROJECT_ENDPOINT Your PROJECT_ENDPOINT is on the Overview page of your Foundry project in the Azure portal. It takes the form https://your-resource.ai.azure.com/api/projects/your-project . 3. Run the demo az login 0-hello-demo Each numbered batch file at the root activates the virtual environment, runs create_agent.py , and launches chat.py . Append log to capture the full session transcript: 0-hello-demo log Reset between runs hello-demo\reset.bat Every demo includes a reset.bat that deletes the registered agent and any associated resources (vector stores, uploaded files). Demos are fully repeatable. Architecture Principles Demonstrated Across the nine demos, the lab illustrates a set of design principles that apply directly to production agent systems: Keyless authentication throughout Every demo uses DefaultAzureCredential . No API keys appear anywhere in the code. Locally, az login provides credentials. In production, managed identity takes over automatically — same code, no secrets to rotate. Server-side conversation state The Responses API stores conversation history server-side. Your application passes a conversation ID; Foundry maintains the thread. This eliminates the common bug of truncating history due to local list management and makes multi-process or multi-instance deployments straightforward. Client-side vs server-side tool execution The lab makes the distinction explicit. Function tools execute in your process — you control the code, the external call, and the error handling. Built-in tools (WebSearch, CodeInterpreter, FileSearch) execute inside Foundry — you get results without managing execution infrastructure. MCP tools (Demo 6, 7) fall between these: they execute in a separately deployed server, with the protocol mediating the call. Progressive tool introduction Each demo's create_agent.py registers the agent once. The chat.py file handles the conversation loop. These two responsibilities are always separate, making it easy to update agent definitions without modifying conversation logic, and vice versa. Security Considerations When building agents for production, keep the following in mind: Never commit .env files. The .gitignore excludes them, but verify this before pushing. Use Azure Key Vault or environment variable injection in CI/CD pipelines. Use managed identity in production. DefaultAzureCredential automatically picks up managed identity when deployed to Azure, eliminating the need for any stored credentials. Apply human-in-the-loop for side-effecting tools. Demo 6 demonstrates this pattern for MCP tool calls. Any agent that can modify external state (create issues, send emails, write files) should surface proposed actions for confirmation. Validate tool outputs before use. Treat data returned by external tools (weather APIs, search results, document retrieval) as untrusted input. Prompt injection through tool results is a real attack surface; grounding instructions in your system prompt reduce but do not eliminate this risk. Scope Toolbox permissions narrowly. When using a Toolbox (Demo 7), use allowed_tools to restrict which tools the agent can call, rather than granting access to all tools in a Toolbox version. Key Takeaways Start with the minimum. A prompt agent with no tools requires fewer than 30 lines of code using the Foundry SDK. Add tools only when the use case demands them. Use model-router unless you have a specific reason not to. The empirical data in the lab shows the router selects appropriate models across all task types — factual, creative, tool-calling, RAG, and code generation. Understand the client/server tool boundary. Function tools give you control; built-in tools give you simplicity. MCP and Toolbox give you governance and interoperability. Choose based on where you need control and where you need scale. Conversation state belongs on the server. Do not maintain conversation history in application memory if you can avoid it. The Responses API conversation object is designed for this. The hosted-demo pattern is for when you need to own the inference path. For most use cases, a declarative prompt agent is sufficient and far simpler to operate. Next Steps Explore the repo: github.com/microsoft-foundry/Foundry-Agent-Lab Microsoft Foundry SDK documentation: learn.microsoft.com/azure/ai-studio/ Responses API quickstart: Prompt agent quickstart Model Router conceptual documentation: Model Router for Microsoft Foundry Model Context Protocol: modelcontextprotocol.io Azure Identity SDK (DefaultAzureCredential): azure-identity Python SDK The Foundry Agent Lab is open source under the MIT licence. Contributions, bug reports, and feature requests are welcome through GitHub Issues. See CONTRIBUTING.md for guidelines.OIDC vs SPN: Securing Azure Deployments with GitHub Actions & Terraform
From Secrets to Trust: Modernizing CI/CD Authentication When building infrastructure pipelines on Microsoft Azure using GitHub Actions and Terraform, one design choice quietly determines your entire security posture: How does your pipeline authenticate to Azure? For years, the answer was simple: Use a Service Principal (SPN) Store a client secret in GitHub Authenticate using credentials It works—but it doesn’t scale securely. This article walks through a real, production-ready implementation comparing: SPN (Client Secret – legacy pattern) OIDC (Federated Identity – modern standard) Backed by a working repo: WorkFlowBasedDeployment Architecture Overview This repository implements a workflow-driven Terraform deployment model with modular Azure infrastructure. Repository Structure .github/workflows/ deploy-infrastructure.yml # OIDC deployment deploy-infrastructure-spn.yml # SPN deployment destroy-infrastructure.yml # OIDC destroy destroy-infrastructure-spn.yml # SPN destroy Deployment/ main.tf providers.tf variables.tf terraform.tfvars modules/ Azure Resources Provisioned Resource Module Resource Group Virtual Network + NSGs vnet rg-network Storage Account sa rg-data Container Apps containerapps rg-compute AI Foundry aifoundry rg-data AI Search aisearch rg-data Azure Container Registry acr rg-compute Key Vault azkeyvault rg-data Monitoring azmonitor rg-compute Private Endpoints private_endpoints rg-network Authentication Models Service Principal (SPN) – The Traditional Way How it works Create App Registration Generate client secret Store it in GitHubTerraform authenticates using environment variables env: ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }} ARM_CLIENT_SECRET: ${{ secrets.AZURE_CLIENT_SECRET }} ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }} The problem Risk Impact Long-lived secrets Can be leaked Manual rotation Operational burden Repo compromise Full environment exposure This model is still supported—but increasingly considered legacy for secure pipelines. OIDC (OpenID Connect) – The Modern Approach How it works GitHub Actions generates a short-lived identity token Microsoft Entra ID validates it Azure issues a temporary access token Terraform executes using that token No secrets. No storage. No rotation. Authentication Models Compared OIDC Flow (Mental Model) Think of OIDC like this: GitHub → Identity Provider Azure → Trust Authority Workflow → Temporary Identity OIDC Implementation (From the Repo) Workflow Configuration permissions: id-token: write contents: read env: ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }} ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }} ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }} ARM_USE_OIDC: true Azure Login - name: Azure Login (OIDC) uses: azure/login@v2 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} Backend (Terraform State with OIDC) terraform init \ -backend-config="use_oidc=true" Even your state storage is secretless Azure Setup for OIDC Create App Registration No client secret required Configure Federated Credential Example: Issuer: https://token.actions.githubusercontent.com Subject: repo:<org>/<repo>:ref:refs/heads/master You can restrict by: Branch Environment Repository Assign RBAC: Grant roles like: Contributor Or scoped resource-level access CI/CD Workflow Design Both SPN and OIDC pipelines follow a 2-stage pattern: Plan Stage terraform fmt terraform validate terraform plan Upload plan artifact Apply Stage Triggered only on main Downloads plan Runs apply -auto-approve Protected via environment approvals This ensures safe, auditable deployments OIDC vs SPN — Real Comparison Feature SPN OIDC Secrets Stored in GitHub None Token lifetime Long-lived Short-lived Rotation Manual Not required Security Medium High Setup Simple Slightly complex Recommended No Yes Common Pitfalls (Real-World Lessons) Missing id-token permission Without this, OIDC fails silently. Federated credential mismatch Wrong branch Incorrect repo name Case sensitivity issues Azure rejects the token completely. RBAC delay Role assignments can take time → causes confusing failures. Backend misconfiguration Forgetting use_oidc=true breaks Terraform state auth. Debugging Tips Enable debug logs in GitHub Actions Check Sign-in logs in Microsoft Entra ID Validate federated credential subject format Always isolate: Identity issue vs Permission issue Migration Strategy (SPN → OIDC) A safe transition looks like this: Keep SPN as fallback Add OIDC alongside Test in DEV environment Remove client secret Revoke old credentials No downtime, no risk. Where This Fits in Modern Azure Architecture This pattern integrates naturally with: Azure Container Apps AI/ML workloads (AI Foundry, Search) Multi-environment deployments Zero-trust enterprise architectures Authentication becomes identity-driven, not secret-driven When NOT to Use OIDC Legacy CI/CD systems without OIDC support Organisations with strict identity federation constraints Cross-tenant scenarios with limited trust setup Note: These cases are becoming increasingly rare in modern cloud setups. Security Perspective Threat SPN Risk OIDC Risk Secret leak High None Credential reuse High Low Token replay Possible Limited Repo compromise Full access Scoped Final Takeaway This repository demonstrates a key shift in modern DevOps: Secrets were a workaround for identity. OIDC replaces that workaround with trust. By combining: GitHub Actions OIDC federation Azure RBAC You get: Secure pipelines Scalable deployments Zero secret management In enterprise environments, moving to OIDC can eliminate secret rotation pipelines entirely, reducing operational overhead and significantly lowering breach risk. Reference Implementation GitHub Repository: WorkFlowBasedDeployment Closing Thought OIDC doesn’t just improve authentication, it fundamentally changes how trust is established in cloud systems. In a world moving toward zero-trust architectures, identity is the new perimeter and OIDC is how you enforce it.Announcing Public Preview of Argo CD extension in AKS Azure Portal Experience
We are excited to announce the public preview of Argo CD in the Azure Portal for Azure Kubernetes Service. As GitOps becomes the standard for deploying and operating applications at scale, customers need a way to adopt GitOps with simpler onboarding, secure defaults, and integrated workflows. With Argo CD now available directly in the Portal, teams can enable and manage GitOps without the complexity of manual setup. Bringing GitOps into the AKS experience Argo CD is widely used across Kubernetes environments, but setup often requires manual configuration across identity, networking, and registry integrations. With the Azure Portal experience, customers can: Enable Argo CD directly from the AKS cluster Configure identity, access, ingress, and registry integration in a guided flow Manage and monitor GitOps workflows through Argo CD UI This reduces onboarding friction and helps you reach your first successful GitOps deployment faster. Trusted identity and secure access The Argo CD experience integrates with Microsoft Entra ID to provide a secure, enterprise-ready foundation: Secure authentication using Workload Identity federation to Azure Container Registry (ACR) and Azure DevOps, removing long-lived credentials and hard-coded secrets Single Sign-On (SSO) using existing Azure identities Enterprise-grade hardening and security This preview includes built-in improvements to strengthen security posture: Images built on Azure Linux for reduced CVEs and improved baseline security Optional automatic patch updates to stay current while maintaining control over change management Parity with upstream Argo CD Argo CD in AKS remains aligned with the upstream open-source project, supporting: High availability (HA) configurations for production workloads Hub-and-spoke architectures for multi-cluster GitOps Application and ApplicationSet for scalable deployment across fleets Getting Started We invite you to explore the Argo CD experience in the Azure Portal and share feedback. To get started, go to your AKS cluster in the Azure Portal, navigate to the GitOps experience, and select Enable Argo CD. Follow the guided setup to configure identity, access, ingress, and registry integration with secure defaults. Once enabled, you can monitor your deployment and view application health and sync status from the Argo CD UI linked in the GitOps blade. For customers who prefer automation and scripting, the Argo CD extension is also available via Azure CLI public preview. NOTE: You can choose between Flux and Argo CD as your GitOps solution based on your needs. The Argo CD option is available during the initial GitOps setup experience, while existing Flux users will continue to see their current configuration.284Views0likes0CommentsHow to Visualize Your Azure AI Workloads Usage for Observability
This article assumes you already have an Azure Foundry project and resource deployed in Microsoft Foundry. The options referenced here are documented in detail in the linked articles; this post serves as a consolidated step by step guide bringing them all together and explaining where each option is most useful. A Summary: Need Best Option Quick day-over-day visual, minimal setup Grafana Dashboard (Option 3) Custom growth % calculations App Insights + KQL in Log Analytics (Option 4) Shareable, interactive report Azure Workbooks (Option 5) Per-user/per-agent granularity APIM + App Insights (Option 6) Quick one-off chart, export to Excel Microsoft Foundry Monitor tab or App Insights Metrics Explorer (Option 1 and 2) Option 1. Within the Microsoft Foundry Portal (Quickest, No Setup) If you have models deployed in Microsoft Foundry and would like to monitor its usage, go to the New Foundry Portal → Build → Models → Monitor tab. View metrics such as: Estimated cost Total token usage Input vs. output tokens Number of requests This is the simplest way to monitor both model and agent usage. For PAYG plans: You can also view your total allocated quota (and figure out which Tier you are on) using the Quota Management Screen (New Foundry Portal → Operate → Quota tab). This screen shows how much your total allocated quota is, per model in a given subscription + region + Deployment Type (Global, Data Zones or Regional). For eg., in the image below, for gpt-4o, I am allocated 7M total TPM in my subscription. I am only using 150K TPM of the allocated 7M TPM amount. Which means, my requests will get throttled if I exceed the 150K TPM limit. To avoid throttling, I would need to increase my shared allocation limit. NOTE: you are charged for usage, so if you allow more capacity, you use more, so you pay more. Option 2: Azure Monitor Metrics Explorer This is already built into the Azure Portal and gives you time-series charts out of the box. Go to Azure Portal → your Azure OpenAI / Foundry resource → Monitoring → Metrics Select a metric like AzureOpenAIRequests or TokenTransaction Set Aggregation to Sum (total) or Max and Time granularity to 1 day Split by ModelDeploymentName to see per-model trends Adjust the time range (e.g., last 30 days) — you'll see day-over-day bars/lines Tip: You can pin these charts to an Azure Dashboard for a persistent view, or click Share → Download to Excel to get the raw data for your own analysis. Option 3: Azure Managed Grafana (Best Pre-Built Dashboard) This is the best option for a polished, real-time, day-over-day dashboard with no custom code. There's a pre-built AI Foundry dashboard ready to import. [grafana.com], [Create a M...ed Grafana] How to set it up: Create an Azure Managed Grafana workspace (if you don't have one) In Grafana, go to Dashboards → New → Import → enter dashboard ID 24039 (for Foundry) Select your Azure Monitor data source and point it to your Foundry resource Tip: You can also import this directly from the Azure Portal: Monitor → Dashboards with Grafana → AI Foundry. That's it — the dashboard gives you (per model deployment): Token trends over time (inference, prompt, completion — day over day) Request trends over time (AzureOpenAIRequests as a time series) Latency trends (bonus) NOTE: Default time range is 7 days — adjust to 30/60/90 days for growth trends Option 4: Application Insights + KQL Queries (Most Flexible, Custom Reports) If you want fully custom day-over-day growth calculations (e.g., % change day-to-day), this is the way. [azurefeeds.com] Setup: Ensure your Foundry project is connected to an Application Insights resource (Foundry → Settings → Connected Resources). Open up App Insights resource → Logs → New Query or choose a sample query. In the images below, we simply ran 'requests' and set the time range to 24 hours. There is also a Kusto Query Language (KQL) mode or Simple mode on the right-hand side: Simple mode will let you run out of the box samples. KQL mode will open up a query window for you to enter custom queries. Below are the results in grid view. Same view but showing a chart: Export options: Another way to get the above graphs are via Log Analytics. Simply enable Diagnostic Settings on your Azure OpenAI resource → send to a Log Analytics workspace. Open Log Analytics → Logs and try our your sample queries. Sample KQL for day-over-day token usage (adjust to your needs): AzureMetrics | where MetricName in ("TokenTransaction", "ProcessedPromptTokens", "GeneratedTokens") | where TimeGenerated > ago(30d) | summarize DailyTokens = sum(Total) by bin(TimeGenerated, 1d), MetricName | order by TimeGenerated asc | render timechart Result: Sample KQL for day-over-day growth % (adjust to your needs): AzureMetrics | where MetricName == "TokenTransaction" | where TimeGenerated > ago(30d) | summarize DailyTokens = sum(Total) by Day = bin(TimeGenerated, 1d) | sort by Day asc | extend PrevDay = prev(DailyTokens) | extend GrowthPct = round((DailyTokens - PrevDay) / PrevDay * 100, 2) | project Day, DailyTokens, GrowthPct Option 5: Azure Monitor Workbooks (Custom Dashboards, Shareable) Workbooks let you build interactive, parameterized dashboards that combine metrics and KQL logs. What's more, you can select resources from multiple subscriptions and visualize them all in one place using Workbooks! Go to Azure Portal → Monitor → Workbooks → New Add a Metrics query panel → select your Log Analytics or App Insights or Foundry resource -> Enter the same query you used in Option 4. Do a test run and view the graphs (this can be viewed as charts or a list (grid view)): 4. Save and share with your team. Option 6: APIM + Application Insights (Granular Per-Caller/Per-Agent Tracking) 1. If your app routes requests through Azure API Management, you can use the azure-openai-emit-token-metric policy to send per-request token metrics to Application Insights with custom dimensions (User ID, Subscription ID, Agent, etc.). [Azure API...osoft Docs] This is ideal for scenarios like: "Which agent consumed the most tokens last week?" "What's the token usage per API consumer/team?" NOTE: Microsoft Foundry resources do not track usage by users. So, fronting your Foundry resource with an APIM could be a way to track users provided you pass the username/id in the request context. How you implement this is upto your app design. Ref: AI-Gateway/labs/token-metrics-emitting/token-metrics-emitting.ipynb at main · Azure-Samples/AI-Gateway · GitHub Bonus: Check out all other APIM + AI related policies here: AI-Gateway/labs/semantic-caching at main · Azure-Samples/AI-Gateway AI-Gateway/labs/token-rate-limiting at main · Azure-Samples/AI-Gateway AI-Gateway/labs/token-metrics-emitting/token-metrics-emitting.ipynb at main · Azure-Samples/AI-Gateway · GitHubTurn Your App Service Web App Into a Self-Healing Agent: LLMOps Best Practices for Production
A user submits a prompt. The agent burns through 50,000 tokens looping on a malformed tool response. Another user trips a model rate limit and the agent silently fails. A bad prompt update goes out at 4 PM Friday and degrades success rate to 60%. Your APM dashboard shows green the entire time because none of that is a 500. This post walks through the LLMOps stack we built into a working reference sample on Azure App Service: the SLIs that matter for agents, a budget circuit breaker, prompt-repair retries, and a fully automated slot-swap rollback when things go sideways. Every code snippet is from the deployable sample at the end of the post. 📦 Sample repo: seligj95/app-service-self-healing-agent-python — azd up and you've got the whole stack live in your subscription in under 10 minutes. Why agent ops ≠ web-app SRE Your web app's reliability model assumes a request maps to bounded work — a SQL query, a cache hit, a templated response. You alert on Http5xx, p95 latency, and dependency failures. Done. An agent breaks that model in four ways: Cost is unbounded per request. An agent that loops on a flaky tool can spend $5 on one user prompt. The HTTP response is still 200. Failure can be silent. A model can hallucinate confident JSON, a tool can return malformed args, and the agent dutifully returns a wrong answer to the user. Zero exceptions logged. Latency is non-deterministic. A "simple" prompt that normally finishes in 2 seconds can blow out to 30s when the model picks an expensive plan. p95 latency tells you nothing. Quality regresses on prompt changes, not code changes. A prompt tweak that ships in seconds can crater tool-call accuracy by 30%. Your CI/CD pipeline didn't catch it because there were no failing tests. Web-app SLOs (uptime, latency, error rate) are necessary but not sufficient. Agents need agent-shaped SLOs. Define your agent SLOs first Before instrumenting anything, write down what "healthy" means. Here are the four SLIs we chose for the sample. None of them are Http5xx. SLI What it measures Why it matters Task success rate % of /chat requests that the agent self-classifies as completed Catches silent failures the HTTP layer misses Cost per task $ spent (input + output tokens × model rate) per /chat The unbounded-loop problem in one number Tool success rate % of tool invocations that didn't raise Tool layer is where most agent failures live Repair retries Times we re-prompted the model after a schema-validation failure Leading indicator of prompt drift In our reference middleware these come out as agent.task.success , agent.cost.usd , agent.tool.success , and agent.repair.retry — eleven custom metrics in total. We emit them via OpenTelemetry so they land in App Insights customMetrics and the included KQL workbook visualizes them as SLO tiles. Observability stack on App Service App Service makes the observability story unusually easy because you get App Insights wired up automatically by azd — no agent install, no DaemonSet, no sidecar. The only thing you bring is the SDK init for your custom metrics: # llmops_middleware/sli.py from azure.monitor.opentelemetry import configure_azure_monitor from opentelemetry import metrics def configure_azure_monitor_if_available() -> bool: if not os.getenv("APPLICATIONINSIGHTS_CONNECTION_STRING"): return False configure_azure_monitor() return True meter = metrics.get_meter("agent") tokens_in = meter.create_counter("agent.tokens.in") cost_usd = meter.create_counter("agent.cost.usd") task_latency = meter.create_histogram("agent.task.latency") tool_success = meter.create_counter("agent.tool.success") # ... We compute cost from a per-model rate card so the metric is in real dollars, not abstract tokens: COST_PER_1K_TOKENS = { "gpt-4o": {"in": 0.0025, "out": 0.01}, "gpt-4o-mini": {"in": 0.00015, "out": 0.0006}, } def record_cost(model: str, tokens_in_count: int, tokens_out_count: int, tenant: str) -> float: rate = COST_PER_1K_TOKENS[model] cost = (tokens_in_count * rate["in"] + tokens_out_count * rate["out"]) / 1000 cost_usd.add(cost, {"model": model, "tenant": tenant}) return cost Once those flow, the KQL queries write themselves: // Top cost-burning tenants in the last hour customMetrics | where timestamp > ago(1h) | where name == "agent.cost.usd" | extend tenant = tostring(customDimensions["tenant"]) | summarize spend_usd = sum(valueSum) by tenant | top 10 by spend_usd desc The sample ships a 6-tile workbook ( observability/workbook.json ) deployed via Bicep. It renders SLO compliance, cost burn-down, tool failure breakdown, latency percentiles, budget breaches, and healing signals out of the box. The deployed workbook in App Insights. The SLO panel dips during a chaos run and recovers as the agent self-heals — exactly the signal you want on a glass-pane dashboard. Cost guardrails with a budget circuit breaker Custom metrics tell you about cost after you spent it. To prevent runaways, you need a circuit breaker that bites before the model call happens. The middleware in llmops_middleware/budget.py keeps a per-tenant counter in memory (per month) and returns a decision: class BudgetDecision(Enum): ALLOW = "allow" # under budget DOWNSHIFT = "downshift" # ≥80% — switch to cheaper model BLOCK = "block" # ≥100% — refuse the request def evaluate(tenant: str) -> BudgetDecision: spent = _spend.get((tenant, _current_period()), 0.0) if spent >= BUDGET_USD_PER_TENANT: return BudgetDecision.BLOCK if spent >= BUDGET_USD_PER_TENANT * 0.80: return BudgetDecision.DOWNSHIFT return BudgetDecision.ALLOW The agent loop reads that decision and downshifts from gpt-4o to gpt-4o-mini — a 16× cost reduction ($0.0025 / 1K input tokens vs $0.00015) — when a tenant crosses 80% of their monthly budget. The user keeps getting answers; the bill stops climbing. def _pick_model(tenant: str) -> str: decision = budget.evaluate(tenant) if decision == BudgetDecision.DOWNSHIFT: sli.model_downshift.add(1, {"tenant": tenant}) return DOWNSHIFT_MODEL return PRIMARY_MODEL For the demo we keep state in memory; production should swap the dict for Redis (atomic INCRBY ) or Cosmos with optimistic concurrency. The interface in budget.py is intentionally tiny so this is a 10-line change. Self-healing patterns There are three patterns in the sample, each addressing a different failure class. 1. Retry with prompt-repair The most common agent failure isn't a tool exception — it's the model returning malformed JSON that fails schema validation on tool args. The fix is to feed the validation error back into the model and ask it to repair the call: # llmops_middleware/repair.py async def retry_with_repair(call_fn, args, *, max_attempts=2): for attempt in range(max_attempts): try: return await call_fn(args) except (ValidationError, RepairableError) as exc: sli.repair_retry.add(1, {"attempt": str(attempt)}) args = await _ask_model_to_repair(args, str(exc)) raise This single pattern recovers 50–70% of "the agent returned garbage" cases without escalating. 2. Tool fallback chains When a primary tool times out or fails open, try a cheaper or simpler one: async def tool_fallback_chain(primary, *fallbacks, args): for fn in (primary, *fallbacks): try: return await fn(args) except ToolUnavailable: sli.tool_success.add(1, {"tool": fn.__name__, "status": "fallback"}) raise NoToolAvailable() Lookup-style tools especially benefit: web search → cached snapshot → static knowledge base. 3. Slot-swap auto-rollback Here's the killer feature App Service brings that's a slog on K8s: deployment slots. You always have a known-good previous version warmed up and one ARM API call away from production traffic. We wire that up to fire automatically when our SLI breaches. The chain is: Metric alert on Http5xx > 5 in 5 minutes (the platform metric, free) Action Group that POSTs to a Logic App webhook (SAS-signed callback URL) Logic App that calls POST /sites/{name}/slots/staging/slotsswap via its managed identity (granted Website Contributor on the target web app) The whole healer is one trigger + two actions: receive the alert webhook, call ARM slotsswap, return a status payload to the caller. The two actions in Bicep: SwapSlots: { type: 'Http' inputs: { method: 'POST' uri: '${environment().resourceManager}@{parameters(\'targetSiteId\')}/slots/staging/slotsswap?api-version=2024-04-01' body: { targetSlot: 'production' } authentication: { type: 'ManagedServiceIdentity' audience: environment().resourceManager } } } No code to deploy, no secrets to manage, no second runtime to babysit. From alert-fire to swapped-slot is about 4 minutes in our tests — under the SLA most agent products have for "user-visible degraded mode." Why not a Function App? We started there. The Logic App is 60 lines of Bicep and zero application code. For a one-action workflow like "swap a slot," the Function adds packaging, deployment, and a runtime to monitor for no benefit. Chaos testing for agents You can't trust a self-healing system you haven't broken. The sample ships a chaos CLI and an in-process injection point so you can practice failures on demand. In-process: llmops_middleware/chaos.py exposes four modes ( off , throttle , malformed , outage ) togglable via POST /admin/chaos . When set, tool calls roll a die and raise the matching exception with the configured probability: class ChaosController: def maybe_inject(self) -> None: if random.random() > self.probability: return if self.mode == "outage": raise ChaosOutage("simulated tool outage") if self.mode == "throttle": raise ChaosThrottled("simulated 429") if self.mode == "malformed": raise ChaosMalformed("simulated bad tool output") External: chaos/inject.py is a small async load driver that sets /admin/chaos then drives /chat at a target RPS, tallying response codes: python chaos/inject.py \ --base-url https://my-agent.azurewebsites.net \ --mode outage --probability 1.0 --rps 10 --duration 300 Running that for 5 minutes against the deployed sample reliably: Drives customMetrics(name="agent.task.failure") over 50/min Trips the Http5xx > 5 metric alert (~90 seconds after threshold breach) Fires the Logic App run (succeeded in 1.2 seconds in our test) Flips the slot — /health instance ID changes The repo's observability/queries.kql has the canonical KQL for each of these signals, and observability/workbook.json is the deployable workbook that visualizes them. The reference middleware Everything in this post is in seligj95/app-service-self-healing-agent-python. The Python package llmops_middleware/ is the part you'd vendor into your own agent — sli.py , budget.py , repair.py , chaos.py . The agent loop and the Bicep are demo-quality but production-shaped. Run it yourself: git clone https://github.com/seligj95/app-service-self-healing-agent-python cd app-service-self-healing-agent-python azd auth login azd up You'll have an agent + AOAI + workbook + healer running in about 8 minutes. Then run the chaos script and watch the slot flip. The KQL workbook Deployable workbook JSON, dropped into the resource group by Bicep. Six panels: SLO tile — % of tasks where agent.task.success was emitted (grouped by tenant) Cost burn-down — running spend per tenant against the monthly budget Top failing tools — failure count by tool, broken down by error class Latency p50/p95/p99 — agent.task.latency histogram Budget breaches — count and tenant list Healing signals — agent.repair.retry + agent.model.downshift + agent.chaos.injected over time It's observability/workbook.json — loadTextContent -ed into infra/shared/monitoring.bicep so you get it deployed automatically. Why App Service for LLMOps After building this, the appeal of App Service for agents is clearer than I expected going in: Slots are an unfair advantage. A pre-warmed previous version, one ARM call from production. K8s blue/green needs you to build it. Managed identity to Azure OpenAI removes the entire key-rotation problem. The sample sets disableLocalAuth: true on the AOAI account — there literally is no key. App Insights is auto-wired so your custom metrics land in customMetrics and your KQL queries work day one. Bicep + azd lets you ship a full LLMOps stack in one repo: app, infra, healing, observability, chaos. If you're standing up a new agent and you don't already have a Kubernetes platform you love, App Service is a strong default. Wrap-up If you take three things from this post: Define agent SLOs in your own terms — task success, cost per task, tool reliability — not just web-app SLOs. Put a circuit breaker between the user and the model. A budget breaker that downshifts to a cheaper model is the highest-ROI middleware you can ship. Make rollback boring. Slot swap + a one-action Logic App + a metric alert is a self-healing system you can build in an afternoon and trust at 3 AM. The sample has all of it wired up. We're considering baking these into App Service — tell us what you'd want The middleware in this sample (SLIs + telemetry, cost guardrails, policy/audit hooks) is exactly the kind of thing we're evaluating as first-class App Service platform features — opt-in sidecars or built-in capabilities so you don't have to vendor a middleware package into every agent you ship. Concretely, we're tracking ideas like: Agent Observatory — a sidecar that intercepts SDK calls (Semantic Kernel, LangChain, Crew AI, AutoGen) and captures full reasoning traces with zero code changes AI Cost Guardian — platform-level quotas and spend caps across Azure OpenAI, Anthropic, and other model providers, with real-time enforcement Policy Guard — governance, PII masking, model-approval lists, and an immutable audit log for regulated workloads If any of those would land for your team — or if you're solving these problems differently and want to push back on the shape — we want to hear it. Drop a comment on this post: the roadmap is genuinely shaped by feedback at this stage.171Views0likes0CommentsPlatform Improvements for Python AI Apps on Azure App Service
Overview Azure App Service (Linux) is a fully managed PaaS offering that supports a broad range of languages, including Python, Node.js, .NET, PHP, and Java. Developers can push source code or deploy a pre-built artifact; the platform handles the rest, including dependency installation, application containerization, and running the application at cloud scale. More customers are building intelligent applications using Azure AI Foundry and other AI services, and Python has become a language of choice for these workloads. The performance and reliability of the Python deployment pipeline directly shape the developer's experience on the platform, so we looked across the deployment path for opportunities to reduce latency and improve reliability. The first set of changes has reduced Python deployment latency on Azure App Service Linux by approximately 30%. This is the first step in a broader effort to make the platform better suited for AI application development, but the gains resulting from this effort will benefit all apps on the platform. Let's look at the details. Where Deployment Time Was Going Python web application deployments on Azure App Service Linux rely on Oryx, the platform's open-source build system, to produce runnable artifacts during remote builds. Platform telemetry showed that around 70% of Python app deployments use remote builds, and the majority of those resolve dependencies via requirements.txt using pip install. To understand where time was going, we profiled a stress workload: a 7.5 GB PyTorch application. Most production builds are smaller, but stress-testing a dependency-heavy application made the pipeline bottlenecks clear. When a Python app is deployed via remote build, the build container in Kudu (the App Service deployment service) runs Oryx to: Extract the uploaded source code. Create a Python virtual environment. Install dependencies via pip install; 4.35 min (~34% of build time). Copy files to a staging directory; 0.98 min (~8%). Compress via tar + gzip into an archive; 7.53 min (~58%). Write the archive to /home (Azure Storage SMB mount). The app container then extracts this archive to the local disk on every cold start. Why the Archive-Based Approach? The /home directory is backed by an Azure Storage SMB mount, where small-file I/O is comparatively expensive. Python dependencies are file-heavy: virtual environments commonly contain tens of thousands of files, and dependency-heavy ML applications can exceed 200,000 files. Writing those files individually over SMB would be prohibitively slow. Instead, the pipeline builds on the container's local filesystem, writes a single compressed archive over SMB, and the app container extracts it locally on startup for efficient module loading. Key insight: Compression was the single largest phase at 58% of build time, longer than installing the packages themselves. What We Changed Zstandard Compression (Replacing gzip) Standard gzip compression is single-threaded. In our benchmark, compression accounted for 58% of total build time, making it the dominant bottleneck. Because the archive is also decompressed during container startup, decompression time affects runtime startup latency as well. We evaluated three compression algorithms: gzip, LZ4, and Zstandard (zstd). The following results are averaged across multiple deployments of a 7.5 GB Python application with PyTorch and additional ML packages: Metric gzip LZ4 zstd Compression time 7.53 min 1.20 min 1.18 min Decompression time 2.80 min 1.18 min 1.07 min Archive size 4.0 GB 5.0 GB 4.8 GB Both zstd and LZ4 were more than 6× faster than gzip for compression and more than 2× faster for decompression. We selected zstd for the following reasons: Comparable speed to LZ4, with smaller archive sizes (4.8 GB vs. 5.0 GB). Mature ecosystem: zstd is based on RFC 8878 published in 2021 and ships with many common Linux distributions. Native tar support: tar –I zstd works out of the box; no extra packages required. Result: Compression time dropped from 7.53 min → 1.18 min (6.4× faster). Decompression improved from 2.80 min → 1.07 min (2.6× faster), directly reducing cold-start latency. Faster Package Installation with uv pip is implemented in Python and has historically optimized compatibility over maximum parallelism. In dependency-heavy workloads, package download, resolution, and installation can become a major part of deployment time. In our 7.5 GB PyTorch benchmark, package installation accounted for ~34% of total build time (4.35 min out of 12.86 min). We introduced uv, a Python package manager written in Rust, as the primary installer for compatible requirements.txt deployments. Its uv pip install interface works with standard pip workflows. Fallback strategy: Compatibility remains the priority. When uv cannot handle a deployment, the platform retries with pip, preserving the behavior customers already depend on. Cache behavior: Package caches remain local to the build container. When the same app is deployed again before the kudu (build) container is recycled, both pip and uv can reuse cached packages and avoid repeated downloads. Result: Package installation time dropped from 4.35 min → 1.50 min (3× faster). Reducing File Copy Overhead A file copy showed up in two places. First, before compression, the build process copied the entire build directory (application code plus Python packages) to a staging location. This existed historically as a safety measure; creating a clean snapshot before tar reads the file tree. But the cost was steep for the large number of files inherent in Python dependencies. The fix was straightforward: create the tar archive directly from the build directory, skipping the intermediate copy entirely. Second, for pre-built deployment scenarios, we replaced the legacy Kudu sync path with Linux-native rsync. That gave us a better optimized tool for large Linux file trees and reduced the overhead of moving files into the final deployment location. Because this path is used beyond Python, the improvement benefits pre-built apps across the broader App Service Linux ecosystem. Result: Eliminated the 0.98-minute staging copy (8% of build time), reduced temporary disk usage, and improved the remaining file sync path. Pre-Built Python Wheels Cache We added a complementary optimization: a read-only cache of pre-built wheels for commonly used Python packages, selected using platform telemetry. The cache is mounted into the Kudu build container at runtime for Python workloads, allowing the installer to use local wheel artifacts before downloading packages externally. When a matching wheel is available, the installer uses it directly, avoiding a network fetch for that package. Cache misses fall back to the upstream registry (e.g., PyPI) as usual. The cache is managed by the platform and kept up to date, so supported Python builds can use it without any app change. Combined Results Controlled Benchmark (PyTorch 7.5 GB, P1mv3 App Service Tier) The following benchmark was measured on the P1mv3 App Service tier. Values in the "After" column reflect the optimized pipeline with zstd compression, uv package installation, direct tar creation, and the pre-built wheels cache enabled together. Phase Before After Improvement Package installation 4.35 min 1.50 min ~3× faster File copy 0.98 min 0 min Eliminated Compression 7.53 min 1.18 min ~6× faster Total build time 12.86 min ~2.68 min ~79% reduction Production Fleet (All Python Linux Web Apps) Production telemetry across Python deployments shows the impact of these changes: deployment latency decreased by approximately 30% after the rollout. The controlled benchmark shows a larger improvement (~79%) because it exercises a dependency-heavy workload where package installation, file copy, and compression dominate total build time. Typical production apps are smaller and spend less time proportionally in those phases. Beyond Faster Builds: Reliability and Runtime Performance Faster builds only help when deployment requests reliably reach a worker that is ready to build. We updated the primary deployment clients Azure CLI, GitHub Actions, and Azure DevOps Pipelines to warm up Kudu before initiating deployments. Clients now issue a lightweight health-check request to the Kudu endpoint, helping ensure the deployment container is running and ready before the deployment begins. Clients also preserve affinity to the warmed-up worker using the ARR affinity cookie returned by the first request. This increases the chance that the deployment uses a worker with Kudu already running and local package caches already available from recent deployments. Together, these client-side changes reduced deployment failures from transient infrastructure issues and helped the pipeline optimizations reach the build phase reliably. Result: Deployment failures caused by cold-start errors (502, 503, 499) dropped by ~30%. We also improved the default runtime configuration for Python apps using the platform-provided Gunicorn startup path. Previously, the platform defaulted to a single worker, leaving most CPU cores idle. Now, it follows Gunicorn's recommended worker formula, fully utilizing available cores on multi-core SKUs and delivering higher request throughput out of the box. workers = (2 × NUM_CORES) + 1 Key Takeaways Measure before optimizing: Platform telemetry showed that remote builds and requirements.txt based installs were the dominant Python deployment paths, which helped us focus on changes that would benefit the most customers. Compression was the biggest bottleneck: In the dependency-heavy benchmark, archive compression took longer than package installation. Replacing gzip with zstd reduced both build time and cold-start extraction time. File count matters: Python virtual environments can contain tens of thousands of files, and AI workloads can contain many more. Reducing unnecessary file copies and using Linux-native file sync helped lower overhead. Compatibility needs a fallback path: Introducing uv improved the common path, while falling back to pip preserved compatibility for apps that depend on existing Python packaging behavior. Deployment reliability is part of performance: Faster builds only help if deployment requests consistently reach a ready worker. Warm-up and worker affinity made the optimized path more reliable for customers. Beyond deployment: Runtime defaults, such as Gunicorn worker configuration, also affect how production apps perform once deployment is complete. Together, these changes made Python deployments faster and more reliable while preserving compatibility through safe fallbacks. We will continue improving the platform to make Azure App Service faster, more reliable, and better suited for AI application development.298Views1like0CommentsFrom Prompt to Production: Open in VS Code for Terraform in Azure Copilot
We’re excited to introduce a new step in the Terraform on Azure experience: Open in VS Code, now available directly from Azure Copilot in the Azure Portal. This capability helps you move seamlessly from AI‑generated Terraform code to real Azure deployments - within a connected, guided workflow designed for enterprise scenarios. Why This Matters Infrastructure as Code with Terraform is powerful, but moving from generated configuration to a deployed environment typically involves multiple tools and handoffs. Teams need to understand Terraform state, work with remote backends, and integrate their code into version‑controlled CI/CD pipelines - often backed by Terraform Cloud or Azure‑native backends in enterprise environments. Open in VS Code brings these steps together. It bridges the gap between AI‑assisted authoring in the Azure Portal and the real‑world workflows required to validate, manage state, and deploy infrastructure with confidence. Continue Your Workflow in VS Code With Azure Copilot, you can describe your infrastructure in natural language and generate Terraform configurations in seconds. For example: “Create an Azure Container App using Terraform with a managed environment, Log Analytics enabled, and a system‑assigned managed identity to securely pull images from Azure Container Registry.” Copilot generates the Terraform configuration for you. From there, you can select Open full view to enter a full‑screen Terraform editor, and then choose Open in VS Code to launch the configuration in an Azure‑hosted VS Code environment. There’s no need to download files or set up a local development environment. VS Code for the web opens with Azure authentication already configured, along with commonly used extensions, so you can immediately focus on refining, validating, and preparing your infrastructure for deployment. Built‑in Guidance for Real Deployments Beyond editing, the VS Code experience includes built‑in, step‑by‑step guidance to help you deploy your Terraform configuration into your own Azure environment - whether you’re experimenting or preparing for production. Because Terraform relies on state management, the workflow starts by helping you choose and configure a backend. Backend Options Option 1: Azure Storage Account as a remote backend A natural fit for Azure‑native and enterprise environments. The experience guides you through creating or selecting a storage account and configuring Terraform to store state securely in Azure. Option 2: HCP Terraform (Terraform Cloud) as a remote backend Ideal for teams already using Terraform Cloud. The guided flow helps you authenticate, connect to an existing organization and workspace, and generate the required backend configuration directly into your Terraform files. Option 3: Temporary workspace for quick validation Designed for learning and experimentation. You can run terraform plan and terraform apply directly in the Azure workspace with temporary state, without committing to a long‑term backend - ideal for quick validation, but not intended for production use. Each option includes an end‑to‑end walkthrough, so you can complete backend setup and run Terraform commands without leaving the VS Code environment or searching through external documentation. Connecting Code, State, and Deployment This experience connects three essential parts of the Terraform workflow: AI‑assisted code generation in Azure Portal Copilot Interactive editing and guided execution in VS Code for the web Flexible backend options for managing Terraform state Together, these pieces make it easier to move from idea to infrastructure in a structured, supported way—whether you’re new to Terraform or managing production workloads with established CI/CD pipelines. Available Now - and What’s Next The Open in VS Code experience for Terraform is now public preview in Azure Portal Copilot. We’re continuing to invest in this workflow, including clearer deployment guidance, future integration with GitHub Actions and other CI/CD pipelines, and deeper enhancements to the full‑screen Terraform editor experience. If you haven’t tried it yet, generate a Terraform configuration with Azure Copilot and open it in VS Code to go from prompt to production end to end in one connected workflow.884Views0likes0Comments