serverless
277 TopicsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.10KViews5likes0CommentsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.8KViews1like1CommentA Practical Path Forward for Heroku Customers with Azure
On February 6, 2026, Heroku announced it is moving to a sustaining engineering model focused on stability, security, reliability, and ongoing support. Many customers are now reassessing how their application platforms will support today’s workloads and future innovation. Microsoft is committed to helping customers migrate and modernize applications from platforms like Heroku to Azure.124Views0likes0CommentsWhat It Takes to Give SRE Agent a Useful Starting Point
In our latest posts, The Agent that investigates itself and Azure SRE Agent Now Builds Expertise Like Your Best Engineer Introducing Deep Context, we wrote about a moment that changed how we think about agent systems. Azure SRE Agent investigated a regression in its own prompt cache, traced the drop to a specific PR, and proposed fixes. What mattered was not just the model. What mattered was the starting point. The agent had code, logs, deployment history, and a workspace it could use to discover the next piece of context. That lesson forced an uncomfortable question about onboarding. If a customer finishes setup and the agent still knows nothing about their app, we have not really onboarded them. We have only created a resource. So for the March 10 GA release, we rebuilt onboarding around a more practical bar: can a new agent become useful on day one? To test that, we used the new flow the way we expect customers to use it. We connected a real sample app, wired up live Azure Monitor alerts, attached code and logs, uploaded a knowledge file, and then pushed the agent through actual work. We asked it to inspect the app, explain a 401 path from the source, debug its own log access, and triage GitHub issues in the repo. This post walks through that experience. We connected everything we could because we wanted to see what the agent does when it has a real starting point, not a partial one. If your setup is shorter, the SRE Agent still works. It just knows less. The cold start we were trying to fix The worst version of an agent experience is familiar by now. You ask a concrete question about your system and get back a smart-sounding answer that is only loosely attached to reality. The model knows what a Kubernetes probe is. It knows what a 500 looks like. It may even know common Kusto table names. But it does not know your deployment, your repo, your auth flow, or the naming mistakes your team made six months ago and still lives with. We saw the same pattern again and again inside our own work. When the agent had real context, it could do deep investigations. When it started cold, it filled the gaps with general knowledge and good guesses. The new onboarding is our attempt to close that gap up front. Instead of treating code, logs, incidents, and knowledge as optional extras, the flow is built around connecting the things the agent needs to reason well. Walking through the new onboarding Starting March 10, you can create and configure an SRE Agent at sre.azure.com. Here is what that looked like for us. Step 1: Create the agent You choose a subscription, resource group, name, and region. Azure provisions the runtime, managed identity, Application Insights, and Log Analytics workspace. In our run, the whole thing took about two minutes. That first step matters more than it may look. We are not just spinning up a chatbot. We are creating the execution environment where the agent can actually work: run commands, inspect files, query services, and keep track of what it learns. Step 2: Start adding context Once provisioning finishes, you land on the setup page. The page is organized around the sources that make the agent useful: code, logs, incidents, Azure resources, and knowledge files. Data source Why it matters Code Lets the agent read the system it is supposed to investigate. Logs Gives it real tables, schemas, and data instead of guesses. Incidents Connects the agent to the place where operational pain actually shows up. Azure resources Gives it the right scope so it starts in the right subscription and resource group. Knowledge files Adds the team-specific context that never shows up cleanly in telemetry. The page is blunt in a way we like. If you have not connected anything yet, it tells you the agent does not know enough about your app to answer useful questions. That is the right framing. The job of onboarding is to fix that. Step 3: Connect logs We started with Azure Data Explorer. The wizard supports Azure Kusto, Datadog, Elasticsearch, Dynatrace, New Relic, Splunk, and Hawkeye. After choosing Kusto, it generated the MCP connector settings for us. We supplied the cluster details, tested the connection, and let it discover the tools. This step removes a whole class of bad agent behavior. The model no longer has to invent table names or hope the cluster it wants is the cluster that exists. It knows what it can query because the connection is explicit. Step 4: Connect the incident platform For incidents, we chose Azure Monitor. This part is simple by design. If incidents are where the agent proves its value, connecting them should feel like the most natural part of setup, not a side quest. PagerDuty and ServiceNow work too, but for this walkthrough we kept it on Azure Monitor so we could wire real alerts to a real app. Step 5: Connect code Then we connected the code repo. We used microsoft-foundry/foundry-agent-webapp, a React and ASP.NET Core sample app running on Azure Container Apps. This is still the highest-leverage source we give the agent. Once the repo is connected, the agent can stop treating the app as an abstract web service. It can read the auth flow. It can inspect how health probes are configured. It can compare logs against the exact code paths that produced them. It can even look at the commit that was live when an incident happened. That changes the quality of the investigation immediately. Step 6: Scope the Azure resources Next we told the agent which resources it was responsible for. We scoped it to the resource group that contained the sample Container App. The wizard then set the roles the agent needed to observe and investigate the environment. That sounds like a small step, but it fixes another common failure mode. Agents do better when they start from the right part of the world. Subscription and resource-group scope give them that boundary. Step 7: Upload knowledge Last, we uploaded a Markdown knowledge file we wrote for the sample app. The file covered the app architecture, API endpoints, auth flow, likely failure modes, and the files we would expect an engineer to open first during debugging. We like Markdown here because it stays honest. It is easy for a human to read, easy for the agent to navigate, and easy to update as the system changes. All sources configured Once everything was connected, the setup panel turned green. At that point the agent had a repo, logs, incidents, Azure resources, and a knowledge file. That is the moment where onboarding stops being a checklist and starts being operational setup. The chat experience makes the setup visible When you open a new thread, the configuration panel stays at the top of the chat. If you expand it, you can see exactly what is connected and what is not. We built this because people should not have to guess what the agent knows. If code is connected and logs are not, that should be obvious. If incidents are wired up but knowledge files are missing, that should be obvious too. The panel makes the agent's working context visible in the same place where you ask it to think. It also makes partial setup less punishing. You do not have to finish every step before the agent becomes useful. But you can see, very clearly, what extra context would make the next answer better. What changed once the agent had context The easiest way to evaluate the onboarding is to look at the first questions we asked after setup. We started with a simple one: What do you know about the Container App in the rg-big-refactor resource group? The agent used Azure CLI to inspect the app, its revisions, and the system logs, then came back with a concise summary: image version, resource sizing, ingress, scale-to-zero behavior, and probe failures during cold start. It also correctly called out that the readiness probe noise was expected and not the root of a real outage. That answer was useful because it was grounded in the actual resource, not in generic advice about Container Apps. Then we asked a harder question: Based on the connected repo, what authentication flow does this app use? If a user reports 401s, what should we check first? The agent opened authConfig.ts, Program.cs, useAuth.ts, postprovision.ps1, and entra-app.bicep, then traced the auth path end to end. The checklist it produced was exactly the kind of thing we hoped onboarding would unlock: client ID alignment, identifier URI issues, redirect URI mismatches, audience validation, missing scopes, token expiry handling, and the single-tenant assumption in the backend. It even pointed to the place in Program.cs where extra logging could be enabled. Without the repo, this would have been a boilerplate answer about JWTs. With the repo, it read like advice from someone who had already been paged for this app before. We did not stop at setup. We wired real monitoring. A polished demo can make any agent look capable, so we pushed farther. We set up live Azure Monitor alerts for the sample web app instead of leaving the incident side as dummy data. We created three alerts: HTTP 5xx errors (Sev 1), for more than 3 server errors in 5 minutes Container restarts (Sev 2), to catch crash loops and OOMs High response latency (Sev 2), when average response time goes above 10 seconds The high-latency alert fired almost immediately. The app was scaling from zero, and the cold start was slow enough to trip the threshold. That was perfect. It gave us a real incident to put through the system instead of a fictional one. Incident response plans From the Builder menu, we created a response plan targeted at incidents with foundry-webapp in the title and severity 1 or 2. The incident that had just fired showed up in the learning flow. We used the actual codebase and deployment details to write the default plan: which files to inspect for failures, how to reason about health probes, and how to tell the difference between a cold start and a real crash. That felt like an important moment in the product. The response plan was not generic incident theater. It was anchored in the system we had just onboarded. One of the most useful demos was the agent debugging itself The sharpest proof point came when we tried to query the Log Analytics workspace from the agent. We expected it to query tables and summarize what it found. Instead, it hit insufficient_scope. That could have been a dead end. Instead, the agent turned the failure into the investigation. It identified the missing permissions, noticed there were two managed identities in play, told us which RBAC roles were required, and gave us the exact commands to apply them. After we fixed the access, it retried and ran a series of KQL queries against the workspace. That is where it found the next problem: Container Apps platform logs were present, but AppRequests, AppExceptions, and the rest of the App Insights-style tables were still empty. That was not a connector bug. It was a real observability gap in the sample app. The backend had OpenTelemetry packages, but the exporter configuration was not actually sending the telemetry we expected. The agent did not just tell us that data was missing. It explained which data was present, which data was absent, and why that difference mattered. That is the sort of thing we wanted this onboarding to set up: not just answering the first question, but exposing the next real thing that needs fixing. We also asked it to triage the repo backlog Once the repo was connected, it was natural to see how well the agent could read open issues against the code. We pointed it at the three open GitHub issues in the sample repo and asked it to triage them. It opened the relevant files, compared the code to the issue descriptions, and came back with a clear breakdown: Issue #21, @fluentui-copilot is not opensource? Partially valid, low severity. The package is public and MIT licensed. The real concern is package maturity, not licensing. Issue #20, SDK fails to deserialize agent tool definitions Confirmed, medium severity. The agent traced the problem to metadata handling in AgentFrameworkService.cs and suggested a safe fallback path. Issue #19, Create Preview experience from AI Foundry is incomplete Confirmed, medium severity. The agent found the gap between the environment variables people are told to paste and the variables the app actually expects. What stood out to us was not just that the output was correct. It was that the agent was careful. It did not overclaim. It separated a documentation concern from two real product bugs. Then it asked whether we wanted it to start implementing the fixes. That is the posture we want from an engineering agent: useful, specific, and a little humble. What the onboarding is really doing After working through the whole flow, we do not think of onboarding as a wizard anymore. We think of it as the process of giving the agent a fair shot. Each connection removes one reason for the model to bluff: Code keeps it from guessing how the system works. Logs keep it from guessing what data exists. Incidents keep it close to operational reality. Azure resource scope keeps it from wandering. Knowledge files keep team-specific context from getting lost. This is the same lesson we learned building the product itself. The agent does better when it can discover context progressively inside a world that is real and well-scoped. Good onboarding is how you create that world. Closing The main thing we learned from this work is simple: onboarding is not done when the resource exists. It is done when the agent can help with a real problem. In one setup we were able to connect a real app, fire a real alert, create a real response plan, debug a real RBAC problem, inspect real logs, and triage real GitHub issues. That is a much better standard than "the wizard completed successfully." If you try SRE Agent after GA, start there. Connect the things that make your system legible, then ask a question that would actually matter during a bad day. The answer will tell you very quickly whether the agent has a real starting point. Create your SRE Agent -> Azure SRE Agent is generally available starting March 10, 2026.448Views1like0CommentsWhat's new in Azure SRE Agent in the GA release
Azure SRE Agent is now generally available (read the GA announcement). . After months in preview with teams across Microsoft and early customers, here's what's shipping at GA. We use SRE Agent in our team We built SRE Agent to solve our own operational problems first. It investigates our regressions, triages errors daily, and turns investigations into reusable knowledge. Every capability in this release was shaped from those learnings. → The Agent That Investigates Itself What's new at GA Redesigned onboarding — useful on day one Can a new agent become useful the same day you set it up? That's the bar we designed around. Connect code, logs, incidents, Azure resources, and knowledge files in a single guided flow. → What It Takes to Give an SRE Agent a Useful Starting Point Deep Context — your agent builds expertise on your environment Continuous access to your logs, code, and knowledge. Persistent memory across investigations. Background intelligence that runs when nobody is asking questions. Your agent already knows your routes, error handlers, and deployment configs because it's been exploring your environment continuously. It remembers what worked last time and surfaces operational insights nobody asked for. → Meet the Best Engineer That Learns Continuously Why SRE Agent - Capabilities that move the needle Automated investigation — proactive and reactive Set up scheduled tasks to run investigations on a cadence — catch issues before they become incidents. When an incident does fire, your agent picks it up automatically through integrations with platforms like ICM, PagerDuty, and ServiceNow. Faster root cause analysis → lower MTTR Your agent is code and context aware and learns continuously. It connects runtime errors to the code that caused them and gets faster with every investigation. Automate workflows across any ecosystem → reduce toil Connect to any system via MCP connectors. Eliminate the context-switching of working across multiple platforms, orchestrate workflows across Azure, monitoring, ticketing, and more from a single place. Integrate with any HTTP API → bring your own tools Write custom Python tools that call any endpoint. Extend your agent to interact with internal APIs, third-party services, or any system your team relies on. Customize your agent → skills and plugins Add your own skills to teach domain-specific knowledge, or browse the Plugin Marketplace to install pre-built capabilities with a single click. Get started Create your agent Documentation Get started guide Pricing Feedback & issues Samples Videos This is just the start — more capabilities are coming soon. Try it out and let us know what you think.1.8KViews0likes0CommentsAgent Hooks: Production-Grade Governance for Azure SRE Agent
Introduction Azure SRE Agent helps engineering teams automate incident response, diagnostics, and remediation tasks. But when you're giving an agent access to production systems—your databases, your Kubernetes clusters, your cloud resources—you need more than just automation. You need governance. Today, we're diving deep into Agent Hooks, the built-in governance framework in Azure SRE Agent that lets you enforce quality standards, prevent dangerous operations, and maintain audit trails without writing custom middleware or proxies. Agent Hooks work by intercepting your SRE Agent at critical execution points—before it responds to users (Stop hooks) or after it executes tools (PostToolUse hooks). You define the rules once in your custom agent configuration, and the SRE Agent runtime enforces them automatically across every conversation thread. In this post, we'll show you how to configure Agent Hooks for a real production scenario: diagnosing and remediating PostgreSQL connection pool exhaustion while maintaining enterprise controls. The Challenge: Autonomous Remediation with Guardrails You're managing a production application backed by Azure PostgreSQL Flexible Server. Your on-call team frequently deals with connection pool exhaustion issues that cause latency spikes. You want your SRE Agent to diagnose and resolve these incidents autonomously, but you need to ensure: Quality Control: The agent provides thorough, evidence-based analysis instead of superficial guesses Safety: The agent can't accidentally execute dangerous commands, but can still perform necessary remediation Compliance: Every agent action is logged for security audits and post-mortems Without Agent Hooks, you'd need to build custom middleware, write validation logic around the SRE Agent API, or settle for manual approval workflows. With Agent Hooks, you configure these controls once in your custom agent definition and the SRE Agent platform enforces them automatically. The Scenario: PostgreSQL Connection Pool Exhaustion For our demo, we'll use a real production application (octopets-prod-web) experiencing connection pool exhaustion. When this happens: P95 latency spikes from ~120ms to 800ms+ Active connections reach the pool limit New requests get queued or fail The correct remediation is to restart the PostgreSQL Flexible Server to flush stale connections—but we want our agent to do this safely and with proper oversight. Demo Setup: Three Hooks, Three Purposes We'll configure three hooks that work together to create a robust governance framework: Hook #1: Quality Gate (Stop Hook) Ensures the agent provides structured, evidence-based responses before presenting them to users. Hook #2: Safety Guardrails (PostToolUse Hook) Blocks dangerous commands while allowing safe operations through an explicit allowlist. Hook #3: Audit Trail (Global Hook) Logs every tool execution across all agents for compliance and debugging. Step-by-Step Implementation Creating the Custom Agent First, we create a specialized subagent in the Azure SRE Agent platform called sre_analyst_agent designed for PostgreSQL diagnostics. In the Agent Canvas, we configure the agent instructions: You are an SRE agent responsible for diagnosing and remediating production issues for an application backed by an Azure PostgreSQL Flexible Server. When investigating a problem: - Use available tools to query Azure Monitor metrics, PostgreSQL logs, and connection statistics - Look for patterns: latency spikes, connection counts, error rates, CPU/memory pressure - Quantify findings with actual numbers where possible (e.g., P95 latency in ms, active connection count, error rate %) When presenting your diagnosis, structure your response with these exact sections: ## Root Cause A precise explanation of what is causing the issue. ## Evidence Specific metrics and observations that support your root cause. Include actual numbers: latency values in ms, connection counts, error rates, timestamps. ## Recommended Actions Numbered list of remediation steps ordered by priority. Be specific — include actual resource names and exact commands. When executing a fix: - Always verify the current state before acting - Confirm the fix worked by re-checking the same metrics after the action - Report before and after numbers to show impact This explicit guidance ensures the agent knows the correct remediation path. Configuring Hook #1: Quality Gate In the Agent Canvas' Hooks tab, we add our first agent-level hook—a Stop hook that fires before the SRE Agent presents its response. This hook uses the SRE Agent's own LLM to evaluate response quality: Event Type: Stop Hook Type: Prompt Activation: Always Hook Prompt: You are a quality gate for an SRE agent that investigates database and app performance issues. Review the agent's response below: $ARGUMENTS Evaluate whether the response meets ALL of the following criteria: 1. Has a "## Root Cause" section with a specific, clear explanation (not vague — must say specifically what failed, e.g., "connection pool exhaustion due to long-running queries holding connections" not just "database issue") 2. Has a "## Evidence" section that includes at least one concrete metric or data point with an actual number (e.g., "P95 latency spiked to 847ms", "active connections: 497/500", "error rate: 23% over last 15 minutes") 3. Has a "## Recommended Actions" section with numbered, specific steps (must include actual resource names or commands, not just "restart the database") If ALL three criteria are met with substantive content, respond: {"ok": true} If ANY criterion is missing, vague, or uses placeholder text, respond: {"ok": false, "reason": "Your response needs more depth before it reaches the user. Specifically: ## Root Cause must name the exact failure mechanism, ## Evidence must include real metric values with numbers (latency in ms, connection counts, error rates), ## Recommended Actions must reference actual resource names and specific commands. Go back and verify your findings."} This hook acts as an automated quality gate built directly into the SRE Agent runtime, catching superficial responses before they reach your on-call engineers. Configuring Hook #2: Safety Guardrails Our second agent-level hook is a PostToolUse hook that fires after the SRE Agent executes Bash or Python tools. This implements an allowlist pattern to control what commands can actually run in production: Event Type: PostToolUse Hook Type: Command (Python) Matcher: Bash|ExecuteShellCommand|ExecutePythonCode Activation: Always Hook Script: #!/usr/bin/env python3 import sys, json, re context = json.load(sys.stdin) tool_input = context.get('tool_input', {}) command = '' if isinstance(tool_input, dict): command = tool_input.get('command', '') or tool_input.get('code', '') # Safe allowlist — check these FIRST before any blocking logic # These are explicitly approved remediation actions for PostgreSQL issues safe_allowlist = [ r'az\s+postgres\s+flexible-server\s+restart', ] for safe_pattern in safe_allowlist: if re.search(safe_pattern, command, re.IGNORECASE): print(json.dumps({ 'decision': 'allow', 'hookSpecificOutput': { 'additionalContext': '[SAFETY] ✅ PostgreSQL server restart approved — recognized as a safe remediation action for connection pool exhaustion.' } })) sys.exit(0) # Destructive commands to block dangerous = [ (r'\baz\s+postgres\s+flexible-server\s+delete\b', 'az postgres flexible-server delete (permanent server deletion)'), (r'\baz\s+\S+\s+delete\b', 'az delete (Azure resource deletion)'), (r'\brm\s+-rf\b', 'rm -rf (recursive force delete)'), (r'\bsudo\b', 'sudo (privilege escalation)'), (r'\bdrop\s+(table|database)\b', 'DROP TABLE/DATABASE (irreversible data loss)'), (r'\btruncate\s+table\b', 'TRUNCATE TABLE (irreversible data wipe)'), (r'\bdelete\s+from\b(?!.*\bwhere\b)', 'DELETE FROM without WHERE clause (wipes entire table)'), ] for pattern, label in dangerous: if re.search(pattern, command, re.IGNORECASE): print(json.dumps({ 'decision': 'block', 'reason': f'🛑 BLOCKED: {label} is not permitted. Use safe, non-destructive alternatives. For PostgreSQL connection issues, prefer server restart or connection pool configuration changes.' })) sys.exit(0) print(json.dumps({'decision': 'allow'})) This ensures only pre-approved PostgreSQL operations can execute, preventing accidental data deletion or configuration changes. Now that we've configured both agent-level hooks, here's what our custom agent looks like in the canvas: - Overview ofsre_analyst_agent with hooks. Agent Canvas showing the sre_analyst_agent configuration with two agent-level hooks attached Configuring Hook #3: Audit Trail Finally, we create a Global hook using the Hooks management page in the Azure SRE Agent Portal. Global hooks apply across all custom agents in your organization, providing centralized governance: obal Hooks Management Page - Creating the sre_audit_trail global hook. The Global Hooks management page showing the sre_audit_trail hook configuration with event type, activation mode, matcher pattern, and Python script editor Event Type: PostToolUse Hook Type: Command (Python) Matcher: * (all tools) Activation: On-demand Hook Script: #!/usr/bin/env python3 import sys, json context = json.load(sys.stdin) tool_name = context.get('tool_name', 'unknown') agent_name = context.get('agent_name', 'unknown') succeeded = context.get('tool_succeeded', False) turn = context.get('current_turn', '?') audit = f'[AUDIT] Turn {turn} | Agent: {agent_name} | Tool: {tool_name} | Success: {succeeded}' print(audit, file=sys.stderr) print(json.dumps({ 'decision': 'allow', 'hookSpecificOutput': { 'additionalContext': audit } })) By setting this as "on-demand," your SRE engineers can toggle this hook on/off per conversation thread from the chat interface—enabling detailed audit logging during incident investigations without overwhelming logs during routine queries. Seeing Agent Hooks in Action Now let's see how these hooks work together when our SRE Agent investigates a real production incident. Activating Audit Trail Before starting our investigation, we toggle on the audit trail hook from the chat interface: - Managing hooks for this thread with sre_audit_trail activated the "Manage hooks for this thread" menu showing the sre_audit_trail global hook toggled on for this conversation This gives us visibility into every tool the agent executes during the investigation. Starting the Investigation We prompt our SRE Agent: "Can you check the octopets-prod-web application and diagnose any performance issues?" The SRE Agent begins gathering metrics from Azure Monitor, and we immediately see our audit trail hook logging each tool execution: This real-time visibility is invaluable for understanding what your SRE Agent is doing and debugging issues when things don't go as planned. Quality Gate Rejection The SRE Agent completes its initial analysis and attempts to respond. But our Stop hook intercepts it—the response doesn't meet our quality standards: - Stop hook forcing agent to provide more detailed analysisStop hook rejection message: "Your response needs more depth and specificity..." forcing the agent to re-analyze with more evidence The hook rejects the response and forces the SRE Agent to retry—gathering more evidence, querying additional metrics, and providing specific numbers. This self-correction happens automatically within the SRE Agent runtime, with no manual intervention required. Structured Final Response After re-verification, the SRE Agent presents a properly structured analysis that passes our quality gate: with Root Cause, Evidence, and Recommended Actions. Agent response showing the required structure: Root Cause section with connection pool exhaustion diagnosis, Evidence section with specific metric numbers, and Recommended Actions with the exact restart command Root Cause: Connection pool exhaustion Evidence: Specific metrics (83 active connections, P95 latency 847ms) Recommended Actions: Restart command with actual resource names This is the level of rigor we expect from production-ready agents. Safety Allowlist in Action The SRE Agent determines it needs to restart the PostgreSQL server to remediate the connection pool exhaustion. Our PostToolUse hook intercepts the command execution and validates it against our allowlist: - PostgreSQL metrics query and restart command output. Code execution output showing the PostgreSQL metrics query results and the az postgres flexible-server restart command being executed successfully Because the az postgres flexible-server restart command matches our safety allowlist pattern, the hook allows it to proceed. If the SRE Agent had attempted any unapproved operation (like DROP DATABASE or firewall rule changes), the safety hook would have blocked it immediately. The Results After the SRE Agent restarts the PostgreSQL server: P95 latency drops from 847ms back to ~120ms Active connections reset to healthy levels Application performance returns to normal But more importantly, we achieved autonomous remediation with enterprise governance: ✅ Quality assurance: Every response met our evidence standards (enforced by Stop hooks) ✅ Safety controls: Only pre-approved operations executed (enforced by PostToolUse hooks) ✅ Complete audit trail: Every tool call logged for compliance (enforced by Global hooks) ✅ Zero manual interventions: The SRE Agent self-corrected when quality standards weren't met This is the power of Agent Hooks—governance that doesn't get in the way of automation. Key Takeaways Agent Hooks bring production-grade governance to Azure SRE Agent: Layered Governance: Combine agent-level hooks for custom agent-specific controls with global hooks for organization-wide policies Fail-Safe by Default: Use allowlist patterns in PostToolUse hooks rather than denylists—explicitly permit safe operations instead of trying to block every dangerous one Self-Correcting SRE Agents: Stop hooks with quality gates create feedback loops that improve response quality without human intervention Audit Without Overhead: On-demand global hooks let your engineers toggle detailed logging only during incident investigations No Custom Middleware: All governance logic lives in your custom agent configuration—no need to build validation proxies or wrapper services Getting Started Agent Hooks are available now in the Azure SRE Agent platform. You can configure them entirely through the UI—no API calls or tokens needed: Agent-Level Hooks: Navigate to the Agent Canvas → Hooks tab and add hooks directly to your custom agent Global Hooks: Use the Hooks management page to create organization-wide policies Thread-Level Control: Toggle on-demand hooks from the chat interface using the "Manage hooks" menu Learn More Agent Hooks Documentation YAML Schema Reference Subagent Builder Guide Ready to build safer, smarter agents? Start experimenting with Agent Hooks today at sre.azure.com.239Views0likes0CommentsAzure SRE Agent Now Builds Expertise Like Your Best Engineer Introducing Deep Context
What if SRE Agent already knew your system before the next incident? Your most experienced SRE didn't become an expert overnight. Day one: reading runbooks, studying architecture diagrams, asking a lot of questions. Month three: knowing which services are fragile, which config changes cascade, which log patterns mean real trouble. Year two: diagnosing a production issue at 2 AM from a single alert because they'd built deep, living context about your systems. That learning process, absorbing documentation, reading code, handling incidents, building intuition from every interaction is what makes an expert. Azure SRE Agent could do the same thing From pulling context to living in it Azure SRE Agent already connects to Azure Monitor, PagerDuty, and ServiceNow. It queries Kusto logs, checks resource health, reads your code, and delivers root cause analysis often resolving incidents without waking anyone up. Thousands of incidents handled. Thousands of engineering hours saved. Deep Context takes this to the next level. Instead of accessing context on demand, your agent now lives in it — continuously reading your code, knowledge building persistent memory from every interaction, and evolving its understanding of your systems in the background. Three things makes Deep Context work: Continuous access. Source code, terminal, Python runtime, and Azure environment are available whenever the agent needs them. Connected repos are cloned into the agent's workspace automatically. The agent knows your code structure from the first message. Persistent memory. Insights from previous investigations, architecture understanding, team context — it all persists across sessions. The next time the agent picks up an alert, it already knows what happened last time. Background intelligence. Even when you're not chatting, background services continuously learn. After every conversation, the agent extracts what worked, what failed, what the root cause was. It aggregates these across all past investigations to build evolving operational insights. The agent recognizes patterns you haven't noticed yet. One example: connected to Kusto, background scanning auto-discovers every table, documents schemas, and builds reusable query templates. But this learning applies broadly — every conversation, every incident, every data source makes the agent sharper. Expertise that compound with every incident New on-call engineer SRE Agent with Deep Context Alert fires Opens runbook, looks up which service this maps to Already knows the service, its dependencies, and failure patterns from prior incidents Investigation Reads logs, searches code, asks teammates Goes straight to the relevant code path, correlates with logs and persistent insights from similar incidents After 100 incidents Becomes the team expert — irreplaceable institutional knowledge Same institutional knowledge — always available, never forgets, scales across your entire organization A human expert takes months to build this depth. An agent with Deep Context builds it in days and the knowledge compounds with every interaction. You shape what your agent learns. Deep Context learns automatically but the best results come when your team actively guides what the agent retains. Type #remember in chat to save important facts your agent should always know environment details, escalation paths, team preferences. For example: "#remember our Redis cache uses Premium tier with 6GB" or "#remember database failover takes approximately 15 minutes." These are recalled automatically during future investigations. Turn investigations into knowledge. After a good investigation, ask your agent to turn the resolution into a runbook: "Create a troubleshooting guide from the steps we just followed and save it to Knowledge settings." The agent generates a structured document, uploads it, and indexes it — so the next time a similar issue occurs, the agent finds and follows that guide automatically. The agent captures insights from every conversation on its own. Your guidance tells it which ones matter most. This is exactly how Microsoft’s own SRE team gets the best results: “Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn’t fail that class of problem again.” Read the full story in The Agent That Investigates Itself. See it in action: an Azure Monitor alert, end to end An HTTP 5xx spike fires on your container app. Your agent is in autonomous mode. It acknowledges the alert, checks resource health, reads logs, and delivers a diagnosis — that's what it already does well. Deep Context makes this dramatically better. Two things change everything: The agent already knows your environment.It'salready read your code, runbooks, and built context from previous investigations. Your route handlers, database layer, deployment configs, operational procedures, it knows all of it. So, when these alert fires, it doesn't start from scratch. It goes straight to the relevant code path, correlates a recent connection pooling commit with the deployment timeline, and confirms the root cause. The agent remembers.It's seen this pattern before a similar incident last week that was investigated but never permanently fixed. It recognizes the recurrence from persistent memory, skips rediscovery, confirms the issue is still in the code, and this time fixes it. Because it's in autonomous mode, the agent edits the source code, restarts the container, pushes the fix to a new branch, creates a PR, opens a GitHub Issue, and verifies service health, all before you wake up. The agent delivers a complete remediation summary including alert, root cause with code references, fix applied, PR created, without a single message from you. Code access turns diagnosis into action. Persistent memory turns recurring problems into solved problems. Give your agent your code — here's why it matters If you're on an IT operations, SRE, or DevOps team, you might think: "Code access? That's for developers." We'd encourage you to rethink that. Your infrastructure-as-code, deployment configs, Helm charts, Terraform files, pipeline definitions — that's all code. And it's exactly the context your agent needs to go from good to extraordinary. When your agent can read your actual configuration and infrastructure code, investigations transform. Instead of generic troubleshooting, you get root cause analysis that points to the exact file, the exact line, the exact config change. It correlates a deployment failure with a specific commit. It reads your Helm values and spots the misconfiguration that caused the pod crash loop. "Will the agent modify our production code?" No. The agent works in a secure sandbox — a copy of your repository, not your production environment. When it identifies a fix, it creates a pull request on a new branch. Your code review process, your CI/CD pipeline, your approval gates — all untouched. The agent proposes. Your team decides. Whether you're a developer, an SRE, or an IT operator managing infrastructure you didn't write — connecting your code is the single highest-impact thing you can do to make your agent smarter. The compound effects Deep Context amplifies every other SRE Agent capability: Deep Context + Incident management → Alerts fire, the agent correlates logs with actual code. Root cause references specific files and line numbers. Deep Context + Scheduled tasks → Automated code analysis, compliance checks, and drift detection — inspecting your actual infrastructure code, not just metrics. Deep Context + MCP connectors → Datadog, Splunk, PagerDuty data combined with source code context. The full picture in one conversation. Deep Context + Knowledge files → Upload runbooks, architecture docs, postmortems — in any format. The agent cross-references your team's knowledge with live code, logs, and infrastructure state. Logs tell the agent what happened. Code tells it why. Your knowledge files tell it what to do about it. Get started Deep Context is available today as part of Azure SRE Agent GA. New agents have it enabled by default. For a step-by-step walkthrough connecting your code, logs, incidents, and knowledge files, see What It Takes to Give an SRE Agent a Useful Starting Point Resources SRE Agent GA Announcement blog - https://aka.ms/sreagent/ga SRE Agent GA What’s new post - https://aka.ms/sreagent/blog/whatsnewGA SRE Agent Documentation – https://aka.ms/sreagent/newdocs SRE Agent Overview - https://aka.ms/sreagent/newdocsoverview382Views0likes0CommentsMigrating to the next generation of Virtual Nodes on Azure Container Instances (ACI)
What is ACI/Virtual Nodes? Azure Container Instances (ACI) is a fully-managed serverless container platform which gives you the ability to run containers on-demand without provisioning infrastructure. Virtual Nodes on ACI allows you to run Kubernetes pods managed by an AKS cluster in a serverless way on ACI instead of traditional VM‑backed node pools. From a developer’s perspective, Virtual Nodes look just like regular Kubernetes nodes, but under the hood the pods are executed on ACI’s serverless infrastructure, enabling fast scale‑out without waiting for new VMs to be provisioned. This makes Virtual Nodes ideal for bursty, unpredictable, or short‑lived workloads where speed and cost efficiency matter more than long‑running capacity planning. Introducing the next generation of Virtual Nodes on ACI The newer Virtual Nodes v2 implementation modernises this capability by removing many of the limitations of the original AKS managed add‑on and delivering a more Kubernetes‑native, flexible, and scalable experience when bursting workloads from AKS to ACI. In this article I will demonstrate how you can migrate an existing AKS cluster using the Virtual Nodes managed add-on (legacy), to the new generation of Virtual Nodes on ACI, which is deployed and managed via Helm. More information about Virtual Nodes on Azure Container Instances can be found here, and the GitHub repo is available here. Advanced documentation for Virtual Nodes on ACI is also available here, and includes topics such as node customisation, release notes and a troubleshooting guide. Please note that all code samples within this guide are examples only, and are provided without warranty/support. Background Virtual Nodes on ACI is rebuilt from the ground-up, and includes several fixes and enhancements, for instance: Added support/features VNet peering, outbound traffic to the internet with network security groups Init containers Host aliases Arguments for exec in ACI Persistent Volumes and Persistent Volume Claims Container hooks Confidential containers (see supported regions list here) ACI standby pools Support for image pulling via Private Link and Managed Identity (MSI) Planned future enhancements Kubernetes network policies Support for IPv6 Windows containers Port Forwarding Note: The new generation of the add-on is managed via Helm rather than as an AKS managed add-on. Requirements & limitations Each Virtual Nodes on ACI deployment requires 3 vCPUs and 12 GiB memory on one of the AKS cluster’s VMs Each Virtual Nodes node supports up to 200 pods DaemonSets are not supported Virtual Nodes on ACI requires AKS clusters with Azure CNI networking (Kubenet is not supported, nor is overlay networking) Virtual Nodes on ACI is incompatible with API server authorized IP ranges for AKS (because of the subnet delegation to ACI) Migrating to the next generation of Virtual Nodes on Azure Container Instances via Helm chart For this walkthrough, I'm using Bash via Windows Subsystem for Linux (WSL), along with the Azure CLI. Direct migration is not supported, and therefore the steps below show an example of removing Virtual Nodes managed add-on and its resources and then installing the Virtual Nodes on ACI Helm chart. In this walkthrough I will explain how to delete and re-create the Virtual Nodes subnet, however if you need to preserve the VNet and/or use a custom subnet name, refer to the Helm customisation steps here. Prerequisites A recent version of the Azure CLI An Azure subscription with sufficient ACI quota for your selected region Helm Deployment steps Initialise environment variables location=northeurope rg=rg-virtualnode-demo vnetName=vnet-virtualnode-demo clusterName=aks-virtualnode-demo aksSubnetName=subnet-aks vnSubnetName=subnet-vn Scale-down any running Virtual Nodes workloads (example below): kubectl delete deploy <deploymentName> -n <namespace> Drain and cordon the legacy Virtual Nodes node: kubectl drain virtual-node-aci-linux Disable the Virtual Nodes managed add-on (legacy): az aks disable-addons --resource-group $rg --name $clusterName --addons virtual-node Export a backup of the original subnet configuration: az network vnet subnet show --resource-group $rg --vnet-name $vnetName --name $vnSubnetName > subnetConfigOriginal.json Delete the original subnet (subnets cannot be renamed and therefore must be re-created): az network vnet subnet delete -g $rg -n $vnSubnetName --vnet-name $vnetName Create the new Virtual Nodes on ACI subnet (replicate the configuration of the original subnet but with the specific name value of cg): vnSubnetId=$(az network vnet subnet create \ --resource-group $rg \ --vnet-name $vnetName \ --name cg \ --address-prefixes 10.241.0.0/16 \ --delegations Microsoft.ContainerInstance/containerGroups --query id -o tsv) Assign the cluster's -kubelet identity Contributor access to the infrastructure resource group, and Network Contributor access to the ACI subnet: nodeRg=$(az aks show --resource-group $rg --name $clusterName --query nodeResourceGroup -o tsv) nodeRgId=$(az group show -n $nodeRg --query id -o tsv) agentPoolIdentityId=$(az aks show --resource-group $rg --name $clusterName --query "identityProfile.kubeletidentity.resourceId" -o tsv) agentPoolIdentityObjectId=$(az identity show --ids $agentPoolIdentityId --query principalId -o tsv) az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Contributor" \ --scope "$nodeRgId" az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Network Contributor" \ --scope "$vnSubnetId" Download the cluster's kubeconfig file: az aks get-credentials -n $clusterName -g $rg Clone the virtualnodesOnAzureContainerInstances GitHub repo: git clone https://github.com/microsoft/virtualnodesOnAzureContainerInstances.git Install the Virtual Nodes on ACI Helm chart: helm install <yourReleaseName> <GitRepoRoot>/Helm/virtualnode Confirm the Virtual Nodes node shows within the cluster and is in a Ready state (virtualnode-n): $ kubectl get node NAME STATUS ROLES AGE VERSION aks-nodepool1-35702456-vmss000000 Ready <none> 4h13m v1.33.6 aks-nodepool1-35702456-vmss000001 Ready <none> 4h13m v1.33.6 virtualnode-0 Ready <none> 162m v1.33.7 Delete the previous Virtual Nodes node from the cluster: kubectl delete node virtual-node-aci-linux Test and confirm pod scheduling on Virtual Node: apiVersion: v1 kind: Pod metadata: annotations: name: demo-pod spec: containers: - command: - /bin/bash - -c - 'counter=1; while true; do echo "Hello, World! Counter: $counter"; counter=$((counter+1)); sleep 1; done' image: mcr.microsoft.com/azure-cli name: hello-world-counter resources: limits: cpu: 2250m memory: 2256Mi requests: cpu: 100m memory: 128Mi nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists If the pod successfully starts on the Virtual Node, you should see similar to the below: $ kubectl get pod -o wide demo-pod NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES demo-pod 1/1 Running 0 95s 10.241.0.4 vnode2-virtualnode-0 <none> <none> Modify your deployments to run on Virtual Nodes on ACI For Virtual Nodes managed add-on (legacy), the following nodeSelector and tolerations are used to run pods on Virtual Nodes: nodeSelector: kubernetes.io/role: agent kubernetes.io/os: linux type: virtual-kubelet tolerations: - key: virtual-kubelet.io/provider operator: Exists - key: azure.com/aci effect: NoSchedule For Virtual Nodes on ACI, the nodeSelector/tolerations are slightly different: nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists Troubleshooting Check the virtual-node-admission-controller and virtualnode-n pods are running within the vn2 namespace: $ kubectl get pod -n vn2 NAME READY STATUS RESTARTS AGE virtual-node-admission-controller-54cb7568f5-b7hnr 1/1 Running 1 (5h21m ago) 5h21m virtualnode-0 6/6 Running 6 (4h48m ago) 4h51m If these pods are in a Pending state, your node pool(s) may not have enough resources available to schedule them (use kubectl describe pod to validate). If the virtualnode-n pod is crashing, check the logs of the proxycri container to see whether there are any Managed Identity permissions issues (the cluster's -agentpool MSI needs to have Contributor access on the infrastructure resource group): kubectl logs -n vn2 virtualnode-0 -c proxycri Further troubleshooting guidance is available within the official documentation. Support If you have issues deploying or using Virtual Nodes on ACI, add a GitHub issue here369Views1like0CommentsMicrosoft Agent Framework, Microsoft Foundry, MCP, Aspire を使った実践的な AI アプリを構築するサンプルが登場
AI エージェントを作ること自体は、以前よりも簡単になってきました。しかし、それらを実際の本番運用のアプリケーションの一部としてデプロイすること (複数のサービス、永続的な状態管理、本番向けのインフラを含めた形で運用すること) になると、途端に複雑になります。 .NET コミュニティの開発者からも、ローカル環境でもクラウドでも動作する、クラウドネイティブな実運用レベルのサンプルが見たいという声が多く寄せられていました。 その声に応え、私たちはオープンソースのサンプルアプリ『Interview Coach (面接コーチ)』を作りました。模擬就職面接を行う AI チャット web アプリです。 このサンプルでは、本番運用を想定したサービスにおいて、以下の技術がどのように組み合わさるのかを示しています: Microsoft Agent Framework Microsoft Foundry Model Context Protocol (MCP) Aspire このアプリは、実際に動作する 面接シミュレーター です。AI コーチがユーザーに対して行動面や技術面の質問を行い、最後に面接パフォーマンスのまとめをフィードバックとして提供します。 この記事では、このアプリで使用している設計パターンと、それらがどのような課題を解決するのかを紹介します。 こちらから Interview Coach デモアプリ を試すことができます。 なぜ Microsoft Agent Framework なのか? もしこれまで .NET で AI エージェントを開発してきたなら、おそらく Semantic Kernel や AutoGen、あるいはその両方を使ったことがあるでしょう。 Microsoft Agent Framework は、それらの次のステップにあたるフレームワークです。 このフレームワークは、同じチームによって開発されており、両プロジェクトをうまく統合して、1つのフレームワークにまとめたものです。 具体的には、 AutoGen のエージェント抽象化 Semantic Kernel のエンタープライズ機能 (状態管理、型安全性、ミドルウェア、テレメトリーなど) を統合し、さらにマルチエージェントのオーケストレーションのためのグラフベースのワークフローを追加しています。 .NET 開発者にとってのメリットは次のとおりです: フレームワークが1つに統合: Semantic Kernel と AutoGen のどちらを使うか悩む必要がありません。 馴染みのある開発パターン: エージェントは dependency injection、IChatClient、そして ASP.NET アプリと同じホスティングモデルを利用します。 本番運用を前提とした設計: OpenTelemetry、ミドルウェアパイプライン、Aspire との統合が最初から用意されています。 マルチエージェントのオーケストレーション: 逐次ワークフロー、並列実行、handoff パターン、グループチャットなどをサポートします。 Interview Coach は、これらの機能を単なる Hello World ではなく、実際のアプリケーションとしてまとめたサンプルです。 なぜ Microsoft Foundry なのか? AI エージェントには、単にモデルがあれば良いわけではありません。インフラも必要です。 Microsoft Foundry は、AI アプリケーションを構築・管理するための Azure のプラットフォームであり、Microsoft Agent Framework の推奨バックエンドでもあります。 Foundry を使うと、次のような機能を1つのポータルで利用できます: モデルアクセス: OpenAI、Meta、Mistral などのモデルを1つのエンドポイントから利用できるカタログ Content safety (安全性): モデレーションや個人情報(PII)検出が組み込まれており、エージェントが問題のある出力をしないように制御。 コスト最適化ルーティング: リクエストがタスクに最適なモデルへ自動的にルーティングされる 評価とファインチューニング: エージェントの品質を測定し、継続的に改善できる エンタープライズ向けガバナンス: Entra ID や Microsoft Defender による認証、アクセス制御、コンプライアンス Interview Coach では、Foundry がエージェントを動かすモデルエンドポイントを提供しています。 エージェントコードは IChatClient インターフェースを利用しているため、Foundry はあくまで設定の選択肢の 1 つですが、最初から豊富なツールが揃っている点で最も便利な選択肢です。 Interview Coach は何をするアプリなのか? Interview Coach は、模擬就職面接を行う対話型 AI です。 ユーザーが 履歴書(resume) と 応募先の職務内容(job description)を入力すると、そこから先はエージェントが面接プロセスを進めていきます。 情報収集(Intake): 履歴書と応募先の職務内容を収集します。 行動面接(Behavioral interview): あなたの経験に合わせて、 STAR メソッド (過去の行動を構造的に説明するための回答フレームワークで、Situation, Task, Action, Result の頭文字から来ている) に基づいた質問を行います。 技術面接(Technical interview): 応募する職種に応じた技術的な質問を行います。 まとめ(Summary): 面接のパフォーマンス (成績) を評価し、具体的なフィードバックを含むレビューを生成します。 ユーザーは、このシステムと Blazor の Web UI を通して対話します。 AI の回答は リアルタイムでストリーミング表示されます。 余談: Behavioral Interview とは Behavioral Interview(行動面接/行動事例面接)とは、応募者の「過去の具体的な行動」を深掘りし、その人の行動特性、スキル、考え方が企業の求める人材像と適合しているかを判断する面接手法です。 単なる知識や志望動機ではなく、「ストレスを感じた時どう対処したか」など過去の事実に基づき、将来のパフォーマンスを予測します。 アーキテクチャ概要 このアプリケーションは、複数のサービスに分割されており、すべて Aspire によってオーケストレーションされています: LLM Provider: Microsoft Foundry(推奨)を利用し、さまざまなモデルへアクセスします。 WebUI: 面接の対話を行うための Blazor ベースのチャットインターフェースです。 Agent: 面接のロジックを担うコンポーネントで、Microsoft Agent Framework 上で構築されています。 MarkItDown MCP Server: Microsoft の MarkItDown (なんでも Markdown にしてくれる Python ライブラリ) を利用し、履歴書(PDF や DOCX)を Markdown 形式に変換して解析します。 InterviewData MCP Server: .NET で実装された MCP サーバーで、面接セッションのデータを SQLite に保存します。 Aspire は、サービスディスカバリ、ヘルスチェック、テレメトリーを管理します。 各コンポーネントは 独立したプロセスとして実行され、1 つのコマンドでアプリケーション全体を起動できます。 パターン 1 : マルチエージェントによるハンドオフ このサンプルが特に興味深いのは、 ハンドオフ (handoff) パターン を採用している点です。 1 つのエージェントがすべてを処理するのではなく、面接のプロセスを 5 つの専門エージェントに分割しています: Agent 役割 Tools Triage (トリアージ) メッセージを適切な担当エージェントへ振り分ける なし(ルーティングのみ) Receptionist (受付) セッションを作成し、履歴書と職務内容を収集 MarkItDown + InterviewData Behavioral Interviewer (行動面接官) STAR メソッドを用いた行動面接を実施 InterviewData Technical Interviewer (技術面接官) 職種に応じた技術面接の質問を行う InterviewData Summarizer (サマリー生成) 面接の最終的なサマリーを生成 InterviewData ハンドオフパターンでは、あるエージェントが会話の制御を次のエージェントに完全に引き渡します。 引き継いだエージェントが、その後の会話をすべて担当します。 これは 「agent-as-tools」パターンとは異なります。 (agent-as-tools では、メインのエージェントが他のエージェントを補助ツールとして呼び出しますが、会話の制御自体はメインエージェントが保持します。) 以下は、このハンドオフワークフローの構成例です: var workflow = AgentWorkflowBuilder .CreateHandoffBuilderWith(triageAgent) .WithHandoffs(triageAgent, [receptionistAgent, behaviouralAgent, technicalAgent, summariserAgent]) .WithHandoffs(receptionistAgent, [behaviouralAgent, triageAgent]) .WithHandoffs(behaviouralAgent, [technicalAgent, triageAgent]) .WithHandoffs(technicalAgent, [summariserAgent, triageAgent]) .WithHandoff(summariserAgent, triageAgent) .Build(); 通常の処理フロー(happy path)は次の順序で進みます。 Receptionist → Behavioral → Technical → Summarizer それぞれの専門エージェントが、次のエージェントへ直接ハンドオフします。 もし想定外の状況が発生した場合は、エージェントは Triage エージェントへ戻り、適切なルーティングを再度行います。 なお、このサンプルには 単一エージェントモードも用意されており、よりシンプルな構成でのデプロイも可能です。 これにより、単一エージェントとマルチエージェントのアプローチを比較することができます。 パターン2: ツール統合のための MCP このプロジェクトでは、ツールはエージェントの内部に実装されていません。 それぞれが 独立した MCP(Model Context Protocol)サーバーとして提供されています。 例えば、MarkItDown サーバーはこのプロジェクトだけでなく、まったく別のエージェントプロジェクトでも再利用できます。また、ツール開発チームはエージェント開発チームとは独立してツールをリリースすることが可能です。 MCP は言語非依存(language-agnostic)であることも特徴です。 そのため、このサンプルでは MarkItDown が Python サーバーとして動作し、エージェントは .NET で実装されています。 エージェントは起動時に MCP クライアントを通じてツールを検出し、必要なエージェントにそれらを渡します。 var receptionistAgent = new ChatClientAgent( chatClient: chatClient, name: "receptionist", instructions: "You are the Receptionist. Set up sessions and collect documents...", tools: [.. markitdownTools, .. interviewDataTools]); 各エージェントには、必要なツールだけが割り当てられます: Triage エージェント:ツールなし(ルーティングのみを担当) インタビュアーエージェント:セッションデータへのアクセス Receptionist エージェント:ドキュメント解析 + セッションアクセス これは 最小権限の原則(principle of least privilege) に基づいた設計です。 パターン3: Aspire によるオーケストレーション Aspire は、アプリケーション全体をまとめて管理する役割を担います。 アプリホストはサービスのトポロジー(構成)を定義し、 どのサービスが存在するのか それぞれがどのように依存しているのか どの設定を受け取るのか を管理します。 これにより、次のような機能が利用できます: Service discovery. サービスは固定の URL ではなく、サービス名で互いを見つけることができます。 Health checks. Aspire ダッシュボードで、各コンポーネントの状態を確認できます。 Distributed tracing. 共通のサービス設定を通じて OpenTelemetry が組み込まれます。 One-command startup. aspire run --file ./apphost.cs を実行するだけで、すべてのサービスが起動します。 デプロイ時には、azd up を実行することで、アプリケーション全体が Azure Container Apps にデプロイされます。 始めてみよう 事前準備 .NET 10 SDK 以降 Azure サブスクリプション Microsoft Foundry project Docker Desktop またはその他のコンテナランタイム ローカルで実行する git clone https://github.com/Azure-Samples/interview-coach-agent-framework.git cd interview-coach-agent-framework # Configure credentials dotnet user-secrets --file ./apphost.cs set MicrosoftFoundry:Project:Endpoint "<your-endpoint>" dotnet user-secrets --file ./apphost.cs set MicrosoftFoundry:Project:ApiKey "<your-key>" # Start all services aspire run --file ./apphost.cs Aspire Dashboard を開き、すべてのサービスの状態が Running になるまで待ちます。 その後、WebUI のエンドポイントをクリックすると、模擬面接を開始できます。 以下は、ハンドオフパターンがどのように動作するかを DevUI 上で可視化したものです。 このチャット UI を使って、面接候補者としてエージェントと対話することができます。 Azure にデプロイする azd auth login azd up Tこれだけで完了です。 残りの処理は Aspire と azd が自動で実行します。 デプロイとテストが完了したら、次のコマンドを実行することで、作成されたすべてのリソースを安全に削除できます。 azd down --force --purge このサンプルから学べること Interview Coach を実際に試すことで、次のような内容を理解できます: Microsoft Foundry を モデルバックエンドとして利用する方法 Microsoft Agent Framework を使った 単一エージェントおよびマルチエージェントシステムの構築 ハンドオフによるオーケストレーションを用いて、ワークフローを専門エージェントに分割する方法 エージェントコードとは独立した MCP ツールサーバーの作成と利用 Aspire を使った 複数サービスからなるアプリケーションのオーケストレーション 一貫性のある構造化された振る舞いを生み出すプロンプト設計 azd up を使った アプリケーション全体のデプロイ方法 試してみよう 完全なソースコードは GitHub で公開されています: Azure-Samples/interview-coach-agent-framework Microsoft Agent Framework を初めて使う場合は、まず次の資料から始めることをおすすめします。 framework documentation Hello World sample. その後、このサンプルに戻ってくると、これらの要素がより大きなプロジェクトの中でどのように組み合わさるのかが理解できるでしょう。 もしこれらのパターンを使って何か作った場合は、ぜひ Issue を作成して 教えてください。 次は? (What's Next?) 我々は、現在、次のような さらなる統合シナリオにも取り組んでいます: Microsoft Foundry Agent Service GitHub Copilot A2A などなど。 これらの機能がリリースされ次第、このサンプルも随時アップデートしていく予定です。 Resources Microsoft Agent Framework ドキュメント Introducing Microsoft Agent Framework preview Microsoft Agent Framework Reaches Release Candidate Microsoft Foundry ドキュメント Microsoft Foundry Agent Service Microsoft Foundry Portal Microsoft.Extensions.AI Model Context Protocol specification Aspire ドキュメント ASP.NET BlazorEven simpler to Safely Execute AI-generated Code with Azure Container Apps Dynamic Sessions
AI agents are writing code. The question is: where does that code run? If it runs in your process, a single hallucinated import os; os.remove('/') can ruin your day. Azure Container Apps dynamic sessions solve this with on-demand sandboxed environments — Hyper-V isolated, fully managed, and ready in milliseconds. Thanks to your feedback, Dynamic Sessions are now easier to use with AI via MCP. Agents can quickly start a session interpreter and safely run code – all using a built-in MCP endpoint. Additionally - new starter samples show how to invoke dynamic sessions from Microsoft Agent Framework with code interpreter and with a custom container for even more versatility. What Are Dynamic Sessions? A session pool maintains a reservoir of pre-warmed, isolated sandboxes. When your app needs one, it’s allocated instantly via REST API. When idle, it’s destroyed automatically after provided session cool down period. What you get: Strong isolation - Each session runs in its own Hyper-V sandbox — enterprise-grade security Millisecond startup -Pre-warmed pool eliminates cold starts Fully managed - No infra to maintain — automatic lifecycle, cleanup, scaling Simple access - Single HTTP endpoint, session identified by a unique ID Scalable - Hundreds to thousands of concurrent sessions Two Session Types 1. Code Interpreter — Run Untrusted Code Safely Code interpreter sessions accept inline code, run it in a Hyper-V sandbox, and return the output. Sessions support network egress and persistent file systems within the session lifetime. Three runtimes are available: Python — Ships with popular libraries pre-installed (NumPy, pandas, matplotlib, etc.). Ideal for AI-generated data analysis, math computation, and chart generation. Node.js — Comes with common npm packages. Great for server-side JavaScript execution, data transformation, and scripting. Shell — A full Linux shell environment where agents can run arbitrary commands, install packages, start processes, manage files, and chain multi-step workflows. Unlike Python/Node.js interpreters, shell sessions expose a complete OS — ideal for agent-driven DevOps, build/test environments, CLI tool execution, and multi-process pipelines. 2. Custom Containers — Bring Your Own Runtime Custom container sessions let you run your own container image in the same isolated, on-demand model. Define your image, and Container Apps handles the pooling, scaling, and lifecycle. Typical use cases are hosting proprietary runtimes, custom code interpreters, and specialized tool chains. This sample (Azure Samples) dives deeper into Customer Containers with Microsoft agent Framework orchestration. MCP Support for Dynamic Sessions Dynamic sessions also support Model Context Protocol (MCP) on both shell and Python session types. This turns a session pool into a remote MCP server that AI agents can connect to — enabling tool execution, file system access, and shell commands in a secure, ephemeral environment. With an MCP-enabled shell session, an Azure Foundry agent can spin up a Flask app, run system commands, or install packages — all in an isolated container that vanishes when done. The MCP server is enabled with a single property on the session pool (isMCPServerEnabled: true), and the resulting endpoint + API key can be plugged directly into Azure Foundry as a connected tool. For a step-by-step walkthrough, see How to add an MCP tool to your Azure Foundry agent using dynamic sessions. Deep Dive: Building an AI Travel Agent with Code Interpreter Sessions Let’s walk through a sample implementation — a travel planning agent that uses dynamic sessions for both static code execution (weather research) and LLM-generated code execution (charting). Full source: github.com/jkalis-MS/AIAgent-ACA-DynamicSession Architecture Travel Agent Architecture Component Purpose Microsoft Agent Framework Agent runtime with middleware, telemetry, and DevUI Azure OpenAI (GPT-4o) LLM for conversation and code generation ACA Session Pools Sandboxed Python code interpreter Azure Container Apps Hosts the agent in a container Application Insights Observability for agent spans The agent implements with two variants switchable in the Agent Framework DevUI — tools in ACA Dynamic Session (sandbox) and tools running locally (no isolation) — making the security value immediately visible. Scenario A: Static Code in a Sandbox — Weather Research The agent sends pre-written Python code to the session pool to fetch live weather data. The code runs with network egress enabled, calls the Open-Meteo API, and returns formatted results — all without touching the host process. import requests from azure.identity import DefaultAzureCredential credential = DefaultAzureCredential() token = credential.get_token("https://dynamicsessions.io/.default") response = requests.post( f"{pool_endpoint}/code/execute?api-version=2024-02-02-preview&identifier=weather-session-1", headers={"Authorization": f"Bearer {token.token}"}, json={"properties": { "codeInputType": "inline", "executionType": "synchronous", "code": weather_code, # Python that calls Open-Meteo API }}, ) result = response.json()["properties"]["stdout"] Scenario B: LLM-Generated Code in a Sandbox — Dynamic Charting This is where it gets interesting. The user asks “plot a chart comparing Miami and Tokyo weather.” The agent: Fetches weather data Asks Azure OpenAI to generate matplotlib code using a tightly-scoped system prompt Safety-checks the generated code for forbidden imports (subprocess, os.system, etc.) Wraps the code with data injection and sends it to the sandbox Downloads the resulting PNG from the sandbox’s /mnt/data/ directory from openai import AzureOpenAI # 1. LLM generates chart code client = AzureOpenAI(azure_endpoint=endpoint, api_key=key, api_version="2024-12-01-preview") generated_code = client.chat.completions.create( model="gpt-4o", messages=[{"role": "system", "content": CODE_GEN_PROMPT}, {"role": "user", "content": f"Weather data: {weather_json}"}], temperature=0.2, ).choices[0].message.content # 2. Execute in sandbox requests.post( f"{pool_endpoint}/code/execute?api-version=2024-02-02-preview&identifier=chart-session-1", headers={"Authorization": f"Bearer {token.token}"}, json={"properties": { "codeInputType": "inline", "executionType": "synchronous", "code": f"import json, matplotlib\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\nweather_data = json.loads('{weather_json}')\n{generated_code}", }}, ) # 3. Download the chart img = requests.get( f"{pool_endpoint}/files/content/chart.png?api-version=2024-02-02-preview&identifier=chart-session-1", headers={"Authorization": f"Bearer {token.token}"}, ).content The result — a dark-themed dual-subplot chart comparing maximal and minimal temperature forecast chart example rendered by the Chart Weather tool in Dynamic Session: Authentication The agent uses DefaultAzureCredential locally and ManagedIdentityCredential when deployed. Tokens are cached and refreshed automatically: from azure.identity import DefaultAzureCredential token = DefaultAzureCredential().get_token("https://dynamicsessions.io/.default") auth_header = f"Bearer {token.token}" # Uses ManagedIdentityCredential automatically when deployed to Container Apps Observability The agent uses Application Insights for end-to-end tracing. The Microsoft Agent Framework exposes OpenTelemetry spans for invoke_agent, chat, and execute tool — wired to Azure Monitor with custom exporters: from azure.monitor.opentelemetry import configure_azure_monitor from agent_framework.observability import create_resource, enable_instrumentation # Configure Azure Monitor first configure_azure_monitor( connection_string="InstrumentationKey=...", resource=create_resource(), # Uses OTEL_SERVICE_NAME, etc. enable_live_metrics=True, ) # Then activate Agent Framework's telemetry code paths, optional if ENABLE_INSTRUMENTATION and/or ENABLE_SENSITIVE_DATA are set in env vars enable_instrumentation(enable_sensitive_data=False) This gives you traces for every agent invocation, tool execution (including sandbox timing), and LLM call — visible in the Application Insights transaction search and end-to-end transaction view in the new Agents blade in Application Insights. You can also open a detailed dashboard by clicking Explore in Grafana. Session pools emit their own metrics and logs for monitoring sandbox utilization and performance. Combined with the agent-level Application Insights traces, you can get full visibility from the user prompt → agent → LLM → sandbox execution → response — across both your application and the infrastructure running untrusted code. Deploy with One Command The project includes full Bicep infrastructure-as-code. A single azd up provisions Azure OpenAI, Container Apps, Session Pool (with egress enabled), Container Registry, Application Insights, and all role assignments. azd auth login azd up Next Steps Dynamic sessions documentation – Microsoft Learn MCP + Shell sessions tutorial - How to add an MCP tool to your Foundry agent Custom container sessions sample - github.com/Azure-Samples/dynamic-sessions-custom-container AI Agent + Dynamic Sessions - github.com/jkalis-MS/AIAgent-ACA-DynamicSession441Views0likes0Comments