We rebuilt onboarding around a simple question: after setup, can SRE Agent help with a real app on the same day? This post shows the full setup so you can see what good looks like, but not all steps are required upfront.
In our latest posts, The Agent that investigates itself and Azure SRE Agent Now Builds Expertise Like Your Best Engineer Introducing Deep Context, we wrote about a moment that changed how we think about agent systems. Azure SRE Agent investigated a regression in its own prompt cache, traced the drop to a specific PR, and proposed fixes. What mattered was not just the model. What mattered was the starting point. The agent had code, logs, deployment history, and a workspace it could use to discover the next piece of context.
That lesson forced an uncomfortable question about onboarding.
If a customer finishes setup and the agent still knows nothing about their app, we have not really onboarded them. We have only created a resource.
So for the March 10 GA release, we rebuilt onboarding around a more practical bar: can a new agent become useful on day one? To test that, we used the new flow the way we expect customers to use it. We connected a real sample app, wired up live Azure Monitor alerts, attached code and logs, uploaded a knowledge file, and then pushed the agent through actual work. We asked it to inspect the app, explain a 401 path from the source, debug its own log access, and triage GitHub issues in the repo.
This post walks through that experience. We connected everything we could because we wanted to see what the agent does when it has a real starting point, not a partial one. If your setup is shorter, the SRE Agent still works. It just knows less.
The cold start we were trying to fix
The worst version of an agent experience is familiar by now. You ask a concrete question about your system and get back a smart-sounding answer that is only loosely attached to reality. The model knows what a Kubernetes probe is. It knows what a 500 looks like. It may even know common Kusto table names. But it does not know your deployment, your repo, your auth flow, or the naming mistakes your team made six months ago and still lives with.
We saw the same pattern again and again inside our own work. When the agent had real context, it could do deep investigations. When it started cold, it filled the gaps with general knowledge and good guesses.
The new onboarding is our attempt to close that gap up front. Instead of treating code, logs, incidents, and knowledge as optional extras, the flow is built around connecting the things the agent needs to reason well.
Walking through the new onboarding
Starting March 10, you can create and configure an SRE Agent at sre.azure.com. Here is what that looked like for us.
Step 1: Create the agent
You choose a subscription, resource group, name, and region. Azure provisions the runtime, managed identity, Application Insights, and Log Analytics workspace. In our run, the whole thing took about two minutes.
That first step matters more than it may look. We are not just spinning up a chatbot. We are creating the execution environment where the agent can actually work: run commands, inspect files, query services, and keep track of what it learns.
Step 2: Start adding context
Once provisioning finishes, you land on the setup page.
The page is organized around the sources that make the agent useful: code, logs, incidents, Azure resources, and knowledge files.
| Data source | Why it matters |
|---|---|
| Code | Lets the agent read the system it is supposed to investigate. |
| Logs | Gives it real tables, schemas, and data instead of guesses. |
| Incidents | Connects the agent to the place where operational pain actually shows up. |
| Azure resources | Gives it the right scope so it starts in the right subscription and resource group. |
| Knowledge files | Adds the team-specific context that never shows up cleanly in telemetry. |
The page is blunt in a way we like. If you have not connected anything yet, it tells you the agent does not know enough about your app to answer useful questions. That is the right framing. The job of onboarding is to fix that.
Step 3: Connect logs
We started with Azure Data Explorer.
The wizard supports Azure Kusto, Datadog, Elasticsearch, Dynatrace, New Relic, Splunk, and Hawkeye. After choosing Kusto, it generated the MCP connector settings for us. We supplied the cluster details, tested the connection, and let it discover the tools.
This step removes a whole class of bad agent behavior. The model no longer has to invent table names or hope the cluster it wants is the cluster that exists. It knows what it can query because the connection is explicit.
Step 4: Connect the incident platform
For incidents, we chose Azure Monitor.
This part is simple by design. If incidents are where the agent proves its value, connecting them should feel like the most natural part of setup, not a side quest. PagerDuty and ServiceNow work too, but for this walkthrough we kept it on Azure Monitor so we could wire real alerts to a real app.
Step 5: Connect code
Then we connected the code repo.
We used microsoft-foundry/foundry-agent-webapp, a React and ASP.NET Core sample app running on Azure Container Apps.
This is still the highest-leverage source we give the agent. Once the repo is connected, the agent can stop treating the app as an abstract web service. It can read the auth flow. It can inspect how health probes are configured. It can compare logs against the exact code paths that produced them. It can even look at the commit that was live when an incident happened.
That changes the quality of the investigation immediately.
Step 6: Scope the Azure resources
Next we told the agent which resources it was responsible for.
We scoped it to the resource group that contained the sample Container App. The wizard then set the roles the agent needed to observe and investigate the environment.
That sounds like a small step, but it fixes another common failure mode. Agents do better when they start from the right part of the world. Subscription and resource-group scope give them that boundary.
Step 7: Upload knowledge
Last, we uploaded a Markdown knowledge file we wrote for the sample app.
The file covered the app architecture, API endpoints, auth flow, likely failure modes, and the files we would expect an engineer to open first during debugging. We like Markdown here because it stays honest. It is easy for a human to read, easy for the agent to navigate, and easy to update as the system changes.
All sources configured
Once everything was connected, the setup panel turned green.
At that point the agent had a repo, logs, incidents, Azure resources, and a knowledge file. That is the moment where onboarding stops being a checklist and starts being operational setup.
The chat experience makes the setup visible
When you open a new thread, the configuration panel stays at the top of the chat.
If you expand it, you can see exactly what is connected and what is not.
We built this because people should not have to guess what the agent knows. If code is connected and logs are not, that should be obvious. If incidents are wired up but knowledge files are missing, that should be obvious too. The panel makes the agent's working context visible in the same place where you ask it to think.
It also makes partial setup less punishing. You do not have to finish every step before the agent becomes useful. But you can see, very clearly, what extra context would make the next answer better.
What changed once the agent had context
The easiest way to evaluate the onboarding is to look at the first questions we asked after setup.
We started with a simple one: What do you know about the Container App in the rg-big-refactor resource group?
The agent used Azure CLI to inspect the app, its revisions, and the system logs, then came back with a concise summary: image version, resource sizing, ingress, scale-to-zero behavior, and probe failures during cold start. It also correctly called out that the readiness probe noise was expected and not the root of a real outage.
That answer was useful because it was grounded in the actual resource, not in generic advice about Container Apps.
Then we asked a harder question: Based on the connected repo, what authentication flow does this app use? If a user reports 401s, what should we check first?
The agent opened authConfig.ts, Program.cs, useAuth.ts, postprovision.ps1, and entra-app.bicep, then traced the auth path end to end.
The checklist it produced was exactly the kind of thing we hoped onboarding would unlock: client ID alignment, identifier URI issues, redirect URI mismatches, audience validation, missing scopes, token expiry handling, and the single-tenant assumption in the backend. It even pointed to the place in Program.cs where extra logging could be enabled.
Without the repo, this would have been a boilerplate answer about JWTs. With the repo, it read like advice from someone who had already been paged for this app before.
We did not stop at setup. We wired real monitoring.
A polished demo can make any agent look capable, so we pushed farther. We set up live Azure Monitor alerts for the sample web app instead of leaving the incident side as dummy data.
We created three alerts:
- HTTP 5xx errors (Sev 1), for more than 3 server errors in 5 minutes
- Container restarts (Sev 2), to catch crash loops and OOMs
- High response latency (Sev 2), when average response time goes above 10 seconds
The high-latency alert fired almost immediately. The app was scaling from zero, and the cold start was slow enough to trip the threshold.
That was perfect.
It gave us a real incident to put through the system instead of a fictional one.
Incident response plans
From the Builder menu, we created a response plan targeted at incidents with foundry-webapp in the title and severity 1 or 2.
The incident that had just fired showed up in the learning flow. We used the actual codebase and deployment details to write the default plan: which files to inspect for failures, how to reason about health probes, and how to tell the difference between a cold start and a real crash.
That felt like an important moment in the product. The response plan was not generic incident theater. It was anchored in the system we had just onboarded.
One of the most useful demos was the agent debugging itself
The sharpest proof point came when we tried to query the Log Analytics workspace from the agent.
We expected it to query tables and summarize what it found. Instead, it hit insufficient_scope.
That could have been a dead end. Instead, the agent turned the failure into the investigation.
It identified the missing permissions, noticed there were two managed identities in play, told us which RBAC roles were required, and gave us the exact commands to apply them.
After we fixed the access, it retried and ran a series of KQL queries against the workspace. That is where it found the next problem: Container Apps platform logs were present, but AppRequests, AppExceptions, and the rest of the App Insights-style tables were still empty.
That was not a connector bug. It was a real observability gap in the sample app. The backend had OpenTelemetry packages, but the exporter configuration was not actually sending the telemetry we expected. The agent did not just tell us that data was missing. It explained which data was present, which data was absent, and why that difference mattered.
That is the sort of thing we wanted this onboarding to set up: not just answering the first question, but exposing the next real thing that needs fixing.
We also asked it to triage the repo backlog
Once the repo was connected, it was natural to see how well the agent could read open issues against the code.
We pointed it at the three open GitHub issues in the sample repo and asked it to triage them.
It opened the relevant files, compared the code to the issue descriptions, and came back with a clear breakdown:
- Issue #21, @fluentui-copilot is not opensource?
Partially valid, low severity. The package is public and MIT licensed. The real concern is package maturity, not licensing. - Issue #20, SDK fails to deserialize agent tool definitions
Confirmed, medium severity. The agent traced the problem to metadata handling in AgentFrameworkService.cs and suggested a safe fallback path. - Issue #19, Create Preview experience from AI Foundry is incomplete
Confirmed, medium severity. The agent found the gap between the environment variables people are told to paste and the variables the app actually expects.
What stood out to us was not just that the output was correct. It was that the agent was careful. It did not overclaim. It separated a documentation concern from two real product bugs. Then it asked whether we wanted it to start implementing the fixes.
That is the posture we want from an engineering agent: useful, specific, and a little humble.
What the onboarding is really doing
After working through the whole flow, we do not think of onboarding as a wizard anymore. We think of it as the process of giving the agent a fair shot.
Each connection removes one reason for the model to bluff:
- Code keeps it from guessing how the system works.
- Logs keep it from guessing what data exists.
- Incidents keep it close to operational reality.
- Azure resource scope keeps it from wandering.
- Knowledge files keep team-specific context from getting lost.
This is the same lesson we learned building the product itself. The agent does better when it can discover context progressively inside a world that is real and well-scoped. Good onboarding is how you create that world.
Closing
The main thing we learned from this work is simple: onboarding is not done when the resource exists. It is done when the agent can help with a real problem.
In one setup we were able to connect a real app, fire a real alert, create a real response plan, debug a real RBAC problem, inspect real logs, and triage real GitHub issues. That is a much better standard than "the wizard completed successfully."
If you try SRE Agent after GA, start there. Connect the things that make your system legible, then ask a question that would actually matter during a bad day. The answer will tell you very quickly whether the agent has a real starting point.
Azure SRE Agent is generally available starting March 10, 2026.