The Most Common Challenge in Modern Cloud Apps
There's a category of bugs that drive engineers crazy: multi-layer infrastructure issues.
Your app deploys successfully. Every Azure resource shows "Succeeded." But the app fails at runtime with a vague error like Login failed for user ''.
Where do you even start?
You're checking the Web App, the SQL Server, the VNet, the private endpoint, the DNS zone, the identity configuration... and each one looks fine in isolation. The problem is how they connect and that's invisible in the portal.
Networking issues are especially brutal. The error says "Login failed" but the actual causes could be DNS, firewall, identity, or all three. The symptom and the root causes are in completely different resources. Without deep Azure networking knowledge, you're just clicking around hoping something jumps out.
Now imagine you vibe coded the infrastructure. You used AI to generate the Bicep, deployed it, and moved on. When it breaks, you're debugging code you didn't write, configuring resources you don't fully understand.
This is where I wanted AI to help not just to build, but to debug.
Enter SRE Agent + Coding Agent
Here's what I used:
|
Layer |
Tool |
Purpose |
|
Build |
VS Code Copilot Agent Mode + Claude Opus |
Generate code, Bicep, deploy |
|
Debug |
Azure SRE Agent |
Diagnose infrastructure issues and create developer issue with suggested fixes in source code (app code and IaC) |
|
Fix |
GitHub Coding Agent |
Create PRs with code and IaC fix from Github issue created by SRE Agent |
Copilot builds. SRE Agent debugs. Coding Agent fixes.
What I Built
I used VS Code Copilot in Agent Mode with Claude Opus to create a .NET 8 Web App connected to Azure SQL via private endpoint:
- Private networking (no public exposure)
- Entra-only authentication
- Managed identity (no secrets)
Deployed with azd up. All green.
Then I tested the health endpoint:
$ curl https://app-tsdvdfdwo77hc.azurewebsites.net/health/sql
{"status":"unhealthy","error":"Login failed for user ''.","errorType":"SqlException"}
Deployment succeeded. App failed. One error.
How I Fixed It: Step by Step
Step 1: Create SRE Agent with Azure Access
I created an SRE Agent with read access to my Azure subscription. You can scope it to specific resource groups. The agent builds a knowledge graph of your resources and their dependencies visible in the Resource Mapping view below.
Step 2: Connect GitHub to SRE Agent using GitHub MCP server
I connected the GitHub MCP server so the agent could read my repository and create issues.
Step 3: Create Sub Agent to analyze source code
I created a sub-agent for analyzing source code using GitHub mcp tools. this lets SRE Agent understand not just Azure resources, but also the Bicep and source code files that created them.
"you are expert in analyzing source code (bicep and app code) from github repos"
Step 4: Invoke Sub-Agent to Analyze the Error
In the SRE Agent chat, I invoked the sub-agent to diagnose the error I received from my app end point. It correlated the runtime error with the infrastructure configuration
Step 5: Watch the SRE Agent Think and Reason
SRE Agent analyzed the error by tracing code in Program.cs, Bicep configurations, and Azure resource relationships Web App, SQL Server, VNet, private endpoint, DNS zone, and managed identity. Its reasoning process worked through each layer, eliminating possibilities one by one until it identified the root causes.
Step 6: Agent Creates GitHub Issue
Based on its analysis, SRE Agent summarized the root causes and suggested fixes in a GitHub issue:
Root Causes:
- Private DNS Zone missing VNet link
- Managed identity not created as SQL user
Suggested Fixes:
- Add virtualNetworkLinks resource to Bicep
- Add SQL setup script to create user with db_datareader and db_datawriter roles
Step 7: Merge the PR from Coding Agent
Assign the Github issue to Coding Agent which then creates a PR with the fixes. I just reviewed the fix. It made sense and I merged it.
Redeployed with azd up, ran the SQL script:
curl -s https://app-tsdvdfdwo77hc.azurewebsites.net/health/sql | jq .
{
"status": "healthy",
"database": "tododb",
"server": "tcp:sql-tsdvdfdwo77hc.database.windows.net,1433",
"message": "Successfully connected to SQL Server"
}
š From error to fix in minutes without manually debugging a single Azure resource.
Why This Matters
If you're a developer building and deploying apps to Azure, SRE Agent changes how you work:
You don't need to be a networking expert. SRE Agent understands the relationships between Azure resources private endpoints, DNS zones, VNet links, managed identities. It connects dots you didn't know existed.
You don't need to guess. Instead of clicking through the portal hoping something looks wrong, the agent systematically eliminates possibilities like a senior engineer would.
You don't break your workflow. SRE Agent suggests fixes in your Bicep and source code not portal changes. Everything stays version controlled. Deployed through pipelines. No hot fixes at 2 AM.
You close the loop. AI helps you build fast. Now AI helps you debug fast too.
Try It Yourself
Do you vibe code your app, your infrastructure, or both? How do you debug when things break?
Here's a challenge: Vibe code a todo app with a Web App, VNet, private endpoint, and SQL database. "Forget" to link the DNS zone to the VNet. Deploy it. Watch it fail. Then point SRE Agent at it and see how it identifies the root cause, creates a GitHub issue with the fix, and hands it off to Coding Agent for a PR.
Share your experience. I'd love to hear how it goes.