azure container apps
205 TopicsUnifying Scattered Observability Data from Dynatrace + Azure for Self-Healing with SRE Agent
What if your deployments could fix themselves? The Deployment Remediation Challenge Modern operations teams face a recurring nightmare: A deployment ships at 9 AM Errors spike at 9:15 AM By the time you correlate logs, identify the bad revision, and execute a rollback—it's 10:30 AM Your users felt 75 minutes of degraded experience The data to detect and fix this existed the entire time—but it was scattered across clouds and platforms: Error logs and traces → Dynatrace (third-party observability cloud) Deployment history and revisions → Azure Container Apps API Resource health and metrics → Azure Monitor Rollback commands → Azure CLI Your observability data lives in one cloud. Your deployment data lives in another. Stitching together log analysis from Dynatrace with deployment correlation from Azure—and then executing remediation—required a human to manually bridge these silos. What if an AI agent could unify data from third-party observability platforms with Azure deployment history and act on it automatically—every week, before users even notice? Enter SRE Agent + Model Context Protocol (MCP) + Subagents Azure SRE Agent doesn't just work with Azure. Using the Model Context Protocol (MCP), you can connect external observability platforms like Dynatrace directly to your agent. Combined with subagents for specialized expertise and scheduled tasks for automation, you can build an automated deployment remediation system. Here's what I built/configured for my Azure Container Apps environment inside SRE Agent: Component Purpose Dynatrace MCP Connector Connect to Dynatrace's MCP gateway for log queries via DQL 'Dynatrace' Subagent Log analysis specialist that executes DQL queries and identifies root causes 'Remediation' Subagent Deployment remediation specialist that correlates errors with deployments and executes rollbacks Scheduled Task Weekly Monday 9 AM health check for the 'octopets-prod-api' Container App Subagent workflow: The subagent workflow in SRE Agent Builder: 'OctopetsScheduledTask' triggers 'RemediationSubagent' (12 tools), which hands off to 'DynatraceSubagent' (3 MCP tools) for log analysis. How I Set It Up: Step by Step Step 1: Connect Dynatrace via MCP SRE Agent supports the Model Context Protocol (MCP) for connecting external data sources. Dynatrace exposes an MCP gateway that provides access to its APIs as first-class tools. Connection configuration: { "name": "dynatrace-mcp-connector", "dataConnectorType": "Mcp", "dataSource": "Endpoint=https://<your-tenant>.live.dynatrace.com/platform-reserved/mcp-gateway/v0.1/servers/dynatrace-mcp/mcp;AuthType=BearerToken;BearerToken=<your-api-token>" } Once connected, SRE Agent automatically discovers Dynatrace tools. 💡 Tip: When creating your Dynatrace API token, grant the `entities.read`, `events.read`, and `metrics.read` scopes for comprehensive access. Step 2: Build Specialized Subagents Generic agents are good. Specialized agents are better. I created two subagents that work together in a coordinated workflow—one for Dynatrace log analysis, the other for deployment remediation. DynatraceSubagent This subagent is the log analysis specialist. It uses the Dynatrace MCP tools to execute DQL queries and identify root causes. Key capabilities: Executes DQL queries via MCP tools (`create-dql`, `execute-dql`, `explain-dql`) Fetches 5xx error counts, request volumes, and spike detection Returns consolidated analysis with root cause, affected services, and error patterns 👉 View full DynatraceSubagent configuration here RemediationSubagent This is the deployment remediation specialist. It correlates Dynatrace log analysis with Azure Container Apps deployment history, generates correlation charts, and executes rollbacks when confidence is high. Key capabilities: Retrieves Container Apps revision history (`GetDeploymentTimes`, `ListRevisions`) Generates correlation charts (`PlotTimeSeriesData`, `PlotBarChart`, `PlotAreaChartWithCorrelation`) Computes confidence score (0-100%) for deployment causation Executes rollback and traffic shift when confidence > 70% 👉 View full RemediationSubagent configuration here The power of specialization: Each agent focuses on its domain—DynatraceSubagent handles log analysis, RemediationSubagent handles deployment correlation and rollback. When the workflow runs, RemediationSubagent hands off to DynatraceSubagent (bi-directional handoff) for analysis, gets the findings back, and continues with remediation. Simple delegation, not a single monolithic agent trying to do everything. Step 3: Create the Weekly Scheduled Task Now the automation. I configured a scheduled task that runs every Monday at 9:30 AM to check whether deployments in the last 4 hours caused any issues—and automatically remediate if needed. Scheduled task configuration: Setting Value Task Name OctopetsScheduledTask Frequency Weekly Day of Week Monday Time 9:30 AM Response Subagent RemediationSubagent Scheduled Task Configuration Configuring the OctopetsScheduledTask in the SRE Agent portal The key insight: the scheduled task is just a coordinator. It immediately hands off to the RemediationSubagent, which orchestrates the entire workflow including handoffs to DynatraceSubagent. Step 4: See It In Action Here's what happens when the scheduled task runs: The scheduled task triggering and initiating Dynatrace analysis for octopets-prod-api The DynatraceSubagent analyzes the logs and identifies the root cause: executing DQL queries and returning consolidated log analysis The RemediationSubagent then generates correlation charts: Finally, with a 95% confidence score, SRE agent executes the rollback autonomously: executing rollback and traffic shift autonomously. The agent detected the bad deployment, generated visual evidence, and automatically shifted 100% traffic to the last known working revision—all without human intervention. Why This Matters Before After Manually check Dynatrace after incidents Automated DQL queries via MCP Stitch together logs + deployments manually Subagents correlate data automatically Rollback requires human decision + execution Confidence-based auto-remediation 75+ minutes from deployment to rollback Under 5 Minutes with autonomous workflow Reactive incident response Proactive weekly health checks Try It Yourself Connect your observability tool via MCP (Dynatrace, Datadog, Prometheus—any tool with an MCP gateway) Build a log analysis subagent that knows how to query your observability data Build a remediation subagent that can correlate logs with deployments and execute fixes Wire them together with handoffs so the subagents can delegate log analysis Create a scheduled task to trigger the workflow automatically Learn More Azure SRE Agent documentation Model Context Protocol (MCP) integration guide Building subagents for specialized workflows Scheduled tasks and automation SRE Agent Community Azure SRE Agent pricing SRE Agent Blogs161Views0likes0CommentsHow SRE Agent Pulls Logs from Grafana and Creates Jira Tickets Without Native Integrations
Your tools. Your workflows. SRE Agent adapts. SRE Agent natively integrates with PagerDuty, ServiceNow, and Azure Monitor. But your team might use Jira for incident tracking. Grafana for dashboards. Loki for logs. Prometheus for metrics. These aren't natively supported. That doesn't matter. SRE Agent supports MCP, the Model Context Protocol. Any MCP-compatible server extends the agent's capabilities. Connect your Grafana instance. Connect your Jira. The agent queries logs, correlates errors, and creates tickets with root cause analysis across tools that were never designed to talk to each other. The Scenario I built a grocery store app that simulates a realistic SRE scenario: an external supplier API starts rate limiting your requests. Customers see "Unable to check inventory" errors. The on-call engineer gets paged. The goal: SRE Agent should diagnose the issue by querying Loki logs through Grafana, identify the root cause, and create a Jira ticket with findings and recommendations. The app runs on Azure Container Apps with Loki for logs and Azure Managed Grafana for visualization. 👉 Deploy it yourself: github.com/dm-chelupati/grocery-sre-demo How I Set Up SRE Agent: Step by Step Step 1: Create SRE Agent I created an SRE Agent and gave it Reader access to my subscription Step 2: Connect to Grafana and Jira via MCP Neither MCP server had a remotely hosted option, and their stdio setup didn't match what SRE Agent supports. So I hosted them myself as Azure Container Apps: Grafana MCP Server — connects to my Azure Managed Grafana instance Atlassian MCP Server — connects to my Jira Cloud instance Now I have two endpoints SRE Agent can reach: https://ca-mcp-grafana.<env>.azurecontainerapps.io/mcp https://ca-mcp-jira.<env>.azurecontainerapps.io/mcp I added both to SRE Agent's MCP configuration as remotely hosted servers. Step 3: Create Sub-Agent with Tools and Instructions I created a sub-agent specifically for incident diagnosis with these tools enabled: Grafana MCP (for querying Loki logs) Atlassian MCP (for creating Jira tickets) Instructions were simple: You are expert in diagnosing applications running on Azure services. You need to use the Grafana tools to get the logs, metrics or traces and create a summary of your findings inside Jira as a ticket. use your knowledge base file loki-queries.md to learn about app configuration with loki and Query the loki for logs in Grafana. Step 4: Invoke Sub-Agent and Watch It Work I went to the SRE Agent chat and asked: @JiraGrafanaexpert: My container app ca-api-3syj3i2fat5dm in resource group rg-groceryapp is experiencing rate limit errors from a supplier API when checking product inventory. The agent: Queried Loki via Grafana MCP: {app="grocery-api"} |= "error" Found 429 rate limit errors spiking — 55+ requests hitting supplier API limits Identified root cause: SUPPLIER_RATE_LIMIT_429 from FreshFoods Wholesale API Created a Jira ticket: One prompt. Logs queried. Root cause identified. Ticket created with remediation steps. Making It Better: The Knowledge File SRE Agent can explore and discover how your apps are wired but you can speed that up. When querying observability data sources, the agent needs to learn the schema, available labels, table structures, and query syntax. For Loki, that means understanding LogQL, knowing which labels your apps use, and what JSON fields appear in logs. SRE Agent can figure things out, but with context, it gets there faster — just like humans. I created a knowledge file that gives the agent a head start: With this context, the agent knows exactly which labels to query, what fields to extract from JSON logs, and which query patterns to use 👉 See my full knowledge file How MCP Makes This Possible SRE Agent supports two ways to connect MCP servers: stdio — runs locally via command. This works for MCP servers that can be invoked via npx, node, or uvx. For example: npx -y @modelcontextprotocol/server-github. Remotely hosted — HTTP endpoint with streamable transport: https://mcp-server.example.com/sse or /mcp The catch: Not every MCP server fits these options out of the box. Some servers only support stdio but not the npx/node/uvx formats SRE Agent expects. Others don't offer a hosted endpoint at all. The solution: host them yourself. Deploy the MCP server as a container with an HTTP endpoint. That's what I did with Grafana MCP Server and Atlassian MCP Server, deployed both as Azure Container Apps exposing /mcp endpoints. Why This Matters Enterprise tooling is fragmented across Azure and non-Azure ecosystems. Some teams use Azure Monitor, others use Datadog. Incident tracking might be ServiceNow in one org and Jira in another. Logs live in Loki, Splunk, Elasticsearch and sometimes all three. SRE Agent meets you where you are. Azure-native tools work out of the box. Everything else connects via MCP. Your observability stack stays the same. Your ticketing system stays the same. The agent becomes the orchestration layer that ties them together. One agent. Any tool. Intelligent workflows across your entire ecosystem. Try It Yourself Create an SRE Agent Deploy MCP servers for your tools (Grafana, Atlassian) Create a sub-agent with the MCP tools connected Add a knowledge file with your app context Ask it to diagnose an issue Watch logs become tickets. Errors become action items. Context becomes intelligence. Learn More Azure SRE Agent documentation Azure SRE Agent blogs Grocery SRE Demo repo MCP specification Azure SRE Agent is currently in preview.423Views0likes0CommentsExciting Updates Coming to Conversational Diagnostics (Public Preview)
Last year, at Ignite 2023, we unveiled Conversational Diagnostics (Preview), a revolutionary tool integrated with AI-powered capabilities to enhance problem-solving for Windows Web Apps. This year, we're thrilled to share what’s new and forthcoming for Conversational Diagnostics (Preview). Get ready to experience a broader range of functionalities and expanded support across various Azure Products, making your troubleshooting journey even more seamless and intuitive.373Views0likes0CommentsAdvanced Container Apps Networking: VNet Integration and Centralized Firewall Traffic Logging
Azure community, I recently documented a networking scenario relevant to Azure Container Apps environments where you need to control and inspect application traffic using a third-party network virtual appliance. The article walks through a practical deployment pattern: • Integrate your Azure Container Apps environment with a Virtual Network. • Configure user-defined routes (UDRs) so that traffic from your container workloads is directed toward a firewall appliance before reaching external networks or backend services. • Verify actual traffic paths using firewall logs to confirm that routing policies are effective. This pattern is helpful for organizations that must enforce advanced filtering, logging, or compliance checks on container egress/ingress traffic, going beyond what native Azure networking controls provide. It also complements Azure Firewall and NSG controls by introducing a dedicated next-generation firewall within your VNet. If you’re working with network control, security perimeters, or hybrid network architectures involving containerized workloads on Azure, you might find it useful. Read the full article on my blog26Views0likes0CommentsDeploy Dynatrace OneAgent on your Container Apps
TOC Introduction Setup References 1. Introduction Dynatrace OneAgent is an advanced monitoring tool that automatically collects performance data across your entire IT environment. It provides deep visibility into applications, infrastructure, and cloud services, enabling real-time observability. OneAgent supports multiple platforms, including containers, VMs, and serverless architectures, ensuring seamless monitoring with minimal configuration. It captures detailed metrics, traces, and logs, helping teams diagnose performance issues, optimize resources, and enhance user experiences. With AI-driven insights, OneAgent proactively detects anomalies and automates root cause analysis, making it an essential component for modern DevOps, SRE, and cloud-native monitoring strategies. 2. Setup 1. After registering your account, go to the control panel and search for Deploy OneAgent. 2. Obtain your Environment ID and create a PaaS token. Be sure to save them for later use. 3. In your local environment's console, log in to the Dynatrace registry. docker login -u XXX XXX.live.dynatrace.com # XXX is your Environment ID # Input PaaS Token when password prompt 4. Create a Dockerfile and an sshd_config file. FROM mcr.microsoft.com/devcontainers/javascript-node:20 # Change XXX into your Environment ID COPY --from=XXX.live.dynatrace.com/linux/oneagent-codemodules:all / / ENV LD_PRELOAD /opt/dynatrace/oneagent/agent/lib64/liboneagentproc.so # SSH RUN apt-get update \ && apt-get install -y --no-install-recommends dialog openssh-server tzdata screen lrzsz htop cron \ && echo "root:Docker!" | chpasswd \ && mkdir -p /run/sshd \ && chmod 700 /root/.ssh/ \ && chmod 600 /root/.ssh/id_rsa COPY ./sshd_config /etc/ssh/ # OTHER EXPOSE 2222 CMD ["/usr/sbin/sshd", "-D", "-o", "ListenAddress=0.0.0.0"] Port 2222 ListenAddress 0.0.0.0 LoginGraceTime 180 X11Forwarding yes Ciphers aes128-cbc,3des-cbc,aes256-cbc,aes128-ctr,aes192-ctr,aes256-ctr MACs hmac-sha2-256,hmac-sha2-512,hmac-sha1,hmac-sha1-96 StrictModes yes SyslogFacility DAEMON PasswordAuthentication yes PermitEmptyPasswords no PermitRootLogin yes Subsystem sftp internal-sftp AllowTcpForwarding yes 5. Build the container and push it to Azure Container Registry (ACR). # YYY is your ACR name docker build -t oneagent:202503201710 . --no-cache # you could setup your own image name docker tag oneagent:202503201710 YYY.azurecr.io/oneagent:202503201710 docker push YYY.azurecr.io/oneagent:202503201710 6. Create an Azure Container App (ACA), set Ingress to port 3000, allow all inbound traffic, and specify the ACR image you just created. 7. Once the container starts, open a console and run the following command to create a temporary HTTP server simulating a Node.js app. mkdir app && cd app echo 'console.log("Node.js app started...")' > index.js npm init -y npm install express cat <<EOF > server.js const express = require('express'); const app = express(); app.get('/', (req, res) => res.send('hello')); app.listen(3000, () => console.log('Server running on port 3000')); EOF # Please Press Ctrl + C to terminate the next command and run again for 3 times node server.js 8. You should now see the results on the ACA homepage. 9. Go back to the Dynatrace control panel, search for Host Classic, and you should see the collected data. 3. References Integrate OneAgent on Azure App Service for Linux and containers — Dynatrace Docs1.9KViews0likes1CommentFind the Alerts You Didn't Know You Were Missing with Azure SRE Agent
I had 6 alert rules. CPU. Memory. Pod restarts. Container errors. OOMKilled. Job failures. I thought I was covered. Then my app went down. I kept refreshing the Azure portal, waiting for an alert. Nothing. That's when it hit me: my alerts were working perfectly. They just weren't designed for this failure mode. Sound familiar? The Problem Every Developer Knows If you're a developer or DevOps engineer, you've been here: a customer reports an issue, you scramble to check your monitoring, and then you realize you don't have the right alerts set up. By the time you find out, it's already too late. You set up what seems like reasonable alerting and assume you're covered. But real-world failures are sneaky. They slip through the cracks of your carefully planned thresholds. My Setup: AKS with Redis I love to vibe code apps using GitHub Copilot Agent mode with Claude Opus 4.5. It's fast, it understands context, and it lets me focus on building rather than boilerplate. For this project, I built a simple journal entry app: AKS cluster hosting the web API Azure Cache for Redis storing journal data Azure Monitor alerts for CPU, memory, pod restarts, container errors, OOMKilled, and job failures Seemed solid. What could go wrong? The Scenario: Redis Password Rotation Here's something that happens constantly in enterprise environments: the security team rotates passwords. It's best practice. It's in the compliance checklist. And it breaks things when apps don't pick up the new credentials. I simulated exactly this. The pods came back up. But they couldn't connect to Redis (as expected). The readiness probes started failing. The LoadBalancer had no healthy backends. The endpoint timed out. And not a single alert fired. Using SRE Agent to Find the Alert Gaps Instead of manually auditing every alert rule and trying to figure out what I missed, I turned to Azure SRE Agent. I asked it a simple question: "My endpoint is timing out. What alerts do I have, and why didn't any of them fire?" Within minutes, it had diagnosed the problem. Here's what it found: My Existing Alerts Why They Didn't Fire High CPU/Memory No resource pressure,just auth failures Pod Restarts Pods weren't restarting, just unhealthy Container Errors App logs weren't being written OOMKilled No memory issues Job Failures No K8s jobs involved The gaps SRE Agent identified: ❌ No synthetic URL availability test ❌ No readiness/liveness probe failure alerts ❌ No "pods not ready" alerts scoped to my namespace ❌ No Redis connection error detection ❌ No ingress 5xx/timeout spike alerts ❌ No per-pod resource alerts (only node-level) SRE Agent didn't just tell me what was wrong, it created a GitHub issue with : KQL queries to detect each failure type Bicep code snippets for new alert rules Remediation suggestions for the app code Exact file paths in my repo to update Check it out: GitHub Issue How I Built It: Step by Step Let me walk you through exactly how I set this up inside SRE Agent. Step 1: Create an SRE Agent I created a new SRE Agent in the Azure portal. Since this workflow analyzes alerts across my subscription (not just one resource group), I didn't configure any specific resource groups. Instead, I gave the agent's managed identity Reader permissions on my entire subscription. This lets it discover resources, list alert rules, and query Log Analytics across all my resource groups. Step 2: Connect GitHub to SRE Agent via MCP I added a GitHub MCP server to give the agent access to my source code repository.MCP (Model Context Protocol) lets you bring any API into the agent. If your tool has an API, you can connect it. I use GitHub for both source code and tracking dev tickets, but you can connect to wherever your code lives (GitLab, Azure DevOps) or your ticketing system (Jira, ServiceNow, PagerDuty). Step 3: Create a Subagent inside SRE Agent for managing Azure Monitor Alerts I created a focused subagent with a specific job and only the tools it needs: Azure Monitor Alerts Expert Prompt: " You are expert in managing operations related to azure monitor alerts on azure resources including discovering alert rules configured on azure resources, creating new alert rules (with user approval and authorization only), processing the alerts fired on azure resources and identifying gaps in the alert rules. You can get the resource details from azure monitor alert if triggered via alert. If not, you need to ask user for the specific resource to perform analysis on. You can use az cli tool to diagnose logs, check the app health metrics. You must use the app code and infra code (bicep files) files you have access to in the github repo <insert your repo> to further understand the possible diagnoses and suggest remediations. Once analysis is done, you must create a github issue with details of analysis and suggested remediation to the source code files in the same repo." Tools enabled: az cli – List resources, alert rules, action groups Log Analytics workspace querying – Run KQL queries for diagnostics GitHub MCP – Search repositories, read file contents, create issues Step 4: Ask the Subagent About Alert Gaps I gave the agent context and asked a simple question: "@AzureAlertExpert: My API endpoint http://132.196.167.102/api/journals/john is timing out. What alerts do I have configured in rg-aks-journal, and why didn't any of them fire? The agent did the analysis autonomously and summarized findings with suggestions to add new alert rules in a GitHub issue. Here's the agentic workflow to perform azure monitor alert operations Why This Matters Faster response times. Issues get diagnosed in minutes, not hours of manual investigation. Consistent analysis. No more "I thought we had an alert for that" moments. The agent systematically checks what's covered and what's not. Proactive coverage. You don't have to wait for an incident to find gaps. Ask the agent to review your alerts before something breaks. The Bottom Line Your alerts have gaps. You just don't know it until something slips through. I had 6 alert rules and still missed a basic failure. My pods weren't restarting, they were just unhealthy. My CPU wasn't spiking, the app was just returning errors. None of my alerts were designed for this. You don't need to audit every alert rule manually. Give SRE Agent your environment, describe the failure, and let it tell you what's missing. Stop discovering alert gaps from customer complaints. Start finding them before they matter. A Few Tips Give the agent Reader access at subscription level so it can discover all resources Use a focused subagent prompt, don't try to do everything in one agent Test your MCP connections before running workflows What Alert Gaps Have Burned You? What's the alert you wish you had set up before an incident? Credential rotation? Certificate expiry? DNS failures? Let us know in the comments.348Views0likes0CommentsFrom Vibe Coding to Working App: How SRE Agent Completes the Developer Loop
The Most Common Challenge in Modern Cloud Apps There's a category of bugs that drive engineers crazy: multi-layer infrastructure issues. Your app deploys successfully. Every Azure resource shows "Succeeded." But the app fails at runtime with a vague error like Login failed for user ''. Where do you even start? You're checking the Web App, the SQL Server, the VNet, the private endpoint, the DNS zone, the identity configuration... and each one looks fine in isolation. The problem is how they connect and that's invisible in the portal. Networking issues are especially brutal. The error says "Login failed" but the actual causes could be DNS, firewall, identity, or all three. The symptom and the root causes are in completely different resources. Without deep Azure networking knowledge, you're just clicking around hoping something jumps out. Now imagine you vibe coded the infrastructure. You used AI to generate the Bicep, deployed it, and moved on. When it breaks, you're debugging code you didn't write, configuring resources you don't fully understand. This is where I wanted AI to help not just to build, but to debug. Enter SRE Agent + Coding Agent Here's what I used: Layer Tool Purpose Build VS Code Copilot Agent Mode + Claude Opus Generate code, Bicep, deploy Debug Azure SRE Agent Diagnose infrastructure issues and create developer issue with suggested fixes in source code (app code and IaC) Fix GitHub Coding Agent Create PRs with code and IaC fix from Github issue created by SRE Agent Copilot builds. SRE Agent debugs. Coding Agent fixes. What I Built I used VS Code Copilot in Agent Mode with Claude Opus to create a .NET 8 Web App connected to Azure SQL via private endpoint: Private networking (no public exposure) Entra-only authentication Managed identity (no secrets) Deployed with azd up. All green. Then I tested the health endpoint: $ curl https://app-tsdvdfdwo77hc.azurewebsites.net/health/sql {"status":"unhealthy","error":"Login failed for user ''.","errorType":"SqlException"} Deployment succeeded. App failed. One error. How I Fixed It: Step by Step Step 1: Create SRE Agent with Azure Access I created an SRE Agent with read access to my Azure subscription. You can scope it to specific resource groups. The agent builds a knowledge graph of your resources and their dependencies visible in the Resource Mapping view below. Step 2: Connect GitHub to SRE Agent using GitHub MCP server I connected the GitHub MCP server so the agent could read my repository and create issues. Step 3: Create Sub Agent to analyze source code I created a sub-agent for analyzing source code using GitHub mcp tools. this lets SRE Agent understand not just Azure resources, but also the Bicep and source code files that created them. "you are expert in analyzing source code (bicep and app code) from github repos" Step 4: Invoke Sub-Agent to Analyze the Error In the SRE Agent chat, I invoked the sub-agent to diagnose the error I received from my app end point. It correlated the runtime error with the infrastructure configuration Step 5: Watch the SRE Agent Think and Reason SRE Agent analyzed the error by tracing code in Program.cs, Bicep configurations, and Azure resource relationships Web App, SQL Server, VNet, private endpoint, DNS zone, and managed identity. Its reasoning process worked through each layer, eliminating possibilities one by one until it identified the root causes. Step 6: Agent Creates GitHub Issue Based on its analysis, SRE Agent summarized the root causes and suggested fixes in a GitHub issue: Root Causes: Private DNS Zone missing VNet link Managed identity not created as SQL user Suggested Fixes: Add virtualNetworkLinks resource to Bicep Add SQL setup script to create user with db_datareader and db_datawriter roles Step 7: Merge the PR from Coding Agent Assign the Github issue to Coding Agent which then creates a PR with the fixes. I just reviewed the fix. It made sense and I merged it. Redeployed with azd up, ran the SQL script: curl -s https://app-tsdvdfdwo77hc.azurewebsites.net/health/sql | jq . { "status": "healthy", "database": "tododb", "server": "tcp:sql-tsdvdfdwo77hc.database.windows.net,1433", "message": "Successfully connected to SQL Server" } 🎉 From error to fix in minutes without manually debugging a single Azure resource. Why This Matters If you're a developer building and deploying apps to Azure, SRE Agent changes how you work: You don't need to be a networking expert. SRE Agent understands the relationships between Azure resources private endpoints, DNS zones, VNet links, managed identities. It connects dots you didn't know existed. You don't need to guess. Instead of clicking through the portal hoping something looks wrong, the agent systematically eliminates possibilities like a senior engineer would. You don't break your workflow. SRE Agent suggests fixes in your Bicep and source code not portal changes. Everything stays version controlled. Deployed through pipelines. No hot fixes at 2 AM. You close the loop. AI helps you build fast. Now AI helps you debug fast too. Try It Yourself Do you vibe code your app, your infrastructure, or both? How do you debug when things break? Here's a challenge: Vibe code a todo app with a Web App, VNet, private endpoint, and SQL database. "Forget" to link the DNS zone to the VNet. Deploy it. Watch it fail. Then point SRE Agent at it and see how it identifies the root cause, creates a GitHub issue with the fix, and hands it off to Coding Agent for a PR. Share your experience. I'd love to hear how it goes. Learn More Azure SRE Agent documentation Azure SRE Agent blogs Azure SRE Agent community Azure SRE Agent home page Azure SRE Agent pricing769Views3likes0CommentsExtend SRE Agent with MCP: Build an Agentic Workflow to Triage Customer Issues
Your inbox is full. GitHub issues piling up. "App not working." "How do I configure alerts?" "Please add dark mode." You open each one, figure out what it is, ask for more info, add labels, route to the right team. An hour later, you're still sorting issues. Sound familiar? The Triage Tax Every L1 support engineer, PM, and on-call developer who's handled customer issues knows this pain. When tickets come in, you're not solving problems, you're sorting them. Read the issue. Is it a bug or a question? Check the docs. Does this feature exist? Ask for more info. Wait two days. Re-triage. Add labels. Route to engineering. It's tedious. It requires judgment, you need to understand the product, know what info is needed, check documentation. And honestly? It's work that nobody volunteers for but someone has to do. In large organizations, it gets even more complex. The issue doesn't just need to be triaged, it needs to be routed to the right engineering team. Is this an auth issue? Frontend? Backend? Infrastructure? A wrong routing decision means delays, re-assignments, and frustrated customers. What if an AI agent could do this for you? Enter Azure SRE Agent + MCP Here's what I built: I gave SRE Agent access to my GitHub and PagerDuty accounts via MCP, uploaded my triage rubric as a markdown file, and set it to run twice a day. No more reading every ticket manually. No more asking the same "please provide more info" questions. No more morning triage sessions. What My Setup Looks Like My app's customer issues come in through GitHub. My team uses PagerDuty to track bugs and incidents. So I connected both via MCP to the SRE Agent. I also uploaded my triage logic as a .md file on how to classify issues, what info is required for each category, which labels to use, which team handles what. And since I didn't want to run this workflow manually, I set up a scheduled task to trigger it twice a day. Now it just runs. I verify its work if I want to. What the Agent Does Fetches all open, unlabeled GitHub issues Reads each issue and classifies it (bug, doc question, feature request) Checks if required info is present Posts a comment asking for details if needed, or acknowledges the issue Adds appropriate labels Creates a PagerDuty incident for bugs ready for engineering Moves to the next issue How I Built It: Step by Step Let me walk you through exactly how I set this up inside SRE Agent. Step 1: Create an SRE Agent I created a new SRE Agent in the Azure portal. Since this workflow triages GitHub issues and not Azure resources, I didn't need to configure any Azure resource groups or subscriptions. Just an agent. Step 2: Connect MCP Servers I added two MCP servers to give the agent access to my tools: GitHub MCP– Fetch issues, post comments, add labels PagerDuty MCP – Create incidents for bugs that need dev team's attention MCP (Model Context Protocol) lets you bring any API into the agent. If your tool has an API, you can connect it. Step 3: Create Subagents I created two focused subagents, each with a specific job and only the tools it needs: GitHub Issue Triager "You are expert in triaging GitHub issues, classifying them into categories such as user needs to supply additional information, bug, documentation question, or feature request. Use the knowledge base to search for the right document that helps you with performing this triaging. Perform all actions autonomously without waiting for user input. Hand off to Incident Creator for the issues you classified as bugs." Tools: GitHub MCP (issues, labels, comments) Incident Creator Here "You are expert in managing incidents in PagerDuty, listing services, incidents, creating incidents with all details. Once done, hand off back to GitHub Issue Triager." Tools: PagerDuty MCP (services, incidents) The handoff between them creates a workflow. They collaborate without human involvement. Step 4: Add Your Knowledge I uploaded my triage logic as a .md file to the agent's knowledge base. This is my rubric - my mental model for how to triage issues: How do I classify bugs vs. doc questions vs. feature requests? What info is required for each category? What labels do I use? When should an incident be created? Which team handles which type of issue? I wrote it down the way I'd explain it to a new teammate. The agent searches and follows it. Step 5: Add a Scheduled Task I didn't want to trigger this workflow manually every time. SRE Agent supports scheduled tasks, workflows that run automatically on a cadence. I set up a trigger to run twice a day: morning and evening. Now the workflow is fully automated. Here is the end to end automated agentic workflow to triage customer tickets. Why MCP Matters Every team uses different tools. Maybe your customer issues live in Zendesk, incidents go to ServiceNow and you use Jira or Azure DevOps. SRE Agent doesn't lock you in. With MCP, you connect to whatever tools you already use. The agent orchestrates across them. That's the extensibility model: your tools, your workflow, orchestrated by the agent. The Result Before: 2 hours every morning sorting tickets. After: By the time anyone logs in, issues are labeled, missing-info requests are posted, urgent bugs have incidents, and feature requests are acknowledged. Your team can finally focus on the complex stuff not sorting tickets. Why This Matters Faster response times. Issues get acknowledged in minutes, not days. Consistent classification. No "this should have been a P1" moments. No tickets bouncing between teams. Happier customers. They get a response immediately even if it's just "we're looking into it." Focus on what matters. Your team should be solving problems, not sorting them. The Bottom Line Triage isn't the job, it's the tax on the job. It quietly eats the hours your team could spend building, debugging, and shipping. You don't need to build a custom triage bot. You don't need to wire up webhooks and write glue code. You give the SRE agent your tools, your logic, and a schedule and it handles the sorting. Use GitHub? Connect GitHub. Use Zendesk? Connect Zendesk. PagerDuty, ServiceNow, Jira - whatever your team runs on, the agent meets you there. Stop sorting tickets. Start shipping. A Few Tips Test MCP endpoints before configuring them in the SRE agent Give each subagent only the tools it needs, don't enable everything Start read-only until you trust the classification, then enable comments Do You Still Want to Triage Issues Manually? What tools does your team use to track customer-reported issues and incidents? Let us know in the comments, we'd love to hear how you'd use this workflow with your stack. Is triage your most toilsome workflow or is there something even worse eating your team's time? Let us know in the comments.520Views1like0CommentsCapacity Planning with Azure Container Apps Workload Profiles
Overview Azure Container Apps (ACA) simplifies container orchestration, but capacity planning often confuses developers. Questions like “How do replicas consume node resources?”, “When does ACA add nodes?”, and “How should I model limits and requests?”. This guide pairs official ACA guidance with practical examples to demystify workload profiles, autoscaling, and resource modelling. Understanding Workload Profiles in Azure Container Apps ACA offers three profile types: Consumption Scales to zero Platform decides node size Billing per replica execution time Dedicated Choose VM SKU (e.g., D4 → 4 vCPU, 16 GiB RAM) Billing per node Flex (Preview) Combines dedicated isolation with consumption-like billing Each profile defines node-level resources. For Example: D4 → 4 vCPU, 16 GiB RAM per node. 2. How Replicas Consume Node Resources ACA runs on managed Kubernetes. Node = VM with fixed resources Replica = Pod scheduled on a node Replicas share node resources; ACA packs replicas until node capacity is full. Example Node: D4 (4 vCPU, 16 GiB RAM) Replica requests: 1 vCPU, 2 GiB 5 replicas → Needs 5 vCPU, 10 GiB ACA places 4 replicas on Node 1 and adds Node 2 for replica 5. 3. When ACA Adds Nodes ACA adds nodes when: Pending replicas cannot fit on existing nodes Resource requests exceed available capacity ACA uses Kubernetes scheduling principles. Nodes scale out when pods are not schedulable due to CPU/memory constrains. 4. Practical Sizing Strategy Identify peak load → translate to CPU/memory per replica Choose workload profile SKU (e.g., D4) Calculate packing: node capacity ÷ replica request = max replica node Add buffer (e.g. 20%) for headroom Configure autoscaling: Min replicas for HA. Max replicas for burst. Min/Max nodes for cost control. 5. Common Misconceptions Myth: “Replicas have dedicated CPU/RAM per container automatically.” Reality: Not exactly. They consume from the node pool based on your configured CPU & memory. Multiple replicas compete for the same node until capacity is exhausted. Myth: “ACA node scaling is CPU-time based.” Reality: ACA node scaling is driven by non-schedulable replicas (cannot place due to configured resources). Triggers for replica scaling are KEDA rules (HTTP, queue, CPU/memory %, etc.), but node scale follows from replica placement pressure. 6. Key Takeaways Model per-node packing before setting replica counts Plan for zero-downtime upgrades (double replicas temporarily) Monitor autoscaling behavior; defaults may not fit every workload 7. Relevant Links: Azure Container Apps plan types | Microsoft Learn Compute and billing structures in Azure Container Apps | Microsoft Learn Post questions | Provide product feedback811Views4likes1CommentStop Running Runbooks at 3 am: Let Azure SRE Agent Do Your On-Call Grunt Work
Your pager goes off. It's 2:47am. Production is throwing 500 errors. You know the drill - SSH into this, query that, check these metrics, correlate those logs. Twenty minutes later, you're still piecing together what went wrong. Sound familiar? The On-Call Reality Nobody Talks About Every SRE, DevOps engineer, and developer who's carried a pager knows this pain. When incidents hit, you're not solving problems - you're executing runbooks. Copy-paste this query. Check that dashboard. Run these az commands. Connect the dots between five different tools. It's tedious. It's error-prone at 3am. And honestly? It's work that doesn't require human creativity but requires human time. What if an AI agent could do this for you? Enter Azure SRE Agent + Runbook Automation Here's what I built: I gave SRE Agent a simple markdown runbook containing the same diagnostic steps I'd run manually during an incident. The agent executes those steps, collects evidence, and sends me an email with everything I need to take action. No more bouncing between terminals. No more forgetting a step because it's 3am and your brain is foggy. What My Runbook Contains Just the basics any on-call would run: az monitor metrics – CPU, memory, request rates Log Analytics queries – Error patterns, exception details, dependency failures App Insights data – Failed requests, stack traces, correlation IDs az containerapp logs – Revision logs, app configuration That's it. Plain markdown with KQL queries and CLI commands. Nothing fancy. What the Agent Does Reads the runbook from its knowledge base Executes each diagnostic step Collects results and evidence Sends me an email with analysis and findings I wake up to an email that says: "CPU spiked to 92% at 2:45am, triggering connection pool exhaustion. Top exception: SqlException (1,832 occurrences). Errors correlate with traffic spike. Recommend scaling to 5 replicas." All the evidence. All the queries used. All the timestamps. Ready for me to act. How to Set This Up (6 Steps) Here's how you can build this yourself: Step 1: Create SRE Agent Create a new SRE Agent in the Azure portal. No Azure resource groups to configure. If your apps run on Azure, the agent pulls context from the incident itself. If your apps run elsewhere, you don't need Azure resource configuration at all. Step 2: Grant Reader Permission (Optional) If your runbooks execute against Azure resources, assign Reader role to the SRE Agent's managed identity on your subscription. This allows the agent to run az commands and query metrics. Skip this if your runbooks target non-Azure apps. Step 3: Add Your Runbook to SRE Agent's Knowledge base You already have runbooks, they're in your wiki, Confluence, or team docs. Just add them as .md files to the agent's knowledge base. To learn about other ways to link your runbooks to the agent, read this Step 4: Connect Outlook Connect the agent to your Outlook so it can send you the analysis email with findings. Step 5: Create a Subagent Create a subagent with simple instructions like: "You are an expert in triaging and diagnosing incidents. When triggered, search the knowledge base for the relevant runbook, execute the diagnostic steps, collect evidence, and send an email summary with your findings." Assign the tools the agent needs: RunAzCliReadCommands – for az monitor, az containerapp commands QueryLogAnalyticsByWorkspaceId – for KQL queries against Log Analytics QueryAppInsightsByResourceId – for App Insights data SearchMemory – to find the right runbook SendOutlookEmail – to deliver the analysis Step 6: Set Up Incident Trigger Connect your incident management tool - PagerDuty, ServiceNow, or Azure Monitor alerts and setup the incident trigger to the subagent. When an incident fires, the agent kicks off automatically. That's it. Your agentic workflow now looks like this: This Works for Any App, Not Just Azure Here's the thing: SRE Agent is platform agnostic. It's executing your runbooks, whatever they contain. On-prem databases? Add your diagnostic SQL. Custom monitoring stack? Add those API calls. The agent doesn't care where your app runs. It cares about following your runbook and getting you answers. Why This Matters Lower MTTR. By the time you're awake and coherent, the analysis is done. Consistent execution. No missed steps. No "I forgot to check the dependencies" at 4am. Evidence for postmortems. Every query, every result, timestamped and documented. Focus on what matters. Your brain should be deciding what to do not gathering data. The Bottom Line On-call runbook execution is the most common, most tedious, and most automatable part of incident response. It's grunt work that pulls engineers away from the creative problem-solving they were hired for. SRE Agent offloads that work from your plate. You write the runbook once, and the agent executes it every time, faster and more consistently than any human at 3am. Stop running runbooks. Start reviewing results. Try it yourself: Create a markdown runbook with your diagnostic queries and commands, add it to your SRE Agent's knowledge base, and let the agent handle your next incident. Your 3am self will thank you.1KViews1like0Comments