azure container apps
208 TopicsFrom "Maybe Next Quarter" to "Running Before Lunch" on Container Apps - Modernizing Legacy .NET App
In early 2025, we wanted to modernize Jon Galloway's MVC Music Store — a classic ASP.NET MVC 5 app running on .NET Framework 4.8 with Entity Framework 6. The goal was straightforward: address vulnerabilities, enable managed identity, and deploy to Azure Container Apps and Azure SQL. No more plaintext connection strings. No more passwords in config files. We hit a wall immediately. Entity Framework on .NET Framework did not support Azure.Identity or DefaultAzureCredential. We just could not add a NuGet package and call it done — we’d need EF Core, which means modern .NET - and rewriting the data layer, the identity system, the startup pipeline, the views. The engineering team estimated one week of dedicated developer work. As a product manager without extensive .NET modernization experience, I wasn't able to complete it quickly on my own, so the project was placed in the backlog. This was before the GitHub Copilot "Agent" mode, the GitHub Copilot app modernization (a specialized agent with skills for modernization) existed but only offered assessment — it could tell you what needed to change, but couldn't make the end to end changes for you. Fast-forward one year. The full modernization agent is available. I sat down with the same app and the same goal. A few hours later, it was running on .NET 10 on Azure Container Apps with managed identity, Key Vault integration, and zero plaintext credentials. Thank you GitHub Copilot app modernization! And while we were on it – GitHub Copilot helped to modernize the experience as well, built more tests and generated more synthetic data for testing. Why Azure Container Apps? Azure Container Apps is an ideal deployment target for this modernized MVC Music Store application because it provides a serverless, fully managed container hosting environment. It abstracts away infrastructure management while natively supporting the key security and operational features this project required. It pairs naturally with infrastructure-as-code deployments, and its per-second billing on a consumption plan keeps costs minimal for a lightweight web app like this, eliminating the overhead of managing Kubernetes clusters while still giving you the container portability that modern .NET apps benefit from. That is why I asked Copilot to modernize to Azure Container Apps - here's how it went - Phase 1: Assessment GitHub Copilot App Modernization started by analyzing the codebase and producing a detailed assessment: Framework gap analysis — .NET Framework 4.0 → .NET 10, identifying every breaking change Dependency inventory — Entity Framework 6 (not EF Core), MVC 5 references, System.Web dependencies Security findings — plaintext SQL connection strings in Web.config, no managed identity support API surface changes — Global.asax → Program.cs minimal hosting, System.Web.Mvc → Microsoft.AspNetCore.Mvc The assessment is not a generic checklist. It reads your code — your controllers, your DbContext, your views — and maps a concrete modernization path. For this app, the key finding was clear: EF 6 on .NET Framework cannot support DefaultAzureCredential. The entire data layer needs to move to EF Core on modern .NET to unlock passwordless authentication. Phase 2: Code & Dependency Modernization This is where last year's experience ended and this year's began. The agent performed the actual modernization: Project structure: .csproj converted from legacy XML format to SDK-style targeting net10.0 Global.asax replaced with Program.cs using minimal hosting packages.config → NuGet PackageReference entries Data layer (the hard part): Entity Framework 6 → EF Core with Microsoft.EntityFrameworkCore.SqlServer DbContext rewritten with OnModelCreating fluent configuration System.Data.Entity → Microsoft.EntityFrameworkCore namespace throughout EF Core modernization generated from scratch Database seeding moved to a proper DbSeeder pattern with MigrateAsync() Identity: ASP.NET Membership → ASP.NET Core Identity with ApplicationUser, ApplicationDbContext Cookie authentication configured through ConfigureApplicationCookie Security (the whole trigger for this modernization): Azure.Identity + DefaultAzureCredential integrated in Program.cs Azure Key Vault configuration provider added via Azure.Extensions.AspNetCore.Configuration.Secrets Connection strings use Authentication=Active Directory Default — no passwords anywhere Application Insights wired through OpenTelemetry Views: Razor views updated from MVC 5 helpers to ASP.NET Core Tag Helpers and conventions _Layout.cshtml and all partials migrated The code changes touched every layer of the application. This is not a find-and-replace — it's a structural rewrite that maintains functional equivalence. Phase 3: Local Testing After modernization, the app builds, runs locally, and connects to a local SQL Server (or SQL in a container). EF Core modernizations apply cleanly, the seed data loads, and you can browse albums, add to cart, and check out. The identity system works. The Key Vault integration gracefully skips when KeyVaultName isn't configured — meaning local dev and Azure use the same Program.cs with zero code branches. Phase 4: AZD UP and Deployment to Azure The agent also generates the deployment infrastructure: azure.yaml — AZD service definition pointing to the Dockerfile, targeting Azure Container Apps Dockerfile — Multi-stage build using mcr.microsoft.com/dotnet/sdk:10.0 and aspnet:10.0 infra/main.bicep — Full IaaC including: Azure Container Apps with system + user-assigned managed identity Azure SQL Server with Azure AD-only authentication (no SQL auth) Azure Key Vault with RBAC, Secrets Officer role for the managed identity Container Registry with ACR Pull role assignment Application Insights + Log Analytics All connection strings injected as Container App secrets — using Active Directory Default, not passwords One command: AZD UP Provisions everything, builds the container, pushes to ACR, deploys to Container Apps. The app starts, runs MigrateAsync() on first boot, seeds the database, and serves traffic. Managed identity handles all auth to SQL and Key Vault. No credentials stored anywhere. What Changed in a Year Early 2025 Now Assessment Available Available Automated code modernization Semi-manual ✅ Full modernization agent Infrastructure generation Semi-manual ✅ Bicep + AZD generated Time to complete Weeks ✅ Hours The technology didn't just improve incrementally. The gap between "assessment" and "done" collapsed. A year ago, knowing what to do and being able to do it were very different things. Now they're the same step. Who This Is For If you have a .NET Framework app sitting on a backlog because "the modernization is too expensive" — revisit that assumption. The process changed. GitHub Copilot app modernization helps you rewrite your data layer, generates your infrastructure, and gets you to azd up. It can help you generate tests to increase your code coverage. If you have some feature requests – or – if you want to further optimize the code for scale – bring your requirements or logs or profile traces, you can take care of all of that during the modernization process. MVC Music Store went from .NET Framework 4.0 with Entity Framework 6 and plaintext SQL credentials to .NET 10 on Azure Container Apps with managed identity, Key Vault, and zero secrets in code. In an afternoon. That backlog item might be a lunch break now 😊. Really. Find your legacy apps and try it yourself. Next steps Modernize your .Net or Java apps with GitHub Copilot app modernization – https://aka.ms/ghcp-appmod Open your legacy application in Visual Studio or Visual Studio Code to start the process Deploy to Azure Container Apps https://aka.ms/aca/start110Views0likes0CommentsMCP-Driven Azure SRE for Databricks
Azure SRE Agent is an AI-powered operations assistant built for incident response and governance. MCP (Model Context Protocol) is the standard interface it uses to connect to external systems and tools. Azure SRE Agent integrates with Azure Databricks through the Model Context Protocol (MCP) to provide: Proactive Compliance - Automated best practice validation Reactive Troubleshooting - Root cause analysis and remediation for job failures This post demonstrates both capabilities with real examples. Architecture The Azure SRE Agent orchestrates Ops Skills and Knowledge Base prompts, then calls the Databricks MCP server over HTTPS. The MCP server translates those requests into Databricks REST API calls, returns structured results, and the agent composes findings, evidence, and remediation. End-to-end, this yields a single loop: intent -> MCP tool calls -> Databricks state -> grounded response. Deployment The MCP server runs as a containerized FastMCP application on Azure Container Apps, fronted by HTTPS and configured with Databricks workspace connection settings. It exposes a tool catalog that the agent invokes through MCP, while the container handles authentication and REST API calls to Databricks. 👉 For deployment instructions, see the GitHub repository. Getting Started Deploy the MCP Server: Follow the quickstart guide to deploy to Azure Container Apps (~30 min) Configure Azure SRE Agent: Create MCP connector with streamable-http transport Upload Knowledge Base from Builder > Knowledge Base using the Best Practices doc: AZURE_DATABRICKS_BEST_PRACTICES.md Benefit: Gives the agent authoritative compliance criteria and remediation commands. Create Ops Skill from Builder > Subagent Builder > Create skill and drop the Ops Skill doc: DATABRICKS_OPS_RUNBOOK_SKILL.md Benefit: Adds incident timelines, runbooks, and escalation triggers to responses. Deploy the subagent YAML: Databricks_MCP_Agent.yaml Benefit: Wires the MCP connector, Knowledge Base, and Ops Skill into one agent for proactive and reactive workflows. Integrate with Alerting: Connect PagerDuty/ServiceNow webhooks Enable auto-remediation for common issues Part 1: Proactive Compliance Use Case: Best Practice Validation Prompt: @Databricks_MCP_Agent: Validate the Databricks workspace for best practices compliance and provide a summary, detailed findings, and concrete remediation steps. What the Agent Does: Calls MCP tools to gather current state: list_clusters() - Audit compute configurations list_catalogs() - Check Unity Catalog setup list_jobs() - Review job configurations execute_sql() - Query governance policies Cross-references findings with Knowledge Base (best practices document) Generates prioritized compliance report Expected Output: Benefits: Time Savings: 5 minutes vs. 2-3 hours manual review Consistency: Same validation criteria across all workspaces Actionable: Specific remediation steps with code examples Part 2: Reactive Incident Response Example 1: Job Failure - Non-Zero Exit Code Scenario: Job job_exceptioning_out fails repeatedly due to notebook code errors. Prompt: Agent Investigation - Calls MCP Tools: get_job() - Retrieves job definition list_job_runs() - Gets recent run history (4 failed runs) get_run_output() - Analyzes error logs Root Cause Analysis: Expected Outcome: Root Cause Identified: sys.exit(1) in notebook code Evidence Provided: Job ID, run history, code excerpt, settings Confidence: HIGH (explicit failing code present) Remediation: Fix code + add retry policy Resolution Time: 3-5 minutes (vs. 30-45 minutes manual investigation) Example 2: Job Failure - Task Notebook Exception Scenario: Job hourly-data-sync fails repeatedly due to exception in task notebook. Prompt: Agent Investigation - Calls MCP Tools: get_job() - Job definition and task configuration list_job_runs() - Recent runs show "TERMINATED with TIMEOUT" execute_sql() - Queries notebook metadata Root Cause Analysis: Expected Outcome: Root Cause Identified: Exception at line 7 - null partition detected Evidence: Notebook path, code excerpt (lines 5-7), run history (7 consecutive failures) Confidence: HIGH (explicit failing code + TIMEOUT/queue issues) Remediation: Fix exception handling + add retry policy Resolution Time: 5-8 minutes (vs. 45+ minutes manual log analysis) Key Benefits Proactive Governance ✅ Continuous compliance monitoring ✅ Automated best practice validation ✅ 95% reduction in manual review time Reactive Incident Response 🚨 Automated root cause analysis ⚡ 80-95% reduction in MTTR 🧠 Context-aware remediation recommendations 📊 Evidence-based troubleshooting Operational Impact Metric Before After Improvement Compliance review time 2-3 hours 5 minutes 95% Job failure investigation 30-45 min 3-8 min 85% On-call alerts requiring intervention 4-6 per shift 1-2 per shift 70% Conclusion Azure SRE Agent transforms Databricks operations by combining proactive governance with reactive troubleshooting. The MCP integration provides: Comprehensive visibility into workspace health Automated compliance monitoring and validation Intelligent incident response with root cause analysis Self-healing capabilities for common failures Result: Teams spend less time firefighting and more time building. Resources 📘 Deployment Guide 🤖 Subagent Configuration 📋 Best Practices Document 🧰 Ops Skill Runbook 🔧 Validation Script 📖 Azure SRE Agent Documentation 📰 Azure SRE Agent Blogs 📜 MCP Specification Questions? Open an issue on GitHub or reach out to the Azure SRE team.381Views0likes0CommentsUnifying Scattered Observability Data from Dynatrace + Azure for Self-Healing with SRE Agent
What if your deployments could fix themselves? The Deployment Remediation Challenge Modern operations teams face a recurring nightmare: A deployment ships at 9 AM Errors spike at 9:15 AM By the time you correlate logs, identify the bad revision, and execute a rollback—it's 10:30 AM Your users felt 75 minutes of degraded experience The data to detect and fix this existed the entire time—but it was scattered across clouds and platforms: Error logs and traces → Dynatrace (third-party observability cloud) Deployment history and revisions → Azure Container Apps API Resource health and metrics → Azure Monitor Rollback commands → Azure CLI Your observability data lives in one cloud. Your deployment data lives in another. Stitching together log analysis from Dynatrace with deployment correlation from Azure—and then executing remediation—required a human to manually bridge these silos. What if an AI agent could unify data from third-party observability platforms with Azure deployment history and act on it automatically—every week, before users even notice? Enter SRE Agent + Model Context Protocol (MCP) + Subagents Azure SRE Agent doesn't just work with Azure. Using the Model Context Protocol (MCP), you can connect external observability platforms like Dynatrace directly to your agent. Combined with subagents for specialized expertise and scheduled tasks for automation, you can build an automated deployment remediation system. Here's what I built/configured for my Azure Container Apps environment inside SRE Agent: Component Purpose Dynatrace MCP Connector Connect to Dynatrace's MCP gateway for log queries via DQL 'Dynatrace' Subagent Log analysis specialist that executes DQL queries and identifies root causes 'Remediation' Subagent Deployment remediation specialist that correlates errors with deployments and executes rollbacks Scheduled Task Weekly Monday 9 AM health check for the 'octopets-prod-api' Container App Subagent workflow: The subagent workflow in SRE Agent Builder: 'OctopetsScheduledTask' triggers 'RemediationSubagent' (12 tools), which hands off to 'DynatraceSubagent' (3 MCP tools) for log analysis. How I Set It Up: Step by Step Step 1: Connect Dynatrace via MCP SRE Agent supports the Model Context Protocol (MCP) for connecting external data sources. Dynatrace exposes an MCP gateway that provides access to its APIs as first-class tools. Connection configuration: { "name": "dynatrace-mcp-connector", "dataConnectorType": "Mcp", "dataSource": "Endpoint=https://<your-tenant>.live.dynatrace.com/platform-reserved/mcp-gateway/v0.1/servers/dynatrace-mcp/mcp;AuthType=BearerToken;BearerToken=<your-api-token>" } Once connected, SRE Agent automatically discovers Dynatrace tools. 💡 Tip: When creating your Dynatrace API token, grant the `entities.read`, `events.read`, and `metrics.read` scopes for comprehensive access. Step 2: Build Specialized Subagents Generic agents are good. Specialized agents are better. I created two subagents that work together in a coordinated workflow—one for Dynatrace log analysis, the other for deployment remediation. DynatraceSubagent This subagent is the log analysis specialist. It uses the Dynatrace MCP tools to execute DQL queries and identify root causes. Key capabilities: Executes DQL queries via MCP tools (`create-dql`, `execute-dql`, `explain-dql`) Fetches 5xx error counts, request volumes, and spike detection Returns consolidated analysis with root cause, affected services, and error patterns 👉 View full DynatraceSubagent configuration here RemediationSubagent This is the deployment remediation specialist. It correlates Dynatrace log analysis with Azure Container Apps deployment history, generates correlation charts, and executes rollbacks when confidence is high. Key capabilities: Retrieves Container Apps revision history (`GetDeploymentTimes`, `ListRevisions`) Generates correlation charts (`PlotTimeSeriesData`, `PlotBarChart`, `PlotAreaChartWithCorrelation`) Computes confidence score (0-100%) for deployment causation Executes rollback and traffic shift when confidence > 70% 👉 View full RemediationSubagent configuration here The power of specialization: Each agent focuses on its domain—DynatraceSubagent handles log analysis, RemediationSubagent handles deployment correlation and rollback. When the workflow runs, RemediationSubagent hands off to DynatraceSubagent (bi-directional handoff) for analysis, gets the findings back, and continues with remediation. Simple delegation, not a single monolithic agent trying to do everything. Step 3: Create the Weekly Scheduled Task Now the automation. I configured a scheduled task that runs every Monday at 9:30 AM to check whether deployments in the last 4 hours caused any issues—and automatically remediate if needed. Scheduled task configuration: Setting Value Task Name OctopetsScheduledTask Frequency Weekly Day of Week Monday Time 9:30 AM Response Subagent RemediationSubagent Scheduled Task Configuration Configuring the OctopetsScheduledTask in the SRE Agent portal The key insight: the scheduled task is just a coordinator. It immediately hands off to the RemediationSubagent, which orchestrates the entire workflow including handoffs to DynatraceSubagent. Step 4: See It In Action Here's what happens when the scheduled task runs: The scheduled task triggering and initiating Dynatrace analysis for octopets-prod-api The DynatraceSubagent analyzes the logs and identifies the root cause: executing DQL queries and returning consolidated log analysis The RemediationSubagent then generates correlation charts: Finally, with a 95% confidence score, SRE agent executes the rollback autonomously: executing rollback and traffic shift autonomously. The agent detected the bad deployment, generated visual evidence, and automatically shifted 100% traffic to the last known working revision—all without human intervention. Why This Matters Before After Manually check Dynatrace after incidents Automated DQL queries via MCP Stitch together logs + deployments manually Subagents correlate data automatically Rollback requires human decision + execution Confidence-based auto-remediation 75+ minutes from deployment to rollback Under 5 Minutes with autonomous workflow Reactive incident response Proactive weekly health checks Try It Yourself Connect your observability tool via MCP (Dynatrace, Datadog, Prometheus—any tool with an MCP gateway) Build a log analysis subagent that knows how to query your observability data Build a remediation subagent that can correlate logs with deployments and execute fixes Wire them together with handoffs so the subagents can delegate log analysis Create a scheduled task to trigger the workflow automatically Learn More Azure SRE Agent documentation Model Context Protocol (MCP) integration guide Building subagents for specialized workflows Scheduled tasks and automation SRE Agent Community Azure SRE Agent pricing SRE Agent Blogs456Views0likes0CommentsHow SRE Agent Pulls Logs from Grafana and Creates Jira Tickets Without Native Integrations
Your tools. Your workflows. SRE Agent adapts. SRE Agent natively integrates with PagerDuty, ServiceNow, and Azure Monitor. But your team might use Jira for incident tracking. Grafana for dashboards. Loki for logs. Prometheus for metrics. These aren't natively supported. That doesn't matter. SRE Agent supports MCP, the Model Context Protocol. Any MCP-compatible server extends the agent's capabilities. Connect your Grafana instance. Connect your Jira. The agent queries logs, correlates errors, and creates tickets with root cause analysis across tools that were never designed to talk to each other. The Scenario I built a grocery store app that simulates a realistic SRE scenario: an external supplier API starts rate limiting your requests. Customers see "Unable to check inventory" errors. The on-call engineer gets paged. The goal: SRE Agent should diagnose the issue by querying Loki logs through Grafana, identify the root cause, and create a Jira ticket with findings and recommendations. The app runs on Azure Container Apps with Loki for logs and Azure Managed Grafana for visualization. 👉 Deploy it yourself: github.com/dm-chelupati/grocery-sre-demo How I Set Up SRE Agent: Step by Step Step 1: Create SRE Agent I created an SRE Agent and gave it Reader access to my subscription Step 2: Connect to Grafana and Jira via MCP Neither MCP server had a remotely hosted option, and their stdio setup didn't match what SRE Agent supports. So I hosted them myself as Azure Container Apps: Grafana MCP Server — connects to my Azure Managed Grafana instance Atlassian MCP Server — connects to my Jira Cloud instance Now I have two endpoints SRE Agent can reach: https://ca-mcp-grafana.<env>.azurecontainerapps.io/mcp https://ca-mcp-jira.<env>.azurecontainerapps.io/mcp I added both to SRE Agent's MCP configuration as remotely hosted servers. Step 3: Create Sub-Agent with Tools and Instructions I created a sub-agent specifically for incident diagnosis with these tools enabled: Grafana MCP (for querying Loki logs) Atlassian MCP (for creating Jira tickets) Instructions were simple: You are expert in diagnosing applications running on Azure services. You need to use the Grafana tools to get the logs, metrics or traces and create a summary of your findings inside Jira as a ticket. use your knowledge base file loki-queries.md to learn about app configuration with loki and Query the loki for logs in Grafana. Step 4: Invoke Sub-Agent and Watch It Work I went to the SRE Agent chat and asked: @JiraGrafanaexpert: My container app ca-api-3syj3i2fat5dm in resource group rg-groceryapp is experiencing rate limit errors from a supplier API when checking product inventory. The agent: Queried Loki via Grafana MCP: {app="grocery-api"} |= "error" Found 429 rate limit errors spiking — 55+ requests hitting supplier API limits Identified root cause: SUPPLIER_RATE_LIMIT_429 from FreshFoods Wholesale API Created a Jira ticket: One prompt. Logs queried. Root cause identified. Ticket created with remediation steps. Making It Better: The Knowledge File SRE Agent can explore and discover how your apps are wired but you can speed that up. When querying observability data sources, the agent needs to learn the schema, available labels, table structures, and query syntax. For Loki, that means understanding LogQL, knowing which labels your apps use, and what JSON fields appear in logs. SRE Agent can figure things out, but with context, it gets there faster — just like humans. I created a knowledge file that gives the agent a head start: With this context, the agent knows exactly which labels to query, what fields to extract from JSON logs, and which query patterns to use 👉 See my full knowledge file How MCP Makes This Possible SRE Agent supports two ways to connect MCP servers: stdio — runs locally via command. This works for MCP servers that can be invoked via npx, node, or uvx. For example: npx -y @modelcontextprotocol/server-github. Remotely hosted — HTTP endpoint with streamable transport: https://mcp-server.example.com/sse or /mcp The catch: Not every MCP server fits these options out of the box. Some servers only support stdio but not the npx/node/uvx formats SRE Agent expects. Others don't offer a hosted endpoint at all. The solution: host them yourself. Deploy the MCP server as a container with an HTTP endpoint. That's what I did with Grafana MCP Server and Atlassian MCP Server, deployed both as Azure Container Apps exposing /mcp endpoints. Why This Matters Enterprise tooling is fragmented across Azure and non-Azure ecosystems. Some teams use Azure Monitor, others use Datadog. Incident tracking might be ServiceNow in one org and Jira in another. Logs live in Loki, Splunk, Elasticsearch and sometimes all three. SRE Agent meets you where you are. Azure-native tools work out of the box. Everything else connects via MCP. Your observability stack stays the same. Your ticketing system stays the same. The agent becomes the orchestration layer that ties them together. One agent. Any tool. Intelligent workflows across your entire ecosystem. Try It Yourself Create an SRE Agent Deploy MCP servers for your tools (Grafana, Atlassian) Create a sub-agent with the MCP tools connected Add a knowledge file with your app context Ask it to diagnose an issue Watch logs become tickets. Errors become action items. Context becomes intelligence. Learn More Azure SRE Agent documentation Azure SRE Agent blogs Grocery SRE Demo repo MCP specification Azure SRE Agent is currently in preview.1.1KViews0likes0CommentsExciting Updates Coming to Conversational Diagnostics (Public Preview)
Last year, at Ignite 2023, we unveiled Conversational Diagnostics (Preview), a revolutionary tool integrated with AI-powered capabilities to enhance problem-solving for Windows Web Apps. This year, we're thrilled to share what’s new and forthcoming for Conversational Diagnostics (Preview). Get ready to experience a broader range of functionalities and expanded support across various Azure Products, making your troubleshooting journey even more seamless and intuitive.383Views0likes0CommentsAdvanced Container Apps Networking: VNet Integration and Centralized Firewall Traffic Logging
Azure community, I recently documented a networking scenario relevant to Azure Container Apps environments where you need to control and inspect application traffic using a third-party network virtual appliance. The article walks through a practical deployment pattern: • Integrate your Azure Container Apps environment with a Virtual Network. • Configure user-defined routes (UDRs) so that traffic from your container workloads is directed toward a firewall appliance before reaching external networks or backend services. • Verify actual traffic paths using firewall logs to confirm that routing policies are effective. This pattern is helpful for organizations that must enforce advanced filtering, logging, or compliance checks on container egress/ingress traffic, going beyond what native Azure networking controls provide. It also complements Azure Firewall and NSG controls by introducing a dedicated next-generation firewall within your VNet. If you’re working with network control, security perimeters, or hybrid network architectures involving containerized workloads on Azure, you might find it useful. Read the full article on my blog67Views0likes0CommentsDeploy Dynatrace OneAgent on your Container Apps
TOC Introduction Setup References 1. Introduction Dynatrace OneAgent is an advanced monitoring tool that automatically collects performance data across your entire IT environment. It provides deep visibility into applications, infrastructure, and cloud services, enabling real-time observability. OneAgent supports multiple platforms, including containers, VMs, and serverless architectures, ensuring seamless monitoring with minimal configuration. It captures detailed metrics, traces, and logs, helping teams diagnose performance issues, optimize resources, and enhance user experiences. With AI-driven insights, OneAgent proactively detects anomalies and automates root cause analysis, making it an essential component for modern DevOps, SRE, and cloud-native monitoring strategies. 2. Setup 1. After registering your account, go to the control panel and search for Deploy OneAgent. 2. Obtain your Environment ID and create a PaaS token. Be sure to save them for later use. 3. In your local environment's console, log in to the Dynatrace registry. docker login -u XXX XXX.live.dynatrace.com # XXX is your Environment ID # Input PaaS Token when password prompt 4. Create a Dockerfile and an sshd_config file. FROM mcr.microsoft.com/devcontainers/javascript-node:20 # Change XXX into your Environment ID COPY --from=XXX.live.dynatrace.com/linux/oneagent-codemodules:all / / ENV LD_PRELOAD /opt/dynatrace/oneagent/agent/lib64/liboneagentproc.so # SSH RUN apt-get update \ && apt-get install -y --no-install-recommends dialog openssh-server tzdata screen lrzsz htop cron \ && echo "root:Docker!" | chpasswd \ && mkdir -p /run/sshd \ && chmod 700 /root/.ssh/ \ && chmod 600 /root/.ssh/id_rsa COPY ./sshd_config /etc/ssh/ # OTHER EXPOSE 2222 CMD ["/usr/sbin/sshd", "-D", "-o", "ListenAddress=0.0.0.0"] Port 2222 ListenAddress 0.0.0.0 LoginGraceTime 180 X11Forwarding yes Ciphers aes128-cbc,3des-cbc,aes256-cbc,aes128-ctr,aes192-ctr,aes256-ctr MACs hmac-sha2-256,hmac-sha2-512,hmac-sha1,hmac-sha1-96 StrictModes yes SyslogFacility DAEMON PasswordAuthentication yes PermitEmptyPasswords no PermitRootLogin yes Subsystem sftp internal-sftp AllowTcpForwarding yes 5. Build the container and push it to Azure Container Registry (ACR). # YYY is your ACR name docker build -t oneagent:202503201710 . --no-cache # you could setup your own image name docker tag oneagent:202503201710 YYY.azurecr.io/oneagent:202503201710 docker push YYY.azurecr.io/oneagent:202503201710 6. Create an Azure Container App (ACA), set Ingress to port 3000, allow all inbound traffic, and specify the ACR image you just created. 7. Once the container starts, open a console and run the following command to create a temporary HTTP server simulating a Node.js app. mkdir app && cd app echo 'console.log("Node.js app started...")' > index.js npm init -y npm install express cat <<EOF > server.js const express = require('express'); const app = express(); app.get('/', (req, res) => res.send('hello')); app.listen(3000, () => console.log('Server running on port 3000')); EOF # Please Press Ctrl + C to terminate the next command and run again for 3 times node server.js 8. You should now see the results on the ACA homepage. 9. Go back to the Dynatrace control panel, search for Host Classic, and you should see the collected data. 3. References Integrate OneAgent on Azure App Service for Linux and containers — Dynatrace Docs2KViews0likes1CommentFind the Alerts You Didn't Know You Were Missing with Azure SRE Agent
I had 6 alert rules. CPU. Memory. Pod restarts. Container errors. OOMKilled. Job failures. I thought I was covered. Then my app went down. I kept refreshing the Azure portal, waiting for an alert. Nothing. That's when it hit me: my alerts were working perfectly. They just weren't designed for this failure mode. Sound familiar? The Problem Every Developer Knows If you're a developer or DevOps engineer, you've been here: a customer reports an issue, you scramble to check your monitoring, and then you realize you don't have the right alerts set up. By the time you find out, it's already too late. You set up what seems like reasonable alerting and assume you're covered. But real-world failures are sneaky. They slip through the cracks of your carefully planned thresholds. My Setup: AKS with Redis I love to vibe code apps using GitHub Copilot Agent mode with Claude Opus 4.5. It's fast, it understands context, and it lets me focus on building rather than boilerplate. For this project, I built a simple journal entry app: AKS cluster hosting the web API Azure Cache for Redis storing journal data Azure Monitor alerts for CPU, memory, pod restarts, container errors, OOMKilled, and job failures Seemed solid. What could go wrong? The Scenario: Redis Password Rotation Here's something that happens constantly in enterprise environments: the security team rotates passwords. It's best practice. It's in the compliance checklist. And it breaks things when apps don't pick up the new credentials. I simulated exactly this. The pods came back up. But they couldn't connect to Redis (as expected). The readiness probes started failing. The LoadBalancer had no healthy backends. The endpoint timed out. And not a single alert fired. Using SRE Agent to Find the Alert Gaps Instead of manually auditing every alert rule and trying to figure out what I missed, I turned to Azure SRE Agent. I asked it a simple question: "My endpoint is timing out. What alerts do I have, and why didn't any of them fire?" Within minutes, it had diagnosed the problem. Here's what it found: My Existing Alerts Why They Didn't Fire High CPU/Memory No resource pressure,just auth failures Pod Restarts Pods weren't restarting, just unhealthy Container Errors App logs weren't being written OOMKilled No memory issues Job Failures No K8s jobs involved The gaps SRE Agent identified: ❌ No synthetic URL availability test ❌ No readiness/liveness probe failure alerts ❌ No "pods not ready" alerts scoped to my namespace ❌ No Redis connection error detection ❌ No ingress 5xx/timeout spike alerts ❌ No per-pod resource alerts (only node-level) SRE Agent didn't just tell me what was wrong, it created a GitHub issue with : KQL queries to detect each failure type Bicep code snippets for new alert rules Remediation suggestions for the app code Exact file paths in my repo to update Check it out: GitHub Issue How I Built It: Step by Step Let me walk you through exactly how I set this up inside SRE Agent. Step 1: Create an SRE Agent I created a new SRE Agent in the Azure portal. Since this workflow analyzes alerts across my subscription (not just one resource group), I didn't configure any specific resource groups. Instead, I gave the agent's managed identity Reader permissions on my entire subscription. This lets it discover resources, list alert rules, and query Log Analytics across all my resource groups. Step 2: Connect GitHub to SRE Agent via MCP I added a GitHub MCP server to give the agent access to my source code repository.MCP (Model Context Protocol) lets you bring any API into the agent. If your tool has an API, you can connect it. I use GitHub for both source code and tracking dev tickets, but you can connect to wherever your code lives (GitLab, Azure DevOps) or your ticketing system (Jira, ServiceNow, PagerDuty). Step 3: Create a Subagent inside SRE Agent for managing Azure Monitor Alerts I created a focused subagent with a specific job and only the tools it needs: Azure Monitor Alerts Expert Prompt: " You are expert in managing operations related to azure monitor alerts on azure resources including discovering alert rules configured on azure resources, creating new alert rules (with user approval and authorization only), processing the alerts fired on azure resources and identifying gaps in the alert rules. You can get the resource details from azure monitor alert if triggered via alert. If not, you need to ask user for the specific resource to perform analysis on. You can use az cli tool to diagnose logs, check the app health metrics. You must use the app code and infra code (bicep files) files you have access to in the github repo <insert your repo> to further understand the possible diagnoses and suggest remediations. Once analysis is done, you must create a github issue with details of analysis and suggested remediation to the source code files in the same repo." Tools enabled: az cli – List resources, alert rules, action groups Log Analytics workspace querying – Run KQL queries for diagnostics GitHub MCP – Search repositories, read file contents, create issues Step 4: Ask the Subagent About Alert Gaps I gave the agent context and asked a simple question: "@AzureAlertExpert: My API endpoint http://132.196.167.102/api/journals/john is timing out. What alerts do I have configured in rg-aks-journal, and why didn't any of them fire? The agent did the analysis autonomously and summarized findings with suggestions to add new alert rules in a GitHub issue. Here's the agentic workflow to perform azure monitor alert operations Why This Matters Faster response times. Issues get diagnosed in minutes, not hours of manual investigation. Consistent analysis. No more "I thought we had an alert for that" moments. The agent systematically checks what's covered and what's not. Proactive coverage. You don't have to wait for an incident to find gaps. Ask the agent to review your alerts before something breaks. The Bottom Line Your alerts have gaps. You just don't know it until something slips through. I had 6 alert rules and still missed a basic failure. My pods weren't restarting, they were just unhealthy. My CPU wasn't spiking, the app was just returning errors. None of my alerts were designed for this. You don't need to audit every alert rule manually. Give SRE Agent your environment, describe the failure, and let it tell you what's missing. Stop discovering alert gaps from customer complaints. Start finding them before they matter. A Few Tips Give the agent Reader access at subscription level so it can discover all resources Use a focused subagent prompt, don't try to do everything in one agent Test your MCP connections before running workflows What Alert Gaps Have Burned You? What's the alert you wish you had set up before an incident? Credential rotation? Certificate expiry? DNS failures? Let us know in the comments.447Views1like0CommentsFrom Vibe Coding to Working App: How SRE Agent Completes the Developer Loop
The Most Common Challenge in Modern Cloud Apps There's a category of bugs that drive engineers crazy: multi-layer infrastructure issues. Your app deploys successfully. Every Azure resource shows "Succeeded." But the app fails at runtime with a vague error like Login failed for user ''. Where do you even start? You're checking the Web App, the SQL Server, the VNet, the private endpoint, the DNS zone, the identity configuration... and each one looks fine in isolation. The problem is how they connect and that's invisible in the portal. Networking issues are especially brutal. The error says "Login failed" but the actual causes could be DNS, firewall, identity, or all three. The symptom and the root causes are in completely different resources. Without deep Azure networking knowledge, you're just clicking around hoping something jumps out. Now imagine you vibe coded the infrastructure. You used AI to generate the Bicep, deployed it, and moved on. When it breaks, you're debugging code you didn't write, configuring resources you don't fully understand. This is where I wanted AI to help not just to build, but to debug. Enter SRE Agent + Coding Agent Here's what I used: Layer Tool Purpose Build VS Code Copilot Agent Mode + Claude Opus Generate code, Bicep, deploy Debug Azure SRE Agent Diagnose infrastructure issues and create developer issue with suggested fixes in source code (app code and IaC) Fix GitHub Coding Agent Create PRs with code and IaC fix from Github issue created by SRE Agent Copilot builds. SRE Agent debugs. Coding Agent fixes. What I Built I used VS Code Copilot in Agent Mode with Claude Opus to create a .NET 8 Web App connected to Azure SQL via private endpoint: Private networking (no public exposure) Entra-only authentication Managed identity (no secrets) Deployed with azd up. All green. Then I tested the health endpoint: $ curl https://app-tsdvdfdwo77hc.azurewebsites.net/health/sql {"status":"unhealthy","error":"Login failed for user ''.","errorType":"SqlException"} Deployment succeeded. App failed. One error. How I Fixed It: Step by Step Step 1: Create SRE Agent with Azure Access I created an SRE Agent with read access to my Azure subscription. You can scope it to specific resource groups. The agent builds a knowledge graph of your resources and their dependencies visible in the Resource Mapping view below. Step 2: Connect GitHub to SRE Agent using GitHub MCP server I connected the GitHub MCP server so the agent could read my repository and create issues. Step 3: Create Sub Agent to analyze source code I created a sub-agent for analyzing source code using GitHub mcp tools. this lets SRE Agent understand not just Azure resources, but also the Bicep and source code files that created them. "you are expert in analyzing source code (bicep and app code) from github repos" Step 4: Invoke Sub-Agent to Analyze the Error In the SRE Agent chat, I invoked the sub-agent to diagnose the error I received from my app end point. It correlated the runtime error with the infrastructure configuration Step 5: Watch the SRE Agent Think and Reason SRE Agent analyzed the error by tracing code in Program.cs, Bicep configurations, and Azure resource relationships Web App, SQL Server, VNet, private endpoint, DNS zone, and managed identity. Its reasoning process worked through each layer, eliminating possibilities one by one until it identified the root causes. Step 6: Agent Creates GitHub Issue Based on its analysis, SRE Agent summarized the root causes and suggested fixes in a GitHub issue: Root Causes: Private DNS Zone missing VNet link Managed identity not created as SQL user Suggested Fixes: Add virtualNetworkLinks resource to Bicep Add SQL setup script to create user with db_datareader and db_datawriter roles Step 7: Merge the PR from Coding Agent Assign the Github issue to Coding Agent which then creates a PR with the fixes. I just reviewed the fix. It made sense and I merged it. Redeployed with azd up, ran the SQL script: curl -s https://app-tsdvdfdwo77hc.azurewebsites.net/health/sql | jq . { "status": "healthy", "database": "tododb", "server": "tcp:sql-tsdvdfdwo77hc.database.windows.net,1433", "message": "Successfully connected to SQL Server" } 🎉 From error to fix in minutes without manually debugging a single Azure resource. Why This Matters If you're a developer building and deploying apps to Azure, SRE Agent changes how you work: You don't need to be a networking expert. SRE Agent understands the relationships between Azure resources private endpoints, DNS zones, VNet links, managed identities. It connects dots you didn't know existed. You don't need to guess. Instead of clicking through the portal hoping something looks wrong, the agent systematically eliminates possibilities like a senior engineer would. You don't break your workflow. SRE Agent suggests fixes in your Bicep and source code not portal changes. Everything stays version controlled. Deployed through pipelines. No hot fixes at 2 AM. You close the loop. AI helps you build fast. Now AI helps you debug fast too. Try It Yourself Do you vibe code your app, your infrastructure, or both? How do you debug when things break? Here's a challenge: Vibe code a todo app with a Web App, VNet, private endpoint, and SQL database. "Forget" to link the DNS zone to the VNet. Deploy it. Watch it fail. Then point SRE Agent at it and see how it identifies the root cause, creates a GitHub issue with the fix, and hands it off to Coding Agent for a PR. Share your experience. I'd love to hear how it goes. Learn More Azure SRE Agent documentation Azure SRE Agent blogs Azure SRE Agent community Azure SRE Agent home page Azure SRE Agent pricing938Views3likes0Comments