serverless
185 TopicsAnnouncing Database connections for Azure Static Web Apps with Data API builder
With database connections for Azure Static Web Apps, you can access your Azure database contents directly from your static site, without having to write any backend code. Build your applications faster, while ensuring security and serverless scalability. Free during public preview.8.5KViews3likes1CommentUnifying Scattered Observability Data from Dynatrace + Azure for Self-Healing with SRE Agent
What if your deployments could fix themselves? The Deployment Remediation Challenge Modern operations teams face a recurring nightmare: A deployment ships at 9 AM Errors spike at 9:15 AM By the time you correlate logs, identify the bad revision, and execute a rollback—it's 10:30 AM Your users felt 75 minutes of degraded experience The data to detect and fix this existed the entire time—but it was scattered across clouds and platforms: Error logs and traces → Dynatrace (third-party observability cloud) Deployment history and revisions → Azure Container Apps API Resource health and metrics → Azure Monitor Rollback commands → Azure CLI Your observability data lives in one cloud. Your deployment data lives in another. Stitching together log analysis from Dynatrace with deployment correlation from Azure—and then executing remediation—required a human to manually bridge these silos. What if an AI agent could unify data from third-party observability platforms with Azure deployment history and act on it automatically—every week, before users even notice? Enter SRE Agent + Model Context Protocol (MCP) + Subagents Azure SRE Agent doesn't just work with Azure. Using the Model Context Protocol (MCP), you can connect external observability platforms like Dynatrace directly to your agent. Combined with subagents for specialized expertise and scheduled tasks for automation, you can build an automated deployment remediation system. Here's what I built/configured for my Azure Container Apps environment inside SRE Agent: Component Purpose Dynatrace MCP Connector Connect to Dynatrace's MCP gateway for log queries via DQL 'Dynatrace' Subagent Log analysis specialist that executes DQL queries and identifies root causes 'Remediation' Subagent Deployment remediation specialist that correlates errors with deployments and executes rollbacks Scheduled Task Weekly Monday 9 AM health check for the 'octopets-prod-api' Container App Subagent workflow: The subagent workflow in SRE Agent Builder: 'OctopetsScheduledTask' triggers 'RemediationSubagent' (12 tools), which hands off to 'DynatraceSubagent' (3 MCP tools) for log analysis. How I Set It Up: Step by Step Step 1: Connect Dynatrace via MCP SRE Agent supports the Model Context Protocol (MCP) for connecting external data sources. Dynatrace exposes an MCP gateway that provides access to its APIs as first-class tools. Connection configuration: { "name": "dynatrace-mcp-connector", "dataConnectorType": "Mcp", "dataSource": "Endpoint=https://<your-tenant>.live.dynatrace.com/platform-reserved/mcp-gateway/v0.1/servers/dynatrace-mcp/mcp;AuthType=BearerToken;BearerToken=<your-api-token>" } Once connected, SRE Agent automatically discovers Dynatrace tools. 💡 Tip: When creating your Dynatrace API token, grant the `entities.read`, `events.read`, and `metrics.read` scopes for comprehensive access. Step 2: Build Specialized Subagents Generic agents are good. Specialized agents are better. I created two subagents that work together in a coordinated workflow—one for Dynatrace log analysis, the other for deployment remediation. DynatraceSubagent This subagent is the log analysis specialist. It uses the Dynatrace MCP tools to execute DQL queries and identify root causes. Key capabilities: Executes DQL queries via MCP tools (`create-dql`, `execute-dql`, `explain-dql`) Fetches 5xx error counts, request volumes, and spike detection Returns consolidated analysis with root cause, affected services, and error patterns 👉 View full DynatraceSubagent configuration here RemediationSubagent This is the deployment remediation specialist. It correlates Dynatrace log analysis with Azure Container Apps deployment history, generates correlation charts, and executes rollbacks when confidence is high. Key capabilities: Retrieves Container Apps revision history (`GetDeploymentTimes`, `ListRevisions`) Generates correlation charts (`PlotTimeSeriesData`, `PlotBarChart`, `PlotAreaChartWithCorrelation`) Computes confidence score (0-100%) for deployment causation Executes rollback and traffic shift when confidence > 70% 👉 View full RemediationSubagent configuration here The power of specialization: Each agent focuses on its domain—DynatraceSubagent handles log analysis, RemediationSubagent handles deployment correlation and rollback. When the workflow runs, RemediationSubagent hands off to DynatraceSubagent (bi-directional handoff) for analysis, gets the findings back, and continues with remediation. Simple delegation, not a single monolithic agent trying to do everything. Step 3: Create the Weekly Scheduled Task Now the automation. I configured a scheduled task that runs every Monday at 9:30 AM to check whether deployments in the last 4 hours caused any issues—and automatically remediate if needed. Scheduled task configuration: Setting Value Task Name OctopetsScheduledTask Frequency Weekly Day of Week Monday Time 9:30 AM Response Subagent RemediationSubagent Scheduled Task Configuration Configuring the OctopetsScheduledTask in the SRE Agent portal The key insight: the scheduled task is just a coordinator. It immediately hands off to the RemediationSubagent, which orchestrates the entire workflow including handoffs to DynatraceSubagent. Step 4: See It In Action Here's what happens when the scheduled task runs: The scheduled task triggering and initiating Dynatrace analysis for octopets-prod-api The DynatraceSubagent analyzes the logs and identifies the root cause: executing DQL queries and returning consolidated log analysis The RemediationSubagent then generates correlation charts: Finally, with a 95% confidence score, SRE agent executes the rollback autonomously: executing rollback and traffic shift autonomously. The agent detected the bad deployment, generated visual evidence, and automatically shifted 100% traffic to the last known working revision—all without human intervention. Why This Matters Before After Manually check Dynatrace after incidents Automated DQL queries via MCP Stitch together logs + deployments manually Subagents correlate data automatically Rollback requires human decision + execution Confidence-based auto-remediation 75+ minutes from deployment to rollback Under 5 Minutes with autonomous workflow Reactive incident response Proactive weekly health checks Try It Yourself Connect your observability tool via MCP (Dynatrace, Datadog, Prometheus—any tool with an MCP gateway) Build a log analysis subagent that knows how to query your observability data Build a remediation subagent that can correlate logs with deployments and execute fixes Wire them together with handoffs so the subagents can delegate log analysis Create a scheduled task to trigger the workflow automatically Learn More Azure SRE Agent documentation Model Context Protocol (MCP) integration guide Building subagents for specialized workflows Scheduled tasks and automation SRE Agent Community Azure SRE Agent pricing SRE Agent Blogs430Views0likes0CommentsHow SRE Agent Pulls Logs from Grafana and Creates Jira Tickets Without Native Integrations
Your tools. Your workflows. SRE Agent adapts. SRE Agent natively integrates with PagerDuty, ServiceNow, and Azure Monitor. But your team might use Jira for incident tracking. Grafana for dashboards. Loki for logs. Prometheus for metrics. These aren't natively supported. That doesn't matter. SRE Agent supports MCP, the Model Context Protocol. Any MCP-compatible server extends the agent's capabilities. Connect your Grafana instance. Connect your Jira. The agent queries logs, correlates errors, and creates tickets with root cause analysis across tools that were never designed to talk to each other. The Scenario I built a grocery store app that simulates a realistic SRE scenario: an external supplier API starts rate limiting your requests. Customers see "Unable to check inventory" errors. The on-call engineer gets paged. The goal: SRE Agent should diagnose the issue by querying Loki logs through Grafana, identify the root cause, and create a Jira ticket with findings and recommendations. The app runs on Azure Container Apps with Loki for logs and Azure Managed Grafana for visualization. 👉 Deploy it yourself: github.com/dm-chelupati/grocery-sre-demo How I Set Up SRE Agent: Step by Step Step 1: Create SRE Agent I created an SRE Agent and gave it Reader access to my subscription Step 2: Connect to Grafana and Jira via MCP Neither MCP server had a remotely hosted option, and their stdio setup didn't match what SRE Agent supports. So I hosted them myself as Azure Container Apps: Grafana MCP Server — connects to my Azure Managed Grafana instance Atlassian MCP Server — connects to my Jira Cloud instance Now I have two endpoints SRE Agent can reach: https://ca-mcp-grafana.<env>.azurecontainerapps.io/mcp https://ca-mcp-jira.<env>.azurecontainerapps.io/mcp I added both to SRE Agent's MCP configuration as remotely hosted servers. Step 3: Create Sub-Agent with Tools and Instructions I created a sub-agent specifically for incident diagnosis with these tools enabled: Grafana MCP (for querying Loki logs) Atlassian MCP (for creating Jira tickets) Instructions were simple: You are expert in diagnosing applications running on Azure services. You need to use the Grafana tools to get the logs, metrics or traces and create a summary of your findings inside Jira as a ticket. use your knowledge base file loki-queries.md to learn about app configuration with loki and Query the loki for logs in Grafana. Step 4: Invoke Sub-Agent and Watch It Work I went to the SRE Agent chat and asked: @JiraGrafanaexpert: My container app ca-api-3syj3i2fat5dm in resource group rg-groceryapp is experiencing rate limit errors from a supplier API when checking product inventory. The agent: Queried Loki via Grafana MCP: {app="grocery-api"} |= "error" Found 429 rate limit errors spiking — 55+ requests hitting supplier API limits Identified root cause: SUPPLIER_RATE_LIMIT_429 from FreshFoods Wholesale API Created a Jira ticket: One prompt. Logs queried. Root cause identified. Ticket created with remediation steps. Making It Better: The Knowledge File SRE Agent can explore and discover how your apps are wired but you can speed that up. When querying observability data sources, the agent needs to learn the schema, available labels, table structures, and query syntax. For Loki, that means understanding LogQL, knowing which labels your apps use, and what JSON fields appear in logs. SRE Agent can figure things out, but with context, it gets there faster — just like humans. I created a knowledge file that gives the agent a head start: With this context, the agent knows exactly which labels to query, what fields to extract from JSON logs, and which query patterns to use 👉 See my full knowledge file How MCP Makes This Possible SRE Agent supports two ways to connect MCP servers: stdio — runs locally via command. This works for MCP servers that can be invoked via npx, node, or uvx. For example: npx -y @modelcontextprotocol/server-github. Remotely hosted — HTTP endpoint with streamable transport: https://mcp-server.example.com/sse or /mcp The catch: Not every MCP server fits these options out of the box. Some servers only support stdio but not the npx/node/uvx formats SRE Agent expects. Others don't offer a hosted endpoint at all. The solution: host them yourself. Deploy the MCP server as a container with an HTTP endpoint. That's what I did with Grafana MCP Server and Atlassian MCP Server, deployed both as Azure Container Apps exposing /mcp endpoints. Why This Matters Enterprise tooling is fragmented across Azure and non-Azure ecosystems. Some teams use Azure Monitor, others use Datadog. Incident tracking might be ServiceNow in one org and Jira in another. Logs live in Loki, Splunk, Elasticsearch and sometimes all three. SRE Agent meets you where you are. Azure-native tools work out of the box. Everything else connects via MCP. Your observability stack stays the same. Your ticketing system stays the same. The agent becomes the orchestration layer that ties them together. One agent. Any tool. Intelligent workflows across your entire ecosystem. Try It Yourself Create an SRE Agent Deploy MCP servers for your tools (Grafana, Atlassian) Create a sub-agent with the MCP tools connected Add a knowledge file with your app context Ask it to diagnose an issue Watch logs become tickets. Errors become action items. Context becomes intelligence. Learn More Azure SRE Agent documentation Azure SRE Agent blogs Grocery SRE Demo repo MCP specification Azure SRE Agent is currently in preview.1KViews0likes0CommentsExciting Updates Coming to Conversational Diagnostics (Public Preview)
Last year, at Ignite 2023, we unveiled Conversational Diagnostics (Preview), a revolutionary tool integrated with AI-powered capabilities to enhance problem-solving for Windows Web Apps. This year, we're thrilled to share what’s new and forthcoming for Conversational Diagnostics (Preview). Get ready to experience a broader range of functionalities and expanded support across various Azure Products, making your troubleshooting journey even more seamless and intuitive.381Views0likes0CommentsModernizing Spring Boot Applications with GitHub Copilot App Modernization
Upgrading Spring Boot applications from 2.x to the latest 3.x releases introduces significant changes across the framework, dependencies, and Jakarta namespace. These updates improve long-term support, performance, and compatibility with modern Java platforms, but the migration can surface breaking API changes and dependency mismatches. GitHub Copilot app modernization helps streamline this transition by analyzing your project, generating an upgrade plan, and applying targeted updates. Supported Upgrade Path GitHub Copilot app modernization supports upgrading Spring Boot applications to Spring Boot 3.5, including: Updating Spring Framework libraries to 6.x Migrating from javax to jakarta Aligning dependency versions with Boot 3.x Updating plugins and starter configurations Adjusting build files for the required JDK level Validating dependency updates and surfacing CVE issues These capabilities complement the Microsoft Learn quickstart for upgrading Java projects using GitHub Copilot app modernization. How GitHub Copilot app modernization helps When you open a Spring Boot 2.x project in Visual Studio Code or IntelliJ IDEA and initiate an upgrade, GitHub Copilot app modernization performs: Project Analysis Detects your current Spring Boot version Identifies incompatible starters, libraries, and plugins Flags javax.* imports requiring Jakarta migration Evaluates your build configuration and JDK requirements Upgrade Plan Generation The tool produces an actionable plan that outlines: New Spring Boot parent version Updated Spring Framework and related modules Required namespace changes from javax.* to jakarta.* Build plugin updates JDK configuration alignment for Boot 3 You can review and adjust the plan before applying changes. Automated Transformations GitHub Copilot app modernization applies targeted changes such as: Updating spring-boot-starter-parent to 3.5.x Migrating imports to jakarta.* Updating dependencies and BOM versions Rewriting removed or deprecated APIs Aligning test dependencies (e.g., JUnit 5) Build / Fix Iteration The agent automatically: Builds the project Captures failures Suggests fixes Applies updates Rebuilds until the project compiles successfully This loop continues until all actionable issues are addressed. Security & Behavior Checks As part of the upgrade, the tool can: Validate CVEs introduced by dependency version changes Surface potential behavior changes Recommend optional fixes Expected Output After running the upgrade for a Spring Boot 2.x project, you should expect: An updated Spring Boot parent in Maven or Gradle Spring Framework 6.x and Jakarta-aligned modules Updated starter dependencies and plugin versions Rewritten imports from javax.* to jakarta.* Updated testing stack A summary file detailing: Versions updated Code edits applied Dependencies changed CVE results Remaining manual review items Developer Responsibilities GitHub Copilot app modernization accelerates technical migration tasks, but final validation still requires developer review, including: Running the full test suite Reviewing custom filters, security configuration, and web components Re-validating integration points Confirming application behavior across runtime environments The tool handles mechanical upgrade work so you can focus on correctness, quality, and functional validation. Learn more For setup, prerequisites, and the broader Java upgrade workflow, refer to the official Microsoft Learn guide: Quickstart: Upgrade a Java Project with GitHub Copilot App Modernization Install GitHub Copilot app modernization for VS Code and IntelliJ IDEA263Views0likes0CommentsProactive Cloud Ops with SRE Agent: Scheduled Checks for Cloud Optimization
The Cloud Optimization Challenge Your cloud environment is always changing: New features ship weekly Traffic patterns shift seasonally Costs creep up quietly Security best practices evolve Teams spin up resources and forget them It's Monday morning. You open the Azure portal. Everything looks... fine. But "fine" isn't great. That VM has been at 8% CPU for weeks. A Key Vault secret expires in 12 days. Nothing's broken. But security is drifting, costs are creeping, and capacity gaps are growing silently. The question isn't "is something broken?" it's "could this be better?" Four Pillars of Cloud Optimization Pillar What Teams Want The Challenge Security Stay compliant, reduce risk Config drift, legacy settings, expiring creds Cost Spend efficiently, justify budget Hard to spot waste across 100s of resources Performance Meet SLOs, handle growth Know when to scale before demand hits Availability Maximize uptime, build resilience Hidden dependencies, single points of failure Most teams check these sometimes. SRE Agent checks them continuously. Enter SRE Agent + Scheduled tasks SRE Agent can pull data from Azure Monitor, resource configurations, metrics, logs, traces, errors, cost data and analyze it on a schedule. If you use tools outside Azure (Datadog, PagerDuty, Splunk), you can connect those via MCP servers so the agent sees your full observability stack. My setup uses Azure-native sources. Here's how I wired it up. How I Set It Up: Step by Step Step 1: Create SRE Agent with Subscription Access I created an SRE Agent without attaching it to any specific resource group. Instead, I gave it Reader access at the subscription level. This lets the agent scan across all my resource groups for optimization opportunities. No resource group configuration needed. The agent builds a knowledge graph of everything VMs, storage accounts, Key Vaults, NSGs, web apps across the subscription. Step 2: Create and Upload My Organization Practices I created an org-practices.md file that defines what "good" looks like for my team: I uploaded this to SRE Agent's knowledge base. Now the agent knows our bar, not just Azure defaults. 👉 See my full org-practices.md Source repos for this demo: security-demoapp - App with intentional security misconfigurations costoptimizationapp - App with cost optimization opportunities Step 3: Connect to Teams Channel I connected SRE Agent to my team's Teams channel so findings land where we already work. Critical findings get immediate notifications. Warnings go into a daily digest. No more logging into separate dashboards. The insights come to us. Step 4: Connect Resource Groups to GitHub Repos Add the two resource groups to the SRE Agent and link the apps to their corresponding GitHub repos: Resource Group GitHub Repository rg-security-opt-demo security-demoapp rg-cost-opt-sreademo costoptimizationapp This enables the agent to create GitHub issues for findings linking violations directly to the repo responsible for that infrastructure. Step 5: Test with Prompts Before setting up automation, I tested the agent with manual prompts to make sure it was finding the right issues. The agent ran the checks, compared against my org-practices.md, and identified the issues. Security Check: Scan resource group "rg-security-opt-demo" for any violations of our security practices defined in org-practices.md in your knowledge base. list violations with severity and remediation steps. Make sure to check against all critical requirements and send message in teams channel with your findings and create an issue in the github repo https://github.com/dm-chelupati/security-demoapp.git Cost Check: Scan resource group "rg-cost-opt-sreademo" for any violations of our costpractices defined in org-practices.md in your knowledge base. list violations with severity and remediation steps. Make sure to check against all critical requirements and send message in teams channel with your findings and create an issue in the github repo https://github.com/dm-chelupati/costoptimizationapp.git Step 6: Check Output via GitHub Issues After running prompts, I checked GitHub. The agent had created issues. Each issue has the root cause, impact, and fix ready for the team to action or for Coding Agent to pick up and create a PR. 👉 See the actual issues created: Security findings issue Cost findings issue Step 7: Set Up Scheduled Triggers This is where it gets powerful. I configured recurring schedules: Weekly Security Check (Wednesdays 8 AM): Create a scheduled trigger that performs security practices checks against the org practices in knowledge base org-practices.md, creates github issue and send teams message on a weekly basis Wednesdays at 8 am UTC Weekly Cost Review (Mondays 8 AM): Create a scheduled trigger that performs cost practices checks against the org practices in knowledge base org-practices.md, creates github issue and send teams message on a weekly basis on Mondays at 8 am UTC Now optimization runs automatically. Every week, fresh findings land in GitHub Issues and Teams. Why Context Makes the SRE Agent Powerful Think about hiring a new SRE. They're excellent at their craft—they know Kubernetes, networking, Azure inside out. But on day one, they can't solve problems in your environment yet. Why? They don't have context: What are your SLOs? What's "acceptable" latency for your app? When do you rotate secrets? Monthly? Quarterly? Before each release? Which resources are production-critical vs. dev experiments? What's your tagging policy? Who owns what? How do you deploy? GitOps? Pipelines? Manual approvals? A great engineer becomes your great engineer once they learn how your team operates. SRE Agent works the same way. Out of the box, it knows Azure resource types, networking, best practices. But it doesn't know your bar. Is 20% CPU utilization acceptable or wasteful? Should secrets expire in 30 days or 90? Are public endpoints ever okay, or never? The more context you give the agent, your SLOs, your runbooks, your policies, the more it reasons like a team member who understands your environment, not just Azure in general. That's why Step 2 matters so much. When I uploaded our standards, the agent stopped checking generic Azure best practices and started checking our best practices. Bring your existing knowledge: You don't have to start from scratch. If your team's documentation already lives in Atlassian Confluence, SharePoint, or other tools, you can connect those via MCP servers. The agent pulls context from where your team already works, no need to duplicate content. Why This Matters Before this setup, optimization was a quarterly thing. Now it happens automatically: Before After Check security when audit requests it Daily automated posture check Find waste when finance complains Weekly savings report in Teams Discover capacity issues during incidents Scheduled headroom analysis Expire credentials and debug at 2 AM 30-day warning with exact secret names Optimization isn't a project anymore. It's a practice. Try It Yourself Create an SRE Agent with access to your subscription Upload your team's standards (security policies, cost thresholds, tagging rules) Set up a scheduled trigger, start with a daily security check Watch the first report land in Teams See what you've been missing while everything looked "fine." Learn More Azure SRE Agent documentation Azure SRE Agent blogs Azure SRE Agent community Azure SRE Agent home page Azure SRE Agent pricing Azure SRE Agent is currently in preview. Get Started575Views1like0CommentsSecure Unique Default Hostnames Now GA for Functions and Logic Apps
We are pleased to announce that Secure Unique Default Hostnames are now generally available (GA) for Azure Functions and Logic Apps (Standard). This expands the security model previously available for Web Apps to the entire App Service ecosystem and provides customers with stronger, more secure, and standardized hostname behavior across all workloads. Why This Feature Matters Historically, App Service resources have used default hostname format such as: <SiteName>.azurewebsites.net While straightforward, this pattern introduced potential security risks, particularly in scenarios where DNS records were left behind after deleting a resource. In those situations, a different user could create a new resource with the same name and unintentionally receive traffic or bindings associated with the old DNS configuration, creating opportunities for issues such as subdomain takeover. Secure Unique Default Hostnames address this by assigning a unique, randomized, region‑scoped hostname to each resource, for example: <SiteName>-<Hash>.<Region>.azurewebsites.net This change ensures that: No other customer can recreate the same default hostname. Apps inherently avoid risks associated with dangling DNS entries. Customers gain a more secure, reliable baseline behavior across App Service. Adopting this model now helps organizations prepare for the long‑term direction of the platform while improving security posture today. What’s New: GA Support for Functions and Logic Apps With this release, both Azure Functions and Logic Apps (Standard) fully support the Secure Unique Default Hostname capability. This brings these services in line with Web Apps and ensures customers across all App Service workloads benefit from the same secure and consistent default hostname model. Azure CLI Support The Azure CLI for Web Apps and Function Apps now includes support for the “--domain-name-scope” parameter. This allows customers to explicitly specify the scope used when generating a unique default hostname during resource creation. Examples: az webapp create --domain-name-scope {NoReuse, ResourceGroupReuse, SubscriptionReuse, TenantReuse} az functionapp create --domain-name-scope {NoReuse, ResourceGroupReuse, SubscriptionReuse, TenantReuse} Including this parameter ensures that deployments consistently use the intended hostname scope and helps teams prepare their automation and provisioning workflows for the secure unique default hostname model. Why Customers Should Adopt This Now While existing resources will continue to function normally, customers are strongly encouraged to adopt Secure Unique Default Hostnames for all new deployments. Early adoption provides several important benefits: A significantly stronger security posture. Protection against dangling DNS and subdomain takeover scenarios. Consistent default hostname behavior as the platform evolves. Alignment with recommended deployment practices and modern DNS hygiene. This feature represents the current best practice for hostname management on App Service and adopting it now helps ensure that new deployments follow the most secure and consistent model available. Recommended Next Steps Enable Secure Unique Default Hostnames for all new Web Apps, Function Apps, and Logic Apps. Update automation and CLI scripts to include the --domain-name-scope parameter when creating new resources. Begin updating internal documentation and operational processes to reflect the new hostname pattern. Additional Resources For readers who want to explore the technical background and earlier announcements in more detail, the following articles offer deeper coverage of unique default hostnames: Public Preview: Creating Web App with a Unique Default Hostname This article introduces the foundational concepts behind unique default hostnames. It explains why the feature was created, how the hostname format works, and provides examples and guidance for enabling the feature when creating new resources. Secure Unique Default Hostnames: GA on App Service Web Apps and Public Preview on Functions This article provides the initial GA announcement for Web Apps and early preview details for Functions. It covers the security benefits, recommended usage patterns, and guidance on how to handle existing resources that were created without unique default hostnames. Conclusion Secure Unique Default Hostnames now provide a more secure and consistent default hostname model across Web Apps, Function Apps, and Logic Apps. This enhancement reduces DNS‑related risks and strengthens application security, and organizations are encouraged to adopt this feature as the standard for new deployments.545Views0likes0CommentsFind the Alerts You Didn't Know You Were Missing with Azure SRE Agent
I had 6 alert rules. CPU. Memory. Pod restarts. Container errors. OOMKilled. Job failures. I thought I was covered. Then my app went down. I kept refreshing the Azure portal, waiting for an alert. Nothing. That's when it hit me: my alerts were working perfectly. They just weren't designed for this failure mode. Sound familiar? The Problem Every Developer Knows If you're a developer or DevOps engineer, you've been here: a customer reports an issue, you scramble to check your monitoring, and then you realize you don't have the right alerts set up. By the time you find out, it's already too late. You set up what seems like reasonable alerting and assume you're covered. But real-world failures are sneaky. They slip through the cracks of your carefully planned thresholds. My Setup: AKS with Redis I love to vibe code apps using GitHub Copilot Agent mode with Claude Opus 4.5. It's fast, it understands context, and it lets me focus on building rather than boilerplate. For this project, I built a simple journal entry app: AKS cluster hosting the web API Azure Cache for Redis storing journal data Azure Monitor alerts for CPU, memory, pod restarts, container errors, OOMKilled, and job failures Seemed solid. What could go wrong? The Scenario: Redis Password Rotation Here's something that happens constantly in enterprise environments: the security team rotates passwords. It's best practice. It's in the compliance checklist. And it breaks things when apps don't pick up the new credentials. I simulated exactly this. The pods came back up. But they couldn't connect to Redis (as expected). The readiness probes started failing. The LoadBalancer had no healthy backends. The endpoint timed out. And not a single alert fired. Using SRE Agent to Find the Alert Gaps Instead of manually auditing every alert rule and trying to figure out what I missed, I turned to Azure SRE Agent. I asked it a simple question: "My endpoint is timing out. What alerts do I have, and why didn't any of them fire?" Within minutes, it had diagnosed the problem. Here's what it found: My Existing Alerts Why They Didn't Fire High CPU/Memory No resource pressure,just auth failures Pod Restarts Pods weren't restarting, just unhealthy Container Errors App logs weren't being written OOMKilled No memory issues Job Failures No K8s jobs involved The gaps SRE Agent identified: ❌ No synthetic URL availability test ❌ No readiness/liveness probe failure alerts ❌ No "pods not ready" alerts scoped to my namespace ❌ No Redis connection error detection ❌ No ingress 5xx/timeout spike alerts ❌ No per-pod resource alerts (only node-level) SRE Agent didn't just tell me what was wrong, it created a GitHub issue with : KQL queries to detect each failure type Bicep code snippets for new alert rules Remediation suggestions for the app code Exact file paths in my repo to update Check it out: GitHub Issue How I Built It: Step by Step Let me walk you through exactly how I set this up inside SRE Agent. Step 1: Create an SRE Agent I created a new SRE Agent in the Azure portal. Since this workflow analyzes alerts across my subscription (not just one resource group), I didn't configure any specific resource groups. Instead, I gave the agent's managed identity Reader permissions on my entire subscription. This lets it discover resources, list alert rules, and query Log Analytics across all my resource groups. Step 2: Connect GitHub to SRE Agent via MCP I added a GitHub MCP server to give the agent access to my source code repository.MCP (Model Context Protocol) lets you bring any API into the agent. If your tool has an API, you can connect it. I use GitHub for both source code and tracking dev tickets, but you can connect to wherever your code lives (GitLab, Azure DevOps) or your ticketing system (Jira, ServiceNow, PagerDuty). Step 3: Create a Subagent inside SRE Agent for managing Azure Monitor Alerts I created a focused subagent with a specific job and only the tools it needs: Azure Monitor Alerts Expert Prompt: " You are expert in managing operations related to azure monitor alerts on azure resources including discovering alert rules configured on azure resources, creating new alert rules (with user approval and authorization only), processing the alerts fired on azure resources and identifying gaps in the alert rules. You can get the resource details from azure monitor alert if triggered via alert. If not, you need to ask user for the specific resource to perform analysis on. You can use az cli tool to diagnose logs, check the app health metrics. You must use the app code and infra code (bicep files) files you have access to in the github repo <insert your repo> to further understand the possible diagnoses and suggest remediations. Once analysis is done, you must create a github issue with details of analysis and suggested remediation to the source code files in the same repo." Tools enabled: az cli – List resources, alert rules, action groups Log Analytics workspace querying – Run KQL queries for diagnostics GitHub MCP – Search repositories, read file contents, create issues Step 4: Ask the Subagent About Alert Gaps I gave the agent context and asked a simple question: "@AzureAlertExpert: My API endpoint http://132.196.167.102/api/journals/john is timing out. What alerts do I have configured in rg-aks-journal, and why didn't any of them fire? The agent did the analysis autonomously and summarized findings with suggestions to add new alert rules in a GitHub issue. Here's the agentic workflow to perform azure monitor alert operations Why This Matters Faster response times. Issues get diagnosed in minutes, not hours of manual investigation. Consistent analysis. No more "I thought we had an alert for that" moments. The agent systematically checks what's covered and what's not. Proactive coverage. You don't have to wait for an incident to find gaps. Ask the agent to review your alerts before something breaks. The Bottom Line Your alerts have gaps. You just don't know it until something slips through. I had 6 alert rules and still missed a basic failure. My pods weren't restarting, they were just unhealthy. My CPU wasn't spiking, the app was just returning errors. None of my alerts were designed for this. You don't need to audit every alert rule manually. Give SRE Agent your environment, describe the failure, and let it tell you what's missing. Stop discovering alert gaps from customer complaints. Start finding them before they matter. A Few Tips Give the agent Reader access at subscription level so it can discover all resources Use a focused subagent prompt, don't try to do everything in one agent Test your MCP connections before running workflows What Alert Gaps Have Burned You? What's the alert you wish you had set up before an incident? Credential rotation? Certificate expiry? DNS failures? Let us know in the comments.438Views1like0CommentsFrom Vibe Coding to Working App: How SRE Agent Completes the Developer Loop
The Most Common Challenge in Modern Cloud Apps There's a category of bugs that drive engineers crazy: multi-layer infrastructure issues. Your app deploys successfully. Every Azure resource shows "Succeeded." But the app fails at runtime with a vague error like Login failed for user ''. Where do you even start? You're checking the Web App, the SQL Server, the VNet, the private endpoint, the DNS zone, the identity configuration... and each one looks fine in isolation. The problem is how they connect and that's invisible in the portal. Networking issues are especially brutal. The error says "Login failed" but the actual causes could be DNS, firewall, identity, or all three. The symptom and the root causes are in completely different resources. Without deep Azure networking knowledge, you're just clicking around hoping something jumps out. Now imagine you vibe coded the infrastructure. You used AI to generate the Bicep, deployed it, and moved on. When it breaks, you're debugging code you didn't write, configuring resources you don't fully understand. This is where I wanted AI to help not just to build, but to debug. Enter SRE Agent + Coding Agent Here's what I used: Layer Tool Purpose Build VS Code Copilot Agent Mode + Claude Opus Generate code, Bicep, deploy Debug Azure SRE Agent Diagnose infrastructure issues and create developer issue with suggested fixes in source code (app code and IaC) Fix GitHub Coding Agent Create PRs with code and IaC fix from Github issue created by SRE Agent Copilot builds. SRE Agent debugs. Coding Agent fixes. What I Built I used VS Code Copilot in Agent Mode with Claude Opus to create a .NET 8 Web App connected to Azure SQL via private endpoint: Private networking (no public exposure) Entra-only authentication Managed identity (no secrets) Deployed with azd up. All green. Then I tested the health endpoint: $ curl https://app-tsdvdfdwo77hc.azurewebsites.net/health/sql {"status":"unhealthy","error":"Login failed for user ''.","errorType":"SqlException"} Deployment succeeded. App failed. One error. How I Fixed It: Step by Step Step 1: Create SRE Agent with Azure Access I created an SRE Agent with read access to my Azure subscription. You can scope it to specific resource groups. The agent builds a knowledge graph of your resources and their dependencies visible in the Resource Mapping view below. Step 2: Connect GitHub to SRE Agent using GitHub MCP server I connected the GitHub MCP server so the agent could read my repository and create issues. Step 3: Create Sub Agent to analyze source code I created a sub-agent for analyzing source code using GitHub mcp tools. this lets SRE Agent understand not just Azure resources, but also the Bicep and source code files that created them. "you are expert in analyzing source code (bicep and app code) from github repos" Step 4: Invoke Sub-Agent to Analyze the Error In the SRE Agent chat, I invoked the sub-agent to diagnose the error I received from my app end point. It correlated the runtime error with the infrastructure configuration Step 5: Watch the SRE Agent Think and Reason SRE Agent analyzed the error by tracing code in Program.cs, Bicep configurations, and Azure resource relationships Web App, SQL Server, VNet, private endpoint, DNS zone, and managed identity. Its reasoning process worked through each layer, eliminating possibilities one by one until it identified the root causes. Step 6: Agent Creates GitHub Issue Based on its analysis, SRE Agent summarized the root causes and suggested fixes in a GitHub issue: Root Causes: Private DNS Zone missing VNet link Managed identity not created as SQL user Suggested Fixes: Add virtualNetworkLinks resource to Bicep Add SQL setup script to create user with db_datareader and db_datawriter roles Step 7: Merge the PR from Coding Agent Assign the Github issue to Coding Agent which then creates a PR with the fixes. I just reviewed the fix. It made sense and I merged it. Redeployed with azd up, ran the SQL script: curl -s https://app-tsdvdfdwo77hc.azurewebsites.net/health/sql | jq . { "status": "healthy", "database": "tododb", "server": "tcp:sql-tsdvdfdwo77hc.database.windows.net,1433", "message": "Successfully connected to SQL Server" } 🎉 From error to fix in minutes without manually debugging a single Azure resource. Why This Matters If you're a developer building and deploying apps to Azure, SRE Agent changes how you work: You don't need to be a networking expert. SRE Agent understands the relationships between Azure resources private endpoints, DNS zones, VNet links, managed identities. It connects dots you didn't know existed. You don't need to guess. Instead of clicking through the portal hoping something looks wrong, the agent systematically eliminates possibilities like a senior engineer would. You don't break your workflow. SRE Agent suggests fixes in your Bicep and source code not portal changes. Everything stays version controlled. Deployed through pipelines. No hot fixes at 2 AM. You close the loop. AI helps you build fast. Now AI helps you debug fast too. Try It Yourself Do you vibe code your app, your infrastructure, or both? How do you debug when things break? Here's a challenge: Vibe code a todo app with a Web App, VNet, private endpoint, and SQL database. "Forget" to link the DNS zone to the VNet. Deploy it. Watch it fail. Then point SRE Agent at it and see how it identifies the root cause, creates a GitHub issue with the fix, and hands it off to Coding Agent for a PR. Share your experience. I'd love to hear how it goes. Learn More Azure SRE Agent documentation Azure SRE Agent blogs Azure SRE Agent community Azure SRE Agent home page Azure SRE Agent pricing919Views3likes0Comments