best practices
111 TopicsUnifying Scattered Observability Data from Dynatrace + Azure for Self-Healing with SRE Agent
What if your deployments could fix themselves? The Deployment Remediation Challenge Modern operations teams face a recurring nightmare: A deployment ships at 9 AM Errors spike at 9:15 AM By the time you correlate logs, identify the bad revision, and execute a rollback—it's 10:30 AM Your users felt 75 minutes of degraded experience The data to detect and fix this existed the entire time—but it was scattered across clouds and platforms: Error logs and traces → Dynatrace (third-party observability cloud) Deployment history and revisions → Azure Container Apps API Resource health and metrics → Azure Monitor Rollback commands → Azure CLI Your observability data lives in one cloud. Your deployment data lives in another. Stitching together log analysis from Dynatrace with deployment correlation from Azure—and then executing remediation—required a human to manually bridge these silos. What if an AI agent could unify data from third-party observability platforms with Azure deployment history and act on it automatically—every week, before users even notice? Enter SRE Agent + Model Context Protocol (MCP) + Subagents Azure SRE Agent doesn't just work with Azure. Using the Model Context Protocol (MCP), you can connect external observability platforms like Dynatrace directly to your agent. Combined with subagents for specialized expertise and scheduled tasks for automation, you can build an automated deployment remediation system. Here's what I built/configured for my Azure Container Apps environment inside SRE Agent: Component Purpose Dynatrace MCP Connector Connect to Dynatrace's MCP gateway for log queries via DQL 'Dynatrace' Subagent Log analysis specialist that executes DQL queries and identifies root causes 'Remediation' Subagent Deployment remediation specialist that correlates errors with deployments and executes rollbacks Scheduled Task Weekly Monday 9 AM health check for the 'octopets-prod-api' Container App Subagent workflow: The subagent workflow in SRE Agent Builder: 'OctopetsScheduledTask' triggers 'RemediationSubagent' (12 tools), which hands off to 'DynatraceSubagent' (3 MCP tools) for log analysis. How I Set It Up: Step by Step Step 1: Connect Dynatrace via MCP SRE Agent supports the Model Context Protocol (MCP) for connecting external data sources. Dynatrace exposes an MCP gateway that provides access to its APIs as first-class tools. Connection configuration: { "name": "dynatrace-mcp-connector", "dataConnectorType": "Mcp", "dataSource": "Endpoint=https://<your-tenant>.live.dynatrace.com/platform-reserved/mcp-gateway/v0.1/servers/dynatrace-mcp/mcp;AuthType=BearerToken;BearerToken=<your-api-token>" } Once connected, SRE Agent automatically discovers Dynatrace tools. 💡 Tip: When creating your Dynatrace API token, grant the `entities.read`, `events.read`, and `metrics.read` scopes for comprehensive access. Step 2: Build Specialized Subagents Generic agents are good. Specialized agents are better. I created two subagents that work together in a coordinated workflow—one for Dynatrace log analysis, the other for deployment remediation. DynatraceSubagent This subagent is the log analysis specialist. It uses the Dynatrace MCP tools to execute DQL queries and identify root causes. Key capabilities: Executes DQL queries via MCP tools (`create-dql`, `execute-dql`, `explain-dql`) Fetches 5xx error counts, request volumes, and spike detection Returns consolidated analysis with root cause, affected services, and error patterns 👉 View full DynatraceSubagent configuration here RemediationSubagent This is the deployment remediation specialist. It correlates Dynatrace log analysis with Azure Container Apps deployment history, generates correlation charts, and executes rollbacks when confidence is high. Key capabilities: Retrieves Container Apps revision history (`GetDeploymentTimes`, `ListRevisions`) Generates correlation charts (`PlotTimeSeriesData`, `PlotBarChart`, `PlotAreaChartWithCorrelation`) Computes confidence score (0-100%) for deployment causation Executes rollback and traffic shift when confidence > 70% 👉 View full RemediationSubagent configuration here The power of specialization: Each agent focuses on its domain—DynatraceSubagent handles log analysis, RemediationSubagent handles deployment correlation and rollback. When the workflow runs, RemediationSubagent hands off to DynatraceSubagent (bi-directional handoff) for analysis, gets the findings back, and continues with remediation. Simple delegation, not a single monolithic agent trying to do everything. Step 3: Create the Weekly Scheduled Task Now the automation. I configured a scheduled task that runs every Monday at 9:30 AM to check whether deployments in the last 4 hours caused any issues—and automatically remediate if needed. Scheduled task configuration: Setting Value Task Name OctopetsScheduledTask Frequency Weekly Day of Week Monday Time 9:30 AM Response Subagent RemediationSubagent Scheduled Task Configuration Configuring the OctopetsScheduledTask in the SRE Agent portal The key insight: the scheduled task is just a coordinator. It immediately hands off to the RemediationSubagent, which orchestrates the entire workflow including handoffs to DynatraceSubagent. Step 4: See It In Action Here's what happens when the scheduled task runs: The scheduled task triggering and initiating Dynatrace analysis for octopets-prod-api The DynatraceSubagent analyzes the logs and identifies the root cause: executing DQL queries and returning consolidated log analysis The RemediationSubagent then generates correlation charts: Finally, with a 95% confidence score, SRE agent executes the rollback autonomously: executing rollback and traffic shift autonomously. The agent detected the bad deployment, generated visual evidence, and automatically shifted 100% traffic to the last known working revision—all without human intervention. Why This Matters Before After Manually check Dynatrace after incidents Automated DQL queries via MCP Stitch together logs + deployments manually Subagents correlate data automatically Rollback requires human decision + execution Confidence-based auto-remediation 75+ minutes from deployment to rollback Under 5 Minutes with autonomous workflow Reactive incident response Proactive weekly health checks Try It Yourself Connect your observability tool via MCP (Dynatrace, Datadog, New Relic, Prometheus—any tool with an MCP gateway) Build a log analysis subagent that knows how to query your observability data Build a remediation subagent that can correlate logs with deployments and execute fixes Wire them together with handoffs so the subagents can delegate log analysis Create a scheduled task to trigger the workflow automatically Learn More Azure SRE Agent documentation Model Context Protocol (MCP) integration guide Building subagents for specialized workflows Scheduled tasks and automation SRE Agent Community Azure SRE Agent pricing SRE Agent Blogs554Views0likes0CommentsIndustry-Wide Certificate Changes Impacting Azure App Service Certificates
Executive Summary In early 2026, industry-wide changes mandated by browser applications and the CA/B Forum will affect both how TLS certificates are issued as well as their validity period. The CA/B Forum is a vendor body that establishes standards for securing websites and online communications through SSL/TLS certificates. Azure App Service is aligning with these standards for both App Service Managed Certificates (ASMC, free, DigiCert-issued) and App Service Certificates (ASC, paid, GoDaddy-issued). Most customers will experience no disruption. Action is required only if you pin certificates or use them for client authentication (mTLS). Update: February 17, 2026 We’ve published new Microsoft Learn documentation, Industry-wide certificate changes impacting Azure App Service , which provides more detailed guidance on these compliance-driven changes. The documentation also includes additional information not previously covered in this blog, such as updates to domain validation reuse, along with an expanding FAQ section. The Microsoft Learn documentation now represents the most complete and up-to-date overview of these changes. Going forward, any new details or clarifications will be published there, and we recommend bookmarking the documentation for the latest guidance. Who Should Read This? App Service administrators Security and compliance teams Anyone responsible for certificate management or application security Quick Reference: What’s Changing & What To Do Topic ASMC (Managed, free) ASC (GoDaddy, paid) Required Action New Cert Chain New chain (no action unless pinned) New chain (no action unless pinned) Remove certificate pinning Client Auth EKU Not supported (no action unless cert is used for mTLS) Not supported (no action unless cert is used for mTLS) Transition from mTLS Validity No change (already compliant) Two overlapping certs issued for the full year None (automated) If you do not pin certificates or use them for mTLS, no action is required. Timeline of Key Dates Date Change Action Required Mid-Jan 2026 and after ASMC migrates to new chain ASMC stops supporting client auth EKU Remove certificate pinning if used Transition to alternative authentication if the certificate is used for mTLS Mar 2026 and after ASC validity shortened ASC migrates to new chain ASC stops supporting client auth EKU Remove certificate pinning if used Transition to alternative authentication if the certificate is used for mTLS Actions Checklist For All Users Review your use of App Service certificates. If you do not pin these certificates and do not use them for mTLS, no action is required. If You Pin Certificates (ASMC or ASC) Remove all certificate or chain pinning before their respective key change dates to avoid service disruption. See Best Practices: Certificate Pinning. If You Use Certificates for Client Authentication (mTLS) Switch to an alternative authentication method before their respective key change dates to avoid service disruption, as client authentication EKU will no longer be supported for these certificates. See Sunsetting the client authentication EKU from DigiCert public TLS certificates. See Set Up TLS Mutual Authentication - Azure App Service Details & Rationale Why Are These Changes Happening? These updates are required by major browser programs (e.g., Chrome) and apply to all public CAs. They are designed to enhance security and compliance across the industry. Azure App Service is automating updates to minimize customer impact. What’s Changing? New Certificate Chain Certificates will be issued from a new chain to maintain browser trust. Impact: Remove any certificate pinning to avoid disruption. Removal of Client Authentication EKU Newly issued certificates will not support client authentication EKU. This change aligns with Google Chrome’s root program requirements to enhance security. Impact: If you use these certificates for mTLS, transition to an alternate authentication method. Shortening of Certificate Validity Certificate validity is now limited to a maximum of 200 days. Impact: ASMC is already compliant; ASC will automatically issue two overlapping certificates to cover one year. No billing impact. Frequently Asked Questions (FAQs) Will I lose coverage due to shorter validity? No. For App Service Certificate, App Service will issue two certificates to span the full year you purchased. Is this unique to DigiCert and GoDaddy? No. This is an industry-wide change. Do these changes impact certificates from other CAs? Yes. These changes are an industry-wide change. We recommend you reach out to your certificates’ CA for more information. Do I need to act today? If you do not pin or use these certs for mTLS, no action is required. Glossary ASMC: App Service Managed Certificate (free, DigiCert-issued) ASC: App Service Certificate (paid, GoDaddy-issued) EKU: Extended Key Usage mTLS: Mutual TLS (client certificate authentication) CA/B Forum: Certification Authority/Browser Forum Additional Resources Changes to the Managed TLS Feature Set Up TLS Mutual Authentication Azure App Service Best Practices – Certificate pinning DigiCert Root and Intermediate CA Certificate Updates 2023 Sunsetting the client authentication EKU from DigiCert public TLS certificates Feedback & Support If you have questions or need help, please visit our official support channels or the Microsoft Q&A, where our team and the community can assist you.2.6KViews1like0CommentsMCP-Driven Azure SRE for Databricks
Azure SRE Agent is an AI-powered operations assistant built for incident response and governance. MCP (Model Context Protocol) is the standard interface it uses to connect to external systems and tools. Azure SRE Agent integrates with Azure Databricks through the Model Context Protocol (MCP) to provide: Proactive Compliance - Automated best practice validation Reactive Troubleshooting - Root cause analysis and remediation for job failures This post demonstrates both capabilities with real examples. Architecture The Azure SRE Agent orchestrates Ops Skills and Knowledge Base prompts, then calls the Databricks MCP server over HTTPS. The MCP server translates those requests into Databricks REST API calls, returns structured results, and the agent composes findings, evidence, and remediation. End-to-end, this yields a single loop: intent -> MCP tool calls -> Databricks state -> grounded response. Deployment The MCP server runs as a containerized FastMCP application on Azure Container Apps, fronted by HTTPS and configured with Databricks workspace connection settings. It exposes a tool catalog that the agent invokes through MCP, while the container handles authentication and REST API calls to Databricks. 👉 For deployment instructions, see the GitHub repository. Getting Started Deploy the MCP Server: Follow the quickstart guide to deploy to Azure Container Apps (~30 min) Configure Azure SRE Agent: Create MCP connector with streamable-http transport Upload Knowledge Base from Builder > Knowledge Base using the Best Practices doc: AZURE_DATABRICKS_BEST_PRACTICES.md Benefit: Gives the agent authoritative compliance criteria and remediation commands. Create Ops Skill from Builder > Subagent Builder > Create skill and drop the Ops Skill doc: DATABRICKS_OPS_RUNBOOK_SKILL.md Benefit: Adds incident timelines, runbooks, and escalation triggers to responses. Deploy the subagent YAML: Databricks_MCP_Agent.yaml Benefit: Wires the MCP connector, Knowledge Base, and Ops Skill into one agent for proactive and reactive workflows. Integrate with Alerting: Connect PagerDuty/ServiceNow webhooks Enable auto-remediation for common issues Part 1: Proactive Compliance Use Case: Best Practice Validation Prompt: @Databricks_MCP_Agent: Validate the Databricks workspace for best practices compliance and provide a summary, detailed findings, and concrete remediation steps. What the Agent Does: Calls MCP tools to gather current state: list_clusters() - Audit compute configurations list_catalogs() - Check Unity Catalog setup list_jobs() - Review job configurations execute_sql() - Query governance policies Cross-references findings with Knowledge Base (best practices document) Generates prioritized compliance report Expected Output: Benefits: Time Savings: 5 minutes vs. 2-3 hours manual review Consistency: Same validation criteria across all workspaces Actionable: Specific remediation steps with code examples Part 2: Reactive Incident Response Example 1: Job Failure - Non-Zero Exit Code Scenario: Job job_exceptioning_out fails repeatedly due to notebook code errors. Prompt: Agent Investigation - Calls MCP Tools: get_job() - Retrieves job definition list_job_runs() - Gets recent run history (4 failed runs) get_run_output() - Analyzes error logs Root Cause Analysis: Expected Outcome: Root Cause Identified: sys.exit(1) in notebook code Evidence Provided: Job ID, run history, code excerpt, settings Confidence: HIGH (explicit failing code present) Remediation: Fix code + add retry policy Resolution Time: 3-5 minutes (vs. 30-45 minutes manual investigation) Example 2: Job Failure - Task Notebook Exception Scenario: Job hourly-data-sync fails repeatedly due to exception in task notebook. Prompt: Agent Investigation - Calls MCP Tools: get_job() - Job definition and task configuration list_job_runs() - Recent runs show "TERMINATED with TIMEOUT" execute_sql() - Queries notebook metadata Root Cause Analysis: Expected Outcome: Root Cause Identified: Exception at line 7 - null partition detected Evidence: Notebook path, code excerpt (lines 5-7), run history (7 consecutive failures) Confidence: HIGH (explicit failing code + TIMEOUT/queue issues) Remediation: Fix exception handling + add retry policy Resolution Time: 5-8 minutes (vs. 45+ minutes manual log analysis) Key Benefits Proactive Governance ✅ Continuous compliance monitoring ✅ Automated best practice validation ✅ 95% reduction in manual review time Reactive Incident Response 🚨 Automated root cause analysis ⚡ 80-95% reduction in MTTR 🧠 Context-aware remediation recommendations 📊 Evidence-based troubleshooting Operational Impact Metric Before After Improvement Compliance review time 2-3 hours 5 minutes 95% Job failure investigation 30-45 min 3-8 min 85% On-call alerts requiring intervention 4-6 per shift 1-2 per shift 70% Conclusion Azure SRE Agent transforms Databricks operations by combining proactive governance with reactive troubleshooting. The MCP integration provides: Comprehensive visibility into workspace health Automated compliance monitoring and validation Intelligent incident response with root cause analysis Self-healing capabilities for common failures Result: Teams spend less time firefighting and more time building. Resources 📘 Deployment Guide 🤖 Subagent Configuration 📋 Best Practices Document 🧰 Ops Skill Runbook 🔧 Validation Script 📖 Azure SRE Agent Documentation 📰 Azure SRE Agent Blogs 📜 MCP Specification Questions? Open an issue on GitHub or reach out to the Azure SRE team.512Views0likes0CommentsAzure WAF Compliance with MCP-Driven SRE Agent
Azure governance at scale is complex. Security teams manually review many resource types across multiple subscriptions. Finance can't track costs without tags. Compliance teams spend days cross-referencing WAF standards against actual infrastructure. And critical security gaps - like RDP open to 0.0.0.0/0 or customer-managed encryption disabled - slip through until discovered in audits. This workflow is broken. Enter the Azure SRE Agent: an AI-powered compliance engine that discovers all resources, assesses them against all 5 WAF pillars AND your organization's specific standards in minutes, then generates exact remediation commands with quantified impact. How it works: The agent leverages these capabilities to transform Azure governance: Autonomous Resource Discovery via MCP - Azure MCP (Model Context Protocol) server exposes Azure Resource Graph and ARM APIs as discoverable tools. The agent automatically inventories all resources across subscriptions with metadata (types, locations, tags, security settings) in seconds. Multi-Pillar WAF Assessment - For each discovered resource, the agent validates against all 5 WAF pillars (Reliability, Security, Cost Optimization, Operational Excellence, Performance) using Azure MCP tools, generating an assessment summary with pass/fail/partial/unknown counts. Org Best Practices Cross-Check - The agent references your organization's compliance standards (stored in knowledge base as org-practices.md) to escalate WAF findings into actionable org policies. A Security policy violation becomes a critical finding. A cost optimization recommendation becomes a warning. Automated Remediation Codegen - For every finding, the agent generates exact Azure CLI commands, Terraform snippets, and Portal steps with impact quantification (risk reduction, cost savings, compliance improvement). Architecture The Azure SRE Agent orchestrates WAF compliance checks by calling Azure MCP tools to discover and assess resources, then cross-references findings with organizational best practices from the knowledge base. End-to-end: subscription scope → resource discovery → 5-pillar assessment → org compliance → remediation commands. Deployment The Azure MCP server runs locally via Node.js in Stdio mode, communicating with the SRE Agent through standard input/output. Configured with Managed Identity credentials, it exposes Azure ARM and governance APIs as MPC tools. The agent uses this catalog plus the org practices knowledge base to validate compliance autonomously. Scheduling Options: Ad-hoc: Run on-demand via agent prompt Scheduled: Use Azure SRE Agent's built-in task scheduler (daily/weekly/monthly compliance scans) Event-driven: Trigger on resource group changes or policy violations Getting Started Configure Azure MCP Connector: Connection Type: Stdio (local process) Command: npx Arguments (in order): -y (auto-accept) azure/mcp (package name) server (run as server) start (start command) --mode (mode flag) all (enable all resources) Environment Variables: AZURE_CLIENT_ID: Your managed identity client ID AZURE_TOKEN_CREDENTIALS: managedidentitycredential Managed Identity: Select your managed identity from dropdown Configure Azure SRE Agent: Upload Knowledge Base from Builder > Knowledge Base using the org practices doc: org-practices.md Benefit: Agent validates Azure resources against both WAF standards AND your company-specific requirements Deploy the agent YAML: Azure_WAF_Compliance_Agent.yaml, attach the Azure MCP and associate agent with the KB article. Benefit: Autonomous resource discovery + 5-pillar assessment + org compliance in one execution Run Your First Validation: Prompt the agent with your subscription or resource group: Integrate with Operations: Schedule daily compliance scans via Azure SRE Agent's scheduled tasks Trigger assessments on resource deployments Export findings to Azure DevOps or ServiceNow for ticketing Autonomous Compliance Workflow Phase 1: Resource Discovery Agent Task: The agent automatically discovers all resources in the target resource group, including VMs, storage accounts, NSGs, Key Vaults, App Services, and more - no manual resource IDs needed. What the Agent Does: Calls Azure MCP tools in parallel: list_virtual_machines() - Inventory compute list_storage_accounts() - Inventory storage list_network_security_groups() - Inventory network security list_key_vaults() - Inventory secrets management list_web_apps() - Inventory App Services Additional resource type queries as needed Builds comprehensive resource inventory with metadata (types, locations, tags, configurations) Expected Output: Phase 2: Multi-Pillar WAF Assessment Agent Task: For each discovered resource, the agent validates against all 5 Azure Well-Architected Framework pillars using Azure MCP tools. What the Agent Does: Reliability Pillar: VM availability zones, backup configuration, health probes Storage redundancy (LRS vs GRS vs ZRS) App Service health checks and auto-healing Security Pillar: NSG rules: Check for 0.0.0.0/0 sources on SSH/RDP Key Vault: Secret expiration dates, soft delete, purge protection Storage: Public blob access, shared key access, HTTPS enforcement, TLS version App Service: HTTPS-only, managed identity, VNet integration Cost Optimization Pillar: VM right-sizing: CPU/memory utilization analysis Orphaned resources: Unattached disks, unused public IPs Resource tagging: cost-center, owner, environment tags Storage tiering opportunities Operational Excellence Pillar: Monitoring: Azure Monitor diagnostics enabled Logging: Log Analytics workspace integration IaC: Resource tags indicating Terraform/Bicep management Performance Pillar: VM series appropriateness for workload Storage account performance tier App Service plan scaling configuration Expected Output: Phase 3: Org Best Practices Cross-Check Agent Task: Cross-reference all WAF findings with organizational requirements from org-practices.md knowledge base. Escalate violations of company-specific standards. What the Agent Does: Reads org-practices.md from knowledge base For each WAF finding, checks if it intersects with org requirement Maps severity based on org standards: 🔴 Critical: NSG open to 0.0.0.0/0, Key Vault secrets expiring < 30 days, storage public access enabled 🟡 Warning: Missing required tags (cost-center, owner), idle VMs, orphaned disks 🔵 Info: Recommendations without org mandate Provides specific evidence and references to org-practices.md sections Expected Output: Phase 4: Remediation with Exact Commands Agent Task: Generate copy-paste CLI commands, Terraform snippets, and Portal steps for every gap. Quantify impact (cost savings, risk reduction). What the Agent Does: For each finding, generates precise fix commands Provides Azure CLI, PowerShell, or Terraform as appropriate Quantifies impact: cost reduction, security risk level, compliance improvement Expected Output: Remediation Roadmap Immediate (Deploy Today - Critical Risk): Enable TLS 1.2 minimum on SQL Server prod-sql-server-01 (2 min) Fix NSG "nsg-aks-nodes" - Remove RDP from 0.0.0.0/0 (5 min) Enable SQL Server Auditing (10 min) Enable RBAC on AKS Cluster (recreate cluster, ~20 min downtime) Enable Customer-Managed Encryption on diagstorage01 (15 min) Short-Term (Next 3 Days - High Risk): Install Azure Monitor VM extensions on ubuntu-prod-01 (10 min) Enable HTTPS-only on Function App data-processor-func (1 min) Enable Purge Protection on Key Vault kv-prod-01 (1 min) Deploy Azure Bastion for secure VM access (20 min) Enable AKS Calico network policies (10 min) Short-Term (1 Week - Medium Risk): Delete orphaned disk disk-unattached-03 ($48/month savings) Configure storage lifecycle policy on logsstorage01 ($120/month savings) Add monitoring to Function App data-processor-func Add required tags to VM ubuntu-prod-01 and others Add NSG to unprotected subnets in vnet-prod Medium-Term (2-4 Weeks - Low Risk): Implement automated secret rotation for Key Vault Configure Application Insights for all Function Apps Set up managed identity on win-dev-02 and other VMs Implement Azure Policy enforcement for ongoing compliance Long-Term (Ongoing - Preventive): Schedule weekly compliance scans via Azure Logic Apps Integrate findings with Azure DevOps for backlog tracking Implement GitOps for IaC enforcement (Terraform, Bicep) Train teams on security best practices and compliance requirements Real-World Results The assessment workflow discovered 8 critical findings and 11 warnings across SQL, AKS, Storage, VMs, and NSGs. Each came with remediation commands and prioritized timelines. Total time: 6 minutes. Manual review would take 4-6 hours. Key Benefits Autonomous Discovery & Assessment ✅ Zero manual resource inventory - agent discovers everything ✅ Multi-pillar WAF validation in one execution ✅ Org-specific compliance enforcement via knowledge base Risk & Cost Visibility 🔴 Immediate identification of critical security gaps (NSGs, secrets, public storage) 💰 Quantified cost savings from orphaned resources and idle VMs 📊 Evidence-based findings with direct references to org policies Actionable Remediation 🛠️ Copy-paste CLI commands and Terraform for every gap ⚡ Prioritized roadmap (immediate/short/medium/long-term) 📈 Impact quantification (risk reduction, cost savings, compliance %) Operational Impact Metric Before After Improvement WAF compliance review 4-6 hours manual 5-8 minutes autonomous 95% Critical security gap discovery 2-3 days Real-time Immediate Org policy violation tracking Manual spreadsheet Automated report 100% Orphaned resource cleanup Quarterly review Weekly automated scan 4x frequency Scheduling Compliance Scans Azure SRE Agent includes built-in task scheduling. From the Scheduled tasks menu, create a new task specifying: Frequency: Daily, Weekly, or Monthly Time: When to run scans Scope: Target resource group or subscription Autonomy: Autonomous (auto-remediate) or Review (approval required) The agent runs on schedule, discovers resources, assesses WAF compliance, and executes or flags remediation based on your settings. Conclusion Azure SRE Agent transforms Azure governance by combining autonomous resource discovery, multi-pillar WAF assessment, and organization-specific compliance enforcement. The MCP integration provides: Continuous compliance monitoring across all 5 WAF pillars Org best practices enforcement via knowledge base integration Automated remediation with exact CLI commands and impact quantification Flexible scheduling for ad-hoc, scheduled, or event-driven scans Result: Security teams maintain compliance effortlessly, finance tracks costs accurately, and platform teams remediate gaps with confidence. Resources 🤖 Agent Configuration YAML 📋 Org Best Practices 📖 Azure SRE Agent Documentation 📰 Azure SRE Agent Blogs 📜 MCP Specification 🏗️ Azure Well-Architected Framework Questions? Open an issue on GitHub or reach out to the Azure SRE team.673Views1like0CommentsFrom Local MCP Server to Hosted Web Agent: App Service Observability, Part 2
In Part 1, we introduced the App Service Observability MCP Server — a proof-of-concept that lets GitHub Copilot (and other AI assistants) query your App Service logs, analyze errors, and help debug issues through natural language. That version runs locally alongside your IDE, and it's great for individual developers who want to investigate their apps without leaving VS Code. A local MCP server is powerful, but it's personal. Your teammate has to clone the repo, configure their IDE, and run it themselves. What if your on-call engineer could just open a browser and start asking questions? What if your whole team had a shared observability assistant — no setup required? In this post, we'll show how we took the same set of MCP tools and wrapped them in a hosted web application — deployed to Azure App Service with a chat UI and a built-in Azure OpenAI agent. We'll cover what changed, what stayed the same, and why this pattern opens the door to far more than just a web app. Quick Recap: The Local MCP Server If you haven't read Part 1, here's the short version: We built an MCP (Model Context Protocol) server that exposes ~15 observability tools for App Service — things like querying Log Analytics, fetching Kudu container logs, analyzing HTTP errors, correlating deployments with failures, and checking logging configurations. You point your AI assistant (GitHub Copilot, Claude, etc.) at the server, and it calls those tools on your behalf to answer questions about your apps. That version: Runs locally on your machine via node Uses stdio transport (your IDE spawns the process) Relies on your Azure credentials ( az login ) — the AI operates with your exact permissions Requires no additional Azure resources It works. It's fast. And for a developer investigating their own apps, it's the simplest path. This is still a perfectly valid way to use the project — nothing about the hosted version replaces it. The Problem: Sharing Is Hard The local MCP server has a limitation: it's tied to one developer's machine and IDE. In practice, this means: On-call engineers need to clone the repo and configure their environment before they can use it Team leads can't point someone at a URL and say "go investigate" Non-IDE users (PMs, support engineers) are left out entirely Consistent configuration (which subscription, which resource group) has to be managed per-person We wanted to keep the same tools and the same observability capabilities, but make them accessible to anyone with a browser. The Solution: Host It on App Service The answer turned out to be straightforward: deploy the MCP server itself to Azure App Service, give it a web frontend, and bring its own AI agent along for the ride. Here's what the hosted version adds on top of the local MCP server: Local MCP Server Hosted Web Agent How it works Runs locally, your IDE's AI calls the tools Deployed to Azure App Service with its own AI agent Interface VS Code, Claude Desktop, or any MCP client Browser-based chat UI Agent Your existing AI assistant (Copilot, Claude, etc.) Built-in Azure OpenAI (GPT-5-mini) Azure resources needed None beyond az login App Service, Azure OpenAI, VNet Best for Individual developers in their IDE Teams who want a shared, centralized tool Authentication Your local az login credentials Managed identity + Easy Auth (Entra ID) Deploy npm install && npm run build azd up The key insight: the MCP tools are identical. Both versions use the exact same set of observability tools — the only difference is who's calling them (your IDE's AI vs. the built-in Azure OpenAI agent) and where the server runs (your laptop vs. App Service). What We Built Architecture ┌─────────────────────────────────────────────────────────────────────────────┐ │ Web Browser │ │ React Chat UI — resource selectors, tool steps, markdown responses │ └──────────────────────────────────┬──────────────────────────────────────────┘ │ HTTP (REST API) ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ Azure App Service (Node.js 20) │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ │ Express Server │ │ │ │ ├── /api/chat → Agent loop (OpenAI → tool calls → respond) │ │ │ │ ├── /api/set-context → Set target app for investigation │ │ │ │ ├── /api/resource-groups, /api/apps → Resource discovery │ │ │ │ ├── /mcp → MCP protocol endpoint (Streamable HTTP) │ │ │ │ └── / → Static SPA (React chat UI) │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ │ VNet Integration (snet-app) │ └─────────────────────────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌──────────────┐ ┌───────────────────┐ ┌────────────────────┐ │ Azure OpenAI │ │ Log Analytics / │ │ ARM API / Kudu │ │ (GPT-5-mini) │ │ KQL Queries │ │ (app metadata, │ │ Private EP │ └───────────────────┘ │ container logs) │ └──────────────┘ └────────────────────┘ The Express server does double duty: it serves the React chat UI as static files and exposes the MCP endpoint for remote IDE connections. The agent loop is simple — when a user sends a message, the server calls Azure OpenAI, which may request tool calls, the server executes those tools, and the loop continues until the AI has a final answer. Demo The following screenshots show how this app can be used. The first screenshot shows what happens when you ask about a functioning app. You can see the agent made 5 tool calls and was able to give a thorough summary of the current app's status, recent deployments, as well as provide some recommendations for how to improve observability of the app itself. I expanded the tools section so you could see exactly what the agent was doing behind the scenes and get a sense of how it was thinking. At this point, you can proceed to ask more questions about your app if there were other pieces of information you wanted to pull from your logs. I then injected a fault into this app by initiating a deployment pointing to a config file that didn't actually exist. The goal here was to prove that the agent could correlate an application issue to a specific deployment event, something that currently involves manual effort and deep investigation into logs and source code. Having an agent that can do this for you in a matter of seconds saves so much time and effort that could be directed to more important activities and ensures that you find the issue the first time. A few minutes after initiating the bad deployment, I saw that my app was no longer responding. Rather than going to the logs and investigating myself, I asked the agent "I'm getting an application error now, what happened?" I obviously know what happened and what the source of the error was, but let's see if the agent can pick that up. The agent was able to see that something was wrong and then point me in the direction to address the issue. It ran a number of tool calls following our investigation steps called out in the skills file and was successfully able to identify the source of the error. And lastly, I wanted to confirm the error was associated with the recent deployment, something that our agent should be able to do because we built in the tools it needs to be able to corrleate these kinds of events with errors. I asked it directly and here was the response, exactly what I expected to see. Infrastructure (one command) Everything is defined in Bicep and deployed with the Azure Developer CLI: azd up This provisions: App Service Plan (P0v3) with App Service (Node.js 20 LTS, VNet-integrated) Azure OpenAI (GPT-5-mini, Global Standard) with a private endpoint and private DNS zone VNet (10.0.0.0/16) with dedicated subnets for the app and private endpoints Managed Identity with RBAC roles: Reader, Website Contributor, Log Analytics Reader, Cognitive Services OpenAI User No API keys anywhere. The App Service authenticates to Azure OpenAI over a private network using its managed identity. The Chat UI The web interface is designed to get out of the way and let you focus on investigating: Resource group and app dropdowns — Browse your subscription, pick the app you want to investigate Tool step visibility — A collapsible panel shows exactly which tools the agent called, what arguments it used, and how long each took Session management — Start fresh conversations, with confirmation dialogs when switching context mid-investigation Markdown responses — The agent's answers are rendered with full formatting, code blocks, and tables When you first open the app, it auto-discovers your subscription and populates the resource group dropdown. Select an app, hit "Tell me about this app," and the agent starts investigating. Security Since this app has subscription-wide read access to your App Services and Log Analytics workspaces, you should definitely enable authentication. After deploying, configure Easy Auth in the Azure Portal: Go to your App Service → Authentication Click Add identity provider → select Microsoft Entra ID Set Unauthenticated requests to "HTTP 401 Unauthorized" This ensures only authorized members of your organization can access the tool. The connection to Azure OpenAI is secured via a private endpoint — traffic never traverses the public internet. The app authenticates using its managed identity with the Cognitive Services OpenAI User role. What Stayed the Same This is the part worth emphasizing: the core tools didn't change at all. Whether you're using the local MCP server or the hosted web agent, you get the same 15 tools. The Agent Skill (SKILL.md) from Part 1 also carries over. The hosted agent has the same domain expertise for App Service debugging baked into its system prompt — the same debugging workflows, common error patterns, KQL templates, and SKU reference that make the local version effective. The Bigger Picture: It's Not Just a Web App Here's what makes this interesting beyond our specific implementation: the pattern is the point. We took a set of domain-specific tools (App Service observability), wrapped them in a standard protocol (MCP), and showed two ways to use them: Local MCP server → Your IDE's AI calls the tools Hosted web agent → A deployed app with its own AI calls the same tools But those are just two examples. The same tools could power: A Microsoft Teams bot — Your on-call channel gets an observability assistant that anyone can mention A Slack integration — Same idea, different platform A CLI agent — A terminal-based chat for engineers who live in the command line An automated monitor — An agent that periodically checks your apps and files alerts An Azure Portal extension — Observability chat embedded directly in the portal experience A mobile app — Check on your apps from your phone during an incident The MCP tools are the foundation. The agent and interface are just the delivery mechanism. Build whatever surface makes sense for your team. This is one of the core ideas behind MCP: write the tools once, use them everywhere. The protocol standardizes how AI assistants discover and call tools, so you're not locked into any single client or agent. Try It Yourself Both versions are open-source: Local MCP server (Part 1): github.com/seligj95/app-service-observability-agent Hosted web agent (Part 2): github.com/seligj95/app-service-observability-agent-hosted To deploy the hosted version: git clone https://github.com/seligj95/app-service-observability-agent-hosted.git cd app-service-observability-agent-hosted azd up To run the local version, see the Getting Started section in Part 1. What's Next? This is still a proof-of-concept, and we're continuing to explore how AI-powered observability can become a first-class part of the App Service platform. Some things we're thinking about: More tools — Resource health, autoscale history, certificate expiration, network diagnostics Multi-app investigations — Correlate issues across multiple apps in a resource group Proactive monitoring — Agents that watch your apps and alert you before users notice Deeper integration — What if every App Service came with a built-in observability endpoint? We'd love your feedback. Try it out, open an issue, or submit a PR if you have ideas for additional tools or debugging patterns. And if you build something interesting on top of these MCP tools — a Teams bot, a CLI agent, anything — we'd love to hear about it.301Views0likes0CommentsAn AI led SDLC: Building an End-to-End Agentic Software Development Lifecycle with Azure and GitHub.
This is due to the inevitable move towards fully agentic, end-to-end SDLCs. We may not yet be at a point where software engineers are managing fleets of agents creating the billion-dollar AI abstraction layer, but (as I will evidence in this article) we are certainly on the precipice of such a world. Before we dive into the reality of agentic development today, let me examine two very different modules from university and their relevance in an AI-first development environment. Manual Requirements Translation. At university I dedicated two whole years to a unit called “Systems Design”. This was one of my favourite units, primarily focused on requirements translation. Often, I would receive a scenario between “The Proprietor” and “The Proprietor’s wife”, who seemed to be in a never-ending cycle of new product ideas. These tasks would be analysed, broken down, manually refined, and then mapped to some kind of early-stage application architecture (potentially some pseudo-code and a UML diagram or two). The big intellectual effort in this exercise was taking human intention and turning it into something tangible to build from (BA’s). Today, by the time I have opened Notepad and started to decipher requirements, an agent can already have created a comprehensive list, a service blueprint, and a code scaffold to start the process (*cough* spec-kit *cough*). Manual debugging. Need I say any more? Old-school debugging with print()’s and breakpoints is dead. I spent countless hours learning to debug in a classroom and then later with my own software, stepping through execution line by line, reading through logs, and understanding what to look for; where correlation did and didn’t mean causation. I think back to my year at IBM as a fresh-faced intern in a cloud engineering team, where around 50% of my time was debugging different issues until it was sufficiently “narrowed down”, and then reading countless Stack Overflow posts figuring out the actual change I would need to make to a PowerShell script or Jenkins pipeline. Already in Azure, with the emergence of SRE agents, that debug process looks entirely different. The debug process for software even more so… #terminallastcommand WHY IS THIS NOT RUNNING? #terminallastcommand Review these logs and surface errors relating to XYZ. As I said: breakpoints are dead, for now at least. Caveat – Is this a good thing? One more deviation from the main core of the article if you would be so kind (if you are not as kind skip to the implementation walkthrough below). Is this actually a good thing? Is a software engineering degree now worthless? What if I love printf()? I don’t know is my answer today, at the start of 2026. Two things worry me: one theoretical and one very real. To start with the theoretical: today AI takes a significant amount of the “donkey work” away from developers. How does this impact cognitive load at both ends of the spectrum? The list that “donkey work” encapsulates is certainly growing. As a result, on one end of the spectrum humans are left with the complicated parts yet to be within an agent’s remit. This could have quite an impact on our ability to perform tasks. If we are constantly dealing with the complex and advanced, when do we have time to re-root ourselves in the foundations? Will we see an increase in developer burnout? How do technical people perform without the mundane or routine tasks? I often hear people who have been in the industry for years discuss how simple infrastructure, computing, development, etc. were 20 years ago, almost with a longing to return to a world where today’s zero trust, globally replicated architectures are a twinkle in an architect’s eye. Is constantly working on only the most complex problems a good thing? At the other end of the spectrum, what if the performance of AI tooling and agents outperforms our wildest expectations? Suddenly, AI tools and agents are picking up more and more of today’s complicated and advanced tasks. Will developers, architects, and organisations lose some ability to innovate? Fundamentally, we are not talking about artificial general intelligence when we say AI; we are talking about incredibly complex predictive models that can augment the existing ideas they are built upon but are not, in themselves, innovators. Put simply, in the words of Scott Hanselman: “Spicy auto-complete”. Does increased reliance on these agents in more and more of our business processes remove the opportunity for innovative ideas? For example, if agents were football managers, would we ever have graduated from Neil Warnock and Mick McCarthy football to Pep? Would every agent just augment a ‘lump it long and hope’ approach? We hear about learning loops, but can these learning loops evolve into “innovation loops?” Past the theoretical and the game of 20 questions, the very real concern I have is off the back of some data shared recently on Stack Overflow traffic. We can see in the diagram below that Stack Overflow traffic has dipped significantly since the release of GitHub Copilot in October 2021, and as the product has matured that trend has only accelerated. Data from 12 months ago suggests that Stack Overflow has lost 77% of new questions compared to 2022… Stack Overflow democratises access to problem-solving (I have to be careful not to talk in past tense here), but I will admit I cannot remember the last time I was reviewing Stack Overflow or furiously searching through solutions that are vaguely similar to my own issue. This causes some concern over the data available in the future to train models. Today, models can be grounded in real, tested scenarios built by developers in anger. What happens with this question drop when API schemas change, when the technology built for today is old and deprecated, and the dataset is stale and never returning to its peak? How do we mitigate this impact? There is potential for some closed-loop type continuous improvement in the future, but do we think this is a scalable solution? I am unsure. So, back to the question: “Is this a good thing?”. It’s great today; the long-term impacts are yet to be seen. If we think that AGI may never be achieved, or is at least a very distant horizon, then understanding the foundations of your technical discipline is still incredibly important. Developers will not only be the managers of their fleet of agents, but also the janitors mopping up the mess when there is an accident (albeit likely mopping with AI-augmented tooling). An AI First SDLC Today – The Reality Enough reflection and nostalgia (I don’t think that’s why you clicked the article), let’s start building something. For the rest of this article I will be building an AI-led, agent-powered software development lifecycle. The example I will be building is an AI-generated weather dashboard. It’s a simple example, but if agents can generate, test, deploy, observe, and evolve this application, it proves that today, and into the future, the process can likely scale to more complex domains. Let’s start with the entry point. The problem statement that we will build from. “As a user I want to view real time weather data for my city so that I can plan my day.” We will use this as the single input for our AI led SDLC. This is what we will pass to promptkit and watch our app and subsequent features built in front of our eyes. The goal is that we will: - Spec-kit to get going and move from textual idea to requirements and scaffold. - Use a coding agent to implement our plan. - A Quality agent to assess the output and quality of the code. - GitHub Actions that not only host the agents (Abstracted) but also handle the build and deployment. - An SRE agent proactively monitoring and opening issues automatically. The end to end flow that we will review through this article is the following: Step 1: Spec-driven development - Spec First, Code Second A big piece of realising an AI-led SDLC today relies on spec-driven development (SDD). One of the best summaries for SDD that I have seen is: “Version control for your thinking”. Instead of huge specs that are stale and buried in a knowledge repository somewhere, SDD looks to make them a first-class citizen within the SDLC. Architectural decisions, business logic, and intent can be captured and versioned as a product evolves; an executable artefact that evolves with the project. In 2025, GitHub released the open-source Spec Kit: a tool that enables the goal of placing a specification at the centre of the engineering process. Specs drive the implementation, checklists, and task breakdowns, steering an agent towards the end goal. This article from GitHub does a great job explaining the basics, so if you’d like to learn more it’s a great place to start (https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/). In short, Spec Kit generates requirements, a plan, and tasks to guide a coding agent through an iterative, structured development process. Through the Spec Kit constitution, organisational standards and tech-stack preferences are adhered to throughout each change. I did notice one (likely intentional) gap in functionality that would cement Spec Kit’s role in an autonomous SDLC. That gap is that the implement stage is designed to run within an IDE or client coding agent. You can now, in the IDE, toggle between task implementation locally or with an agent in the cloud. That is great but again it still requires you to drive through the IDE. Thinking about this in the context of an AI-led SDLC (where we are pushing tasks from Spec Kit to a coding agent outside of my own desktop), it was clear that a bridge was needed. As a result, I used Spec Kit to create the Spec-to-issue tool. This allows us to take the tasks and plan generated by Spec Kit, parse the important parts, and automatically create a GitHub issue, with the option to auto-assign the coding agent. From the perspective of an autonomous AI-led SDLC, Speckit really is the entry point that triggers the flow. How Speckit is surfaced to users will vary depending on the organisation and the context of the users. For the rest of this demo I use Spec Kit to create a weather app calling out to the OpenWeather API, and then add additional features with new specs. With one simple prompt of “/promptkit.specify “Application feature/idea/change” I suddenly had a really clear breakdown of the tasks and plan required to get to my desired end state while respecting the context and preferences I had previously set in my Spec Kit constitution. I had mentioned a desire for test driven development, that I required certain coverage and that all solutions were to be Azure Native. The real benefit here compared to prompting directly into the coding agent is that the breakdown of one large task into individual measurable small components that are clear and methodical improves the coding agents ability to perform them by a considerable degree. We can see an example below of not just creating a whole application but another spec to iterate on an existing application and add a feature. We can see the result of the spec creation, the issue in our github repo and most importantly for the next step, our coding agent, GitHub CoPilot has been assigned automatically. Step 2: GitHub Coding Agent - Iterative, autonomous software creation Talking of coding agents, GitHub Copilot’s coding agent is an autonom ous agent in GitHub that can take a scoped development task and work on it in the background using the repository’s context. It can make code changes and produce concrete outputs like commits and pull requests for a developer to review. The developer stays in control by reviewing, requesting changes, or taking over at any point. This does the heavy lifting in our AI-led SDLC. We have already seen great success with customers who have adopted the coding agent when it comes to carrying out menial tasks to save developers time. These coding agents can work in parallel to human developers and with each other. In our example we see that the coding agent creates a new branch for its changes, and creates a PR which it starts working on as it ticks off the various tasks generated in our spec. One huge positive of the coding agent that sets it apart from other similar solutions is the transparency in decision-making and actions taken. The monitoring and observability built directly into the feature means that the agent’s “thinking” is easily visible: the iterations and steps being taken can be viewed in full sequence in the Agents tab. Furthermore, the action that the agent is running is also transparently available to view in the Actions tab, meaning problems can be assessed very quickly. Once the coding agent is finished, it has run the required tests and, even in the case of a UI change, goes as far as calling the Playwright MCP server and screenshotting the change to showcase in the PR. We are then asked to review the change. In this demo, I also created a GitHub Action that is triggered when a PR review is requested: it creates the required resources in Azure and surfaces the (in this case) Azure Container Apps revision URL, making it even smoother for the human in the loop to evaluate the changes. Just like any normal PR, if changes are required comments can be left; when they are, the coding agent can pick them up and action what is needed. It’s also worth noting that for any manual intervention here, use of GitHub Codespaces would work very well to make minor changes or perform testing on an agent’s branch. We can even see the unit tests that have been specified in our spec how been executed by our coding agent. The pattern used here (Spec Kit -> coding agent) overcomes one of the biggest challenges we see with the coding agent. Unlike an IDE-based coding agent, the GitHub.com coding agent is left to its own iterations and implementation without input until the PR review. This can lead to subpar performance, especially compared to IDE agents which have constant input and interruption. The concise and considered breakdown generated from Spec Kit provides the structure and foundation for the agent to execute on; very little is left to interpretation for the coding agent. Step 3: GitHub Code Quality Review (Human in the loop with agent assistance.) GitHub Code Quality is a feature (currently in preview) that proactively identifies code quality risks and opportunities for enhancement both in PRs and through repository scans. These are surfaced within a PR and also in repo-level scoreboards. This means that PRs can now extend existing static code analysis: Copilot can action CodeQL, PMD, and ESLint scanning on top of the new, in-context code quality findings and autofixes. Furthermore, we receive a summary of the actual changes made. This can be used to assist the human in the loop in understanding what changes have been made and whether enhancements or improvements are required. Thinking about this in the context of review coverage, one of the challenges sometimes in already-lean development teams is the time to give proper credence to PRs. Now, with AI-assisted quality scanning, we can be more confident in our overall evaluation and test coverage. I would expect that use of these tools alongside existing human review processes would increase repository code quality and reduce uncaught errors. The data points support this too. The Qodo 2025 AI Code Quality report showed that usage of AI code reviews increased quality improvements to 81% (from 55%). A similar study from Atlassian RovoDev 2026 study showed that 38.7% of comments left by AI agents in code reviews lead to additional code fixes. LLM’s in their current form are never going to achieve 100% accuracy however these are still considerable, significant gains in one of the most important (and often neglected) parts of the SDLC. With a significant number of software supply chain attacks recently it is also not a stretch to imagine that that many projects could benefit from "independently" (use this term loosely) reviewed and summarised PR's and commits. This in the future could potentially by a specialist/sub agent during a PR or merge to focus on identifying malicious code that may be hidden within otherwise normal contributions, case in point being the "near-miss" XZ Utils attack. Step 4: GitHub Actions for build and deploy - No agents here, just deterministic automation. This step will be our briefest, as the idea of CI/CD and automation needs no introduction. It is worth noting that while I am sure there are additional opportunities for using agents within a build and deploy pipeline, I have not investigated them. I often speak with customers about deterministic and non-deterministic business process automation, and the importance of distinguishing between the two. Some processes were created to be deterministic because that is all that was available at the time; the number of conditions required to deal with N possible flows just did not scale. However, now those processes can be non-deterministic. Good examples include IVR decision trees in customer service or hard-coded sales routines to retain a customer regardless of context; these would benefit from less determinism in their execution. However, some processes remain best as deterministic flows: financial transactions, policy engines, document ingestion. While all these flows may be part of an AI solution in the future (possibly as a tool an agent calls, or as part of a larger agent-based orchestration), the processes themselves are deterministic for a reason. Just because we could have dynamic decision-making doesn’t mean we should. Infrastructure deployment and CI/CD pipelines are one good example of this, in my opinion. We could have an agent decide what service best fits our codebase and which region we should deploy to, but do we really want to, and do the benefits outweigh the potential negatives? In this process flow we use a deterministic GitHub action to deploy our weather application into our “development” environment and then promote through the environments until we reach production and we want to now ensure that the application is running smoothly. We also use an action as mentioned above to deploy and surface our agents changes. In Azure Container Apps we can do this in a secure sandbox environment called a “Dynamic Session” to ensure strong isolation of what is essentially “untrusted code”. Often enterprises can view the building and development of AI applications as something that requires a completely new process to take to production, while certain additional processes are new, evaluation, model deployment etc many of our traditional SDLC principles are just as relevant as ever before, CI/CD pipelines being a great example of that. Checked in code that is predictably deployed alongside required services to run tests or promote through environments. Whether you are deploying a java calculator app or a multi agent customer service bot, CI/CD even in this new world is a non-negotiable. We can see that our geolocation feature is running on our Azure Container Apps revision and we can begin to evaluate if we agree with CoPilot that all the feature requirements have been met. In this case they have. If they hadn't we'd just jump into the PR and add a new comment with "@copilot" requesting our changes. Step 5: SRE Agent - Proactive agentic day two operations. The SRE agent service on Azure is an operations-focused agent that continuously watches a running service using telemetry such as logs, metrics, and traces. When it detects incidents or reliability risks, it can investigate signals, correlate likely causes, and propose or initiate response actions such as opening issues, creating runbook-guided fixes, or escalating to an on-call engineer. It effectively automates parts of day two operations while keeping humans in control of approval and remediation. It can be run in two different permission models: one with a reader role that can temporarily take user permissions for approved actions when identified. The other model is a privileged level that allows it to autonomously take approved actions on resources and resource types within the resource groups it is monitoring. In our example, our SRE agent could take actions to ensure our container app runs as intended: restarting pods, changing traffic allocations, and alerting for secret expiry. The SRE agent can also perform detailed debugging to save human SREs time, summarising the issue, fixes tried so far, and narrowing down potential root causes to reduce time to resolution, even across the most complex issues. My initial concern with these types of autonomous fixes (be it VPA on Kubernetes or an SRE agent across your infrastructure) is always that they can very quickly mask problems, or become an anti-pattern where you have drift between your IaC and what is actually running in Azure. One of my favourite features of SRE agents is sub-agents. Sub-agents can be created to handle very specific tasks that the primary SRE agent can leverage. Examples include alerting, report generation, and potentially other third-party integrations or tooling that require a more concise context. In my example, I created a GitHub sub-agent to be called by the primary agent after every issue that is resolved. When called, the GitHub sub-agent creates an issue summarising the origin, context, and resolution. This really brings us full circle. We can then potentially assign this to our coding agent to implement the fix before we proceed with the rest of the cycle; for example, a change where a port is incorrect in some Bicep, or min scale has been adjusted because of latency observed by the SRE agent. These are quick fixes that can be easily implemented by a coding agent, subsequently creating an autonomous feedback loop with human review. Conclusion: The journey through this AI-led SDLC demonstrates that it is possible, with today’s tooling, to improve any existing SDLC with AI assistance, evolving from simply using a chat interface in an IDE. By combining Speckit, spec-driven development, autonomous coding agents, AI-augmented quality checks, deterministic CI/CD pipelines, and proactive SRE agents, we see an emerging ecosystem where human creativity and oversight guide an increasingly capable fleet of collaborative agents. As with all AI solutions we design today, I remind myself that “this is as bad as it gets”. If the last two years are anything to go by, the rate of change in this space means this article may look very different in 12 months. I imagine Spec-to-issue will no longer be required as a bridge, as native solutions evolve to make this process even smoother. There are also some areas of an AI-led SDLC that are not included in this post, things like reviewing the inner-loop process or the use of existing enterprise patterns and blueprints. I also did not review use of third-party plugins or tools available through GitHub. These would make for an interesting expansion of the demo. We also did not look at the creation of custom coding agents, which could be hosted in Microsoft Foundry; this is especially pertinent with the recent announcement of Anthropic models now being available to deploy in Foundry. Does today’s tooling mean that developers, QAs, and engineers are no longer required? Absolutely not (and if I am honest, I can’t see that changing any time soon). However, it is evidently clear that in the next 12 months, enterprises who reshape their SDLC (and any other business process) to become one augmented by agents will innovate faster, learn faster, and deliver faster, leaving organisations who resist this shift struggling to keep up.6.3KViews5likes0CommentsBeyond the Desktop: The Future of Development with Microsoft Dev Box and GitHub Codespaces
The modern developer platform has already moved past the desktop. We’re no longer defined by what’s installed on our laptops, instead we look at what tooling we can use to move from idea to production. An organisations developer platform strategy is no longer a nice to have, it sets the ceiling for what’s possible, an organisation can’t iterate it's way to developer nirvana if the foundation itself is brittle. A great developer platform shrinks TTFC (time to first commit), accelerates release velocity, and maybe most importantly, helps alleviate everyday frictions that lead to developer burnout. Very few platforms deliver everything an organization needs from a developer platform in one product. Modern development spans multiple dimensions, local tooling, cloud infrastructure, compliance, security, cross-platform builds, collaboration, and rapid onboarding. The options organizations face are then to either compromise on one or more of these areas or force developers into rigid environments that slow productivity and innovation. This is where Microsoft Dev Box and GitHub Codespaces come into play. On their own, each addresses critical parts of the modern developer platform: Microsoft Dev Box provides a full, managed cloud workstation. Dev Box gives developers a consistent, high-performance environment while letting central IT apply strict governance and control. Internally at Microsoft, we estimate that usage of Dev Box by our development teams delivers savings of 156 hours per year per developer purely on local environment setup and upkeep. We have also seen significant gains in other key SPACE metrics reducing context-switching friction and improving build/test cycles. Although the benefits of Dev Box are clear in the results demonstrated by our customers it is not without its challenges. The biggest challenge often faced by Dev Box customers is its lack of native Linux support. At the time of writing and for the foreseeable future Dev Box does not support native Linux developer workstations. While WSL2 provides partial parity, I know from my own engineering projects it still does not deliver the full experience. This is where GitHub Codespaces comes into this story. GitHub Codespaces delivers instant, Linux-native environments spun up directly from your repository. It’s lightweight, reproducible, and ephemeral ideal for rapid iteration, PR testing, and cross-platform development where you need Linux parity or containerized workflows. Unlike Dev Box, Codespaces can run fully in Linux, giving developers access to native tools, scripts, and runtimes without workarounds. It also removes much of the friction around onboarding: a new developer can open a repository and be coding in minutes, with the exact environment defined by the project’s devcontainer.json. That said, Codespaces isn’t a complete replacement for a full workstation. While it’s perfect for isolated project work or ephemeral testing, it doesn’t provide the persistent, policy-controlled environment that enterprise teams often require for heavier workloads or complex toolchains. Used together, they fill the gaps that neither can cover alone: Dev Box gives the enterprise-grade foundation, while Codespaces provides the agile, cross-platform sandbox. For organizations, this pairing sets a higher ceiling for developer productivity, delivering a truly hybrid, agile and well governed developer platform. Better Together: Dev Box and GitHub Codespaces in action Together, Microsoft Dev Box and GitHub Codespaces deliver a hybrid developer platform that combines consistency, speed, and flexibility. Teams can spin up full, policy-compliant Dev Box workstations preloaded with enterprise tooling, IDEs, and local testing infrastructure, while Codespaces provides ephemeral, Linux-native environments tailored to each project. One of my favourite use cases is having local testing setups like a Docker Swarm cluster, ready to go in either Dev Box or Codespaces. New developers can jump in and start running services or testing microservices immediately, without spending hours on environment setup. Anecdotally, my time to first commit and time to delivering “impact” has been significantly faster on projects where one or both technologies provide local development services out of the box. Switching between Dev Boxes and Codespaces is seamless every environment keeps its own libraries, extensions, and settings intact, so developers can jump between projects without reconfiguring or breaking dependencies. The result is a turnkey, ready-to-code experience that maximizes productivity, reduces friction, and lets teams focus entirely on building, testing, and shipping software. To showcase this value, I thought I would walk through an example scenario. In this scenario I want to simulate a typical modern developer workflow. Let's look at a day in the life of a developer on this hybrid platform building an IOT project using Python and React. Spin up a ready-to-go workstation (Dev Box) for Windows development and heavy builds. Launch a Linux-native Codespace for cross-platform services, ephemeral testing, and PR work. Run "local" testing like a Docker Swarm cluster, database, and message queue ready to go out-of-the-box. Switch seamlessly between environments without losing project-specific configurations, libraries, or extensions. 9:00 AM – Morning Kickoff on Dev Box I start my day on my Microsoft Dev Box, which gives me a fully-configured Windows environment with VS Code, design tools, and Azure integrations. I select my teams project, and the environment is pre-configured for me through the Dev Box catalogue. Fortunately for me, its already provisioned. I could always self service another one using the "New Dev Box" button if I wanted too. I'll connect through the browser but I could use the desktop app too if I wanted to. My Tasks are: Prototype a new dashboard widget for monitoring IoT device temperature. Use GUI-based tools to tweak the UI and preview changes live. Review my Visio Architecture. Join my morning stand up. Write documentation notes and plan API interactions for the backend. In a flash, I have access to my modern work tooling like Teams, I have this projects files already preloaded and all my peripherals are working without additional setup. Only down side was that I did seem to be the only person on my stand up this morning? Why Dev Box first: GUI-heavy tasks are fast and responsive. Dev Box’s environment allows me to use a full desktop. Great for early-stage design, planning, and visual work. Enterprise Apps are ready for me to use out of the box (P.S. It also supports my multi-monitor setup). I use my Dev Box to make a very complicated change to my IoT dashboard. Changing the title from "IoT Dashboard" to "Owain's IoT Dashboard". I preview this change in a browser live. (Time for a coffee after this hardwork). The rest of the dashboard isnt loading as my backend isnt running... yet. 10:30 AM – Switching to Linux Codespaces Once the UI is ready, I push the code to GitHub and spin up a Linux-native GitHub Codespace for backend development. Tasks: Implement FastAPI endpoints to support the new IoT feature. Run the service on my Codespace and debug any errors. Why Codespaces now: Linux-native tools ensure compatibility with the production server. Docker and containerized testing run natively, avoiding WSL translation overhead. The environment is fully reproducible across any device I log in from. 12:30 PM – Midday Testing & Sync I toggle between Dev Box and Codespaces to test and validate the integration. I do this in my Dev Box Edge browser viewing my codespace (I use my Codespace in a browser through this demo to highlight the difference in environments. In reality I would leverage the VSCode "Remote Explorer" extension and its GitHub Codespace integration to use my Codespace from within my own desktop VSCode but that is personal preference) and I use the same browser to view my frontend preview. I update the environment variable for my frontend that is running locally in my Dev Box and point it at the port running my API locally on my Codespace. In this case it was a web socket connection and HTTPS calls to port 8000. I can make this public by changing the port visibility in my Codespace. https://fluffy-invention-5x5wp656g4xcp6x9-8000.app.github.dev/api/devices wss://fluffy-invention-5x5wp656g4xcp6x9-8000.app.github.dev/ws This allows me to: Preview the frontend widget on Dev Box, connecting to the backend running in Codespaces. Make small frontend adjustments in Dev Box while monitoring backend logs in Codespaces. Commit changes to GitHub, keeping both environments in sync and leveraging my CI/CD for deployment to the next environment. We can see the Dev Box running local frontend and the Codespace running the API connected to each other, making requests and displaying the data in the frontend! Hybrid advantage: Dev Box handles GUI previews comfortably and allows me to live test frontend changes. Codespaces handles production-aligned backend testing and Linux-native tools. Dev Box allows me to view all of my files in one screen with potentially multiple Codespaces running in browser of VS Code Desktop. Due to all of those platform efficiencies I have completed my days goals within an hour or two and now I can spend the rest of my day learning about how to enable my developers to inner source using GitHub CoPilot and MCP (Shameless plug). The bottom line There are some additional considerations when architecting a developer platform for an enterprise such as private networking and security not covered in this post but these are implementation details to deliver the described developer experience. Architecting such a platform is a valuable investment to deliver the developer platform foundations we discussed at the top of the article. While in this demo I have quickly built I was working in a mono repository in real engineering teams it is likely (I hope) that an application is built of many different repositories. The great thing about Dev Box and Codespaces is that this wouldn’t slow down the rapid development I can achieve when using both. My Dev Box would be specific for the project or development team, pre loaded with all the tools I need and potentially some repos too! When I need too I can quickly switch over to Codespaces and work in a clean isolated environment and push my changes. In both cases any changes I want to deliver locally are pushed into GitHub (Or ADO), merged and my CI/CD ensures that my next step, potentially a staging environment or who knows perhaps *Whispering* straight into production is taken care of. Once I’m finished I delete my Codespace and potentially my Dev Box if I am done with the project, knowing I can self service either one of these anytime and be up and running again! Now is there overlap in terms of what can be developed in a Codespace vs what can be developed in Azure Dev Box? Of course, but as organisations prioritise developer experience to ensure release velocity while maintaining organisational standards and governance then providing developers a windows native and Linux native service both of which are primarily charged on the consumption of the compute* is a no brainer. There are also gaps that neither fill at the moment for example Microsoft Dev Box only provides windows compute while GitHub Codespaces only supports VS Code as your chosen IDE. It's not a question of which service do I choose for my developers, these two services are better together! *Changes have been announced to Dev Box pricing. A W365 license is already required today and dev boxes will continue to be managed through Azure. For more information please see: Microsoft Dev Box capabilities are coming to Windows 365 - Microsoft Dev Box | Microsoft Learn1.2KViews2likes0CommentsTransforming Retry-After Headers in Azure APIM: A Step-by-Step Guide
In this blog post, you'll learn how to customize the Retry-After response header in Azure API Management (APIM) rate-limiting policies, enhancing your API's flexibility and user experience. While it does not delve into the specifics of the rate-limit or rate-limit-by-key policies, it provides a practical guide for altering the Retry-After header. For detailed information on the rate-limit policy, please visit Azure API Management policy reference - rate-limit | Microsoft Learn. Understanding Rate Limiting: Protecting Your API from Overuse Rate limiting is a technique used to control how often requests are made to a resource. It helps prevent excessive or abusive use and ensures the resource is available to all users. Rate limiting is often used to protect against denial-of-service (DoS) attacks, which aim to overwhelm a network or server with too many requests, making it unavailable to legitimate users. It can also limit the number of requests from individual users to prevent a single user or group from monopolizing the resource. Azure API Management Rate Limit Policies In Azure, access to APIs is controlled using the following API Management policies: rate-limit rate-limit-by-key The implementation of these policies is straightforward but somewhat limited and less flexible, in my opinion. The Default Retry-After Header: What You Need to Know In the Azure APIM rate-limit policy documentation, it is mentioned that once the client's requests are throttled, the service starts returning a response header containing the time interval (in seconds) after which the client should retry the request. The default name of the header is Retry-After, and this name can be customized. For example: Retry-After: 60 However, in one use case for a customer, there was a requirement to provide a timestamp instead of a time interval as a header value. For example: Retry-After: 2020-05-04T12:23:41.6181792Z To implement this, the header value needs to change, but this is something that the rate-limit policy does not support. Customizing the Retry-After Header The basis for changing the response header value lies in the on-error scope. You can implement a policy like the following: <inbound> <base> <rate-limit-by-key calls="1000" renewal-period="60" counter-key="@(context.Request.IpAddress)" increment-condition="@(context.Response.StatusCode == 200)" remaining-calls-variable-name="remainingCallsPerIP" retry-after-header-name="Retry-After" remaining-calls-header-name="Requests-Remaining" retry-after-variable-name="retryAfter"> </rate-limit-by-key></inbound> <on-error> <choose> <when condition="@(context.LastError.Reason == " ratelimitexceeded")"=""> <set-header name="Retry-After" exists-action="override"> <value>@(DateTime.UtcNow.AddSeconds(context.Variables.GetValueOrDefault<int>("retryAfter")).ToString("o"))</int></value> </set-header> </when> </choose> <base> </on-error> Please refer to APIM predefined errors for policies here: Error handling in Azure API Management policies | Microsoft Learn Here, the key point is that whenever the APIM rate limit is reached, an error occurs, which is then captured in the on-error scope. To set or override the response header only in rate-limiting scenarios, you need to filter using the RateLimitExceeded error reason. After that, the exact error value is determined by adding the current UTC timestamp with the value of the retryAfter variable in seconds. With this, you have now customized the Retry-After header with a timestamp instead of a time interval (in seconds). Conclusion In conclusion, customizing the Retry-After response header in Azure API Management can significantly enhance the flexibility and user experience of your API services. By leveraging the on-error scope and handling the RateLimitExceeded error, you can provide a more informative and user-friendly response to clients when rate limits are exceeded. This approach not only meets specific customer requirements but also demonstrates the adaptability of Azure APIM in handling various scenarios. With these steps, you can ensure that your API remains robust, efficient, and user centric.326Views0likes1CommentRun Playwright Tests on Cloud Browsers using Playwright Workspaces
This post walks through setting up and running Playwright UI and API tests on Azure Playwright Testing Service (Preview). It covers workspace setup, project configuration, remote browser execution, and viewing test reports and traces using Visual Studio or VS Code.1.5KViews1like0CommentsIntroducing the Azure Static Web Apps Skill for GitHub Copilot
From "Built" to "Deployed" in Minutes You've just finished building something great: A marketing landing page for your startup A portfolio site showcasing your work A documentation site for your open-source project An e-commerce storefront built with Next.js An internal dashboard for your team A blog with your latest content Now you want to share it with the world. Azure Static Web Apps offers free hosting, global CDN, custom domains, and built-in auth—but figuring out the deployment workflow can slow you down. The learning curve: "What's the CLI command again? Is it swa-cli or @azure/static-web-apps-cli? Where does staticwebapp.config.json go?" The golden path: "Hey Copilot, deploy my Vite app to Azure Static Web Apps" The skill provides a streamlined, tested workflow—the golden path—so you can focus on your app, not the deployment process. What Are Agent Skills? Agent Skills are self-contained knowledge bundles that enhance GitHub Copilot's capabilities for specialized tasks. They provide a curated, golden path for common workflows: Feature Benefit Curated Commands Exact CLI syntax that works today Guardrails Knows what NOT to do (e.g., never manually create swa-cli.config.json) Troubleshooting Built-in solutions for common errors Framework Awareness Knows Vite uses port 5173, React uses 3000 The Traditional Approach: Learning as You Go Deploying to Azure Static Web Apps is straightforward once you know the workflow, but the learning curve can add time to your first deployment: Typical First-Time Challenges Finding the right tools: Azure has multiple CLI options—the Azure CLI extension (az staticwebapp) for managing resources, and the SWA CLI (@azure/static-web-apps-cli) for local development and deployment. Knowing which to use when takes research. Configuration decisions: Where does staticwebapp.config.json go? Should you use swa init or create configs manually? What's the right build output path for your framework? Framework-specific setup: Different frameworks have different port numbers, build commands, and output directories. Vite uses port 5173, Create React App uses 3000, Next.js has its own conventions. SPA routing: Client-side routing needs navigationFallback configuration—something you might not discover until you see 404 errors in production. Typical First Deployment: 25-45 minutes This time includes reading documentation, trying commands, and troubleshooting configuration. Not wrong—just a learning process that the skill can accelerate. The Golden Path: With the Azure Static Web Apps Skill The skill provides a streamlined workflow that guides you through each step—whether you're deploying a portfolio, marketing site, or enterprise application: Step 1: Install and Initialize (30 seconds) You: "Set up my Vite React app for Azure Static Web Apps" Copilot (with skill): "I'll help you set up SWA for your Vite app: 1. Install the CLI: npm install -D @azure/static-web-apps-cli 2. Initialize (auto-detects Vite framework): npx swa init --yes" Correct package, correct command, framework auto-detected! Step 2: Local Development (15 seconds) You: "Run it locally" Copilot (with skill): "Start the local emulator: npx swa start Open http://localhost:4280 to test." Dev server proxy automatically configured! Step 3: Deploy (30 seconds) You: "Deploy to Azure" Copilot (with skill): "Deploy to production: npx swa login npx swa deploy --env production" Correct commands AND proactive troubleshooting advice! Total Time: Under 3 minutes The skill provides the golden path—tested commands, framework awareness, and proactive guidance. Side-by-Side Comparison Task Learning Curve Golden Path (with Skill) Find correct CLI Research docs Instant guidance Create config Trial and error swa init --yes Local dev setup Framework research Auto-detected Deployment Documentation reading Guided workflow Total Time 25-45 minutes < 3 minutes Confidence Level Learning Guided Understanding Azure Static Web Apps Tooling Azure provides multiple tools for working with Static Web Apps. The skill helps you navigate these options by providing the golden path for each scenario: Two Complementary Tools Tool Purpose Install Command Azure CLI extension (az staticwebapp) Manage Azure resources Built into Azure CLI SWA CLI (swa) Local development, deploying apps npm install -D azure/static-web-apps-cli Both tools serve important purposes, and the skill guides you to the right one for your task. The Skill Provides the Golden Path The skill guides you through: The right tool for your task The recommended workflow: swa init → swa start → swa deploy Framework detection (Vite, React, Vue, Next.js, and more) Best practices like using swa init for configuration What the Skill Knows Best Practices Built In Best Practice Why It Matters Use swa init for configuration Auto-detects framework settings Framework-specific ports Vite uses 5173, React uses 3000, etc. navigationFallback for SPAs Prevents 404s on client-side routes platform.apiRuntime for APIs Required for Azure Functions integration Built-in Troubleshooting Common Issue Skill's Solution 404 on client routes Add navigationFallback API returns 404 Check apiRuntime and function exports Build output not found Verify output_location matches build CORS errors APIs under /api/* are same-origin Installing the Skill There are two ways to install the skill: using GitHub Copilot CLI (recommended) or manually adding the skill file. Option 1: GitHub Copilot CLI (Recommended) The fastest way to get started is using the Copilot CLI plugin system: # Add the repo as a plugin marketplace /plugin marketplace add microsoft/github-copilot-for-azure # Install the Azure plugin (includes Static Web Apps skill) /plugin install azure@github-copilot-for-azure # Update the plugin when changes are made /plugin update azure@github-copilot-for-azure Option 2: Manual Installation If you prefer to install the skill manually: 1. Create the Skill Folder your-project/ └── skills/ └── azure-static-web-apps/ └── SKILL.md 2. Copy the Skill Content The skill file contains: Overview of Azure Static Web Apps capabilities Installation commands with verification Configuration guidance for both config files Command reference for all SWA CLI commands Scenarios for common workflows Troubleshooting for common issues Get the skill: github.com/github/awesome-copilot Try It Yourself Prompt What Copilot Does "Deploy my React app to SWA" Full workflow: install → init → build → deploy "Add authentication to my routes" Configures staticwebapp.config.json with auth rules "Set up GitHub Actions for SWA" Generates complete CI/CD workflow "Why am I getting 404 on /dashboard" Identifies missing navigationFallback "Add an API backend" Creates Azure Functions with correct runtime config Key Takeaways Time Savings: What takes 30-45 minutes now takes under 3 minutes with guided, accurate commands Best Practices: The skill encodes best practices—guiding you through the recommended workflow Curated Guidance: Skills contain curated, tested commands for the golden path deployment Context-Aware: Framework-specific knowledge means less configuration guessing Learn More Agent Skills Specification: agentskills.io/specification Azure Static Web Apps Docs: learn.microsoft.com/azure/static-web-apps SWA CLI Reference: azure.github.io/static-web-apps-cli More Skills: github.com/github/awesome-copilot Have you tried Agent Skills? Share your experience deploying to Azure Static Web Apps in the comments!778Views0likes0Comments