azure paas
169 TopicsExtend SRE Agent with MCP: Build an Agentic Workflow to Triage Customer Issues
Your inbox is full. GitHub issues piling up. "App not working." "How do I configure alerts?" "Please add dark mode." You open each one, figure out what it is, ask for more info, add labels, route to the right team. An hour later, you're still sorting issues. Sound familiar? The Triage Tax Every L1 support engineer, PM, and on-call developer who's handled customer issues knows this pain. When tickets come in, you're not solving problems, you're sorting them. Read the issue. Is it a bug or a question? Check the docs. Does this feature exist? Ask for more info. Wait two days. Re-triage. Add labels. Route to engineering. It's tedious. It requires judgment, you need to understand the product, know what info is needed, check documentation. And honestly? It's work that nobody volunteers for but someone has to do. In large organizations, it gets even more complex. The issue doesn't just need to be triaged, it needs to be routed to the right engineering team. Is this an auth issue? Frontend? Backend? Infrastructure? A wrong routing decision means delays, re-assignments, and frustrated customers. What if an AI agent could do this for you? Enter Azure SRE Agent + MCP Here's what I built: I gave SRE Agent access to my GitHub and PagerDuty accounts via MCP, uploaded my triage rubric as a markdown file, and set it to run twice a day. No more reading every ticket manually. No more asking the same "please provide more info" questions. No more morning triage sessions. What My Setup Looks Like My app's customer issues come in through GitHub. My team uses PagerDuty to track bugs and incidents. So I connected both via MCP to the SRE Agent. I also uploaded my triage logic as a .md file on how to classify issues, what info is required for each category, which labels to use, which team handles what. And since I didn't want to run this workflow manually, I set up a scheduled task to trigger it twice a day. Now it just runs. I verify its work if I want to. What the Agent Does Fetches all open, unlabeled GitHub issues Reads each issue and classifies it (bug, doc question, feature request) Checks if required info is present Posts a comment asking for details if needed, or acknowledges the issue Adds appropriate labels Creates a PagerDuty incident for bugs ready for engineering Moves to the next issue How I Built It: Step by Step Let me walk you through exactly how I set this up inside SRE Agent. Step 1: Create an SRE Agent I created a new SRE Agent in the Azure portal. Since this workflow triages GitHub issues and not Azure resources, I didn't need to configure any Azure resource groups or subscriptions. Just an agent. Step 2: Connect MCP Servers I added two MCP servers to give the agent access to my tools: GitHub MCP– Fetch issues, post comments, add labels PagerDuty MCP – Create incidents for bugs that need dev team's attention MCP (Model Context Protocol) lets you bring any API into the agent. If your tool has an API, you can connect it. Step 3: Create Subagents I created two focused subagents, each with a specific job and only the tools it needs: GitHub Issue Triager "You are expert in triaging GitHub issues, classifying them into categories such as user needs to supply additional information, bug, documentation question, or feature request. Use the knowledge base to search for the right document that helps you with performing this triaging. Perform all actions autonomously without waiting for user input. Hand off to Incident Creator for the issues you classified as bugs." Tools: GitHub MCP (issues, labels, comments) Incident Creator Here "You are expert in managing incidents in PagerDuty, listing services, incidents, creating incidents with all details. Once done, hand off back to GitHub Issue Triager." Tools: PagerDuty MCP (services, incidents) The handoff between them creates a workflow. They collaborate without human involvement. Step 4: Add Your Knowledge I uploaded my triage logic as a .md file to the agent's knowledge base. This is my rubric - my mental model for how to triage issues: How do I classify bugs vs. doc questions vs. feature requests? What info is required for each category? What labels do I use? When should an incident be created? Which team handles which type of issue? I wrote it down the way I'd explain it to a new teammate. The agent searches and follows it. Step 5: Add a Scheduled Task I didn't want to trigger this workflow manually every time. SRE Agent supports scheduled tasks, workflows that run automatically on a cadence. I set up a trigger to run twice a day: morning and evening. Now the workflow is fully automated. Here is the end to end automated agentic workflow to triage customer tickets. Why MCP Matters Every team uses different tools. Maybe your customer issues live in Zendesk, incidents go to ServiceNow and you use Jira or Azure DevOps. SRE Agent doesn't lock you in. With MCP, you connect to whatever tools you already use. The agent orchestrates across them. That's the extensibility model: your tools, your workflow, orchestrated by the agent. The Result Before: 2 hours every morning sorting tickets. After: By the time anyone logs in, issues are labeled, missing-info requests are posted, urgent bugs have incidents, and feature requests are acknowledged. Your team can finally focus on the complex stuff not sorting tickets. Why This Matters Faster response times. Issues get acknowledged in minutes, not days. Consistent classification. No "this should have been a P1" moments. No tickets bouncing between teams. Happier customers. They get a response immediately even if it's just "we're looking into it." Focus on what matters. Your team should be solving problems, not sorting them. The Bottom Line Triage isn't the job, it's the tax on the job. It quietly eats the hours your team could spend building, debugging, and shipping. You don't need to build a custom triage bot. You don't need to wire up webhooks and write glue code. You give the SRE agent your tools, your logic, and a schedule and it handles the sorting. Use GitHub? Connect GitHub. Use Zendesk? Connect Zendesk. PagerDuty, ServiceNow, Jira - whatever your team runs on, the agent meets you there. Stop sorting tickets. Start shipping. A Few Tips Test MCP endpoints before configuring them in the SRE agent Give each subagent only the tools it needs, don't enable everything Start read-only until you trust the classification, then enable comments Do You Still Want to Triage Issues Manually? What tools does your team use to track customer-reported issues and incidents? Let us know in the comments, we'd love to hear how you'd use this workflow with your stack. Is triage your most toilsome workflow or is there something even worse eating your team's time? Let us know in the comments.99Views0likes0CommentsStop Running Runbooks at 3 am: Let Azure SRE Agent Do Your On-Call Grunt Work
Your pager goes off. It's 2:47am. Production is throwing 500 errors. You know the drill - SSH into this, query that, check these metrics, correlate those logs. Twenty minutes later, you're still piecing together what went wrong. Sound familiar? The On-Call Reality Nobody Talks About Every SRE, DevOps engineer, and developer who's carried a pager knows this pain. When incidents hit, you're not solving problems - you're executing runbooks. Copy-paste this query. Check that dashboard. Run these az commands. Connect the dots between five different tools. It's tedious. It's error-prone at 3am. And honestly? It's work that doesn't require human creativity but requires human time. What if an AI agent could do this for you? Enter Azure SRE Agent + Runbook Automation Here's what I built: I gave SRE Agent a simple markdown runbook containing the same diagnostic steps I'd run manually during an incident. The agent executes those steps, collects evidence, and sends me an email with everything I need to take action. No more bouncing between terminals. No more forgetting a step because it's 3am and your brain is foggy. What My Runbook Contains Just the basics any on-call would run: az monitor metrics – CPU, memory, request rates Log Analytics queries – Error patterns, exception details, dependency failures App Insights data – Failed requests, stack traces, correlation IDs az containerapp logs – Revision logs, app configuration That's it. Plain markdown with KQL queries and CLI commands. Nothing fancy. What the Agent Does Reads the runbook from its knowledge base Executes each diagnostic step Collects results and evidence Sends me an email with analysis and findings I wake up to an email that says: "CPU spiked to 92% at 2:45am, triggering connection pool exhaustion. Top exception: SqlException (1,832 occurrences). Errors correlate with traffic spike. Recommend scaling to 5 replicas." All the evidence. All the queries used. All the timestamps. Ready for me to act. How to Set This Up (6 Steps) Here's how you can build this yourself: Step 1: Create SRE Agent Create a new SRE Agent in the Azure portal. No Azure resource groups to configure. If your apps run on Azure, the agent pulls context from the incident itself. If your apps run elsewhere, you don't need Azure resource configuration at all. Step 2: Grant Reader Permission (Optional) If your runbooks execute against Azure resources, assign Reader role to the SRE Agent's managed identity on your subscription. This allows the agent to run az commands and query metrics. Skip this if your runbooks target non-Azure apps. Step 3: Add Your Runbook to SRE Agent's Knowledge base You already have runbooks, they're in your wiki, Confluence, or team docs. Just add them as .md files to the agent's knowledge base. To learn about other ways to link your runbooks to the agent, read this Step 4: Connect Outlook Connect the agent to your Outlook so it can send you the analysis email with findings. Step 5: Create a Subagent Create a subagent with simple instructions like: "You are an expert in triaging and diagnosing incidents. When triggered, search the knowledge base for the relevant runbook, execute the diagnostic steps, collect evidence, and send an email summary with your findings." Assign the tools the agent needs: RunAzCliReadCommands – for az monitor, az containerapp commands QueryLogAnalyticsByWorkspaceId – for KQL queries against Log Analytics QueryAppInsightsByResourceId – for App Insights data SearchMemory – to find the right runbook SendOutlookEmail – to deliver the analysis Step 6: Set Up Incident Trigger Connect your incident management tool - PagerDuty, ServiceNow, or Azure Monitor alerts and setup the incident trigger to the subagent. When an incident fires, the agent kicks off automatically. That's it. Your agentic workflow now looks like this: This Works for Any App, Not Just Azure Here's the thing: SRE Agent is platform agnostic. It's executing your runbooks, whatever they contain. On-prem databases? Add your diagnostic SQL. Custom monitoring stack? Add those API calls. The agent doesn't care where your app runs. It cares about following your runbook and getting you answers. Why This Matters Lower MTTR. By the time you're awake and coherent, the analysis is done. Consistent execution. No missed steps. No "I forgot to check the dependencies" at 4am. Evidence for postmortems. Every query, every result, timestamped and documented. Focus on what matters. Your brain should be deciding what to do not gathering data. The Bottom Line On-call runbook execution is the most common, most tedious, and most automatable part of incident response. It's grunt work that pulls engineers away from the creative problem-solving they were hired for. SRE Agent offloads that work from your plate. You write the runbook once, and the agent executes it every time, faster and more consistently than any human at 3am. Stop running runbooks. Start reviewing results. Try it yourself: Create a markdown runbook with your diagnostic queries and commands, add it to your SRE Agent's knowledge base, and let the agent handle your next incident. Your 3am self will thank you.356Views0likes0CommentsCall Function App from Azure Data Factory with Managed Identity Authentication
Integrating Azure Function Apps into your Azure Data Factory (ADF) workflows is a common practice. To enhance security beyond the use of function API keys, leveraging managed identity authentication is strongly recommended. Given the fact that many existing guides were outdated with recent updates to Azure services, this article provides a comprehensive, up-to-date walkthrough on configuring managed identity in ADF to securely call Function Apps. The provided methods can also be adapted to other Azure services that need to call Function Apps with managed identity authentication. The high level process is: Enable Managed Identity on Data Factory Configure Microsoft Entra Sign-in on Azure Function App Configure Linked Service in Data Factory Assign Permissions to the Data Factory in Azure Function Step 1: Enable Managed Identity on Data Factory On the Data Factory’s portal, go to Managed Identities, and enable a system assigned managed identity. Step 2: Configure Microsoft Entra Sign-in on Azure Function App 1. Go to Function App portal and enable Authentication. Choose "Microsoft" as the identity provider. 2. Add an app registration to the app, it could be an existing one or you can choose to let the platform create a new app registration. 3. Next, allow the ADF as a client application to authenticate to the function app. This step is a new requirement from previous guides. If these settings are not correctly set, the 403 response will be returned: Add the Application ID of the ADF managed identity in Allowed client application and Object ID of the ADF managed identity in the Allowed identities. If the requests are only allowed from specific tenants, add the Tenant ID of the managed identity in the last box. 4. This part sets the response from function app for the unauthenticated requests. We should set the response as "HTTP 401 Unauthorized: recommended for APIs" as sign-in page is not feasible for API calls from ADF. 5. Then, click next and use the default permission option. 6. After everything is set, click "Add" to complete the configuration. Copy the generated App (client) id, as this is used in data factory to handle authorization. Step 3: Configure Linked Service in Data Factory 1. To use an Azure Function activity in a pipeline, follow the steps here: Create an Azure Function activity with UI 2. Then Edit or New a Azure Function Linked Service. 3. Change authentication method to System Assigned Managed Identity, and paste the copied client ID of function app identity provider from Step 2 into Resource ID. This step is necessary as authorization does not work without this. Step 4: Assign Permissions to the Data Factory in Azure Function 1. On the function app portal, go to Access control (IAM), and Add a new role assignment. 2. Assign reader role. 3. Assign the Data Factory’s Managed Identity to that role. After everything is set, test that the function app can be called from Azure Data Factory successfully. Reference: https://prodata.ie/2022/06/16/enabling-managed-identity-authentication-on-azure-functions-in-data-factory/ https://learn.microsoft.com/en-us/azure/data-factory/control-flow-azure-function-activity https://docs.azure.cn/en-us/app-service/overview-authentication-authorization1.7KViews0likes2CommentsIndustry-Wide Certificate Changes Impacting Azure App Service Certificates
Executive Summary In early 2026, industry-wide changes mandated by browser applications and the CA/B Forum will affect both how TLS certificates are issued as well as their validity period. The CA/B Forum is a vendor body that establishes standards for securing websites and online communications through SSL/TLS certificates. Azure App Service is aligning with these standards for both App Service Managed Certificates (ASMC, free, DigiCert-issued) and App Service Certificates (ASC, paid, GoDaddy-issued). Most customers will experience no disruption. Action is required only if you pin certificates or use them for client authentication (mTLS). Who Should Read This? App Service administrators Security and compliance teams Anyone responsible for certificate management or application security Quick Reference: What’s Changing & What To Do Topic ASMC (Managed, free) ASC (GoDaddy, paid) Required Action New Cert Chain New chain (no action unless pinned) New chain (no action unless pinned) Remove certificate pinning Client Auth EKU Not supported (no action unless cert is used for mTLS) Not supported (no action unless cert is used for mTLS) Transition from mTLS Validity No change (already compliant) Two overlapping certs issued for the full year None (automated) If you do not pin certificates or use them for mTLS, no action is required. Timeline of Key Dates Date Change Action Required Mid-Jan 2026 and after ASMC migrates to new chain ASMC stops supporting client auth EKU Remove certificate pinning if used Transition to alternative authentication if the certificate is used for mTLS Mar 2026 and after ASC validity shortened ASC migrates to new chain ASC stops supporting client auth EKU Remove certificate pinning if used Transition to alternative authentication if the certificate is used for mTLS Actions Checklist For All Users Review your use of App Service certificates. If you do not pin these certificates and do not use them for mTLS, no action is required. If You Pin Certificates (ASMC or ASC) Remove all certificate or chain pinning before their respective key change dates to avoid service disruption. See Best Practices: Certificate Pinning. If You Use Certificates for Client Authentication (mTLS) Switch to an alternative authentication method before their respective key change dates to avoid service disruption, as client authentication EKU will no longer be supported for these certificates. See Sunsetting the client authentication EKU from DigiCert public TLS certificates. See Set Up TLS Mutual Authentication - Azure App Service Details & Rationale Why Are These Changes Happening? These updates are required by major browser programs (e.g., Chrome) and apply to all public CAs. They are designed to enhance security and compliance across the industry. Azure App Service is automating updates to minimize customer impact. What’s Changing? New Certificate Chain Certificates will be issued from a new chain to maintain browser trust. Impact: Remove any certificate pinning to avoid disruption. Removal of Client Authentication EKU Newly issued certificates will not support client authentication EKU. This change aligns with Google Chrome’s root program requirements to enhance security. Impact: If you use these certificates for mTLS, transition to an alternate authentication method. Shortening of Certificate Validity Certificate validity is now limited to a maximum of 200 days. Impact: ASMC is already compliant; ASC will automatically issue two overlapping certificates to cover one year. No billing impact. Frequently Asked Questions (FAQs) Will I lose coverage due to shorter validity? No. For App Service Certificate, App Service will issue two certificates to span the full year you purchased. Is this unique to DigiCert and GoDaddy? No. This is an industry-wide change. Do these changes impact certificates from other CAs? Yes. These changes are an industry-wide change. We recommend you reach out to your certificates’ CA for more information. Do I need to act today? If you do not pin or use these certs for mTLS, no action is required. Glossary ASMC: App Service Managed Certificate (free, DigiCert-issued) ASC: App Service Certificate (paid, GoDaddy-issued) EKU: Extended Key Usage mTLS: Mutual TLS (client certificate authentication) CA/B Forum: Certification Authority/Browser Forum Additional Resources Changes to the Managed TLS Feature Set Up TLS Mutual Authentication Azure App Service Best Practices – Certificate pinning DigiCert Root and Intermediate CA Certificate Updates 2023 Sunsetting the client authentication EKU from DigiCert public TLS certificates Feedback & Support If you have questions or need help, please visit our official support channels or the Microsoft Q&A, where our team and the community can assist you.661Views1like0CommentsImportant Changes to App Service Managed Certificates: Is Your Certificate Affected?
Overview As part of an upcoming industry-wide change, DigiCert, the Certificate Authority (CA) for Azure App Service Managed Certificates (ASMC), is required to migrate to a new validation platform to meet multi-perspective issuance corroboration (MPIC) requirements. While most certificates will not be impacted by this change, certain site configurations and setups may prevent certificate issuance or renewal starting July 28, 2025. Update December 8, 2025 We’ve published an update in November about how App Service Managed Certificates can now be supported on sites that block public access. This reverses the limitation introduced in July 2025, as mentioned in this blog. Note: This blog post reflects a point-in-time update and will not be revised. For the latest and most accurate details on App Service Managed Certificates, please refer to official documentation or subsequent updates. Learn more about the November 2025 update here: Follow-Up to ‘Important Changes to App Service Managed Certificates’: November 2025 Update. August 5, 2025 We’ve published a Microsoft Learn documentation titled App Service Managed Certificate (ASMC) changes – July 28, 2025 that contains more in-depth mitigation guidance and a growing FAQ section to support the changes outlined in this blog post. While the blog currently contains the most complete overview, the documentation will soon be updated to reflect all blog content. Going forward, any new information or clarifications will be added to the documentation page, so we recommend bookmarking it for the latest guidance. What Will the Change Look Like? For most customers: No disruption. Certificate issuance and renewals will continue as expected for eligible site configurations. For impacted scenarios: Certificate requests will fail (no certificate issued) starting July 28, 2025, if your site configuration is not supported. Existing certificates will remain valid until their expiration (up to six months after last renewal). Impacted Scenarios You will be affected by this change if any of the following apply to your site configurations: Your site is not publicly accessible: Public accessibility to your app is required. If your app is only accessible privately (e.g., requiring a client certificate for access, disabling public network access, using private endpoints or IP restrictions), you will not be able to create or renew a managed certificate. Other site configurations or setup methods not explicitly listed here that restrict public access, such as firewalls, authentication gateways, or any custom access policies, can also impact eligibility for managed certificate issuance or renewal. Action: Ensure your app is accessible from the public internet. However, if you need to limit access to your app, then you must acquire your own SSL certificate and add it to your site. Your site uses Azure Traffic Manager "nested" or "external" endpoints: Only “Azure Endpoints” on Traffic Manager will be supported for certificate creation and renewal. “Nested endpoints” and “External endpoints” will not be supported. Action: Transition to using "Azure Endpoints". However, if you cannot, then you must obtain a different SSL certificate for your domain and add it to your site. Your site relies on *.trafficmanager.net domain: Certificates for *.trafficmanager.net domains will not be supported for creation or renewal. Action: Add a custom domain to your app and point the custom domain to your *.trafficmanager.net domain. After that, secure the custom domain with a new SSL certificate. If none of the above applies, no further action is required. How to Identify Impacted Resources? To assist with the upcoming changes, you can use Azure Resource Graph (ARG) queries to help identify resources that may be affected under each scenario. Please note that these queries are provided as a starting point and may not capture every configuration. Review your environment for any unique setups or custom configurations. Scenario 1: Sites Not Publicly Accessible This ARG query retrieves a list of sites that either have the public network access property disabled or are configured to use client certificates. It then filters for sites that are using App Service Managed Certificates (ASMC) for their custom hostname SSL bindings. These certificates are the ones that could be affected by the upcoming changes. However, please note that this query does not provide complete coverage, as there may be additional configurations impacting public access to your app that are not included here. Ultimately, this query serves as a helpful guide for users, but a thorough review of your environment is recommended. You can copy this query, paste it into Azure Resource Graph Explorer, and then click "Run query" to view the results for your environment. // ARG Query: Identify App Service sites that commonly restrict public access and use ASMC for custom hostname SSL bindings resources | where type == "microsoft.web/sites" // Extract relevant properties for public access and client certificate settings | extend publicNetworkAccess = tolower(tostring(properties.publicNetworkAccess)), clientCertEnabled = tolower(tostring(properties.clientCertEnabled)) // Filter for sites that either have public network access disabled // or have client certificates enabled (both can restrict public access) | where publicNetworkAccess == "disabled" or clientCertEnabled != "false" // Expand the list of SSL bindings for each site | mv-expand hostNameSslState = properties.hostNameSslStates | extend hostName = tostring(hostNameSslState.name), thumbprint = tostring(hostNameSslState.thumbprint) // Only consider custom domains (exclude default *.azurewebsites.net) and sites with an SSL certificate bound | where tolower(hostName) !endswith "azurewebsites.net" and isnotempty(thumbprint) // Select key site properties for output | project siteName = name, siteId = id, siteResourceGroup = resourceGroup, thumbprint, publicNetworkAccess, clientCertEnabled // Join with certificates to find only those using App Service Managed Certificates (ASMC) // ASMCs are identified by the presence of the "canonicalName" property | join kind=inner ( resources | where type == "microsoft.web/certificates" | extend certThumbprint = tostring(properties.thumbprint), canonicalName = tostring(properties.canonicalName) // Only ASMC uses the "canonicalName" property | where isnotempty(canonicalName) | project certName = name, certId = id, certResourceGroup = tostring(properties.resourceGroup), certExpiration = properties.expirationDate, certThumbprint, canonicalName ) on $left.thumbprint == $right.certThumbprint // Final output: sites with restricted public access and using ASMC for custom hostname SSL bindings | project siteName, siteId, siteResourceGroup, publicNetworkAccess, clientCertEnabled, thumbprint, certName, certId, certResourceGroup, certExpiration, canonicalName Scenario 2: Traffic Manager Endpoint Types For this scenario, please manually review your Traffic Manager profile configurations to ensure only “Azure Endpoints” are in use. We recommend inspecting your Traffic Manager profiles directly in the Azure portal or using relevant APIs to confirm your setup and ensure compliance with the new requirements. Scenario 3: Certificates Issued to *.trafficmanager.net Domains This ARG query helps you identify App Service Managed Certificates (ASMC) that were issued to *.trafficmanager.net domains. In addition, it also checks whether any web apps are currently using those certificates for custom domain SSL bindings. You can copy this query, paste it into Azure Resource Graph Explorer, and then click "Run query" to view the results for your environment. // ARG Query: Identify App Service Managed Certificates (ASMC) issued to *.trafficmanager.net domains // Also checks if any web apps are currently using those certificates for custom domain SSL bindings resources | where type == "microsoft.web/certificates" // Extract the certificate thumbprint and canonicalName (ASMCs have a canonicalName property) | extend certThumbprint = tostring(properties.thumbprint), canonicalName = tostring(properties.canonicalName) // Only ASMC uses the "canonicalName" property // Filter for certificates issued to *.trafficmanager.net domains | where canonicalName endswith "trafficmanager.net" // Select key certificate properties for output | project certName = name, certId = id, certResourceGroup = tostring(properties.resourceGroup), certExpiration = properties.expirationDate, certThumbprint, canonicalName // Join with web apps to see if any are using these certificates for SSL bindings | join kind=leftouter ( resources | where type == "microsoft.web/sites" // Expand the list of SSL bindings for each site | mv-expand hostNameSslState = properties.hostNameSslStates | extend hostName = tostring(hostNameSslState.name), thumbprint = tostring(hostNameSslState.thumbprint) // Only consider bindings for *.trafficmanager.net custom domains with a certificate bound | where tolower(hostName) endswith "trafficmanager.net" and isnotempty(thumbprint) // Select key site properties for output | project siteName = name, siteId = id, siteResourceGroup = resourceGroup, thumbprint ) on $left.certThumbprint == $right.thumbprint // Final output: ASMCs for *.trafficmanager.net domains and any web apps using them | project certName, certId, certResourceGroup, certExpiration, canonicalName, siteName, siteId, siteResourceGroup Ongoing Updates We will continue to update this post with any new queries or important changes as they become available. Be sure to check back for the latest information. Note on Comments We hope this information helps you navigate the upcoming changes. To keep this post clear and focused, comments are closed. If you have questions, need help, or want to share tips or alternative detection methods, please visit our official support channels or the Microsoft Q&A, where our team and the community can assist you.24KViews1like1CommentFollow-Up to ‘Important Changes to App Service Managed Certificates’: November 2025 Update
This post provides an update to the Tech Community article ‘Important Changes to App Service Managed Certificates: Is Your Certificate Affected?’ and covers the latest changes introduced since July 2025. With the November 2025 update, ASMC now remains supported even if the site is not publicly accessible, provided all other requirements are met. Details on requirements, exceptions, and validation steps are included below. Background Context to July 2025 Changes As of July 2025, all ASMC certificate issuance and renewals use HTTP token validation. Previously, public access was required because DigiCert needed to access the endpoint https://<hostname>/.well-known/pki-validation/fileauth.txt to verify the token before issuing the certificate. App Service automatically places this token during certificate creation and renewal. If DigiCert cannot access this endpoint, domain ownership validation fails, and the certificate cannot be issued. November 2025 Update Starting November 2025, App Service now allows DigiCert's requests to the https://<hostname>/.well-known/pki-validation/fileauth.txt endpoint, even if the site blocks public access. If there’s a request to create an App Service Managed Certificate (ASMC), App Service places the domain validation token at the validation endpoint. When DigiCert tries to reach the validation endpoint, App Service front ends present the token, and the request terminates at the front end layer. DigiCert's request does not reach the workers running the application. This behavior is now the default for ASMC issuance for initial certificate creation and renewals. Customers do not need to specifically allow DigiCert's IP addresses. Exceptions and Unsupported Scenarios This update addresses most scenarios that restrict public access, including App Service Authentication, disabling public access, IP restrictions, private endpoints, and client certificates. However, a public DNS record is still required. For example, sites using a private endpoint with a custom domain on a private DNS cannot validate domain ownership and obtain a certificate. Even with all validations now relying on HTTP token validation and DigiCert requests being allowed through, certain configurations are still not supported for ASMC: Sites configured as "Nested" or "External" endpoints behind Traffic Manager. Only "Azure" endpoints are supported. Certificates requested for domains ending in *.trafficmanager.net are not supported. Testing Customers can easily test whether their site’s configuration or set-up supports ASMC by attempting to create one for their site. If the initial request succeeds, renewals should also work, provided all requirements are met and the site is not listed in an unsupported scenario.6KViews1like0CommentsExpanding the Public Preview of the Azure SRE Agent
We are excited to share that the Azure SRE Agent is now available in public preview for everyone instantly – no sign up required. A big thank you to all our preview customers who provided feedback and helped shape this release! Watching teams put the SRE Agent to work taught us a ton, and we’ve baked those lessons into a smarter, more resilient, and enterprise-ready experience. You can now find Azure SRE Agent directly in the Azure Portal and get started, or use the link below. 📖 Learn more about SRE Agent. 👉 Create your first SRE Agent (Azure login required) What’s New in Azure SRE Agent - October Update The Azure SRE Agent now delivers secure-by-default governance, deeper diagnostics, and extensible automation—built for scale. It can even resolve incidents autonomously by following your team’s runbooks. With native integrations across Azure Monitor, GitHub, ServiceNow, and PagerDuty, it supports root cause analysis using both source code and historical patterns. And since September 1, billing and reporting are available via Azure Agent Units (AAUs). Please visit product documentation for the latest updates. Here are a few highlights for this month: Prioritizing enterprise governance and security: By default, the Azure SRE Agent operates with least-privilege access and never executes write actions on Azure resources without explicit human approval. Additionally, it uses role-based access control (RBAC) so organizations can assign read-only or approver roles, providing clear oversight and traceability from day one. This allows teams to choose their desired level of autonomy from read-only insights to approval-gated actions to full automation without compromising control. Covering the breadth and depth of Azure: The Azure SRE Agent helps teams manage and understand their entire Azure footprint. With built-in support for AZ CLI and kubectl, it works across all Azure services. But it doesn’t stop there—diagnostics are enhanced for platforms like PostgreSQL, API Management, Azure Functions, AKS, Azure Container Apps, and Azure App Service. Whether you're running microservices or managing monoliths, the agent delivers consistent automation and deep insights across your cloud environment. Automating Incident Management: The Azure SRE Agent now plugs directly into Azure Monitor, PagerDuty, and ServiceNow to streamline incident detection and resolution. These integrations let the Agent ingest alerts and trigger workflows that match your team’s existing tools—so you can respond faster, with less manual effort. Engineered for extensibility: The Azure SRE Agent incident management approach lets teams reuse existing runbooks and customize response plans to fit their unique workflows. Whether you want to keep a human in the loop or empower the Agent to autonomously mitigate and resolve issues, the choice is yours. This flexibility gives teams the freedom to evolve—from guided actions to trusted autonomy—without ever giving up control. Root cause, meet source code: The Azure SRE Agent now supports code-aware root cause analysis (RCA) by linking diagnostics directly to source context in GitHub and Azure DevOps. This tight integration helps teams trace incidents back to the exact code changes that triggered them—accelerating resolution and boosting confidence in automated responses. By bridging operational signals with engineering workflows, the agent makes RCA faster, clearer, and more actionable. Close the loop with DevOps: The Azure SRE Agent now generates incident summary reports directly in GitHub and Azure DevOps—complete with diagnostic context. These reports can be assigned to a GitHub Copilot coding agent, which automatically creates pull requests and merges validated fixes. Every incident becomes an actionable code change, driving permanent resolution instead of temporary mitigation. Getting Started Start here: Create a new SRE Agent in the Azure portal (Azure login required) Blog: Announcing a flexible, predictable billing model for Azure SRE Agent Blog: Enterprise-ready and extensible – Update on the Azure SRE Agent preview Product documentation Product home page Community & Support We’d love to hear from you! Please use our GitHub repo to file issues, request features, or share feedback with the team5.5KViews2likes3CommentsAnnouncing a flexible, predictable billing model for Azure SRE Agent
Billing for Azure SRE Agent will start on September 1, 2025. Announced at Microsoft Build 2025, Azure SRE Agent is a pre-built AI agent for root cause analysis, uptime improvement, and operational cost reduction. Learn more about the billing model and example scenarios.3.5KViews1like1CommentReimagining AI Ops with Azure SRE Agent: New Automation, Integration, and Extensibility features
Azure SRE Agent offers intelligent and context aware automation for IT operations. Enhanced by customer feedback from our preview, the SRE Agent has evolved into an extensible platform to automate and manage tasks across Azure and other environments. Built on an Agentic DevOps approach - drawing from proven practices in internal Azure operations - the Azure SRE Agent has already saved over 20,000 engineering hours across Microsoft product teams operations, delivering strong ROI for teams seeking sustainable AIOps. An Operations Agent that adapts to your playbooks Azure SRE Agent is an AI powered operations automation platform that empowers SREs, DevOps, IT operations, and support teams to automate tasks such as incident response, customer support, and developer operations from a single, extensible agent. Its value proposition and capabilities have evolved beyond diagnosis and mitigation of Azure issues, to automating operational workflows and seamless integration with the standards and processes used in your organization. SRE Agent is designed to automate operational work and reduce toil, enabling developers and operators to focus on high-value tasks. By streamlining repetitive and complex processes, SRE Agent accelerates innovation and improves reliability across cloud and hybrid environments. In this article, we will look at what’s new and what has changed since the last update. What’s New: Automation, Integration, and Extensibility Azure SRE Agent just got a major upgrade. From no-code automation to seamless integrations and expanded data connectivity, here’s what’s new in this release: No-code Sub-Agent Builder: Rapidly create custom automations without writing code. Flexible, event-driven triggers: Instantly respond to incidents and operational changes. Expanded data connectivity: Unify diagnostics and troubleshooting across more data sources. Custom actions: Integrate with your existing tools and orchestrate end-to-end workflows via MCP. Prebuilt operational scenarios: Accelerate deployment and improve reliability out of the box. Unlike generic agent platforms, Azure SRE Agent comes with deep integrations, prebuilt tools, and frameworks specifically for IT, DevOps, and SRE workflows. This means you can automate complex operational tasks faster and more reliably, tailored to your organization’s needs. Sub-Agent Builder: Custom Automation, No Code Required Empower teams to automate repetitive operational tasks without coding expertise, dramatically reducing manual workload and development cycles. This feature helps address the need for targeted automation, letting teams solve specific operational pain points without relying on one-size-fits-all solutions. Modular Sub-Agents: Easily create custom sub-agents tailored to your team’s needs. Each sub-agent can have its own instructions, triggers, and toolsets, letting you automate everything from outage response to customer email triage. Prebuilt System Tools: Eliminate the inefficiency of creating basic automation from scratch, and choose from a rich library of hundreds of built-in tools for Azure operations, code analysis, deployment management, diagnostics, and more. Custom Logic: Align automation to your unique business processes by defining your automation logic and prompts, teaching the agent to act exactly as your workflow requires. Flexible Triggers: Automate on Your Terms Invoke the agent to respond automatically to mission-critical events, not wait for manual commands. This feature helps speed up incident response and eliminate missed opportunities for efficiency. Multi-Source Triggers: Go beyond chat-based interactions, and trigger the agent to automatically respond to Incident Management and Ticketing systems like PagerDuty and ServiceNow, Observability Alerting systems like Azure Monitor Alerts, or even on a cron-based schedule for proactive monitoring and best-practices checks. Additional trigger sources such as GitHub issues, Azure DevOps pipelines, email, etc. will be added over time. This means automation can start exactly when and where you need it. Event-Driven Operations: Integrate with your CI/CD, monitoring, or support systems to launch automations in response to real-world events - like deployments, incidents, or customer requests. Vital for reducing downtime, it ensures that business-critical actions happen automatically and promptly. Expanded Data Connectivity: Unified Observability and Troubleshooting Integrate data, enabling comprehensive diagnostics and troubleshooting and faster, more informed decision-making by eliminating silos and speeding up issue resolution. Multiple Data Sources: The agent can now read data from Azure Monitor, Log Analytics, and Application Insights based on its Azure role-based access control (RBAC). Additional observability data sources such as Dynatrace, New Relic, Datadog, and more can be added via the Remote Model Context Protocol (MCP) servers for these tools. This gives you a unified view for diagnostics and automation. Knowledge Integration: Rather than manually detailing every instruction in your prompt, you can upload your Troubleshooting Guide (TSG) or Runbook directly, allowing the agent to automatically create an execution plan from the file. You may also connect the agent to resources like SharePoint, Jira, or documentation repositories through Remote MCP servers, enabling it to retrieve needed files on its own. This approach utilizes your organization’s existing knowledge base, streamlining onboarding and enhancing consistency in managing incidents. Azure SRE Agent is also building multi-agent collaboration by integrating with PagerDuty and Neubird, enabling advanced, cross-platform incident management and reliability across diverse environments. Custom Actions: Automate Anything, Anywhere Extend automation beyond Azure and integrate with any tool or workflow, solving the problem of limited automation scope and enabling end-to-end process orchestration. Out-of-the-Box Actions: Instantly automate common tasks like running azcli, kubectl, creating GitHub issues, or updating Azure resources, reducing setup time and operational overhead. Communication Notifications: The SRE Agent now features built-in connectors for Outlook, enabling automated email notifications, and for Microsoft Teams, allowing it to post messages directly to Teams channels for streamlined communication. Bring Your Own Actions: Drop in your own Remote MCP servers to extend the agent’s capabilities to any custom tool or workflow. Future-proof your agentic DevOps by automating proprietary or emerging processes with confidence. Prebuilt Operations Scenarios Address common operational challenges out of the box, saving teams time and effort while improving reliability and customer satisfaction. Incident Response: Minimize business impact and reduce operational risk by automating detection, diagnosis, and mitigation of your workload stack. The agent has built-in runbooks for common issues related to many Azure resource types including Azure Kubernetes Service (AKS), Azure Container Apps (ACA), Azure App Service, Azure Logic Apps, Azure Database for PostgreSQL, Azure CosmosDB, Azure VMs, etc. Support for additional resource types is being added continually, please see product documentation for the latest information. Root Cause Analysis & IaC Drift Detection: Instantly pinpoint incident causes with AI-driven root cause analysis including automated source code scanning via GitHub and Azure DevOps integration. Proactively detect and resolve infrastructure drift by comparing live cloud environments against source-controlled IaC, ensuring configuration consistency and compliance. Handle Complex Investigations: Enable the deep investigation mode that uses a hypothesis-driven method to analyze possible root causes. It collects logs and metrics, tests hypotheses with iterative checks, and documents findings. The process delivers a clear summary and actionable steps to help teams accurately resolve critical issues. Incident Analysis: The integrated dashboard offers a comprehensive overview of all incidents managed by the SRE Agent. It presents essential metrics, including the number of incidents reviewed, assisted, and mitigated by the agent, as well as those awaiting human intervention. Users can leverage aggregated visualizations and AI-generated root cause analyses to gain insights into incident processing, identify trends, enhance response strategies, and detect areas for improvement in incident management. Inbuilt Agent Memory: The new SRE Agent Memory System transforms incident response by institutionalizing the expertise of top SREs - capturing, indexing, and reusing critical knowledge from past incidents, investigations, and user guidance. Benefit from faster, more accurate troubleshooting, as the agent learns from both successes and mistakes, surfacing relevant insights, runbooks, and mitigation strategies exactly when needed. This system leverages advanced retrieval techniques and a domain-aware schema to ensure every on-call engagement is smarter than the last, reducing mean time to resolution (MTTR) and minimizing repeated toil. Automatically gain a continuously improving agent that remembers what works, avoids past pitfalls, and delivers actionable guidance tailored to the environment. GitHub Copilot and Azure DevOps Integration: Automatically triage, respond to, and resolve issues raised in GitHub or Azure DevOps. Integration with modern development platforms such as GitHub Copilot coding agent increases efficiency and ensures that issues are resolved faster, reducing bottlenecks in the development lifecycle. Ready to get started? Azure SRE Agent home page Product overview Pricing Page Pricing Calculator Pricing Blog Demo recordings Deployment samples What’s Next? Give us feedback: Your feedback is critical - You can Thumbs Up / Thumbs Down each interaction or thread, or go to the “Give Feedback” button in the agent to give us in-product feedback - or you can create issues or just share your thoughts in our GitHub repo at https://github.com/microsoft/sre-agent. We’re just getting started. In the coming months, expect even more prebuilt integrations, expanded data sources, and new automation scenarios. We anticipate continuous growth and improvement throughout our agentic AI platforms and services to effectively address customer needs and preferences. Let us know what Ops toil you want to automate next!2.4KViews0likes0CommentsAzure SRE Agent: Expanding Observability and Multi-Cloud Resilience
The Azure SRE Agent continues to evolve as a cornerstone for operational excellence and incident management. Over the past few months, we have made significant strides in enabling integrations with leading observability platforms—Dynatrace, New Relic, and Datadog—through Model Context Protocol (MCP) Servers. These partnerships serve joint customers, enabling automated remediation across diverse environments. Deepening Integrations with MCP Servers Our collaboration with these partners is more than technical—it’s about delivering value at scale. Datadog, New Relic, and Dynatrace are all Azure Native ISV Service partners. With these integrations Azure Native customers can also choose to add these MCP servers directly from the Azure Native partners’ resource: Datadog: At Ignite, Azure SRE Agent was presented with the Datadog MCP Server, to demonstrate how our customers can streamline complex workflows. Customers can now bring their Datadog MCP Server into Azure SRE Agent, extending knowledge capabilities and centralizing logs and metrics. Find Datadog Azure Native offerings on Marketplace. New Relic: When an alert fires in New Relic, the Azure SRE Agent calls the New Relic MCP Server to provide Intelligent Observability insights. This agentic integration with the New Relic MCP Server offers over 35 specialized tools across, entity and account management, alerts and monitoring, data analysis and queries, performance analysis, and much more. The advanced remediation skills of the Azure SRE Agent + New Relic AI help our joint customers diagnose and resolve production issues faster. Find New Relic’s Azure Native offering on Marketplace Dynatrace: The Dynatrace integration bridges Microsoft Azure's cloud-native infrastructure management with Dynatrace's AI-powered observability platform, leveraging the Davis AI engine and remote MCP server capabilities for incident detection, root cause analysis, and remediation across hybrid cloud environments. Check out Dynatrace’s Azure Native offering on Marketplace. These integrations are made possible by Azure SRE Agent’s MCP connectors. The MCP connectors in Azure SRE Agent act as the bridge between the agent and MCP servers, enabling dynamic discovery and execution of specialized tools for observability and incident management across diverse environments. This feature allows customers to build their own custom sub-agents to leverage tools from MCP Servers from integrated platforms like Dynatrace, Datadog, and New Relic to complement the agent’s diagnostic and remediation capabilities. By connecting Azure SRE Agent to external MCP servers scenarios such as cross-platform telemetry analysis are unlocked. Looking Ahead: Multi-Agent Collaboration Azure SRE Agent isn’t stopping with MCP integrations We’re actively working with PagerDuty and NeuBird to support dynamic use cases via agent-to-agent collaboration: PagerDuty: PagerDuty’s PD Advance SRE Agent is an AI-powered assistant that triages incidents by analyzing logs, diagnostics, past incident history, and runbooks to surface relevant context and recommended remediations. At Ignite, PagerDuty and Microsoft demonstrated how Azure SRE Agent can ingest PagerDuty incidents and collaborate with PagerDuty’s SRE Agent to complement triage using historical patterns, runbook intelligence and Azure diagnostics. NeuBird: NeuBird’s Agentic AI SRE, Hawkeye, autonomously investigates and resolves incidents across hybrid, and multi-cloud environments. By connecting to telemetry sources like Azure Monitor, Prometheus, and GitHub, Hawkeye delivers real-time diagnosis and targeted fixes. Building on the work presented at SRE Day this partnership underscores our commitment to agentic ecosystems where specialized agents collaborate for complex scenarios. Sign up for the private preview to try the integration, here. Additionally, please check out NeuBird on Marketplace. These efforts reflect a broader vision: Azure SRE Agent as a hub for cross-platform reliability, enabling customers to manage incidents across Azure, on-premises, and other clouds with confidence. Why This Matters As organizations embrace distributed architectures, the need for integrated, intelligent, and multi-cloud-ready SRE solutions has never been greater. By partnering with industry leaders and pioneering agent-to-agent workflows, Azure SRE Agent is setting the stage for a future where resilience is not just reactive—it’s proactive and collaborative.681Views2likes0Comments