azure functions
254 TopicsImportant Changes to App Service Managed Certificates: Is Your Certificate Affected?
Overview As part of an upcoming industry-wide change, DigiCert, the Certificate Authority (CA) for Azure App Service Managed Certificates (ASMC), is required to migrate to a new validation platform to meet multi-perspective issuance corroboration (MPIC) requirements. While most certificates will not be impacted by this change, certain site configurations and setups may prevent certificate issuance or renewal starting July 28, 2025. Update December 8, 2025 We’ve published an update in November about how App Service Managed Certificates can now be supported on sites that block public access. This reverses the limitation introduced in July 2025, as mentioned in this blog. Note: This blog post reflects a point-in-time update and will not be revised. For the latest and most accurate details on App Service Managed Certificates, please refer to official documentation or subsequent updates. Learn more about the November 2025 update here: Follow-Up to ‘Important Changes to App Service Managed Certificates’: November 2025 Update. August 5, 2025 We’ve published a Microsoft Learn documentation titled App Service Managed Certificate (ASMC) changes – July 28, 2025 that contains more in-depth mitigation guidance and a growing FAQ section to support the changes outlined in this blog post. While the blog currently contains the most complete overview, the documentation will soon be updated to reflect all blog content. Going forward, any new information or clarifications will be added to the documentation page, so we recommend bookmarking it for the latest guidance. What Will the Change Look Like? For most customers: No disruption. Certificate issuance and renewals will continue as expected for eligible site configurations. For impacted scenarios: Certificate requests will fail (no certificate issued) starting July 28, 2025, if your site configuration is not supported. Existing certificates will remain valid until their expiration (up to six months after last renewal). Impacted Scenarios You will be affected by this change if any of the following apply to your site configurations: Your site is not publicly accessible: Public accessibility to your app is required. If your app is only accessible privately (e.g., requiring a client certificate for access, disabling public network access, using private endpoints or IP restrictions), you will not be able to create or renew a managed certificate. Other site configurations or setup methods not explicitly listed here that restrict public access, such as firewalls, authentication gateways, or any custom access policies, can also impact eligibility for managed certificate issuance or renewal. Action: Ensure your app is accessible from the public internet. However, if you need to limit access to your app, then you must acquire your own SSL certificate and add it to your site. Your site uses Azure Traffic Manager "nested" or "external" endpoints: Only “Azure Endpoints” on Traffic Manager will be supported for certificate creation and renewal. “Nested endpoints” and “External endpoints” will not be supported. Action: Transition to using "Azure Endpoints". However, if you cannot, then you must obtain a different SSL certificate for your domain and add it to your site. Your site relies on *.trafficmanager.net domain: Certificates for *.trafficmanager.net domains will not be supported for creation or renewal. Action: Add a custom domain to your app and point the custom domain to your *.trafficmanager.net domain. After that, secure the custom domain with a new SSL certificate. If none of the above applies, no further action is required. How to Identify Impacted Resources? To assist with the upcoming changes, you can use Azure Resource Graph (ARG) queries to help identify resources that may be affected under each scenario. Please note that these queries are provided as a starting point and may not capture every configuration. Review your environment for any unique setups or custom configurations. Scenario 1: Sites Not Publicly Accessible This ARG query retrieves a list of sites that either have the public network access property disabled or are configured to use client certificates. It then filters for sites that are using App Service Managed Certificates (ASMC) for their custom hostname SSL bindings. These certificates are the ones that could be affected by the upcoming changes. However, please note that this query does not provide complete coverage, as there may be additional configurations impacting public access to your app that are not included here. Ultimately, this query serves as a helpful guide for users, but a thorough review of your environment is recommended. You can copy this query, paste it into Azure Resource Graph Explorer, and then click "Run query" to view the results for your environment. // ARG Query: Identify App Service sites that commonly restrict public access and use ASMC for custom hostname SSL bindings resources | where type == "microsoft.web/sites" // Extract relevant properties for public access and client certificate settings | extend publicNetworkAccess = tolower(tostring(properties.publicNetworkAccess)), clientCertEnabled = tolower(tostring(properties.clientCertEnabled)) // Filter for sites that either have public network access disabled // or have client certificates enabled (both can restrict public access) | where publicNetworkAccess == "disabled" or clientCertEnabled != "false" // Expand the list of SSL bindings for each site | mv-expand hostNameSslState = properties.hostNameSslStates | extend hostName = tostring(hostNameSslState.name), thumbprint = tostring(hostNameSslState.thumbprint) // Only consider custom domains (exclude default *.azurewebsites.net) and sites with an SSL certificate bound | where tolower(hostName) !endswith "azurewebsites.net" and isnotempty(thumbprint) // Select key site properties for output | project siteName = name, siteId = id, siteResourceGroup = resourceGroup, thumbprint, publicNetworkAccess, clientCertEnabled // Join with certificates to find only those using App Service Managed Certificates (ASMC) // ASMCs are identified by the presence of the "canonicalName" property | join kind=inner ( resources | where type == "microsoft.web/certificates" | extend certThumbprint = tostring(properties.thumbprint), canonicalName = tostring(properties.canonicalName) // Only ASMC uses the "canonicalName" property | where isnotempty(canonicalName) | project certName = name, certId = id, certResourceGroup = tostring(properties.resourceGroup), certExpiration = properties.expirationDate, certThumbprint, canonicalName ) on $left.thumbprint == $right.certThumbprint // Final output: sites with restricted public access and using ASMC for custom hostname SSL bindings | project siteName, siteId, siteResourceGroup, publicNetworkAccess, clientCertEnabled, thumbprint, certName, certId, certResourceGroup, certExpiration, canonicalName Scenario 2: Traffic Manager Endpoint Types For this scenario, please manually review your Traffic Manager profile configurations to ensure only “Azure Endpoints” are in use. We recommend inspecting your Traffic Manager profiles directly in the Azure portal or using relevant APIs to confirm your setup and ensure compliance with the new requirements. Scenario 3: Certificates Issued to *.trafficmanager.net Domains This ARG query helps you identify App Service Managed Certificates (ASMC) that were issued to *.trafficmanager.net domains. In addition, it also checks whether any web apps are currently using those certificates for custom domain SSL bindings. You can copy this query, paste it into Azure Resource Graph Explorer, and then click "Run query" to view the results for your environment. // ARG Query: Identify App Service Managed Certificates (ASMC) issued to *.trafficmanager.net domains // Also checks if any web apps are currently using those certificates for custom domain SSL bindings resources | where type == "microsoft.web/certificates" // Extract the certificate thumbprint and canonicalName (ASMCs have a canonicalName property) | extend certThumbprint = tostring(properties.thumbprint), canonicalName = tostring(properties.canonicalName) // Only ASMC uses the "canonicalName" property // Filter for certificates issued to *.trafficmanager.net domains | where canonicalName endswith "trafficmanager.net" // Select key certificate properties for output | project certName = name, certId = id, certResourceGroup = tostring(properties.resourceGroup), certExpiration = properties.expirationDate, certThumbprint, canonicalName // Join with web apps to see if any are using these certificates for SSL bindings | join kind=leftouter ( resources | where type == "microsoft.web/sites" // Expand the list of SSL bindings for each site | mv-expand hostNameSslState = properties.hostNameSslStates | extend hostName = tostring(hostNameSslState.name), thumbprint = tostring(hostNameSslState.thumbprint) // Only consider bindings for *.trafficmanager.net custom domains with a certificate bound | where tolower(hostName) endswith "trafficmanager.net" and isnotempty(thumbprint) // Select key site properties for output | project siteName = name, siteId = id, siteResourceGroup = resourceGroup, thumbprint ) on $left.certThumbprint == $right.thumbprint // Final output: ASMCs for *.trafficmanager.net domains and any web apps using them | project certName, certId, certResourceGroup, certExpiration, canonicalName, siteName, siteId, siteResourceGroup Ongoing Updates We will continue to update this post with any new queries or important changes as they become available. Be sure to check back for the latest information. Note on Comments We hope this information helps you navigate the upcoming changes. To keep this post clear and focused, comments are closed. If you have questions, need help, or want to share tips or alternative detection methods, please visit our official support channels or the Microsoft Q&A, where our team and the community can assist you.23KViews1like1CommentFollow-Up to ‘Important Changes to App Service Managed Certificates’: November 2025 Update
This post provides an update to the Tech Community article ‘Important Changes to App Service Managed Certificates: Is Your Certificate Affected?’ and covers the latest changes introduced since July 2025. With the November 2025 update, ASMC now remains supported even if the site is not publicly accessible, provided all other requirements are met. Details on requirements, exceptions, and validation steps are included below. Background Context to July 2025 Changes As of July 2025, all ASMC certificate issuance and renewals use HTTP token validation. Previously, public access was required because DigiCert needed to access the endpoint https://<hostname>/.well-known/pki-validation/fileauth.txt to verify the token before issuing the certificate. App Service automatically places this token during certificate creation and renewal. If DigiCert cannot access this endpoint, domain ownership validation fails, and the certificate cannot be issued. November 2025 Update Starting November 2025, App Service now allows DigiCert's requests to the https://<hostname>/.well-known/pki-validation/fileauth.txt endpoint, even if the site blocks public access. If there’s a request to create an App Service Managed Certificate (ASMC), App Service places the domain validation token at the validation endpoint. When DigiCert tries to reach the validation endpoint, App Service front ends present the token, and the request terminates at the front end layer. DigiCert's request does not reach the workers running the application. This behavior is now the default for ASMC issuance for initial certificate creation and renewals. Customers do not need to specifically allow DigiCert's IP addresses. Exceptions and Unsupported Scenarios This update addresses most scenarios that restrict public access, including App Service Authentication, disabling public access, IP restrictions, private endpoints, and client certificates. However, a public DNS record is still required. For example, sites using a private endpoint with a custom domain on a private DNS cannot validate domain ownership and obtain a certificate. Even with all validations now relying on HTTP token validation and DigiCert requests being allowed through, certain configurations are still not supported for ASMC: Sites configured as "Nested" or "External" endpoints behind Traffic Manager. Only "Azure" endpoints are supported. Certificates requested for domains ending in *.trafficmanager.net are not supported. Testing Customers can easily test whether their site’s configuration or set-up supports ASMC by attempting to create one for their site. If the initial request succeeds, renewals should also work, provided all requirements are met and the site is not listed in an unsupported scenario.4.7KViews1like0CommentsExpanding the Public Preview of the Azure SRE Agent
We are excited to share that the Azure SRE Agent is now available in public preview for everyone instantly – no sign up required. A big thank you to all our preview customers who provided feedback and helped shape this release! Watching teams put the SRE Agent to work taught us a ton, and we’ve baked those lessons into a smarter, more resilient, and enterprise-ready experience. You can now find Azure SRE Agent directly in the Azure Portal and get started, or use the link below. 📖 Learn more about SRE Agent. 👉 Create your first SRE Agent (Azure login required) What’s New in Azure SRE Agent - October Update The Azure SRE Agent now delivers secure-by-default governance, deeper diagnostics, and extensible automation—built for scale. It can even resolve incidents autonomously by following your team’s runbooks. With native integrations across Azure Monitor, GitHub, ServiceNow, and PagerDuty, it supports root cause analysis using both source code and historical patterns. And since September 1, billing and reporting are available via Azure Agent Units (AAUs). Please visit product documentation for the latest updates. Here are a few highlights for this month: Prioritizing enterprise governance and security: By default, the Azure SRE Agent operates with least-privilege access and never executes write actions on Azure resources without explicit human approval. Additionally, it uses role-based access control (RBAC) so organizations can assign read-only or approver roles, providing clear oversight and traceability from day one. This allows teams to choose their desired level of autonomy from read-only insights to approval-gated actions to full automation without compromising control. Covering the breadth and depth of Azure: The Azure SRE Agent helps teams manage and understand their entire Azure footprint. With built-in support for AZ CLI and kubectl, it works across all Azure services. But it doesn’t stop there—diagnostics are enhanced for platforms like PostgreSQL, API Management, Azure Functions, AKS, Azure Container Apps, and Azure App Service. Whether you're running microservices or managing monoliths, the agent delivers consistent automation and deep insights across your cloud environment. Automating Incident Management: The Azure SRE Agent now plugs directly into Azure Monitor, PagerDuty, and ServiceNow to streamline incident detection and resolution. These integrations let the Agent ingest alerts and trigger workflows that match your team’s existing tools—so you can respond faster, with less manual effort. Engineered for extensibility: The Azure SRE Agent incident management approach lets teams reuse existing runbooks and customize response plans to fit their unique workflows. Whether you want to keep a human in the loop or empower the Agent to autonomously mitigate and resolve issues, the choice is yours. This flexibility gives teams the freedom to evolve—from guided actions to trusted autonomy—without ever giving up control. Root cause, meet source code: The Azure SRE Agent now supports code-aware root cause analysis (RCA) by linking diagnostics directly to source context in GitHub and Azure DevOps. This tight integration helps teams trace incidents back to the exact code changes that triggered them—accelerating resolution and boosting confidence in automated responses. By bridging operational signals with engineering workflows, the agent makes RCA faster, clearer, and more actionable. Close the loop with DevOps: The Azure SRE Agent now generates incident summary reports directly in GitHub and Azure DevOps—complete with diagnostic context. These reports can be assigned to a GitHub Copilot coding agent, which automatically creates pull requests and merges validated fixes. Every incident becomes an actionable code change, driving permanent resolution instead of temporary mitigation. Getting Started Start here: Create a new SRE Agent in the Azure portal (Azure login required) Blog: Announcing a flexible, predictable billing model for Azure SRE Agent Blog: Enterprise-ready and extensible – Update on the Azure SRE Agent preview Product documentation Product home page Community & Support We’d love to hear from you! Please use our GitHub repo to file issues, request features, or share feedback with the team5.5KViews2likes3CommentsAnnouncing a flexible, predictable billing model for Azure SRE Agent
Billing for Azure SRE Agent will start on September 1, 2025. Announced at Microsoft Build 2025, Azure SRE Agent is a pre-built AI agent for root cause analysis, uptime improvement, and operational cost reduction. Learn more about the billing model and example scenarios.3.4KViews1like1CommentReimagining AI Ops with Azure SRE Agent: New Automation, Integration, and Extensibility features
Azure SRE Agent offers intelligent and context aware automation for IT operations. Enhanced by customer feedback from our preview, the SRE Agent has evolved into an extensible platform to automate and manage tasks across Azure and other environments. Built on an Agentic DevOps approach - drawing from proven practices in internal Azure operations - the Azure SRE Agent has already saved over 20,000 engineering hours across Microsoft product teams operations, delivering strong ROI for teams seeking sustainable AIOps. An Operations Agent that adapts to your playbooks Azure SRE Agent is an AI powered operations automation platform that empowers SREs, DevOps, IT operations, and support teams to automate tasks such as incident response, customer support, and developer operations from a single, extensible agent. Its value proposition and capabilities have evolved beyond diagnosis and mitigation of Azure issues, to automating operational workflows and seamless integration with the standards and processes used in your organization. SRE Agent is designed to automate operational work and reduce toil, enabling developers and operators to focus on high-value tasks. By streamlining repetitive and complex processes, SRE Agent accelerates innovation and improves reliability across cloud and hybrid environments. In this article, we will look at what’s new and what has changed since the last update. What’s New: Automation, Integration, and Extensibility Azure SRE Agent just got a major upgrade. From no-code automation to seamless integrations and expanded data connectivity, here’s what’s new in this release: No-code Sub-Agent Builder: Rapidly create custom automations without writing code. Flexible, event-driven triggers: Instantly respond to incidents and operational changes. Expanded data connectivity: Unify diagnostics and troubleshooting across more data sources. Custom actions: Integrate with your existing tools and orchestrate end-to-end workflows via MCP. Prebuilt operational scenarios: Accelerate deployment and improve reliability out of the box. Unlike generic agent platforms, Azure SRE Agent comes with deep integrations, prebuilt tools, and frameworks specifically for IT, DevOps, and SRE workflows. This means you can automate complex operational tasks faster and more reliably, tailored to your organization’s needs. Sub-Agent Builder: Custom Automation, No Code Required Empower teams to automate repetitive operational tasks without coding expertise, dramatically reducing manual workload and development cycles. This feature helps address the need for targeted automation, letting teams solve specific operational pain points without relying on one-size-fits-all solutions. Modular Sub-Agents: Easily create custom sub-agents tailored to your team’s needs. Each sub-agent can have its own instructions, triggers, and toolsets, letting you automate everything from outage response to customer email triage. Prebuilt System Tools: Eliminate the inefficiency of creating basic automation from scratch, and choose from a rich library of hundreds of built-in tools for Azure operations, code analysis, deployment management, diagnostics, and more. Custom Logic: Align automation to your unique business processes by defining your automation logic and prompts, teaching the agent to act exactly as your workflow requires. Flexible Triggers: Automate on Your Terms Invoke the agent to respond automatically to mission-critical events, not wait for manual commands. This feature helps speed up incident response and eliminate missed opportunities for efficiency. Multi-Source Triggers: Go beyond chat-based interactions, and trigger the agent to automatically respond to Incident Management and Ticketing systems like PagerDuty and ServiceNow, Observability Alerting systems like Azure Monitor Alerts, or even on a cron-based schedule for proactive monitoring and best-practices checks. Additional trigger sources such as GitHub issues, Azure DevOps pipelines, email, etc. will be added over time. This means automation can start exactly when and where you need it. Event-Driven Operations: Integrate with your CI/CD, monitoring, or support systems to launch automations in response to real-world events - like deployments, incidents, or customer requests. Vital for reducing downtime, it ensures that business-critical actions happen automatically and promptly. Expanded Data Connectivity: Unified Observability and Troubleshooting Integrate data, enabling comprehensive diagnostics and troubleshooting and faster, more informed decision-making by eliminating silos and speeding up issue resolution. Multiple Data Sources: The agent can now read data from Azure Monitor, Log Analytics, and Application Insights based on its Azure role-based access control (RBAC). Additional observability data sources such as Dynatrace, New Relic, Datadog, and more can be added via the Remote Model Context Protocol (MCP) servers for these tools. This gives you a unified view for diagnostics and automation. Knowledge Integration: Rather than manually detailing every instruction in your prompt, you can upload your Troubleshooting Guide (TSG) or Runbook directly, allowing the agent to automatically create an execution plan from the file. You may also connect the agent to resources like SharePoint, Jira, or documentation repositories through Remote MCP servers, enabling it to retrieve needed files on its own. This approach utilizes your organization’s existing knowledge base, streamlining onboarding and enhancing consistency in managing incidents. Azure SRE Agent is also building multi-agent collaboration by integrating with PagerDuty and Neubird, enabling advanced, cross-platform incident management and reliability across diverse environments. Custom Actions: Automate Anything, Anywhere Extend automation beyond Azure and integrate with any tool or workflow, solving the problem of limited automation scope and enabling end-to-end process orchestration. Out-of-the-Box Actions: Instantly automate common tasks like running azcli, kubectl, creating GitHub issues, or updating Azure resources, reducing setup time and operational overhead. Communication Notifications: The SRE Agent now features built-in connectors for Outlook, enabling automated email notifications, and for Microsoft Teams, allowing it to post messages directly to Teams channels for streamlined communication. Bring Your Own Actions: Drop in your own Remote MCP servers to extend the agent’s capabilities to any custom tool or workflow. Future-proof your agentic DevOps by automating proprietary or emerging processes with confidence. Prebuilt Operations Scenarios Address common operational challenges out of the box, saving teams time and effort while improving reliability and customer satisfaction. Incident Response: Minimize business impact and reduce operational risk by automating detection, diagnosis, and mitigation of your workload stack. The agent has built-in runbooks for common issues related to many Azure resource types including Azure Kubernetes Service (AKS), Azure Container Apps (ACA), Azure App Service, Azure Logic Apps, Azure Database for PostgreSQL, Azure CosmosDB, Azure VMs, etc. Support for additional resource types is being added continually, please see product documentation for the latest information. Root Cause Analysis & IaC Drift Detection: Instantly pinpoint incident causes with AI-driven root cause analysis including automated source code scanning via GitHub and Azure DevOps integration. Proactively detect and resolve infrastructure drift by comparing live cloud environments against source-controlled IaC, ensuring configuration consistency and compliance. Handle Complex Investigations: Enable the deep investigation mode that uses a hypothesis-driven method to analyze possible root causes. It collects logs and metrics, tests hypotheses with iterative checks, and documents findings. The process delivers a clear summary and actionable steps to help teams accurately resolve critical issues. Incident Analysis: The integrated dashboard offers a comprehensive overview of all incidents managed by the SRE Agent. It presents essential metrics, including the number of incidents reviewed, assisted, and mitigated by the agent, as well as those awaiting human intervention. Users can leverage aggregated visualizations and AI-generated root cause analyses to gain insights into incident processing, identify trends, enhance response strategies, and detect areas for improvement in incident management. Inbuilt Agent Memory: The new SRE Agent Memory System transforms incident response by institutionalizing the expertise of top SREs - capturing, indexing, and reusing critical knowledge from past incidents, investigations, and user guidance. Benefit from faster, more accurate troubleshooting, as the agent learns from both successes and mistakes, surfacing relevant insights, runbooks, and mitigation strategies exactly when needed. This system leverages advanced retrieval techniques and a domain-aware schema to ensure every on-call engagement is smarter than the last, reducing mean time to resolution (MTTR) and minimizing repeated toil. Automatically gain a continuously improving agent that remembers what works, avoids past pitfalls, and delivers actionable guidance tailored to the environment. GitHub Copilot and Azure DevOps Integration: Automatically triage, respond to, and resolve issues raised in GitHub or Azure DevOps. Integration with modern development platforms such as GitHub Copilot coding agent increases efficiency and ensures that issues are resolved faster, reducing bottlenecks in the development lifecycle. Ready to get started? Azure SRE Agent home page Product overview Pricing Page Pricing Calculator Pricing Blog Demo recordings Deployment samples What’s Next? Give us feedback: Your feedback is critical - You can Thumbs Up / Thumbs Down each interaction or thread, or go to the “Give Feedback” button in the agent to give us in-product feedback - or you can create issues or just share your thoughts in our GitHub repo at https://github.com/microsoft/sre-agent. We’re just getting started. In the coming months, expect even more prebuilt integrations, expanded data sources, and new automation scenarios. We anticipate continuous growth and improvement throughout our agentic AI platforms and services to effectively address customer needs and preferences. Let us know what Ops toil you want to automate next!2KViews0likes0CommentsProactive Monitoring Made Simple with Azure SRE Agent
SRE teams strive for proactive operations, catching issues before they impact customers and reducing the chaos of incident response. While perfection may be elusive, the real goal is minimizing outages and gaining immediate line of sight into production environments. Today, that’s harder than ever. It requires correlating countless signals and alerts, understanding how they relate—or don’t relate—to each other, and assigning the right sense of urgency and impact. Anything that shortens this cycle, accelerates detection, and enables automated remediation is what modern SRE teams crave. What if you could skip the scripting and pipelines? What if you could simply describe what you want in plain language and let it run automatically on a schedule? Scheduled Tasks for Azure SRE Agent With Scheduled Tasks for Azure SRE Agent, that what-if scenario is now a reality. Scheduled tasks combine natural language prompts with Azure SRE Agent’s automation capabilities, so you can express intent, set a schedule, and let the agent do the rest—without writing a single line of code. This means: ⚡ Faster incident response through early detection ✅ Better compliance with automated checks 🎯 More time for high-value engineering work and innovation 💡 The shift from reactive to proactive: Instead of waiting for alerts to fire or customers to report issues, you’re continuously monitoring, validating, and catching problems before they escalate. How Scheduled Tasks Work Under the Hood When you create a Scheduled Task, the process is more than just running a prompt on a timer. Here’s what happens: 1. Prompt Interpretation and Plan Creation The SRE Agent takes your natural language prompt—such as “Scan all resources for security best practices”—and converts it into a structured execution plan. This plan defines the steps, tools, and data sources required to fulfill your request. 2. Built-In Tools and MCP Integration The agent uses its built-in capabilities (Azure CLI, Log Analytics workspace, Appinsights) and can also leverage 3 rd party data sources or tools via MCP server integration for extended functionality. 3. Results Analysis and Smart Summarization After execution, the agent analyzes results, identifies anomalies or issues, and provides actionable summaries not just raw data dumps. 4. Notification and Escalation Based on findings, the agent can: Post updates to Teams channels Create or update incidents Send email notifications Trigger follow-up actions Real-World Use Cases for Proactive Ops Here’s where scheduled tasks shine for SRE teams: Use Case Prompt Example Schedule Security Posture Check “Scan all subscriptions for resources with public endpoints and flag any that shouldn’t be exposed” Daily Cost Anomaly Detection “Compare this week’s spend against last week and alert if any service exceeds 20% growth” Weekly Compliance Drift Detection “Check all storage accounts for encryption settings and report any non-compliant resources” Daily Resource Health Summary “Summarize the health status of all production VMs and highlight any degraded instances” Every 4 hours Incident Trend Analysis “Analyze ICM incidents from the past week, identify patterns, and summarize top contributing services” Weekly Getting Started in 3 Steps Step 1: Define Your Intent Write a natural language prompt describing what you want to monitor or check. Be specific about: - What resources or scope - What conditions to look for - What action to take if issues are found Example: > “Every morning at 8 AM, check all production Kubernetes clusters for pods in CrashLoopBackOff state. If any are found, post a summary to the #sre-alerts Teams channel with cluster name, namespace, and pod details.” Step 2: Set Your Schedule Choose how often the task should run: - Cron expressions for precise control - Simple intervals (hourly, daily, weekly) Step 3: Define Where to Receive Updates Include in your prompt where you want results delivered when the task finishes execution. The agent can use its built-in tools and connectors to: - Post summaries to a Teams channel - Send email notifications - Create or update ICM incidents Example prompt with notification: > "Check all production databases for long-running queries over 30 seconds. If any are found, post a summary to the #database-alerts Teams channel." Why This Matters for Proactive Operations Traditional monitoring approaches have limitations: Traditional Approach With Scheduled Tasks Write scripts, maintain pipelines Describe in plain language Static thresholds and rules Contextual, AI-powered analysis Alert fatigue from noisy signals Smart summarization of what matters Separate tools for check vs. action Unified detection and response Requires dedicated DevOps effort Any SRE can create and modify The result? Your team spends less time building and maintaining monitoring infrastructure and more time on the work that truly requires human expertise. Best Practices for Scheduled Tasks Start simple, iterate — Begin with one or two high-value checks and expand as you gain confidence Be specific in prompts — The more context you provide, the better the results Set appropriate frequencies — Not everything needs to run hourly; match the schedule to the risk Review and refine — Check task results periodically and adjust prompts for better accuracy What’s Next? Scheduled tasks are just the beginning. We’re continuing to invest in capabilities that help SRE teams shift left—catching issues earlier, automating routine checks, and freeing up time for strategic work. Ready to Start? Use this sample that shows how to create a scheduled health check sub-agent: https://github.com/microsoft/sre-agent/blob/main/samples/automation/samples/02-scheduled-health-check-sample.md This example demonstrates: - Building a HealthCheckAgent using built-in tools like Azure CLI and Log Analytics Workspace - Scheduling daily health checks for a container app at 7 AM - Sending email alerts when anomalies are detected 🔗 Explore more samples here: https://github.com/microsoft/sre-agent/tree/main/samples More to Learn Ignite 2025 announcements: https://aka.ms/ignite25/blog/sreagent Documentation: https://aka.ms/sreagent/docs Support & Feature Requests: https://github.com/microsoft/sre-agent/issues495Views0likes0CommentsHost remote MCP servers on Azure Functions
Model Context Protocol (MCP) servers allow AI agents to access external tools, data, and systems, greatly extending the capability and power of agents. When you’re ready to expose your MCP servers externally, within your organization or to the world, it’s important that the servers are run in a secure, scalable, and reliable environment. Azure Functions provides such a robust platform for hosting your remote MCP servers, offering high scalability with the Flex Consumption plan, built‑in authentication feature for Microsoft Entra and OAuth, and a serverless billing model. The platform also offers two hosting options for added flexibility and convenience. The options allow for hosting of MCP servers built with Azure Functions MCP extension or the official MCP SDKs. Azure Functions MCP Extension (GA) The MCP extension allows you to build and host servers using Azure Functions programming model, i.e. using triggers and bindings. The MCP tool trigger allows you to focus on implementing tools you want to expose, instead of worrying about handling protocol and server logistics. The MCP extension launched as public preview back in April and is now generally available, with support for .NET, Java, JavaScript, Python, and Typescript. New features in the extension Support for streamable-http transport Support for the newer streamable-http transport is added to the extension. Unless your client specifically requires the older Server-Sent Events (SSE) transport, you should use the streamable-http. The two transports have different endpoints in the extension: Transport Endpoint Streamable HTTP /runtime/webhooks/mcp Server-Sent Events (SSE) /runtime/webhooks/mcp/sse Defining server information You can use the extensions.mcp section in host.json to define MCP server information. { "version": "2.0", "extensions": { "mcp": { "instructions": "Some test instructions on how to use the server", "serverName": "TestServer", "serverVersion": "2.0.0", "encryptClientState": true, "messageOptions": { "useAbsoluteUriForEndpoint": false }, "system": { "webhookAuthorizationLevel": "System" } } } } Built-in server authentication and authorization The built-in feature implements the requirements of the MCP authorization protocol, such as issuing 401 challenge and hosting the Protected Resource Metadata document. You can configure it to use identity providers like Microsoft Entra for server authentication. In addition to server authenticating, you can also leverage this feature to implement on-behalf-of (OBO) auth flows where the client invokes a tool that accesses some downstream services on-behalf-of the user. Learn more about the built-in authentication and authorization feature. Mavin Build Plugin for Java For Java applications, the Maven Build Plugin (version 1.40.0) parses and verifies MCP tool annotations during build time. This process automatically generates the correct MCP extension configuration, ensuring that the MCP tool defined by the user is properly set up. The build-time analysis is especially beneficial for Java apps, as it allows developers to utilize the MCP extension without concerns about increased cold start times. We'll continuously enhance the plugin’s capabilities. Upcoming improvements, such as property type inference, will reduce manual configuration and make it even easier to use the McpToolTrigger. Get started Checkout the quickstarts to get an MCP extension server deployed in minutes: C# (.NET) remote-mcp-functions-dotnet Python remote-mcp-functions-python TypeScript (Node.js) remote-mcp-functions-typescript Java remote-mcp-functions-java References Learn more about the MCP extension and tool trigger in official documentations. Self‑hosted MCP server (public preview) In addition to the MCP extension, Azure Functions also supports hosting MCP servers implemented with the official SDKs. This is a suitable option for teams that have existing SDK‑based servers or who favor the SDK experience over the Functions programming model. There is no need to modify your server code; you can lift and shift these MCP servers to Azure Functions— which is why they are termed self‑hosted. The hosting capability supports the following features: Stateless servers that use the streamable-http transport. If you need your server to be stateful, consider using the Functions MCP extension for now. Servers implemented with Python, TypeScript, C#, or Java MCP SDK. Built-in server authentication and authorization like the MCP extension Hosting requirement Self-hosted MCP servers are deployed to the Azure Functions platform as custom handlers. You can think of custom handlers as lightweight web servers that receive events from the Functions host. The only requirement for hosting the MCP server is a file called host.json. Add this file to your project root to tell Functions how to run the server. An example host.json for a Python server looks like: { "version": "2.0", "configurationProfile": "mcp-custom-handler", "customHandler": { "description": { "defaultExecutablePath": "python", "arguments": ["path to main python script, e.g. hello.py"] }, "port": "8000" } } Get started Check out quickstarts to get your self-hosted MCP server deployed in minutes: C# (.NET) mcp-sdk-functions-hosting-dotnet Python mcp-sdk-functions-hosting-python TypeScript (Node.js) mcp-sdk-functions-hosting-node Java Coming soon! References Read the official documentation of self-hosted MCP servers and learn about integrations with Azure services like Foundry and API Center. For .NET developers - check out the overview of self-hosted MCP servers from the recent .NET Conference! We’d love to hear from you! Let us know your thoughts about hosting remote MCP server on Azure Functions. Does either of the options meet your needs? What other MCP features are you looking for? Let us know what you’d like us to prioritize next!726Views3likes1CommentFaster Python on Azure Functions with uvloop
Python 3.13+ apps on Azure Functions are now faster by default. By replacing the standard event loop with uvloop, the Functions Python worker delivers higher throughput and lower latency for asynchronous workloads — no code changes required. Introduction Azure Functions powers millions of customer scenarios, from real-time APIs to event-driven automation. For Python developers, scalability often comes down to how efficiently the runtime handles I/O, concurrency, and asynchronous workloads. That’s why, starting with Python 3.13, the Azure Functions Python worker now uses uvloop as its default event loop. Built on top of libuv (the same library behind Node.js), uvloop provides a drop-in replacement for Python’s standard asyncio loop with measurable performance improvements. For customers, this means faster request handling and more responsive serverless applications — without having to update a single line of app code. Why Event Loops Matter The event loop is the backbone of any asynchronous Python application. It schedules coroutines, manages I/O events, and drives concurrency. In serverless workloads like Azure Functions, this loop runs continuously to: Handle incoming HTTP requests Dispatch and complete async tasks (like database queries or service calls) Manage parallel event processing (Event Hubs, Service Bus, etc.) The default Python event loop (UnixSelectorEventLoop) is reliable, but it wasn’t designed for high-throughput scenarios at massive scale. Uvloop, by contrast, is a high-performance reimplementation in Cython that consistently outperforms the built-in loop in both throughput and latency. How It Works in Azure Functions In Python 3.13+, the Azure Functions Python worker sets uvloop as the default event loop policy at startup: import uvloop, asyncio asyncio.set_event_loop_policy(uvloop.EventLoopPolicy()) This means any async workload — whether you’re using async def in your functions, calling external APIs, or parallelizing work with asyncio.gather — benefits immediately from uvloop’s faster scheduling and I/O handling. It is already available in the Functions runtime environment. No configuration changes, no requirements.txt edits, and no feature flags. If you’re running Functions on Python 3.13 or higher, uvloop is already in play. Measuring the Performance Gains We tested uvloop against the existing Unix event loop across several realistic workloads. For testing with Flex Consumption on Azure, the app with no uvloop is on Python 3.12, while the app with uvloop is on Python 3.13. The Flex Consumption app has an instance size of 2048 MB. Results were measured by taking the median of three runs for each test case. Test 1: 10k Requests, 50 Virtual Users Environment Event Loop Average HTTP Request Time (ms) Requests per second % Diff vs unix Local unix 96.95 515 - uvloop 87.99 565 +9.7% Azure unix 54.34 882 - uvloop 51.77 923 +4.8% Test 2: Sustained Load, 100 Virtual Users (5 min) Environment Event Loop Number of Requests Requests per second % Diff vs unix Local unix 157,580 525 - uvloop 167,928 560 +6.4% Azure unix 571,797 1,898 - uvloop 588,458 1,961 +2.9% Test 3: Heavy Concurrency, 500 Virtual Users + 5 async tasks per request Environment Event Loop Number of Requests Requests per second % Diff vs unix Local unix 216,212 720 - uvloop 231,878 772 +7% Azure unix 1,791,600 5,696 - uvloop 1,806,750 6,020 +1% The Unix Event Loop started showing failures in both environments in ~2% of requests. Across the board, uvloop delivered measurable improvements in throughput and latency — especially under high concurrency. Why Only Python 3.13+? While uvloop works with older versions of Python, we rolled it out as the default starting in 3.13 because: It ensured the change was strictly a net positive in performance and stability Easier rollout for all available Azure Functions SKUs, avoiding breaking existing customers Python 3.13 for the Azure Functions Worker introduces a Proxy Worker, so this is an additional performance boost to help with the extra overhead introduced Older runtimes remain on the standard event loop to minimize compatibility risks. Challenges and Lessons Learned Integrating uvloop into the Functions Python worker surfaced a few interesting challenges: Compatibility: Ensuring uvloop worked seamlessly across Linux environments at scale Observability: Updating logs to confirm which event loop policy was active Benchmark design: Testing realistic workloads (HTTP requests, async fan-out) to validate improvements beyond microbenchmarks Through this process, we confirmed uvloop consistently improved throughput and latency without regressions. Future Directions Switching to uvloop is just one step in making Azure Functions Python faster and more scalable. Looking ahead, we’re exploring: Deeper async optimizations: further tuning around asyncio and gRPC handling Serialization improvements: building on work like orjson for faster data processing Cold start performance: reducing startup overhead in Python workers Conclusion By adopting uvloop as the default event loop for Python 3.13+, Azure Functions makes async workloads faster, more reliable, and more scalable — all without requiring customers to change their code. If you’re upgrading to Python 3.13 for your Functions apps, uvloop is already running under the hood to give you better performance out of the box. Further Reading Azure Functions Azure Functions Python Developer Reference Guide Azure Functions Performance Optimizer Azure Functions Python Worker Azure Functions Python Library Azure Loading Testing Overview303Views0likes0CommentsAnnouncing Azure Functions Durable Task Scheduler Dedicated SKU GA & Consumption SKU Public Preview
Earlier this year, we introduced the Durable Task Scheduler, our orchestration engine designed for complex workflows and intelligent agents. It automatically checkpoints progress and protects your orchestration state, enabling resilient and reliable execution. Today, we’re excited to announce a major milestone: Durable Task Scheduler is now Generally Available with the Dedicated SKU, and the Consumption SKU is entering Public Preview. These offerings provide advanced orchestration capabilities for cloud-native and AI applications, providing predictable pricing for steady workloads with the Dedicated SKU and flexible, pay-as-you-go billing for dynamic, variable workloads with the Consumption SKU. “The Durable Task Scheduler has been a game-changer for our projects. It keeps our workflows running reliably with minimal code, even as they grow in complexity. It automatically recovers from unexpected issues, so we don’t have to step in. It scales to handle millions of orchestrations, and the real-time dashboard makes it simple to monitor and manage everything as it happens.” – Pedram Rezaei, VP of Engineering for Copilot What is the durable task scheduler? If you’re new to the Durable Task Scheduler, we recommend checking out our previous blog posts for a detailed background on what it is and how/when to leverage it: https://aka.ms/dts-public-preview https://aka.ms/workflow-in-aca In brief, the Durable Task Scheduler is a fully managed backend for durable execution on Azure. It can serve as the backend for a Durable Function App using the Durable Functions extension, or as the backend for an app leveraging the Durable Task SDKs in other compute environments, such as Azure Container Apps, Azure Kubernetes Services, or Azure App Service. It simplifies the development and operation of complex, stateful, and long-running workflows by providing automatic orchestration state persistence, fault tolerance, and built-in monitoring, all freeing developers from the operational overhead of managing orchestration storage and failure recovery. The Durable Task Scheduler is designed to deliver the best possible experience by addressing the key challenges developers face when self-managing orchestration infrastructure, such as configuring storage accounts, checkpointing orchestration progress, troubleshooting unexpected orchestration behavior, and ensuring high reliability. As of today, the Durable Task Scheduler is available across all Function App SKUs and includes autoscaling support in options like Flex Consumption. “Durable Task Scheduler has significantly accelerated execution of complex business logic which requires orchestration. We are observing up to 10 times faster speed as compared to the blob storage backend. We also love the dashboard view for our taskhubs, which gives us great visibility and helps us monitor, time and manage our workflows.” – Roney Varghese, Software Engineer at Pinnacle Tech Dedicated and Consumption SKUs Dedicated SKU (GA) The Dedicated SKU, which has been available in public preview since March of this year, has now graduated to General Availability. It delivers predictable performance and high reliability with dedicated infrastructure, high throughput, and up to 90-days orchestration data retention. It’s ideal for mission-critical workloads requiring consistent, high-scale throughput and for organizations that prefer predictable billing. Key features of the Dedicated SKU include: Dedicated Infrastructure: Runs on dedicated resources guaranteeing isolation. Custom Scaling: Configure Capacity Units (CUs) to match your workload needs. High Availability: High availability with multi-CU deployments. Data Retention: Up to 90 days. Performance: Each CU supports up to 2,000 actions per second and 50GB of orchestration data. What’s new in the Dedicated SKU? More Capacity Units As of today, the Dedicated SKU enables you to purchase additional capacity units for high performance and orchestration data storage. High Availability For applications requiring even higher availability for mission-critical scenarios, the Dedicated SKU now offers a High Availability feature. To enable high availability, you need at least 3 capacity units on your scheduler instance. Learn more about the Dedicated SKU here: https://aka.ms/dts-dedicated-sku Consumption SKU (Public Preview) We’ve heard your feedback loud and clear. We understand that the Dedicated SKU isn’t the right fit for every scenario. That’s why we’re introducing a new pricing plan: the Consumption SKU, a SKU tailored for workloads that run intermittently, or scale dynamically, and for requirements where flexibility and cost efficiency matter most. The Consumption SKU is perfect for variable workloads and development/test environments. It offers: Pay-Per-Use: Only pay for actions dispatched No Upfront Costs: No minimum commitments. Data Retention: Up to 30 days. Performance: Up to 500 actions per second. Learn more about the Consumption SKU here: https://aka.ms/dts-consumption-sku Roadmap We’re excited to reach this milestone, but we also have many plans for the future. Here’s a glimpse of the features you can expect to see in the Durable Task Scheduler in the near future: Private Endpoints Zone Redundancy in the Dedicated SKU Export API – Need your orchestration data for longer than the max retention limit? Use the Export API to move data out of DTS into a storage provider of your choice. Dynamic Scaling of Capacity Units – Set a minimum and maximum and allow DTS to dynamically scale up and down depending on orchestration throughput. Ability to handle payloads larger than 1MB Get started with the Durable Task Scheduler today Documentation: https://aka.ms/dts-documentation Samples: https://aka.ms/dts-samples Getting Started: https://aka.ms/dts-getting-started630Views1like0Comments
