DevOps
377 TopicsCommon Misconceptions When Running Locally vs. Deploying to Azure Linux-based Web Apps
TOC Introduction Environment Variable Build Time Compatible Memory Conclusion 1. Introduction One of the most common issues during project development is the scenario where “the application runs perfectly in the local environment but fails after being deployed to Azure.” In most cases, deployment logs will clearly reveal the problem and allow you to fix it quickly. However, there are also more complicated situations where "due to the nature of the error itself" relevant logs may be difficult to locate. This article introduces several common categories of such problems and explains how to troubleshoot them. We will demonstrate them using Python and popular AI-related packages, as these tend to exhibit compatibility-related behavior. Before you begin, it is recommended that you read Deployment and Build from Azure Linux based Web App | Microsoft Community Hub on how Azure Linux-based Web Apps perform deployments so you have a basic understanding of the build process. 2. Environment Variable Simulating a Local Flask + sklearn Project First, let’s simulate a minimal Flask + sklearn project in any local environment (VS Code in this example). For simplicity, the sample code does not actually use any sklearn functions; it only displays plain text. app.py from flask import Flask app = Flask(__name__) @app.route("/") def index(): return "hello deploy environment variable" if __name__ == "__main__": app.run(host="0.0.0.0", port=8000) We also preset the environment variables required during Azure deployment, although these will not be used when running locally. .deployment [config] SCM_DO_BUILD_DURING_DEPLOYMENT=false As you may know, the old package name sklearn has long been deprecated in favor of scikit-learn. However, for the purpose of simulating a compatibility error, we will intentionally specify the outdated package name. requirements.txt Flask==3.1.0 gunicorn==23.0.0 sklearn After running the project locally, you can open a browser and navigate to the target URL to verify the result. python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt python app.py Of course, you may encounter the same compatibility issue even in your local environment. Simply running the following command resolves it: export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True We will revisit this error and its solution shortly. For now, create a Linux Web App running Python 3.12 and configure the following environment variables. We will define Oryx Build as the deployment method. SCM_DO_BUILD_DURING_DEPLOYMENT=false WEBSITE_RUN_FROM_PACKAGE=false ENABLE_ORYX_BUILD=true After deploying the code and checking the Deployment Center, you should see an error similar to the following. From the detailed error message, the cause is clear: sklearn is deprecated and replaced by scikit-learn, so additional compatibility handling is now required by the Python runtime. The error message suggests the following solutions: Install the newer scikit-learn package directly. If your project is deeply coupled to the old sklearn package and cannot be refactored yet, enable compatibility by setting an environment variable to allow installation of the deprecated package. Typically, this type of “works locally but fails on Azure” behavior happens because the deprecated package was installed in the local environment a long time ago at the start of the project, and everything has been running smoothly since. Package compatibility issues like this are very common across various languages on Linux. When a project becomes tightly coupled to an outdated package, you may not be able to upgrade it immediately. In these cases, compatibility workarounds are often the only practical short-term solution. In our example, we will add the environment variable: SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True However, here comes the real problem: This variable is needed during the build phase, but the environment variables set in Azure Portal’s Application Settings only take effect at runtime. So what should we do? The answer is simple, shift the Oryx Build process from build-time to runtime. First, open Azure Portal → Configuration and disable Oryx Build. ENABLE_ORYX_BUILD=false Next, modify the project by adding a startup script. run.sh #!/bin/bash export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True python -m venv .venv source .venv/bin/activate pip install -r requirements.txt python app.py The startup script works just like the commands you run locally before executing the application. The difference is that you can inject the necessary compatibility environment variables before running pip install or starting the app. After that, return to Azure Portal and add the following Startup Command under Stack Settings. This ensures that your compatibility environment variables and build steps run before the runtime starts. bash run.sh Your overall project structure will now look like this. Once redeployed, everything should work correctly. 3. Build Time Build-Time Errors Caused by AI-Related Packages Many build-time failures are caused by AI-related packages, whose installation processes can be extremely time-consuming. You can investigate these issues by reviewing the deployment logs at the following maintenance URL: https://<YOUR_APP_NAME>.scm.azurewebsites.net/newui Compatible Let’s simulate a Flask + numpy project. The code is shown below. app.py from flask import Flask app = Flask(__name__) @app.route("/") def index(): return "hello deploy compatible" if __name__ == "__main__": app.run(host="0.0.0.0", port=8000) We reuse the same environment variables from the sklearn example. .deployment [config] SCM_DO_BUILD_DURING_DEPLOYMENT=false This time, we simulate the incompatibility between numpy==1.21.0 and Python 3.10. requirements.txt Flask==3.1.0 gunicorn==23.0.0 numpy==1.21.0 We will skip the local execution part and move directly to creating a Linux Web App running Python 3.10. Configure the same environment variables as before, and define the deployment method as runtime build. SCM_DO_BUILD_DURING_DEPLOYMENT=false WEBSITE_RUN_FROM_PACKAGE=false ENABLE_ORYX_BUILD=false After deployment, Deployment Center shows a successful publish. However, the actual website displays an error. At this point, you must check the deployment log files mentioned earlier. You will find two key logs: 1. docker.log Displays real-time logs of the platform creating and starting the container. In this case, you will see that the health probe exceeded the default 230-second startup window, causing container startup failure. This tells us the root cause is container startup timeout. To determine why it timed out, we must inspect the second file. 2. default_docker.log Contains the internal execution logs of the container. Not generated in real time, usually delayed around 15 minutes. Therefore, if docker.log shows a timeout error, wait at least 15 minutes to allow the logs to be written here. In this example, the internal log shows that numpy was being compiled during pip install, and the compilation step took too long. We now have a concrete diagnosis: numpy 1.21.0 is not compatible with Python 3.10, which forces pip to compile from source. The compilation exceeds the platform’s startup time limit (230 seconds) and causes the container to fail. We can verify this by checking numpy’s official site: numpy · PyPI numpy 1.21.0 only provides wheels for cp37, cp38, cp39 but not cp310 (which is python 3.10). Thus, compilation becomes unavoidable. Possible Solutions Set the environment variable WEBSITES_CONTAINER_START_TIME_LIMIT to increase the allowed container startup time. Downgrade Python to 3.9 or earlier. Upgrade numpy to 1.21.0+, where suitable wheels for Python 3.10 are available. In this example, we choose this option. After upgrading numpy to version 1.25.0 (which supports Python 3.10) from specifying in requirements.txt and redeploying, the issue is resolved. numpy · PyPI requirements.txt Flask==3.1.0 gunicorn==23.0.0 numpy==1.25.0 Memory The final example concerns the App Service SKU. AI packages such as Streamlit, PyTorch, and others require significant memory. Any one of these packages may cause the build process to fail due to insufficient memory. The error messages vary widely each time. If you repeatedly encounter unexplained build failures, check Deployment Center or default_docker.log for Exit Code 137, which indicates that the system ran out of memory during the build. The only solution in such cases is to scale up. 4. Conclusion This article introduced several common troubleshooting techniques for resolving Linux Web App issues caused during the build stage. Most of these problems relate to package compatibility, although the symptoms may vary greatly. By understanding the debugging process demonstrated in these examples, you will be better prepared to diagnose and resolve similar issues in future deployments.240Views0likes1CommentAzure SRE Agent: Expanding Observability and Multi-Cloud Resilience
The Azure SRE Agent continues to evolve as a cornerstone for operational excellence and incident management. Over the past few months, we have made significant strides in enabling integrations with leading observability platforms—Dynatrace, New Relic, and Datadog—through Model Context Protocol (MCP) Servers. These partnerships serve joint customers, enabling automated remediation across diverse environments. Deepening Integrations with MCP Servers Our collaboration with these partners is more than technical—it’s about delivering value at scale. Datadog, New Relic, and Dynatrace are all Azure Native ISV Service partners. With these integrations Azure Native customers can also choose to add these MCP servers directly from the Azure Native partners’ resource: Datadog: At Ignite, Azure SRE Agent was presented with the Datadog MCP Server, to demonstrate how our customers can streamline complex workflows. Customers can now bring their Datadog MCP Server into Azure SRE Agent, extending knowledge capabilities and centralizing logs and metrics. Find Datadog Azure Native offerings on Marketplace. New Relic: When an alert fires in New Relic, the Azure SRE Agent calls the New Relic MCP Server to provide Intelligent Observability insights. This agentic integration with the New Relic MCP Server offers over 35 specialized tools across, entity and account management, alerts and monitoring, data analysis and queries, performance analysis, and much more. The advanced remediation skills of the Azure SRE Agent + New Relic AI help our joint customers diagnose and resolve production issues faster. Find New Relic’s Azure Native offering on Marketplace Dynatrace: The Dynatrace integration bridges Microsoft Azure's cloud-native infrastructure management with Dynatrace's AI-powered observability platform, leveraging the Davis AI engine and remote MCP server capabilities for incident detection, root cause analysis, and remediation across hybrid cloud environments. Check out Dynatrace’s Azure Native offering on Marketplace. These integrations are made possible by Azure SRE Agent’s MCP connectors. The MCP connectors in Azure SRE Agent act as the bridge between the agent and MCP servers, enabling dynamic discovery and execution of specialized tools for observability and incident management across diverse environments. This feature allows customers to build their own custom sub-agents to leverage tools from MCP Servers from integrated platforms like Dynatrace, Datadog, and New Relic to complement the agent’s diagnostic and remediation capabilities. By connecting Azure SRE Agent to external MCP servers scenarios such as cross-platform telemetry analysis are unlocked. Looking Ahead: Multi-Agent Collaboration Azure SRE Agent isn’t stopping with MCP integrations We’re actively working with PagerDuty and NeuBird to support dynamic use cases via agent-to-agent collaboration: PagerDuty: PagerDuty’s PD Advance SRE Agent is an AI-powered assistant that triages incidents by analyzing logs, diagnostics, past incident history, and runbooks to surface relevant context and recommended remediations. At Ignite, PagerDuty and Microsoft demonstrated how Azure SRE Agent can ingest PagerDuty incidents and collaborate with PagerDuty’s SRE Agent to complement triage using historical patterns, runbook intelligence and Azure diagnostics. NeuBird: NeuBird’s Agentic AI SRE, Hawkeye, autonomously investigates and resolves incidents across hybrid, and multi-cloud environments. By connecting to telemetry sources like Azure Monitor, Prometheus, and GitHub, Hawkeye delivers real-time diagnosis and targeted fixes. Building on the work presented at SRE Day this partnership underscores our commitment to agentic ecosystems where specialized agents collaborate for complex scenarios. Sign up for the private preview to try the integration, here. Additionally, please check out NeuBird on Marketplace. These efforts reflect a broader vision: Azure SRE Agent as a hub for cross-platform reliability, enabling customers to manage incidents across Azure, on-premises, and other clouds with confidence. Why This Matters As organizations embrace distributed architectures, the need for integrated, intelligent, and multi-cloud-ready SRE solutions has never been greater. By partnering with industry leaders and pioneering agent-to-agent workflows, Azure SRE Agent is setting the stage for a future where resilience is not just reactive—it’s proactive and collaborative.302Views2likes0CommentsAnnouncing Advanced Kubernetes Troubleshooting Agent Capabilities (preview) in Azure Copilot
What’s new? Today, we're announcing Kubernetes troubleshooting agent capabilities in Azure Copilot, offering an intuitive, guided agentic experience that helps users detect, triage, and resolve common Kubernetes issues in their AKS clusters. The agent can provide root cause analysis for Kubernetes clusters and resources and is triggered by Kubernetes-specific keywords. It can detect problems like resource failures and scaling bottlenecks and intelligently correlates signals across metrics and events using `kubectl` commands when reasoning and provides actionable solutions. By simplifying complex diagnostics and offering clear next steps, the agent empowers users to troubleshoot independently. How it works With Kubernetes troubleshooting agent, Azure Copilot automatically investigates issues in your cluster by running targeted `kubectl` commands and analyzing your cluster’s configuration and current state. For instance, it identifies failing or pending pods, cluster events, resource utilization metrics, and configuration details to build a complete picture of what’s causing the issue. Azure Copilot then determines the most effective mitigation steps for your specific environment. It provides clear, step-by-step guidance, and in many cases, offers a one-click fix to resolve the issue automatically. If Azure Copilot can’t fully resolve the problem, it can generate a pre-populated support request with all the diagnostic details Microsoft Support needs. You’ll be able to review and confirm everything before the request is submitted. This agent is available via Azure Copilot in the Azure Portal. Learn more about how Azure Copilot works. How to Get Started To start using agents, your global administrator must request access to the agents preview at the tenant level in the Azure Copilot admin center. This confirms your interest in the preview and allows us to enable access. Once approved, users will see the Agent mode toggle in Azure Copilot chat and can then start using Copilot agents. Capacity is limited, so sign up early for the best chance to participate. Additionally, if you are interested in helping shape the future of agentic cloud ops and the role Copilot will play in it, please join our customer feedback program by filling up this form. Agents (preview) in Azure Copilot | Microsoft Learn Troubleshooting sample prompts From an AKS cluster resource, click Kubernetes troubleshooting with Copilot to automatically open Azure Copilot in context of the resource you want to troubleshoot: Try These Prompts to Get Started: Here are a few examples of the kinds of prompts you can use. If you're not already working in the context of a resource, you may need to provide the specific resource that you want to troubleshoot. "My pod keeps restarting can you help me figure out why" "Pods are stuck pending what is blocking them from being scheduled" "I am getting ImagePullBackOff how do I fix this" "One of my nodes is NotReady what is causing it" "My service cannot reach the backend pod what should I check" Note: When using these kinds of prompts, be sure agent mode is enabled by selecting the icon in the chat window: Learn More Troubleshooting agent capabilities in Agents (preview) in Azure Copilot | Microsoft Learn Announcing the CLI Agent for AKS: Agentic AI-powered operations and diagnostics at your fingertips - AKS Engineering Blog Microsoft Copilot in Azure Series - Kubectl | Microsoft Community Hub365Views3likes0CommentsCompose for Agents on Azure Container Apps and Serverless GPU (public preview)
Empowering intelligent applications The next wave of AI is agentic – systems that can reason, plan, and act on our behalf. Whether you’re building a virtual assistant that books travel or a multi‑model workflow that triages support tickets, these applications rely on multiple models, tools, and services working together. Unfortunately, building them has not been easy: Tooling sprawl. Developers must wire together LLMs, vector databases, MCP (Model Context Protocol) tools and orchestration logic, often across disparate SDKs and running processes. Keeping those dependencies in sync for local development and production is tedious and error‑prone. Specialized hardware. Large language models and agent orchestration frameworks often require GPUs to run effectively. Procuring and managing GPU instances can be costly, particularly for prototypes and small teams. Operational complexity. Agentic applications are typically composed of many services. Scaling them, managing health and secure connectivity, and reproducing the same environment from a developer laptop into production quickly becomes a full‑time job. Why Azure Container Apps is the right home With Azure Container Apps (ACA), you can now tackle these challenges without sacrificing the familiar Docker Compose workflow that so many developers love. We’re excited to announce that Compose for Agents is in public preview on Azure Container Apps. This integration brings the power of Docker’s new agentic tooling to a platform that was built for serverless containers. Here’s why ACA is the natural home for agentic workloads: Serverless GPUs with per‑second billing. Container Apps offers serverless GPU compute. Your agentic workloads can run on GPUs only when they need to, and you only pay for the seconds your container is actually running. This makes it economical to prototype and scale complex models without upfront infrastructure commitments. Media reports on the preview note that Docker’s Offload service uses remote GPUs via cloud providers such as Microsoft to overcome local hardware limits, and ACA brings that capability directly into the Azure native experience. Sandboxed dynamic sessions for tools. Many agentic frameworks execute user‑provided code as part of their workflows. ACA’s dynamic sessions provide secure, short‑lived sandboxes for running these tasks. This means untrusted or transient code (for example, evaluation scripts or third‑party plugins) runs in an isolated environment, keeping your production services safe. Fully managed scaling and operations. Container Apps automatically scales each service based on traffic and queue length, and it can scale down to zero when idle. You get built‑in service discovery, ingress, rolling updates and revision management without having to operate your own orchestrator. Developers can focus on building agents rather than patching servers. First‑class Docker Compose support. Compose remains a favourite tool for developers’ inner loop and for orchestrating multi‑container systems. Compose for Agents extends the format to declare open‑source models, agents and tools alongside your microservices. By pointing docker compose up at ACA, the same YAML file you use locally now deploys automatically to a fully managed container environment. Model Runner and MCP Gateway built in. Docker’s Model Runner lets you pull open‑weight language models from Docker Hub and exposes them via OpenAI‑compatible endpoints, and the MCP (Model Context Protocol) Gateway connects your agents to curated tools. ACA integrates these components into your Compose stack, giving you everything you need for retrieval‑augmented generation, vector search or domain‑specific tool invocation. What this means for developers The Compose for Agents public preview on Container Apps brings together the simplicity of Docker Compose and the operational power of Azure’s serverless compute platform. Developers can now: Define agent stacks declaratively. Instead of cobbling together scripts, you describe your entire agentic application in a single compose.yaml file. Compose already supports popular frameworks like LangGraph, Embabel, Vercel AI SDK, Spring AI, Crew AI, Google ADK and Agno. You can mix and match these frameworks with your own microservices, databases and queues. Run anywhere with the same configuration. Docker emphasizes that you can “define your open models, agents and MCP‑compatible tools, then spin up your full agentic stack with a simple docker compose up”. By bringing this workflow to ACA, Microsoft ensures that the same compose file runs unchanged on your laptop and in the cloud. Scale seamlessly. Large language models and multi‑agent orchestration can be compute‑intensive. News coverage notes that Docker’s Offload service provides remote GPUs for these workloads ACA extends that capability with serverless GPUs and automated scaling, letting you test locally and then burst to the cloud with no changes to your YAML. Collaboration with Docker This preview is the result of close collaboration between Microsoft and Docker. A Docker has always been focused on simplifying complex developer workflows. “With Compose for Agents, we’re extending that same experience that developers know and love from containers to agents, bringing the power of Compose to the emerging world of AI-native, agentic applications. It delivers the same simplicity and predictability to prototyping, testing, and deploying across local and cloud environments” said Elyi Aleyner, VP of Strategy and Head of Tech Alliances at Docker. “We’re excited to partner with Microsoft to bring this innovation to Azure Container Apps, enabling developers to go from ‘compose up’ on their laptops to secure, GPU-backed workloads in the cloud with zero friction.” Empowering choice Every team has its own favourite frameworks and tools. We’ve ensured that Compose for Agents on ACA is framework‑agnostic: you can use LangGraph for complex workflows, CrewAI for multi‑agent coordination, or Spring AI to integrate with your existing Java stack. Want to run a vector store from the MCP catalog alongside your own service? Simply add it to your Compose file. Docker’s curated catalog provides over a hundred ready‑to‑use tools and services for retrieval, document summarization, database access and more. ACA’s flexibility means you’re free to choose the stack that best fits your problem. Get started today The public preview of Compose for Agents support in Azure Container Apps is available now. You can: Install the latest Azure Container Apps Extension Define your application in a compose.yaml file, including models, tools and agent code and deploy to ACA via az containerapp compose up. ACA will provision GPU resources, dynamic sessions and auto‑scaling infrastructure automatically. Iterate locally using standard docker compose up commands, then push the same configuration to the cloud when you’re ready. For more detailed instructions please go to https://aka.ms/aca/compose-for-agents409Views2likes0CommentsSecurity Where It Matters: Runtime Context and AI Fixes Now Integrated in Your Dev Workflow
Security teams and developers face the same frustrating cycle: thousands of alerts, limited time, and no clear way to know which issues matter most. Applications suffer attacks as quickly as once every three minutes, 1 emphasizing the importance of proactive security that prioritizes critical, exploitable vulnerabilities. Microsoft is leading this shift with new integrations in the end-to-end solution that combines GitHub Advanced Security’s developer-first application security tool with Microsoft Defender for Cloud's runtime protection, enhanced by agentic remediation. Now available in public preview. This integration empowers organizations to secure code to cloud and accelerates tackling of security issues in their software portfolio using agentic remediation and runtime context-based vulnerability prioritization. The result: fewer distractions, faster fixes, better collaboration and more proactive security from code to cloud. The DevSecOps Dilemma— too many alerts, not enough action Over the past decade, the application security industry has made significant strides in improving detection accuracy and fostering collaboration between security teams and developers. These advances have enabled both groups to work together on real issues and drive meaningful progress. However, despite these improvements, remediation trends across the industry have remained stagnant. Quarter after quarter, year after year, vulnerability counts continue to rise with critical / high vulnerabilities constituting 17.4% of vulnerability backlogs and a mean-time-to-remediation (MTTR) of 116 days 2 Today, three big challenges slow teams down: Security teams are drowning in alert fatigue, struggling to distinguish real, exploitable risks from noise. At the same time, AI is rapidly introducing new threat vectors that defenders have little time to research or understand—leaving organizations vulnerable to missed threats and evolving attack techniques. Developers lack clear prioritization while remediation takes long, so they lose time fixing issues that may never be exploited. Remediation cycles are slow, leaving systems exposed to potential attacks while teams debate which issues matter most or search for the right person to fix them Both teams rely on separate, non-integrated tools, making collaboration slow and frustrating. Development and security teams frequently operate in silos, reducing efficiency and creating blind spots. This leads to wasted time, unresolved threats, and growing backlogs. Teams are stuck reacting to noise instead of solving real problems. DevSecOps reimagined in the era of AI Your app is live and serving thousands of customers. Defender for Cloud detects a vulnerability in an internet-facing API that handles sensitive data. In the past, this alert would age in a dashboard while developers worked on unrelated fixes because they didn’t know this was the critical one. Now, with the new integration, a security campaign can be created in GitHub filtering for runtime risk (internet exposed, sensitive data etc.) notifying the developer to prioritize this issue. The developer views the issue in their workflow, understands why it matters, and uses Copilot Autofix to apply an AI-suggested fix in minutes. The developer can then select these risks at bulk and assign the GitHub Copilot coding agent to create a draft PR for a multi merge fix ready for human review. Virtual Registry: Code-to-Runtime Mapping Code to runtime mapping is possible with the Virtual Registry which makes GitHub a trusted source for artifact metadata. Integrated with Microsoft Defender for Cloud, the Virtual Registry enables smarter risk prioritization and faster incident response. Teams can quickly answer: Is this vulnerability running in production? Is it exposed to sensitive workloads? Do I need to act now? By combining runtime and repository context, the Virtual Registry streamlines alert triage and incident response. We shipped a new set of filters to both Code Scanning and Dependabot and Security Campaigns that are based on the artifact metadata that is stored in the Virtual Registry. Faster fixes with agentic remediation The integration includes Copilot Autofix, an AI-powered tool that suggests code changes to fix security problems. It checks that the fixes work and helps developers resolve issues quickly, without switching tools. To complete the agentic work flow we can be bulk assign these autofixes to GitHub Copilot Coding agent to create a draft Pull Request awaiting human review. Why this matters Fewer alerts to sort through: Focus only on what’s exploitable in production. Faster fixes: AI-powered fix suggestions through GitHub Copilot Autofix have shown to fix 50% of alerts within the PR with a 70% reduction in mean time-to-remediation 3 Better teamwork: Developers and security teams collaborate seamlessly. With collaborative security now powered by connected context, we’ve seen 68% of alert remediated using GitHub Advanced Security’s security campaigns. 3 Try it now This feature is available in public preview and will be showcased at Microsoft Ignite. If your team builds cloud-native applications, this integration helps you protect code to cloud more effectively—without slowing down development. Customer FAQs How do I start using the integration? From Microsoft Defender for Cloud: Go to the environment section in the Defender for Cloud portal. Grant a new GitHub connector or update an existing one to provide consent to scan your source code. If you use GitHub, setup is one click. You’ll immediately see initial scan results and recommended fixes. From GitHub: You will be able to filter alerts by runtime context in addition to receiving AI-suggested fixes. How do I purchase this integration? For GitHub: GitHub Advanced Security (GHAS) is available as: Code Security SKU: $30 per committer/month (available April 2025) GHAS Bundle: $49 per committer/month (available now) GitHub Enterprise Cloud GitHub Copilot For Microsoft Defender for Cloud CSPM: Defender CSPM: $5 per billable resource/month Both can be enabled through the Azure Portal as Azure meters. [1]: Software Under Siege | AppSec Threat Report 2025 | Contrast Security [2]: Edgescan | Vulnerability Statistics Report 2025 [3]: GitHub Internal Data951Views2likes0CommentsReimagining AI Ops with Azure SRE Agent: New Automation, Integration, and Extensibility features
Azure SRE Agent offers intelligent and context aware automation for IT operations. Enhanced by customer feedback from our preview, the SRE Agent has evolved into an extensible platform to automate and manage tasks across Azure and other environments. Built on an Agentic DevOps approach - drawing from proven practices in internal Azure operations - the Azure SRE Agent has already saved over 20,000 engineering hours across Microsoft product teams operations, delivering strong ROI for teams seeking sustainable AIOps. An Operations Agent that adapts to your playbooks Azure SRE Agent is an AI powered operations automation platform that empowers SREs, DevOps, IT operations, and support teams to automate tasks such as incident response, customer support, and developer operations from a single, extensible agent. Its value proposition and capabilities have evolved beyond diagnosis and mitigation of Azure issues, to automating operational workflows and seamless integration with the standards and processes used in your organization. SRE Agent is designed to automate operational work and reduce toil, enabling developers and operators to focus on high-value tasks. By streamlining repetitive and complex processes, SRE Agent accelerates innovation and improves reliability across cloud and hybrid environments. In this article, we will look at what’s new and what has changed since the last update. What’s New: Automation, Integration, and Extensibility Azure SRE Agent just got a major upgrade. From no-code automation to seamless integrations and expanded data connectivity, here’s what’s new in this release: No-code Sub-Agent Builder: Rapidly create custom automations without writing code. Flexible, event-driven triggers: Instantly respond to incidents and operational changes. Expanded data connectivity: Unify diagnostics and troubleshooting across more data sources. Custom actions: Integrate with your existing tools and orchestrate end-to-end workflows via MCP. Prebuilt operational scenarios: Accelerate deployment and improve reliability out of the box. Unlike generic agent platforms, Azure SRE Agent comes with deep integrations, prebuilt tools, and frameworks specifically for IT, DevOps, and SRE workflows. This means you can automate complex operational tasks faster and more reliably, tailored to your organization’s needs. Sub-Agent Builder: Custom Automation, No Code Required Empower teams to automate repetitive operational tasks without coding expertise, dramatically reducing manual workload and development cycles. This feature helps address the need for targeted automation, letting teams solve specific operational pain points without relying on one-size-fits-all solutions. Modular Sub-Agents: Easily create custom sub-agents tailored to your team’s needs. Each sub-agent can have its own instructions, triggers, and toolsets, letting you automate everything from outage response to customer email triage. Prebuilt System Tools: Eliminate the inefficiency of creating basic automation from scratch, and choose from a rich library of hundreds of built-in tools for Azure operations, code analysis, deployment management, diagnostics, and more. Custom Logic: Align automation to your unique business processes by defining your automation logic and prompts, teaching the agent to act exactly as your workflow requires. Flexible Triggers: Automate on Your Terms Invoke the agent to respond automatically to mission-critical events, not wait for manual commands. This feature helps speed up incident response and eliminate missed opportunities for efficiency. Multi-Source Triggers: Go beyond chat-based interactions, and trigger the agent to automatically respond to Incident Management and Ticketing systems like PagerDuty and ServiceNow, Observability Alerting systems like Azure Monitor Alerts, or even on a cron-based schedule for proactive monitoring and best-practices checks. Additional trigger sources such as GitHub issues, Azure DevOps pipelines, email, etc. will be added over time. This means automation can start exactly when and where you need it. Event-Driven Operations: Integrate with your CI/CD, monitoring, or support systems to launch automations in response to real-world events - like deployments, incidents, or customer requests. Vital for reducing downtime, it ensures that business-critical actions happen automatically and promptly. Expanded Data Connectivity: Unified Observability and Troubleshooting Integrate data, enabling comprehensive diagnostics and troubleshooting and faster, more informed decision-making by eliminating silos and speeding up issue resolution. Multiple Data Sources: The agent can now read data from Azure Monitor, Log Analytics, and Application Insights based on its Azure role-based access control (RBAC). Additional observability data sources such as Dynatrace, New Relic, Datadog, and more can be added via the Remote Model Context Protocol (MCP) servers for these tools. This gives you a unified view for diagnostics and automation. Knowledge Integration: Rather than manually detailing every instruction in your prompt, you can upload your Troubleshooting Guide (TSG) or Runbook directly, allowing the agent to automatically create an execution plan from the file. You may also connect the agent to resources like SharePoint, Jira, or documentation repositories through Remote MCP servers, enabling it to retrieve needed files on its own. This approach utilizes your organization’s existing knowledge base, streamlining onboarding and enhancing consistency in managing incidents. Azure SRE Agent is also building multi-agent collaboration by integrating with PagerDuty and Neubird, enabling advanced, cross-platform incident management and reliability across diverse environments. Custom Actions: Automate Anything, Anywhere Extend automation beyond Azure and integrate with any tool or workflow, solving the problem of limited automation scope and enabling end-to-end process orchestration. Out-of-the-Box Actions: Instantly automate common tasks like running azcli, kubectl, creating GitHub issues, or updating Azure resources, reducing setup time and operational overhead. Communication Notifications: The SRE Agent now features built-in connectors for Outlook, enabling automated email notifications, and for Microsoft Teams, allowing it to post messages directly to Teams channels for streamlined communication. Bring Your Own Actions: Drop in your own Remote MCP servers to extend the agent’s capabilities to any custom tool or workflow. Future-proof your agentic DevOps by automating proprietary or emerging processes with confidence. Prebuilt Operations Scenarios Address common operational challenges out of the box, saving teams time and effort while improving reliability and customer satisfaction. Incident Response: Minimize business impact and reduce operational risk by automating detection, diagnosis, and mitigation of your workload stack. The agent has built-in runbooks for common issues related to many Azure resource types including Azure Kubernetes Service (AKS), Azure Container Apps (ACA), Azure App Service, Azure Logic Apps, Azure Database for PostgreSQL, Azure CosmosDB, Azure VMs, etc. Support for additional resource types is being added continually, please see product documentation for the latest information. Root Cause Analysis & IaC Drift Detection: Instantly pinpoint incident causes with AI-driven root cause analysis including automated source code scanning via GitHub and Azure DevOps integration. Proactively detect and resolve infrastructure drift by comparing live cloud environments against source-controlled IaC, ensuring configuration consistency and compliance. Handle Complex Investigations: Enable the deep investigation mode that uses a hypothesis-driven method to analyze possible root causes. It collects logs and metrics, tests hypotheses with iterative checks, and documents findings. The process delivers a clear summary and actionable steps to help teams accurately resolve critical issues. Incident Analysis: The integrated dashboard offers a comprehensive overview of all incidents managed by the SRE Agent. It presents essential metrics, including the number of incidents reviewed, assisted, and mitigated by the agent, as well as those awaiting human intervention. Users can leverage aggregated visualizations and AI-generated root cause analyses to gain insights into incident processing, identify trends, enhance response strategies, and detect areas for improvement in incident management. Inbuilt Agent Memory: The new SRE Agent Memory System transforms incident response by institutionalizing the expertise of top SREs - capturing, indexing, and reusing critical knowledge from past incidents, investigations, and user guidance. Benefit from faster, more accurate troubleshooting, as the agent learns from both successes and mistakes, surfacing relevant insights, runbooks, and mitigation strategies exactly when needed. This system leverages advanced retrieval techniques and a domain-aware schema to ensure every on-call engagement is smarter than the last, reducing mean time to resolution (MTTR) and minimizing repeated toil. Automatically gain a continuously improving agent that remembers what works, avoids past pitfalls, and delivers actionable guidance tailored to the environment. GitHub Copilot and Azure DevOps Integration: Automatically triage, respond to, and resolve issues raised in GitHub or Azure DevOps. Integration with modern development platforms such as GitHub Copilot coding agent increases efficiency and ensures that issues are resolved faster, reducing bottlenecks in the development lifecycle. Ready to get started? Azure SRE Agent home page Product overview Pricing Page Pricing Calculator Pricing Blog Demo recordings Deployment samples What’s Next? Give us feedback: Your feedback is critical - You can Thumbs Up / Thumbs Down each interaction or thread, or go to the “Give Feedback” button in the agent to give us in-product feedback - or you can create issues or just share your thoughts in our GitHub repo at https://github.com/microsoft/sre-agent. We’re just getting started. In the coming months, expect even more prebuilt integrations, expanded data sources, and new automation scenarios. We anticipate continuous growth and improvement throughout our agentic AI platforms and services to effectively address customer needs and preferences. Let us know what Ops toil you want to automate next!1.1KViews0likes0CommentsAzure DevOps for Container Apps: End-to-End CI/CD with Self-Hosted Agents
Join this hands-on session to learn how to build a complete CI/CD pipeline for containerized applications on Azure Container Apps using Azure DevOps. You'll discover how to leverage self-hosted agents running as event-driven Container Apps jobs to deploy a full-stack web application with frontend and backend components. In this practical demonstration, you'll see how to create an automated deployment pipeline that builds, tests, and deploys containerized applications to Azure Container Apps. You'll learn how self-hosted agents in Container Apps jobs provide a serverless, cost-effective solution that scales automatically with your pipeline demands—you only pay for the time your agents are running. Don't miss your spot !156Views0likes0CommentsDisciplined Guardrail Development in enterprise application with GitHub Copilot
What Is Disciplined Guardrail-Based Development? In AI-assisted software development, approaches like Vibe Coding—which prioritize momentum and intuition—often fail to ensure code quality and maintainability. To address this, Disciplined Guardrail-Based Development introduces structured rules ("guardrails") that guide AI systems during coding and maintenance tasks, ensuring consistent quality and reliability. To get AI (LLMs) to generate appropriate code, developers must provide clear and specific instructions. Two key elements are essential: What to build – Clarifying requirements and breaking down tasks How to build it – Defining the application architecture The way these two elements are handled depends on the development methodology or process being used. Here are examples as follows. How to Set Up Disciplined Guardrails in GitHub Copilot To implement disciplined guardrail-based development with GitHub Copilot, two key configuration features are used: 1. Custom Instructions (.github/copilot-instructions.md): This file allows you to define persistent instructions that GitHub Copilot will always refer to when generating code. Purpose: Establish coding standards, architectural rules, naming conventions, and other quality guidelines. Best Practice: Instead of placing all instructions in a single file, split them into multiple modular files and reference them accordingly. This improves maintainability and clarity. Example Use: You might define rules like using camelCase for variables, enforcing error boundaries in React, or requiring TypeScript for all new code. https://docs.github.com/en/copilot/how-tos/configure-custom-instructions/add-repository-instructions 2. Chat Modes (.github/chatmodes/*.chatmode.md): These files define specialized chat modes tailored to specific tasks or workflows. Purpose: Customize Copilot’s behavior for different development contexts (e.g., debugging, writing tests, refactoring). Structure: Each .chatmode.md file includes metadata and instructions that guide Copilot’s responses in that mode. Example Use: A debug.chatmode.md might instruct Copilot to focus on identifying and resolving runtime errors, while a test.chatmode.md could prioritize generating unit tests with specific frameworks. https://code.visualstudio.com/docs/copilot/customization/custom-chat-modes The files to be created and their relationships are as follows. Next, there are introductions for the specific creation method. #1: Custom Instructions With custom instructions, you can define commands that are always provided to GitHub Copilot. The prepared files are always referenced during chat sessions and passed to the LLM (this can also be confirmed from the chat history). An important note is to split the content into several files and include links to those files within the .github/copilot-instructions.md file. Because it can become too long if everything is written in a single file. There are mainly two types of content that should be described in custom instructions: A: Development Process (≒ outcome + Creation Method) What documents or code will be created: requirements specification, design documents, task breakdown tables, implementation code, etc. In what order and by whom they will be created: for example, proceed in the order of requirements definition → design → task breakdown → coding. B: Application Architecture How will the outcome be defined in A be created? What technology stack and component structure will be used? A concrete example of copilot-instructions.md is shown below. # Development Rules ## Architecture - When performing design and coding tasks, always refer to the following architecture documents and strictly follow them as rules. ### Product Overview - Document the product overview in `.github/architecture/product.md` ### Technology Stack - Document the technologies used in `.github/architecture/techstack.md` ### Coding Standards - Document coding standards in `.github/architecture/codingrule.md` ### Project Structure - Document the project directory structure in `.github/architecture/structure.md` ### Glossary (Japanese-English) - Document the list of terms used in the project in `.github/architecture/dictionary.md` ## Development Flow - Follow a disciplined development flow and execute the following four stages in order (proceed to the next stage only after completing the current one): 1. Requirement Definition 2. Design 3. Task Breakdown 4. Coding ### 1. Requirement Definition - Document requirements in `docs/[subsystem_name]/[business_name]/requirement.md` - Use `requirement.chatmode.md` to define requirements - Focus on clarifying objectives, understanding the current situation, and setting success criteria - Once requirements are defined, obtain user confirmation before proceeding to the next stage ### 2. Design - Document design in `docs/[subsystem_name]/[business_name]/design.md` - Use `design.chatmode.md` to define the design - Define UI, module structure, and interface design - Once the design is complete, obtain user confirmation before proceeding to the next stage ### 3. Task Breakdown - Document tasks in `docs/[subsystem_name]/[business_name]/tasks.md` - Use `tasks.chatmode.md` to define tasks - Break down tasks into executable units and set priorities - Once task breakdown is complete, obtain user confirmation before proceeding to the next stage ### 4. Coding - Implement code under `src/[subsystem_name]/[business_name]/` - Perform coding task by task - Update progress in `docs/[subsystem_name]/[business_name]/tasks.md` - Report to the user upon completion of each task Note: The only file that is always sent to the LLM is `copilot-instructions.md`. Documents linked from there (such as `product.md` or `techstack.md`) are not guaranteed to be read by the LLM. That said, a reasonably capable LLM will usually review these files before proceeding with the work. If the LLM does not properly reference each file, you may explicitly add these architecture documents to the context. Another approach is to instruct the LLM to review these files in the **chat mode settings**, which will be described later. There are various “schools of thought” regarding application architecture, and it is still an ongoing challenge to determine exactly what should be defined and what documents should be created. The choice of architecture depends on factors such as the business context, development scale, and team structure, so it is difficult to prescribe a one-size-fits-all approach. That said, as a general guideline, it is desirable to summarize the following: Product Overview: Overview of the product, service, or business, including its overall characteristics Technology Stack: What technologies will be used to develop the application? Project Structure: How will folders and directories be organized during development? Module Structure: How will the application be divided into modules? Coding Rules: Rules for handling exceptions, naming conventions, and other coding practices Writing all of this from scratch can be challenging. A practical approach is to create template information with the help of Copilot and then refine it. Specifically, you can: Use tools like M365 Copilot Researcher to create content based on general principles Analyze a prototype application and have the architecture information summarized (using Ask mode or Edit mode, feed the solution files to a capable LLM for analysis) However, in most cases, the output cannot be used as-is. The structure may not be analyzed correctly (hallucinations may occur) Project-specific practices and rules may not be captured Use the generated content as a starting point, and then refine it to create architecture documentation tailored to your own project. When creating architecture documents for enterprise-scale application development, a useful approach is to distinguish between the foundational parts and the individual application parts. Discipline-based guardrail development is particularly effective when building multiple applications in a “cookie-cutter” style on top of a common foundation. A cler example of this is Data-Oriented Architecture (DOA). In DOA, individual business applications are built on top of a shared database that serves as the overall common foundation. In this case, the foundational parts (the database layer) should not be modified arbitrarily by individual developers. Instead, focus on how to standardize the development of the individual application parts (the blue-framed sections) while ensuring consistency. Architecture documentation should be organized with this distinction in mind, emphasizing the uniformity of application-level development built upon the stable foundation. #2 Chat Mode By default, GitHub Copilot provides three chat modes: Ask, Edit, and Agent. However, by creating files under .github/chatmodes/*.chatmode.md, you can customize the Agent mode to create chat modes tailored for specific tasks. Specifically, you can configure the following three aspects. Functionally, this allows you to perform a specific task without having to manually change the model or tools, or write detailed instructions each time: model: Specify the default LLM to use (Note: The user can still manually switch to another LLM if desired) tools: Restrict which tools can be used (Note: The user can still manually select other tools if desired) custom instructions: Provide custom instructions specific to this chat mode A concrete example of .github/chatmodes/*.chatmode.md is shown below. description: This mode is used for requirement definition tasks. model: Claude Sonnet 4 tools: ['changes', 'codebase', 'editFiles', 'fetch', 'findTestFiles', 'githubRepo', 'new', 'openSimpleBrowser', 'runCommands', 'search', 'searchResults', 'terminalLastCommand', 'terminalSelection', 'usages', 'vscodeAPI', 'mssql_connect', 'mssql_disconnect', 'mssql_list_servers', 'mssql_show_schema'] --- # Requirement Definition Mode In this mode, requirement definition tasks are performed. Specifically, the project requirements are clarified, and necessary functions and specifications are defined. Based on instructions or interviews with the user, document the requirements according to the format below. If any specifications are ambiguous or unclear, Copilot should ask the user questions to clarify them. ## File Storage Location Save the requirement definition file in the following location: - Save as `requirement.md` under the directory `docs/[subsystem_name]/[business_name]/` ## Requirement Definition Format While interviewing the user, document the following items in the Markdown file: - **Subsystem Name**: The name of the subsystem to which this business belongs - **Business Name**: The name of the business - **Overview**: A summary of the business - **Use Cases**: Clarify who uses this business, when/under what circumstances, and for what purpose, using the following structure: - **Who (Persona)**: User or system roles - **When/Under What Circumstances (Scenario)**: Timing when the business is executed - **Purpose (Goal)**: Objectives or expected outcomes of the business - **Importance**: The importance of the business (e.g., High, Medium, Low) - **Acceptance Criteria**: Conditions that must be satisfied for the requirement to be considered met - **Status**: Current state of the requirement (e.g., In Progress, Completed) ## After Completion - Once requirement definition is complete, obtain user confirmation and proceed to the next stage (Design). Tips for Creating Chat Modes Here are some tips for creating custom chat modes: Align with the development process: Create chat modes based on the workflow and the deliverables. Instruct the LLM to ask the user when unsure: Direct the LLM to request clarification from the user if any information is missing. Clarify what deliverables to create and where to save them: Make it explicit which outputs are expected and their storage locations. The second point is particularly important. Many AI (LLMs) tend to respond to user prompts in a sycophantic manner (known as sycophancy). As a result, they may fill in unspecified requirements or perform tasks that were not requested, often with the intention of being helpful. The key difference between Ask/Edit modes and Agent mode is that Agent mode allows the LLM to proactively ask questions and engage in dialogue with the user. However, unless the user explicitly includes a prompt such as “ask if you don’t know,” the AI rarely initiates questions on its own. By creating a custom chat mode and instructing the LLM to “ask the user when unsure,” you can fully leverage the benefits of Agent mode. About Tools You can easily check tool names from the list of available tools in the command palette. Alternatively, as shown in the diagram below, it can be convenient to open the custom chat mode file and specify the tool configuration. You can specify not only the MCP server functionality but also built-in tools and Copilot Extensions. Example of Actual Operation An example interaction when using this chat mode is as follows: The LLM behaves according to the custom instructions defined in the chat mode. When you answer questions from GHC, the LLM uses that information to reason and proceed with the task. However, the output is not guaranteed to be correct (hallucinations may occur) → A human should review the output and make any necessary corrections before committing. The basic approach to disciplined guardrail-based development has been covered above. In actual business application development, it is also helpful to understand the following two points: Referencing the database schema Integrated management of design documents and implementation code (Important) Reading the Database Schema In business application development, requirements definition and functional design are often based on the schema information of entities. There are two main ways to allow the system to read schema information: Dynamically read the schema from a development/test DB server using MCP or similar tools. Include a file containing schema information within the project and read from it. A development/test database can be prepared, and schema information can be read via the MCP server or Copilot Extensions. For SQL Server or Azure SQL Database, an MCP Server is available, but its setup can be cumbersome. Therefore, using Copilot Extensions is often easier and recommended. This approach is often seen online, but it is not recommended for the following reasons: Setting up MCP Server or Copilot Extensions can be cumbersome (installation, connection string management, etc.) It is time-consuming (the LLM needs schema information → reads the schema → writes code based on it) Connecting to a DB server via MCP or similar tools is useful for scenarios such as “querying a database in natural language” for non-engineers performing data analysis. However, if the goal is simply to obtain the schema information of entities needed for business application development, the method described below is much simpler. Storing Schema Information Within the Project Place a file containing the schema information inside the project. Any of the following formats is recommended. Write custom instructions so that development refers to this file: DDL (full CREATE DATABASE scripts) O/R mapper files (e.g., Entity Framework context files) Text files documenting schema information, etc. DDL files are difficult for humans to read, but AI (LLMs) can easily read and accurately understand them. In .NET + SQL development, it is recommended to include both the DDL and EF O/R mapper files. Additionally, if you include links to these files in your architecture documents and chat mode instructions, the LLM can generate code while understanding the schema with high accuracy. Integrated Management of Design Documents and Implementation Code Disciplined guardrail-based development with LLMs has made it practical to synchronize and manage design documents and implementation code together—something that was traditionally very difficult. In long-standing systems, it is common for old design documents to become largely useless. During maintenance, code changes are often prioritized. As a result, updating and maintaining design documents tends to be neglected, leading to a significant divergence between design documents and the actual code. For these reasons, the following have been considered best practices (though often not followed in reality): Limit requirements and external design documents to the minimum necessary. Do not create internal design documents; instead, document within the code itself. Always update design documents before making changes to the implementation code. When using LLMs, guardrail-based development makes it easier to enforce a “write the documentation first” workflow. Following the flow of defining specifications, updating the documents, and then writing code also helps the LLM generate appropriate code more reliably. Even if code is written first, LLM-assisted code analysis can significantly reduce the effort required to update the documentation afterward. However, the following points should be noted when doing this: Create and manage design documents as text files, not Word, Excel, or PowerPoint. Use text-based technologies like Mermaid for diagrams. Clearly define how design documents correspond to the code. The last point is especially important. It is crucial to align the structure of requirements and design documents with the structure of the implementation code. For example: Place design documents directly alongside the implementation code. Align folder structures, e.g., /doc and /src. Information about grouping methods and folder mapping should be explicitly included in the custom instructions. Conclusion of Disciplined Guardrail-Based Development with GHC Formalizing and Applying Guardrails Define the development flow and architecture documents in .github/copilot-instructions.md using split references. Prepare .github/chatmodes/* for each development phase, enforcing “ask the AI if anything is unclear.” Synchronization of Documents and Implementation Code Update docs first → use the diff as the basis for implementation (Doc-first). Keep docs in text format (Markdown/Mermaid). Fix folder correspondence between /docs and /src. Handling Schemas Store DDL/O-R mapper files (e.g., EF) in the repository and have the LLM reference them. Minimize dynamic DB connections, prioritizing speed, reproducibility, and security. This disciplined guardrail-based development technique is an AI-assisted approach that significantly improves the quality, maintainability, and team efficiency of enterprise business application development. Adapt it appropriately to each project to maximize productivity in application development.809Views5likes0CommentsAgentic Power for AKS: Introducing the Agentic CLI in Public Preview
We are excited to announce the agentic CLI for AKS, available now in public preview directly through the Azure CLI. A huge thank you to all our private preview customers who took the time to try out our beta releases and provide feedback to our team. The agentic CLI is now available for everyone to try--continue reading to learn how you can get started. Why we built the agentic CLI for AKS The way we build software is changing with the democratization of coding agents. We believe the same should happen for how users manage their Kubernetes environments. With this feature, we want to simplify the management and troubleshooting of AKS clusters, while reducing the barrier to entry for startups and developers by bridging the knowledge gap. The agentic CLI for AKS is designed to simplify this experience by bringing agentic capabilities to your cluster operations and observability, translating natural language into actionable guidance and analysis. Whether you need to right-size your infrastructure, troubleshoot complex networking issues like DNS or outbound connectivity, or ensure smooth K8s upgrades, the agentic CLI helps you make informed decisions quickly and confidently. Our goal: streamline cluster operations and empower teams to ask questions like “Why is my pod restarting?” or “How can I optimize my cluster for cost?” and get instant, actionable answers. The agentic CLI for AKS is built on the open-source HolmesGPT project, which has recently been accepted as a CNCF Sandbox project. With a pluggable LLM endpoint structure and open-source backing, the agentic CLI is purpose-built for customizability and data privacy. From private to public preview: what's new? Earlier this year, we launched the agentic CLI in private beta for a small group of AKS customers. Their feedback has shaped what's new in our public preview release, which we are excited to share with the broader AKS community. Let’s dig in: Simplified setup: One-time initialization for LLM parameters with ‘az aks agent-init'. Configure your LLM parameters such as API key and model through a simple, guided user interface. AKS MCP integration: Enable the agent to install and run the AKS MCP server locally (directly in your CLI client) for advanced context-aware operations. The AKS MCP server includes tools for AKS clusters and associated Azure resources. Try it out: az aks agent “list all my unhealthy nodepools” --aks-mcp -n <cluster-name> -g <resource-group> Deeper investigations: New "Task List" feature which helps the agent plan and execute on complex investigations. Checklist-style tracker that allows you to stay updated on the agent's progress and planned tool calls. Provide in-line feedback: Share insights directly from the CLI about the agent's performance using /feedback. Provide a rating of the agent's analysis and optional written feedback directly to the agentic CLI team. Your feedback is highly appreciated and will help us improve the agentic CLI's capabilities. Performance and security improvements: Minor improvements for faster load times and reduced latency, as well as hardened initialization and token handling. Getting Started Install the extension az extension add --name aks-agent Set up you LLM endpoint az aks agent-init Start asking questions Some recommended scenarios to try out: Troubleshoot cluster health: az aks agent "Give me an overview of my cluster's health" Right-size your cluster: az aks agent "How can I optimize my node pool for cost?" Try out the AKS MCP integration: az aks agent "Show me CPU and memory usage trends" --aks-mcp -n <cluster-name> -g <resource-group> Get upgrade guidance: az aks agent "What should I check before upgrading my AKS cluster?" Update the agentic CLI extension az extension update --name aks-agent Join the Conversation We’d love your feedback! Use the built-in '/feedback' command or visit our GitHub repository to share ideas and issues. Learn more: https://aka.ms/aks/agentic-cli Share feedback: https://aka.ms/aks/agentic-cli/issues678Views1like0CommentsExpanding the Public Preview of the Azure SRE Agent
We are excited to share that the Azure SRE Agent is now available in public preview for everyone instantly – no sign up required. A big thank you to all our preview customers who provided feedback and helped shape this release! Watching teams put the SRE Agent to work taught us a ton, and we’ve baked those lessons into a smarter, more resilient, and enterprise-ready experience. You can now find Azure SRE Agent directly in the Azure Portal and get started, or use the link below. 📖 Learn more about SRE Agent. 👉 Create your first SRE Agent (Azure login required) What’s New in Azure SRE Agent - October Update The Azure SRE Agent now delivers secure-by-default governance, deeper diagnostics, and extensible automation—built for scale. It can even resolve incidents autonomously by following your team’s runbooks. With native integrations across Azure Monitor, GitHub, ServiceNow, and PagerDuty, it supports root cause analysis using both source code and historical patterns. And since September 1, billing and reporting are available via Azure Agent Units (AAUs). Please visit product documentation for the latest updates. Here are a few highlights for this month: Prioritizing enterprise governance and security: By default, the Azure SRE Agent operates with least-privilege access and never executes write actions on Azure resources without explicit human approval. Additionally, it uses role-based access control (RBAC) so organizations can assign read-only or approver roles, providing clear oversight and traceability from day one. This allows teams to choose their desired level of autonomy from read-only insights to approval-gated actions to full automation without compromising control. Covering the breadth and depth of Azure: The Azure SRE Agent helps teams manage and understand their entire Azure footprint. With built-in support for AZ CLI and kubectl, it works across all Azure services. But it doesn’t stop there—diagnostics are enhanced for platforms like PostgreSQL, API Management, Azure Functions, AKS, Azure Container Apps, and Azure App Service. Whether you're running microservices or managing monoliths, the agent delivers consistent automation and deep insights across your cloud environment. Automating Incident Management: The Azure SRE Agent now plugs directly into Azure Monitor, PagerDuty, and ServiceNow to streamline incident detection and resolution. These integrations let the Agent ingest alerts and trigger workflows that match your team’s existing tools—so you can respond faster, with less manual effort. Engineered for extensibility: The Azure SRE Agent incident management approach lets teams reuse existing runbooks and customize response plans to fit their unique workflows. Whether you want to keep a human in the loop or empower the Agent to autonomously mitigate and resolve issues, the choice is yours. This flexibility gives teams the freedom to evolve—from guided actions to trusted autonomy—without ever giving up control. Root cause, meet source code: The Azure SRE Agent now supports code-aware root cause analysis (RCA) by linking diagnostics directly to source context in GitHub and Azure DevOps. This tight integration helps teams trace incidents back to the exact code changes that triggered them—accelerating resolution and boosting confidence in automated responses. By bridging operational signals with engineering workflows, the agent makes RCA faster, clearer, and more actionable. Close the loop with DevOps: The Azure SRE Agent now generates incident summary reports directly in GitHub and Azure DevOps—complete with diagnostic context. These reports can be assigned to a GitHub Copilot coding agent, which automatically creates pull requests and merges validated fixes. Every incident becomes an actionable code change, driving permanent resolution instead of temporary mitigation. Getting Started Start here: Create a new SRE Agent in the Azure portal (Azure login required) Blog: Announcing a flexible, predictable billing model for Azure SRE Agent Blog: Enterprise-ready and extensible – Update on the Azure SRE Agent preview Product documentation Product home page Community & Support We’d love to hear from you! Please use our GitHub repo to file issues, request features, or share feedback with the team5.2KViews2likes3Comments