devops
79 TopicsSecurity Where It Matters: Runtime Context and AI Fixes Now Integrated in Your Dev Workflow
Security teams and developers face the same frustrating cycle: thousands of alerts, limited time, and no clear way to know which issues matter most. Applications suffer attacks as quickly as once every three minutes, 1 emphasizing the importance of proactive security that prioritizes critical, exploitable vulnerabilities. Microsoft is leading this shift with new integrations in the end-to-end solution that combines GitHub Advanced Security’s developer-first application security tool with Microsoft Defender for Cloud's runtime protection, enhanced by agentic remediation. Now available in public preview. This integration empowers organizations to secure code to cloud and accelerates tackling of security issues in their software portfolio using agentic remediation and runtime context-based vulnerability prioritization. The result: fewer distractions, faster fixes, better collaboration and more proactive security from code to cloud. The DevSecOps Dilemma— too many alerts, not enough action Over the past decade, the application security industry has made significant strides in improving detection accuracy and fostering collaboration between security teams and developers. These advances have enabled both groups to work together on real issues and drive meaningful progress. However, despite these improvements, remediation trends across the industry have remained stagnant. Quarter after quarter, year after year, vulnerability counts continue to rise with critical / high vulnerabilities constituting 17.4% of vulnerability backlogs and a mean-time-to-remediation (MTTR) of 116 days 2 Today, three big challenges slow teams down: Security teams are drowning in alert fatigue, struggling to distinguish real, exploitable risks from noise. At the same time, AI is rapidly introducing new threat vectors that defenders have little time to research or understand—leaving organizations vulnerable to missed threats and evolving attack techniques. Developers lack clear prioritization while remediation takes long, so they lose time fixing issues that may never be exploited. Remediation cycles are slow, leaving systems exposed to potential attacks while teams debate which issues matter most or search for the right person to fix them Both teams rely on separate, non-integrated tools, making collaboration slow and frustrating. Development and security teams frequently operate in silos, reducing efficiency and creating blind spots. This leads to wasted time, unresolved threats, and growing backlogs. Teams are stuck reacting to noise instead of solving real problems. DevSecOps reimagined in the era of AI Your app is live and serving thousands of customers. Defender for Cloud detects a vulnerability in an internet-facing API that handles sensitive data. In the past, this alert would age in a dashboard while developers worked on unrelated fixes because they didn’t know this was the critical one. Now, with the new integration, a security campaign can be created in GitHub filtering for runtime risk (internet exposed, sensitive data etc.) notifying the developer to prioritize this issue. The developer views the issue in their workflow, understands why it matters, and uses Copilot Autofix to apply an AI-suggested fix in minutes. The developer can then select these risks at bulk and assign the GitHub Copilot coding agent to create a draft PR for a multi merge fix ready for human review. Virtual Registry: Code-to-Runtime Mapping Code to runtime mapping is possible with the Virtual Registry which makes GitHub a trusted source for artifact metadata. Integrated with Microsoft Defender for Cloud, the Virtual Registry enables smarter risk prioritization and faster incident response. Teams can quickly answer: Is this vulnerability running in production? Is it exposed to sensitive workloads? Do I need to act now? By combining runtime and repository context, the Virtual Registry streamlines alert triage and incident response. We shipped a new set of filters to both Code Scanning and Dependabot and Security Campaigns that are based on the artifact metadata that is stored in the Virtual Registry. Faster fixes with agentic remediation The integration includes Copilot Autofix, an AI-powered tool that suggests code changes to fix security problems. It checks that the fixes work and helps developers resolve issues quickly, without switching tools. To complete the agentic work flow we can be bulk assign these autofixes to GitHub Copilot Coding agent to create a draft Pull Request awaiting human review. Why this matters Fewer alerts to sort through: Focus only on what’s exploitable in production. Faster fixes: AI-powered fix suggestions through GitHub Copilot Autofix have shown to fix 50% of alerts within the PR with a 70% reduction in mean time-to-remediation 3 Better teamwork: Developers and security teams collaborate seamlessly. With collaborative security now powered by connected context, we’ve seen 68% of alert remediated using GitHub Advanced Security’s security campaigns. 3 Try it now This feature is available in public preview and will be showcased at Microsoft Ignite. If your team builds cloud-native applications, this integration helps you protect code to cloud more effectively—without slowing down development. Customer FAQs How do I start using the integration? From Microsoft Defender for Cloud: Go to the environment section in the Defender for Cloud portal. Grant a new GitHub connector or update an existing one to provide consent to scan your source code. If you use GitHub, setup is one click. You’ll immediately see initial scan results and recommended fixes. From GitHub: You will be able to filter alerts by runtime context in addition to receiving AI-suggested fixes. How do I purchase this integration? For GitHub: GitHub Advanced Security (GHAS) is available as: Code Security SKU: $30 per committer/month (available April 2025) GHAS Bundle: $49 per committer/month (available now) GitHub Enterprise Cloud GitHub Copilot For Microsoft Defender for Cloud CSPM: Defender CSPM: $5 per billable resource/month Both can be enabled through the Azure Portal as Azure meters. [1]: Software Under Siege | AppSec Threat Report 2025 | Contrast Security [2]: Edgescan | Vulnerability Statistics Report 2025 [3]: GitHub Internal Data1.2KViews2likes1CommentNever Explain Context Twice: Introducing Azure SRE Agent memory
In our recent blog post, we highlighted how Azure SRE Agent has evolved into an extensible AI-powered operations platform. One of the most requested capabilities from customers has been the ability for agents to retain knowledge across sessions-learning from past incidents, remembering team preferences, and continuously improving troubleshooting accuracy. Today, we're excited to dive deeper into the Azure SRE Agent memory, a powerful feature that transforms how your operations teams work with AI. Why Memory Matters for AI Operations Every seasoned SRE knows that institutional knowledge is invaluable. The most effective on-call engineers aren't just technically skilled, they remember the quirks of specific services, recall solutions from past incidents, and know the team's preferred diagnostic approaches. Until now, AI assistants started every conversation from scratch, forcing teams to repeatedly explain context that experienced engineers would simply know. The SRE Agent Memory changes this paradigm. It enables agents to: Remember team facts, preferences, and context across all conversations Retrieve relevant runbooks and documentation during troubleshooting Learn from past sessions to improve future responses Share knowledge across your entire team automatically Context Engineering: The Key to Better AI Outcomes At the heart of the memory is a concept we call context engineering, the practice of purposefully curating and optimizing the information you provide to the agent to get better results. Rather than hoping the AI figures things out, you systematically build a knowledge foundation that makes every interaction smarter. The workflow is simple: Identify gaps: Use Session Insights to see where the agent struggled or lacked knowledge Add targeted context: Upload runbooks to the Knowledge Base or save facts with User Memories Track improvement: Review subsequent sessions to measure whether your additions improved outcomes Iterate: Continuously refine your context based on real session data This feedback loop transforms ad-hoc troubleshooting into a systematically improving process, where each session makes future sessions more effective. Memory Components at a Glance The memory consists of three complementary components that work together to give your agents comprehensive knowledge: 🧠 User Memories: Quick Chat Commands for Team Knowledge Save facts, preferences, and context using simple chat commands. User Memories are ideal for team standards, service configurations, and workflow patterns that should persist across all conversations. Key benefits: ✅ Instant setup-no configuration required ✅ Managed directly in chat with #remember, #forget, and #retrieve commands ✅ Shared across all team members automatically ✅ Works across all conversations and agents Example commands: #remember Team owns app-service-prod in East US region #remember For latency issues, check Redis cache first #remember Production deployments happen Tuesdays at 2 PM PST When you save a memory, it's instantly available across all your team's conversations. The agent automatically retrieves relevant memories during reasoning, no additional configuration needed. Saving team knowledge with the #remember command Use #retrieve to search and display your saved memories: Retrieving saved memories with the #retrieve command 📚 Knowledge Base: Direct Document Uploads for Runbooks and Guides Upload markdown and text files directly to the agent's knowledge base. Documents are automatically indexed using semantic search and available for agent retrieval during troubleshooting. The Knowledge Base uses intelligent indexing that combines keyword matching with semantic similarity. Documents are automatically split into optimal chunks, so agents retrieve the most relevant sections, not entire documents. Key benefits: ✅ Supports .md and .txt files (up to 16MB per file) ✅ Automatic chunking and semantic indexing ✅ Simple file upload interface ✅ Instant availability after upload Best for: Static runbooks, troubleshooting guides, internal documentation, and configuration templates. Navigate to Settings > Knowledge Base to access document management. There you will find Add File, allows you to upload txt and md file(s) and Delete, allows you to delete individual or bulk files. 📊 Session Insights: Automated Analysis of Your Troubleshooting Sessions Get automated feedback on your troubleshooting sessions with timelines, performance analysis, and key learnings. Session Insights help you understand what happened, learn from mistakes, and continuously improve. Key benefits: ✅ Automatic analysis after conversations complete ✅ Chronological timeline of actions taken ✅ Performance scoring with specific improvement suggestions ✅ Key learnings for future sessions Navigate to Settings > Session Insights to view your troubleshooting analysis: Session Insights dashboard showing analysis of past troubleshooting sessions You can also manually trigger insight generation for any conversation by clicking the Generate Session Insights icon in the chat footer: Manually triggering Session Insights generation Each insight includes: Timeline: A chronological narrative showing what actions were taken and their outcomes What Went Well: Highlights correct understanding and effective actions Areas for Improvement: Shows what could be done better with specific remediation steps Key Learnings: Actionable takeaways for future sessions Investigation Quality Score: Sessions rated on a 1-5 scale for completeness How Azure SRE Agent Use Memory: The SearchMemory Tool During conversations, incident handling, and scheduled tasks, Azure SRE Agents search across memory sources to retrieve relevant context using the SearchMemory tool. Enabling Memory Retrieval in Custom Sub-Agents When building custom sub-agents with the Sub-Agent Builder, you can enable memory retrieval by adding the SearchMemory tool to your sub-agent's toolset. This allows your custom automation to leverage all the knowledge stored in User Memories and the Knowledge Base. How it works: In the Sub-Agent Builder, add the SearchMemory tool to your sub-agent's available tools The tool automatically searches across all memory sources using intelligent retrieval Your sub-agent receives relevant context to inform its responses and actions This means your custom sub-agents, whether handling specific incident types, automating runbook execution, or performing scheduled health checks, can all benefit from your team's accumulated knowledge. Choosing the Right Memory Type Feature User Memories Knowledge Base Setup Instant (chat commands) Quick (file upload) Management Chat commands Portal UI Content Size Short facts Documents (up to 16MB) Best Use Case Team preferences Static runbooks Team Sharing ✅ Shared ✅ Shared Quick guidance: User Memories: Short, focused facts (1-2 sentences) for immediate team context Knowledge Base: Well-structured documents with clear headers for procedural knowledge Getting Started in Minutes 1. Start with User Memories Open any chat with your Azure SRE Agent and save immediate team knowledge: #remember Team owns services: app-service-prod, redis-cache-prod, and sql-db-prod #remember For latency issues, check Redis cache health first #remember Team uses East US for production workloads That's it, these facts are now available across all conversations. 2. Upload Key Documents Add critical runbooks and guides to the Knowledge Base: Navigate to Settings > Knowledge Base Upload .md or .txt files Files are automatically indexed and available immediately 3. Review Session Insights After troubleshooting sessions, check Settings > Session Insights to see what went well and where the agent needs more context. Use this feedback to identify gaps and add targeted memories or documentation. Best Practices for Building Agent Memory Content Organization Keep memories focused and specific Use consistent terminology across your team Avoid duplication, choose one source of truth for each piece of information Security Never store: ❌ Credentials, API keys, or secrets ❌ Personal identifiable information (PII) ❌ Customer data or logs ❌ Confidential business information Maintenance Regularly review and update memories Remove outdated information using #forget Consolidate duplicate entries Use #retrieve to audit what's been saved The Impact: Smarter Troubleshooting, Lower MTTR The Azure SRE Agent memory delivers measurable improvements: Faster troubleshooting: Agents immediately understand your environment and preferences Reduced toil: No more repeatedly explaining the same context Institutional knowledge capture: Critical team knowledge persists even as team members change Continuous improvement: Each session makes future sessions more effective By systematically building your agent's knowledge foundation, you create an operations assistant that truly understands your environment, reducing mean time to resolution (MTTR) and freeing your team to focus on high-value work. Ready to Get Started? Azure SRE Agent home page Product documentation Pricing information Demo recordings What's Next? We're continually enhancing the memory based on customer feedback. Your input is critical, use the thumbs up/down feedback in the agent, or share your thoughts in our GitHub repo. What operational knowledge would you like your AI agent to remember? Let us know! This blog post is part of our ongoing series on Azure SRE Agent capabilities. See our previous post on automation, integration, and extensibility features.311Views0likes0CommentsExpanding the Public Preview of the Azure SRE Agent
We are excited to share that the Azure SRE Agent is now available in public preview for everyone instantly – no sign up required. A big thank you to all our preview customers who provided feedback and helped shape this release! Watching teams put the SRE Agent to work taught us a ton, and we’ve baked those lessons into a smarter, more resilient, and enterprise-ready experience. You can now find Azure SRE Agent directly in the Azure Portal and get started, or use the link below. 📖 Learn more about SRE Agent. 👉 Create your first SRE Agent (Azure login required) What’s New in Azure SRE Agent - October Update The Azure SRE Agent now delivers secure-by-default governance, deeper diagnostics, and extensible automation—built for scale. It can even resolve incidents autonomously by following your team’s runbooks. With native integrations across Azure Monitor, GitHub, ServiceNow, and PagerDuty, it supports root cause analysis using both source code and historical patterns. And since September 1, billing and reporting are available via Azure Agent Units (AAUs). Please visit product documentation for the latest updates. Here are a few highlights for this month: Prioritizing enterprise governance and security: By default, the Azure SRE Agent operates with least-privilege access and never executes write actions on Azure resources without explicit human approval. Additionally, it uses role-based access control (RBAC) so organizations can assign read-only or approver roles, providing clear oversight and traceability from day one. This allows teams to choose their desired level of autonomy from read-only insights to approval-gated actions to full automation without compromising control. Covering the breadth and depth of Azure: The Azure SRE Agent helps teams manage and understand their entire Azure footprint. With built-in support for AZ CLI and kubectl, it works across all Azure services. But it doesn’t stop there—diagnostics are enhanced for platforms like PostgreSQL, API Management, Azure Functions, AKS, Azure Container Apps, and Azure App Service. Whether you're running microservices or managing monoliths, the agent delivers consistent automation and deep insights across your cloud environment. Automating Incident Management: The Azure SRE Agent now plugs directly into Azure Monitor, PagerDuty, and ServiceNow to streamline incident detection and resolution. These integrations let the Agent ingest alerts and trigger workflows that match your team’s existing tools—so you can respond faster, with less manual effort. Engineered for extensibility: The Azure SRE Agent incident management approach lets teams reuse existing runbooks and customize response plans to fit their unique workflows. Whether you want to keep a human in the loop or empower the Agent to autonomously mitigate and resolve issues, the choice is yours. This flexibility gives teams the freedom to evolve—from guided actions to trusted autonomy—without ever giving up control. Root cause, meet source code: The Azure SRE Agent now supports code-aware root cause analysis (RCA) by linking diagnostics directly to source context in GitHub and Azure DevOps. This tight integration helps teams trace incidents back to the exact code changes that triggered them—accelerating resolution and boosting confidence in automated responses. By bridging operational signals with engineering workflows, the agent makes RCA faster, clearer, and more actionable. Close the loop with DevOps: The Azure SRE Agent now generates incident summary reports directly in GitHub and Azure DevOps—complete with diagnostic context. These reports can be assigned to a GitHub Copilot coding agent, which automatically creates pull requests and merges validated fixes. Every incident becomes an actionable code change, driving permanent resolution instead of temporary mitigation. Getting Started Start here: Create a new SRE Agent in the Azure portal (Azure login required) Blog: Announcing a flexible, predictable billing model for Azure SRE Agent Blog: Enterprise-ready and extensible – Update on the Azure SRE Agent preview Product documentation Product home page Community & Support We’d love to hear from you! Please use our GitHub repo to file issues, request features, or share feedback with the team5.5KViews2likes3CommentsAnnouncing a flexible, predictable billing model for Azure SRE Agent
Billing for Azure SRE Agent will start on September 1, 2025. Announced at Microsoft Build 2025, Azure SRE Agent is a pre-built AI agent for root cause analysis, uptime improvement, and operational cost reduction. Learn more about the billing model and example scenarios.3.4KViews1like1CommentReimagining AI Ops with Azure SRE Agent: New Automation, Integration, and Extensibility features
Azure SRE Agent offers intelligent and context aware automation for IT operations. Enhanced by customer feedback from our preview, the SRE Agent has evolved into an extensible platform to automate and manage tasks across Azure and other environments. Built on an Agentic DevOps approach - drawing from proven practices in internal Azure operations - the Azure SRE Agent has already saved over 20,000 engineering hours across Microsoft product teams operations, delivering strong ROI for teams seeking sustainable AIOps. An Operations Agent that adapts to your playbooks Azure SRE Agent is an AI powered operations automation platform that empowers SREs, DevOps, IT operations, and support teams to automate tasks such as incident response, customer support, and developer operations from a single, extensible agent. Its value proposition and capabilities have evolved beyond diagnosis and mitigation of Azure issues, to automating operational workflows and seamless integration with the standards and processes used in your organization. SRE Agent is designed to automate operational work and reduce toil, enabling developers and operators to focus on high-value tasks. By streamlining repetitive and complex processes, SRE Agent accelerates innovation and improves reliability across cloud and hybrid environments. In this article, we will look at what’s new and what has changed since the last update. What’s New: Automation, Integration, and Extensibility Azure SRE Agent just got a major upgrade. From no-code automation to seamless integrations and expanded data connectivity, here’s what’s new in this release: No-code Sub-Agent Builder: Rapidly create custom automations without writing code. Flexible, event-driven triggers: Instantly respond to incidents and operational changes. Expanded data connectivity: Unify diagnostics and troubleshooting across more data sources. Custom actions: Integrate with your existing tools and orchestrate end-to-end workflows via MCP. Prebuilt operational scenarios: Accelerate deployment and improve reliability out of the box. Unlike generic agent platforms, Azure SRE Agent comes with deep integrations, prebuilt tools, and frameworks specifically for IT, DevOps, and SRE workflows. This means you can automate complex operational tasks faster and more reliably, tailored to your organization’s needs. Sub-Agent Builder: Custom Automation, No Code Required Empower teams to automate repetitive operational tasks without coding expertise, dramatically reducing manual workload and development cycles. This feature helps address the need for targeted automation, letting teams solve specific operational pain points without relying on one-size-fits-all solutions. Modular Sub-Agents: Easily create custom sub-agents tailored to your team’s needs. Each sub-agent can have its own instructions, triggers, and toolsets, letting you automate everything from outage response to customer email triage. Prebuilt System Tools: Eliminate the inefficiency of creating basic automation from scratch, and choose from a rich library of hundreds of built-in tools for Azure operations, code analysis, deployment management, diagnostics, and more. Custom Logic: Align automation to your unique business processes by defining your automation logic and prompts, teaching the agent to act exactly as your workflow requires. Flexible Triggers: Automate on Your Terms Invoke the agent to respond automatically to mission-critical events, not wait for manual commands. This feature helps speed up incident response and eliminate missed opportunities for efficiency. Multi-Source Triggers: Go beyond chat-based interactions, and trigger the agent to automatically respond to Incident Management and Ticketing systems like PagerDuty and ServiceNow, Observability Alerting systems like Azure Monitor Alerts, or even on a cron-based schedule for proactive monitoring and best-practices checks. Additional trigger sources such as GitHub issues, Azure DevOps pipelines, email, etc. will be added over time. This means automation can start exactly when and where you need it. Event-Driven Operations: Integrate with your CI/CD, monitoring, or support systems to launch automations in response to real-world events - like deployments, incidents, or customer requests. Vital for reducing downtime, it ensures that business-critical actions happen automatically and promptly. Expanded Data Connectivity: Unified Observability and Troubleshooting Integrate data, enabling comprehensive diagnostics and troubleshooting and faster, more informed decision-making by eliminating silos and speeding up issue resolution. Multiple Data Sources: The agent can now read data from Azure Monitor, Log Analytics, and Application Insights based on its Azure role-based access control (RBAC). Additional observability data sources such as Dynatrace, New Relic, Datadog, and more can be added via the Remote Model Context Protocol (MCP) servers for these tools. This gives you a unified view for diagnostics and automation. Knowledge Integration: Rather than manually detailing every instruction in your prompt, you can upload your Troubleshooting Guide (TSG) or Runbook directly, allowing the agent to automatically create an execution plan from the file. You may also connect the agent to resources like SharePoint, Jira, or documentation repositories through Remote MCP servers, enabling it to retrieve needed files on its own. This approach utilizes your organization’s existing knowledge base, streamlining onboarding and enhancing consistency in managing incidents. Azure SRE Agent is also building multi-agent collaboration by integrating with PagerDuty and Neubird, enabling advanced, cross-platform incident management and reliability across diverse environments. Custom Actions: Automate Anything, Anywhere Extend automation beyond Azure and integrate with any tool or workflow, solving the problem of limited automation scope and enabling end-to-end process orchestration. Out-of-the-Box Actions: Instantly automate common tasks like running azcli, kubectl, creating GitHub issues, or updating Azure resources, reducing setup time and operational overhead. Communication Notifications: The SRE Agent now features built-in connectors for Outlook, enabling automated email notifications, and for Microsoft Teams, allowing it to post messages directly to Teams channels for streamlined communication. Bring Your Own Actions: Drop in your own Remote MCP servers to extend the agent’s capabilities to any custom tool or workflow. Future-proof your agentic DevOps by automating proprietary or emerging processes with confidence. Prebuilt Operations Scenarios Address common operational challenges out of the box, saving teams time and effort while improving reliability and customer satisfaction. Incident Response: Minimize business impact and reduce operational risk by automating detection, diagnosis, and mitigation of your workload stack. The agent has built-in runbooks for common issues related to many Azure resource types including Azure Kubernetes Service (AKS), Azure Container Apps (ACA), Azure App Service, Azure Logic Apps, Azure Database for PostgreSQL, Azure CosmosDB, Azure VMs, etc. Support for additional resource types is being added continually, please see product documentation for the latest information. Root Cause Analysis & IaC Drift Detection: Instantly pinpoint incident causes with AI-driven root cause analysis including automated source code scanning via GitHub and Azure DevOps integration. Proactively detect and resolve infrastructure drift by comparing live cloud environments against source-controlled IaC, ensuring configuration consistency and compliance. Handle Complex Investigations: Enable the deep investigation mode that uses a hypothesis-driven method to analyze possible root causes. It collects logs and metrics, tests hypotheses with iterative checks, and documents findings. The process delivers a clear summary and actionable steps to help teams accurately resolve critical issues. Incident Analysis: The integrated dashboard offers a comprehensive overview of all incidents managed by the SRE Agent. It presents essential metrics, including the number of incidents reviewed, assisted, and mitigated by the agent, as well as those awaiting human intervention. Users can leverage aggregated visualizations and AI-generated root cause analyses to gain insights into incident processing, identify trends, enhance response strategies, and detect areas for improvement in incident management. Inbuilt Agent Memory: The new SRE Agent Memory System transforms incident response by institutionalizing the expertise of top SREs - capturing, indexing, and reusing critical knowledge from past incidents, investigations, and user guidance. Benefit from faster, more accurate troubleshooting, as the agent learns from both successes and mistakes, surfacing relevant insights, runbooks, and mitigation strategies exactly when needed. This system leverages advanced retrieval techniques and a domain-aware schema to ensure every on-call engagement is smarter than the last, reducing mean time to resolution (MTTR) and minimizing repeated toil. Automatically gain a continuously improving agent that remembers what works, avoids past pitfalls, and delivers actionable guidance tailored to the environment. GitHub Copilot and Azure DevOps Integration: Automatically triage, respond to, and resolve issues raised in GitHub or Azure DevOps. Integration with modern development platforms such as GitHub Copilot coding agent increases efficiency and ensures that issues are resolved faster, reducing bottlenecks in the development lifecycle. Ready to get started? Azure SRE Agent home page Product overview Pricing Page Pricing Calculator Pricing Blog Demo recordings Deployment samples What’s Next? Give us feedback: Your feedback is critical - You can Thumbs Up / Thumbs Down each interaction or thread, or go to the “Give Feedback” button in the agent to give us in-product feedback - or you can create issues or just share your thoughts in our GitHub repo at https://github.com/microsoft/sre-agent. We’re just getting started. In the coming months, expect even more prebuilt integrations, expanded data sources, and new automation scenarios. We anticipate continuous growth and improvement throughout our agentic AI platforms and services to effectively address customer needs and preferences. Let us know what Ops toil you want to automate next!2KViews0likes0CommentsAzure SRE Agent: Expanding Observability and Multi-Cloud Resilience
The Azure SRE Agent continues to evolve as a cornerstone for operational excellence and incident management. Over the past few months, we have made significant strides in enabling integrations with leading observability platforms—Dynatrace, New Relic, and Datadog—through Model Context Protocol (MCP) Servers. These partnerships serve joint customers, enabling automated remediation across diverse environments. Deepening Integrations with MCP Servers Our collaboration with these partners is more than technical—it’s about delivering value at scale. Datadog, New Relic, and Dynatrace are all Azure Native ISV Service partners. With these integrations Azure Native customers can also choose to add these MCP servers directly from the Azure Native partners’ resource: Datadog: At Ignite, Azure SRE Agent was presented with the Datadog MCP Server, to demonstrate how our customers can streamline complex workflows. Customers can now bring their Datadog MCP Server into Azure SRE Agent, extending knowledge capabilities and centralizing logs and metrics. Find Datadog Azure Native offerings on Marketplace. New Relic: When an alert fires in New Relic, the Azure SRE Agent calls the New Relic MCP Server to provide Intelligent Observability insights. This agentic integration with the New Relic MCP Server offers over 35 specialized tools across, entity and account management, alerts and monitoring, data analysis and queries, performance analysis, and much more. The advanced remediation skills of the Azure SRE Agent + New Relic AI help our joint customers diagnose and resolve production issues faster. Find New Relic’s Azure Native offering on Marketplace Dynatrace: The Dynatrace integration bridges Microsoft Azure's cloud-native infrastructure management with Dynatrace's AI-powered observability platform, leveraging the Davis AI engine and remote MCP server capabilities for incident detection, root cause analysis, and remediation across hybrid cloud environments. Check out Dynatrace’s Azure Native offering on Marketplace. These integrations are made possible by Azure SRE Agent’s MCP connectors. The MCP connectors in Azure SRE Agent act as the bridge between the agent and MCP servers, enabling dynamic discovery and execution of specialized tools for observability and incident management across diverse environments. This feature allows customers to build their own custom sub-agents to leverage tools from MCP Servers from integrated platforms like Dynatrace, Datadog, and New Relic to complement the agent’s diagnostic and remediation capabilities. By connecting Azure SRE Agent to external MCP servers scenarios such as cross-platform telemetry analysis are unlocked. Looking Ahead: Multi-Agent Collaboration Azure SRE Agent isn’t stopping with MCP integrations We’re actively working with PagerDuty and NeuBird to support dynamic use cases via agent-to-agent collaboration: PagerDuty: PagerDuty’s PD Advance SRE Agent is an AI-powered assistant that triages incidents by analyzing logs, diagnostics, past incident history, and runbooks to surface relevant context and recommended remediations. At Ignite, PagerDuty and Microsoft demonstrated how Azure SRE Agent can ingest PagerDuty incidents and collaborate with PagerDuty’s SRE Agent to complement triage using historical patterns, runbook intelligence and Azure diagnostics. NeuBird: NeuBird’s Agentic AI SRE, Hawkeye, autonomously investigates and resolves incidents across hybrid, and multi-cloud environments. By connecting to telemetry sources like Azure Monitor, Prometheus, and GitHub, Hawkeye delivers real-time diagnosis and targeted fixes. Building on the work presented at SRE Day this partnership underscores our commitment to agentic ecosystems where specialized agents collaborate for complex scenarios. Sign up for the private preview to try the integration, here. Additionally, please check out NeuBird on Marketplace. These efforts reflect a broader vision: Azure SRE Agent as a hub for cross-platform reliability, enabling customers to manage incidents across Azure, on-premises, and other clouds with confidence. Why This Matters As organizations embrace distributed architectures, the need for integrated, intelligent, and multi-cloud-ready SRE solutions has never been greater. By partnering with industry leaders and pioneering agent-to-agent workflows, Azure SRE Agent is setting the stage for a future where resilience is not just reactive—it’s proactive and collaborative.610Views2likes0CommentsProactive Monitoring Made Simple with Azure SRE Agent
SRE teams strive for proactive operations, catching issues before they impact customers and reducing the chaos of incident response. While perfection may be elusive, the real goal is minimizing outages and gaining immediate line of sight into production environments. Today, that’s harder than ever. It requires correlating countless signals and alerts, understanding how they relate—or don’t relate—to each other, and assigning the right sense of urgency and impact. Anything that shortens this cycle, accelerates detection, and enables automated remediation is what modern SRE teams crave. What if you could skip the scripting and pipelines? What if you could simply describe what you want in plain language and let it run automatically on a schedule? Scheduled Tasks for Azure SRE Agent With Scheduled Tasks for Azure SRE Agent, that what-if scenario is now a reality. Scheduled tasks combine natural language prompts with Azure SRE Agent’s automation capabilities, so you can express intent, set a schedule, and let the agent do the rest—without writing a single line of code. This means: ⚡ Faster incident response through early detection ✅ Better compliance with automated checks 🎯 More time for high-value engineering work and innovation 💡 The shift from reactive to proactive: Instead of waiting for alerts to fire or customers to report issues, you’re continuously monitoring, validating, and catching problems before they escalate. How Scheduled Tasks Work Under the Hood When you create a Scheduled Task, the process is more than just running a prompt on a timer. Here’s what happens: 1. Prompt Interpretation and Plan Creation The SRE Agent takes your natural language prompt—such as “Scan all resources for security best practices”—and converts it into a structured execution plan. This plan defines the steps, tools, and data sources required to fulfill your request. 2. Built-In Tools and MCP Integration The agent uses its built-in capabilities (Azure CLI, Log Analytics workspace, Appinsights) and can also leverage 3 rd party data sources or tools via MCP server integration for extended functionality. 3. Results Analysis and Smart Summarization After execution, the agent analyzes results, identifies anomalies or issues, and provides actionable summaries not just raw data dumps. 4. Notification and Escalation Based on findings, the agent can: Post updates to Teams channels Create or update incidents Send email notifications Trigger follow-up actions Real-World Use Cases for Proactive Ops Here’s where scheduled tasks shine for SRE teams: Use Case Prompt Example Schedule Security Posture Check “Scan all subscriptions for resources with public endpoints and flag any that shouldn’t be exposed” Daily Cost Anomaly Detection “Compare this week’s spend against last week and alert if any service exceeds 20% growth” Weekly Compliance Drift Detection “Check all storage accounts for encryption settings and report any non-compliant resources” Daily Resource Health Summary “Summarize the health status of all production VMs and highlight any degraded instances” Every 4 hours Incident Trend Analysis “Analyze ICM incidents from the past week, identify patterns, and summarize top contributing services” Weekly Getting Started in 3 Steps Step 1: Define Your Intent Write a natural language prompt describing what you want to monitor or check. Be specific about: - What resources or scope - What conditions to look for - What action to take if issues are found Example: > “Every morning at 8 AM, check all production Kubernetes clusters for pods in CrashLoopBackOff state. If any are found, post a summary to the #sre-alerts Teams channel with cluster name, namespace, and pod details.” Step 2: Set Your Schedule Choose how often the task should run: - Cron expressions for precise control - Simple intervals (hourly, daily, weekly) Step 3: Define Where to Receive Updates Include in your prompt where you want results delivered when the task finishes execution. The agent can use its built-in tools and connectors to: - Post summaries to a Teams channel - Send email notifications - Create or update ICM incidents Example prompt with notification: > "Check all production databases for long-running queries over 30 seconds. If any are found, post a summary to the #database-alerts Teams channel." Why This Matters for Proactive Operations Traditional monitoring approaches have limitations: Traditional Approach With Scheduled Tasks Write scripts, maintain pipelines Describe in plain language Static thresholds and rules Contextual, AI-powered analysis Alert fatigue from noisy signals Smart summarization of what matters Separate tools for check vs. action Unified detection and response Requires dedicated DevOps effort Any SRE can create and modify The result? Your team spends less time building and maintaining monitoring infrastructure and more time on the work that truly requires human expertise. Best Practices for Scheduled Tasks Start simple, iterate — Begin with one or two high-value checks and expand as you gain confidence Be specific in prompts — The more context you provide, the better the results Set appropriate frequencies — Not everything needs to run hourly; match the schedule to the risk Review and refine — Check task results periodically and adjust prompts for better accuracy What’s Next? Scheduled tasks are just the beginning. We’re continuing to invest in capabilities that help SRE teams shift left—catching issues earlier, automating routine checks, and freeing up time for strategic work. Ready to Start? Use this sample that shows how to create a scheduled health check sub-agent: https://github.com/microsoft/sre-agent/blob/main/samples/automation/samples/02-scheduled-health-check-sample.md This example demonstrates: - Building a HealthCheckAgent using built-in tools like Azure CLI and Log Analytics Workspace - Scheduling daily health checks for a container app at 7 AM - Sending email alerts when anomalies are detected 🔗 Explore more samples here: https://github.com/microsoft/sre-agent/tree/main/samples More to Learn Ignite 2025 announcements: https://aka.ms/ignite25/blog/sreagent Documentation: https://aka.ms/sreagent/docs Support & Feature Requests: https://github.com/microsoft/sre-agent/issues493Views0likes0CommentsDeploying a Bun + Hono + Vite app to Azure App Service
TOC Introduction Local Environment Deployment Conclusion 1. Introduction Anthropic, the company behind Claude, recently acquired the JavaScript runtime startup Bun, marking one of the most significant shifts in the modern JavaScript ecosystem since the arrival of Node.js and Deno. This acquisition signals more than a business move, it represents a strategic consolidation of performance-oriented tooling, developer ergonomics, and the future of AI-accelerated software development. At the center of this momentum lies a powerful trio: Bun, Hono, and Vite. Bun is a next-generation JavaScript runtime that reimagines everything from dependency installation to HTTP servers, bundling, and execution speed. On top of Bun, frameworks like Hono provide an elegant, lightweight approach to building APIs and full web applications. Hono embraces the Web Standard API model while optimizing for speed and minimal footprint. For front-end development, Vite completes the trio. Vite provides lightning-fast local development through native ES modules and an optimized build pipeline. When paired with Bun, the developer experience becomes even smoother, as Bun accelerates not only the dev server but also the entire build process. The result is a full-stack workflow where front-end and back-end both benefit from a consistent, high-performance environment. This article will guide you through deploying a Bun + Hono + Vite application to an Azure Linux Web App. 2. Local Environment The development environment used in this example is a Docker container. All project creation, modification, and testing will take place inside this environment. We will build an application using Bun + Hono + Vite, containing three endpoints: / : The root endpoint, which displays Hello Bun Hono Vite /api/hello : A static endpoint /api/backend : A runtime endpoint executed on the server, returning computed results for the frontend to display Create the Docker development environment This command generates a Bun + Hono + Vite project. docker run --rm -it -v "$PWD":/app -w /app oven/bun:latest bunx create-vite . --template vanilla-ts Follow the prompts as shown in the image and make the appropriate selections. After completing the prompts, the corresponding project files will be created. At this point, you may press Ctrl + C to stop the running Bun server. Next, install Hono: docker run --rm -it -v "$PWD":/app -w /app oven/bun:latest bun add hono Create 4 files and edit 2 files Create .vscode/settings.json This file serves two purposes: To prevent the large node_modules folder from being uploaded during deployment. Uploading it wastes bandwidth and may cause compatibility issues between local and production environments. To disable ORYX BUILD from interfering in the deployment process. We will use a custom startup script to handle all build tasks instead. { "appService.zipIgnorePattern": [ "node_modules{,/**}", ".git{,/**}", ".vscode{,/**}" ], "appService.showBuildDuringDeployPrompt": false } Create vite.config.ts This file configures Vite’s base path to use relative paths instead of absolute paths. Although this does not affect the production environment, it is essential locally when URLs and ports may differ. import { defineConfig } from 'vite'; export default defineConfig({ base: './', }); Create server.ts This file configures backend routing using Hono and sets up the Bun server. It includes several test endpoints, some static, and some executed at runtime. import { Hono } from 'hono'; import { serveStatic } from 'hono/bun'; const app = new Hono(); app.get('/api/hello', (c) => { return c.text('this is api/hello'); }); app.get('/api/backend', (c) => { const result = 1 + 1; return c.json({ message: "this is /api/backend", calc: `1 + 1 = ${result}`, value: result, }); }); app.use('/assets/*', serveStatic({ root: './dist' })); app.use('/*', serveStatic({ root: './dist' })); app.get('/', serveStatic({ path: './dist/index.html' })); const port = Number(process.env.PORT ?? 3000); export default { port, fetch: app.fetch, }; Create startup.sh This script serves several important roles: Remove any node_modules folders or tar archives created by ORYX during deployment Install the Bun runtime Fully take over the ORYX build process Start the Bun server as the final step #!/bin/bash set -e echo "===== Startup script running =====" cd /home/site/wwwroot echo "Cleaning up Oryx-generated node_modules..." if [ -d /node_modules ]; then echo "Removing /node_modules ..." rm -rf /node_modules fi if [ -f node_modules.tar.gz ]; then echo "Removing node_modules.tar.gz ..." rm -f node_modules.tar.gz fi echo "Oryx cleanup complete." export BUN_INSTALL=/home/site/wwwroot/.bun export PATH="$BUN_INSTALL/bin:/home/site/wwwroot/node_modules/.bin:$PATH" export NODE_PATH="/home/site/wwwroot/node_modules" if [ ! -f "$BUN_INSTALL/bin/bun" ]; then echo "Bun not found. Installing..." curl -fsSL https://bun.sh/install | bash else echo "Bun already installed at $BUN_INSTALL" fi echo "Using Bun version:" bun --version echo "Running bun install ..." bun install echo "Running bun run build ..." bun run build echo "Starting server with bun run start ..." bun run start Modify src/main.ts Simplify the default welcome page so it only displays our test text. // src/main.ts document.querySelector<HTMLDivElement>('#app')!.innerHTML = ` <h1>Hello Bun Hono Vite</h1> `; Modify package.json Update the scripts section so that the project uses the newly created server.ts as the server entry point. { "name": "app", "private": true, "version": "0.0.0", "type": "module", "scripts": { "dev:vite": "vite", "build": "vite build", "preview": "vite preview", "start": "bun server.ts" }, "devDependencies": { "typescript": "~5.9.3", "vite": "^7.2.4" }, "dependencies": { "hono": "^4.10.7" } } Build the project locally This command generates a dist directory containing all Vite-built static assets. docker run --rm -it -v "$PWD":/app -w /app oven/bun:latest bun run build Run the server locally for testing docker run --rm -it -v "$PWD":/app -w /app -p 3000:3000 -e PORT=3000 oven/bun:latest bun run start You can press Ctrl + C to stop the server when finished. http://127.0.0.1:3000/ http://127.0.0.1:3000/api/hello http://127.0.0.1:3000/api/backend 3. Deployment We create a Linux Web App with a minimum SKU of Premium. Add Environment Variables SCM_DO_BUILD_DURING_DEPLOYMENT=false Purpose: Prevents the deployment environment from packaging during publish. This must also be set in the deployment environment itself. WEBSITE_RUN_FROM_PACKAGE=false Purpose: Instructs Azure Web App not to run the app from a prepackaged file. ENABLE_ORYX_BUILD=false Purpose: Prevents Azure Web App from building after deployment. All build tasks will instead execute during the startup script. Add Startup Command bash /home/site/wwwroot/startup.sh After that, we can return to VS Code and deploy the project. Once the deployment is complete, wait about five minutes for the build process to finish, and then you can begin testing. / /api/hello /api/backend 4. Conclusion Bun + Hono + Vite form a cohesive ecosystem that embodies the next era of JavaScript development: fast, compact, ergonomic, and deeply aligned with modern infrastructure needs. It is particularly well-suited for AI applications, where latency, concurrency, and rapid iteration matter more than ever. From streaming inference endpoints to vector database integrations, this stack offers the responsiveness and scalability essential for AI-powered systems.317Views0likes0CommentsCommon Misconceptions When Running Locally vs. Deploying to Azure Linux-based Web Apps
TOC Introduction Environment Variable Build Time Compatible Memory Conclusion 1. Introduction One of the most common issues during project development is the scenario where “the application runs perfectly in the local environment but fails after being deployed to Azure.” In most cases, deployment logs will clearly reveal the problem and allow you to fix it quickly. However, there are also more complicated situations where "due to the nature of the error itself" relevant logs may be difficult to locate. This article introduces several common categories of such problems and explains how to troubleshoot them. We will demonstrate them using Python and popular AI-related packages, as these tend to exhibit compatibility-related behavior. Before you begin, it is recommended that you read Deployment and Build from Azure Linux based Web App | Microsoft Community Hub on how Azure Linux-based Web Apps perform deployments so you have a basic understanding of the build process. 2. Environment Variable Simulating a Local Flask + sklearn Project First, let’s simulate a minimal Flask + sklearn project in any local environment (VS Code in this example). For simplicity, the sample code does not actually use any sklearn functions; it only displays plain text. app.py from flask import Flask app = Flask(__name__) @app.route("/") def index(): return "hello deploy environment variable" if __name__ == "__main__": app.run(host="0.0.0.0", port=8000) We also preset the environment variables required during Azure deployment, although these will not be used when running locally. .deployment [config] SCM_DO_BUILD_DURING_DEPLOYMENT=false As you may know, the old package name sklearn has long been deprecated in favor of scikit-learn. However, for the purpose of simulating a compatibility error, we will intentionally specify the outdated package name. requirements.txt Flask==3.1.0 gunicorn==23.0.0 sklearn After running the project locally, you can open a browser and navigate to the target URL to verify the result. python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt python app.py Of course, you may encounter the same compatibility issue even in your local environment. Simply running the following command resolves it: export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True We will revisit this error and its solution shortly. For now, create a Linux Web App running Python 3.12 and configure the following environment variables. We will define Oryx Build as the deployment method. SCM_DO_BUILD_DURING_DEPLOYMENT=false WEBSITE_RUN_FROM_PACKAGE=false ENABLE_ORYX_BUILD=true After deploying the code and checking the Deployment Center, you should see an error similar to the following. From the detailed error message, the cause is clear: sklearn is deprecated and replaced by scikit-learn, so additional compatibility handling is now required by the Python runtime. The error message suggests the following solutions: Install the newer scikit-learn package directly. If your project is deeply coupled to the old sklearn package and cannot be refactored yet, enable compatibility by setting an environment variable to allow installation of the deprecated package. Typically, this type of “works locally but fails on Azure” behavior happens because the deprecated package was installed in the local environment a long time ago at the start of the project, and everything has been running smoothly since. Package compatibility issues like this are very common across various languages on Linux. When a project becomes tightly coupled to an outdated package, you may not be able to upgrade it immediately. In these cases, compatibility workarounds are often the only practical short-term solution. In our example, we will add the environment variable: SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True However, here comes the real problem: This variable is needed during the build phase, but the environment variables set in Azure Portal’s Application Settings only take effect at runtime. So what should we do? The answer is simple, shift the Oryx Build process from build-time to runtime. First, open Azure Portal → Configuration and disable Oryx Build. ENABLE_ORYX_BUILD=false Next, modify the project by adding a startup script. run.sh #!/bin/bash export SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True python -m venv .venv source .venv/bin/activate pip install -r requirements.txt python app.py The startup script works just like the commands you run locally before executing the application. The difference is that you can inject the necessary compatibility environment variables before running pip install or starting the app. After that, return to Azure Portal and add the following Startup Command under Stack Settings. This ensures that your compatibility environment variables and build steps run before the runtime starts. bash run.sh Your overall project structure will now look like this. Once redeployed, everything should work correctly. 3. Build Time Build-Time Errors Caused by AI-Related Packages Many build-time failures are caused by AI-related packages, whose installation processes can be extremely time-consuming. You can investigate these issues by reviewing the deployment logs at the following maintenance URL: https://<YOUR_APP_NAME>.scm.azurewebsites.net/newui Compatible Let’s simulate a Flask + numpy project. The code is shown below. app.py from flask import Flask app = Flask(__name__) @app.route("/") def index(): return "hello deploy compatible" if __name__ == "__main__": app.run(host="0.0.0.0", port=8000) We reuse the same environment variables from the sklearn example. .deployment [config] SCM_DO_BUILD_DURING_DEPLOYMENT=false This time, we simulate the incompatibility between numpy==1.21.0 and Python 3.10. requirements.txt Flask==3.1.0 gunicorn==23.0.0 numpy==1.21.0 We will skip the local execution part and move directly to creating a Linux Web App running Python 3.10. Configure the same environment variables as before, and define the deployment method as runtime build. SCM_DO_BUILD_DURING_DEPLOYMENT=false WEBSITE_RUN_FROM_PACKAGE=false ENABLE_ORYX_BUILD=false After deployment, Deployment Center shows a successful publish. However, the actual website displays an error. At this point, you must check the deployment log files mentioned earlier. You will find two key logs: 1. docker.log Displays real-time logs of the platform creating and starting the container. In this case, you will see that the health probe exceeded the default 230-second startup window, causing container startup failure. This tells us the root cause is container startup timeout. To determine why it timed out, we must inspect the second file. 2. default_docker.log Contains the internal execution logs of the container. Not generated in real time, usually delayed around 15 minutes. Therefore, if docker.log shows a timeout error, wait at least 15 minutes to allow the logs to be written here. In this example, the internal log shows that numpy was being compiled during pip install, and the compilation step took too long. We now have a concrete diagnosis: numpy 1.21.0 is not compatible with Python 3.10, which forces pip to compile from source. The compilation exceeds the platform’s startup time limit (230 seconds) and causes the container to fail. We can verify this by checking numpy’s official site: numpy · PyPI numpy 1.21.0 only provides wheels for cp37, cp38, cp39 but not cp310 (which is python 3.10). Thus, compilation becomes unavoidable. Possible Solutions Set the environment variable WEBSITES_CONTAINER_START_TIME_LIMIT to increase the allowed container startup time. Downgrade Python to 3.9 or earlier. Upgrade numpy to 1.21.0+, where suitable wheels for Python 3.10 are available. In this example, we choose this option. After upgrading numpy to version 1.25.0 (which supports Python 3.10) from specifying in requirements.txt and redeploying, the issue is resolved. numpy · PyPI requirements.txt Flask==3.1.0 gunicorn==23.0.0 numpy==1.25.0 Memory The final example concerns the App Service SKU. AI packages such as Streamlit, PyTorch, and others require significant memory. Any one of these packages may cause the build process to fail due to insufficient memory. The error messages vary widely each time. If you repeatedly encounter unexplained build failures, check Deployment Center or default_docker.log for Exit Code 137, which indicates that the system ran out of memory during the build. The only solution in such cases is to scale up. 4. Conclusion This article introduced several common troubleshooting techniques for resolving Linux Web App issues caused during the build stage. Most of these problems relate to package compatibility, although the symptoms may vary greatly. By understanding the debugging process demonstrated in these examples, you will be better prepared to diagnose and resolve similar issues in future deployments.444Views0likes1CommentAnnouncing Advanced Kubernetes Troubleshooting Agent Capabilities (preview) in Azure Copilot
What’s new? Today, we're announcing Kubernetes troubleshooting agent capabilities in Azure Copilot, offering an intuitive, guided agentic experience that helps users detect, triage, and resolve common Kubernetes issues in their AKS clusters. The agent can provide root cause analysis for Kubernetes clusters and resources and is triggered by Kubernetes-specific keywords. It can detect problems like resource failures and scaling bottlenecks and intelligently correlates signals across metrics and events using `kubectl` commands when reasoning and provides actionable solutions. By simplifying complex diagnostics and offering clear next steps, the agent empowers users to troubleshoot independently. How it works With Kubernetes troubleshooting agent, Azure Copilot automatically investigates issues in your cluster by running targeted `kubectl` commands and analyzing your cluster’s configuration and current state. For instance, it identifies failing or pending pods, cluster events, resource utilization metrics, and configuration details to build a complete picture of what’s causing the issue. Azure Copilot then determines the most effective mitigation steps for your specific environment. It provides clear, step-by-step guidance, and in many cases, offers a one-click fix to resolve the issue automatically. If Azure Copilot can’t fully resolve the problem, it can generate a pre-populated support request with all the diagnostic details Microsoft Support needs. You’ll be able to review and confirm everything before the request is submitted. This agent is available via Azure Copilot in the Azure Portal. Learn more about how Azure Copilot works. How to Get Started To start using agents, your global administrator must request access to the agents preview at the tenant level in the Azure Copilot admin center. This confirms your interest in the preview and allows us to enable access. Once approved, users will see the Agent mode toggle in Azure Copilot chat and can then start using Copilot agents. Capacity is limited, so sign up early for the best chance to participate. Additionally, if you are interested in helping shape the future of agentic cloud ops and the role Copilot will play in it, please join our customer feedback program by filling up this form. Agents (preview) in Azure Copilot | Microsoft Learn Troubleshooting sample prompts From an AKS cluster resource, click Kubernetes troubleshooting with Copilot to automatically open Azure Copilot in context of the resource you want to troubleshoot: Try These Prompts to Get Started: Here are a few examples of the kinds of prompts you can use. If you're not already working in the context of a resource, you may need to provide the specific resource that you want to troubleshoot. "My pod keeps restarting can you help me figure out why" "Pods are stuck pending what is blocking them from being scheduled" "I am getting ImagePullBackOff how do I fix this" "One of my nodes is NotReady what is causing it" "My service cannot reach the backend pod what should I check" Note: When using these kinds of prompts, be sure agent mode is enabled by selecting the icon in the chat window: Learn More Troubleshooting agent capabilities in Agents (preview) in Azure Copilot | Microsoft Learn Announcing the CLI Agent for AKS: Agentic AI-powered operations and diagnostics at your fingertips - AKS Engineering Blog Microsoft Copilot in Azure Series - Kubectl | Microsoft Community Hub415Views3likes0Comments