azure kubernetes service
161 TopicsAnnouncing Advanced Kubernetes Troubleshooting Agent Capabilities (preview) in Azure Copilot
What’s new? Today, we're announcing Kubernetes troubleshooting agent capabilities in Azure Copilot, offering an intuitive, guided agentic experience that helps users detect, triage, and resolve common Kubernetes issues in their AKS clusters. The agent can provide root cause analysis for Kubernetes clusters and resources and is triggered by Kubernetes-specific keywords. It can detect problems like resource failures and scaling bottlenecks and intelligently correlates signals across metrics and events using `kubectl` commands when reasoning and provides actionable solutions. By simplifying complex diagnostics and offering clear next steps, the agent empowers users to troubleshoot independently. How it works With Kubernetes troubleshooting agent, Azure Copilot automatically investigates issues in your cluster by running targeted `kubectl` commands and analyzing your cluster’s configuration and current state. For instance, it identifies failing or pending pods, cluster events, resource utilization metrics, and configuration details to build a complete picture of what’s causing the issue. Azure Copilot then determines the most effective mitigation steps for your specific environment. It provides clear, step-by-step guidance, and in many cases, offers a one-click fix to resolve the issue automatically. If Azure Copilot can’t fully resolve the problem, it can generate a pre-populated support request with all the diagnostic details Microsoft Support needs. You’ll be able to review and confirm everything before the request is submitted. This agent is available via Azure Copilot in the Azure Portal. Learn more about how Azure Copilot works. How to Get Started To start using agents, your global administrator must request access to the agents preview at the tenant level in the Azure Copilot admin center. This confirms your interest in the preview and allows us to enable access. Once approved, users will see the Agent mode toggle in Azure Copilot chat and can then start using Copilot agents. Capacity is limited, so sign up early for the best chance to participate. Additionally, if you are interested in helping shape the future of agentic cloud ops and the role Copilot will play in it, please join our customer feedback program by filling up this form. Agents (preview) in Azure Copilot | Microsoft Learn Troubleshooting sample prompts From an AKS cluster resource, click Kubernetes troubleshooting with Copilot to automatically open Azure Copilot in context of the resource you want to troubleshoot: Try These Prompts to Get Started: Here are a few examples of the kinds of prompts you can use. If you're not already working in the context of a resource, you may need to provide the specific resource that you want to troubleshoot. "My pod keeps restarting can you help me figure out why" "Pods are stuck pending what is blocking them from being scheduled" "I am getting ImagePullBackOff how do I fix this" "One of my nodes is NotReady what is causing it" "My service cannot reach the backend pod what should I check" Note: When using these kinds of prompts, be sure agent mode is enabled by selecting the icon in the chat window: Learn More Troubleshooting agent capabilities in Agents (preview) in Azure Copilot | Microsoft Learn Announcing the CLI Agent for AKS: Agentic AI-powered operations and diagnostics at your fingertips - AKS Engineering Blog Microsoft Copilot in Azure Series - Kubectl | Microsoft Community Hub321Views3likes0CommentsBuilding AI apps and agents for the new frontier
Every new wave of applications brings with it the promise of reshaping how we work, build and create. From digitization to web, from cloud to mobile, these shifts have made us all more connected, more engaged and more powerful. The incoming wave of agentic applications, estimated to number 1.3 billion over the next 2 years[1] is no different. But the expectations of these new services are unprecedented, in part for how they will uniquely operate with both intelligence and agency, how they will act on our behalf, integrated as a member of our teams and as a part of our everyday lives. The businesses already achieving the greatest impact from agents are what we call Frontier Organizations. This week at Microsoft Ignite we’re showcasing what the best frontier organizations are delivering, for their employees, for their customers and for their markets. And we’re introducing an incredible slate of innovative services and tools that will help every organization achieve this same frontier transformation. What excites me most is how frontier organizations are applying AI to achieve their greatest level of creativity and problem solving. Beyond incremental increases in efficiency or cost savings, frontier firms use AI to accelerate the pace of innovation, shortening the gap from prototype to production, and continuously refining services to drive market fit. Frontier organizations aren’t just moving faster, they are using AI and agents to operate in novel ways, redefining traditional business processes, evolving traditional roles and using agent fleets to augment and expand their workforce. To do this they build with intent, build for impact and ground services in deep, continuously evolving, context of you, your organization and your market that makes every service, every interaction, hyper personalized, relevant and engaging. Today we’re announcing new capabilities that help you build what was previously impossible. To launch and scale fleets of agents in an open system across models, tools, and knowledge. And to run and operate agents with the confidence that every service is secure, governed and trusted. The question is, how do you get there? How do you build the AI apps and agents fueling the future? Read further for just a few highlights of how Microsoft can help you become frontier: Build with agentic DevOps Perhaps the greatest area of agentic innovation today is in service of developers. Microsoft’s strategy for agentic DevOps is redefining the developer experience to be AI-native, extending the power of AI to every stage of the software lifecycle and integrating AI services into the tools embraced by millions of developers. At Ignite, we’re helping every developer build faster, build with greater quality and security and deliver increasingly innovative apps that will shape their businesses. Across our developer services, AI agents now operate like an active member of your development and operations teams – collaborating, automating, and accelerating every phase of the software development lifecycle. From planning and coding to deployment and production, agents are reshaping how we build. And developers can now orchestrate fleets of agents, assigning tasks to agents to execute code reviews, testing, defect resolution, and even modernization of legacy Java and .NET applications. We continue to take this strategy forward with a new generation of AI-powered tools, with GitHub Agent HQ making coding agents like Codex, Claude Code, and Jules available soon directly in GitHub and Visual Studio Code, to Custom Agents to encode domain expertise, and “bring your own models” to empower teams to adapt and innovate. It’s these advancements that make GitHub Copilot, the world’s the most popular AI pair programmer, serving over 26 million users and helping organizations like Pantone, Ahold Delhaize USA, and Commerzbank streamline processes and save time. Within Microsoft’s own developer teams, we’re seeing transformative results with agentic DevOps. GitHub Copilot coding agent is now a top contributor—not only to GitHub’s core application but also to our major open-source projects like the Microsoft Agent Framework and Aspire. Copilot is reducing task completion time from hours to minutes and eliminating up to two weeks of manual development effort for complex work. Across Microsoft, 90% of pull requests are now covered by GitHub Copilot code review, increasing the pace of PR completion. Our AI-powered assistant for Microsoft’s engineering ecosystem is deeply integrated into VS Code, Teams, and other tools, giving engineers and product managers real-time, context-aware answers where they work—saving 2.2k developer days in September alone. For app modernization, GitHub Copilot has reduced modernization project timelines by as much as 88%. In production environments, Azure SRE agent has handled over 7K incidents and collected diagnostics on over 18K incidents, saving over 10,000 hours for on-call engineers. These results underscore how agentic workflows are redefining speed, scale, and reliability across the software lifecycle at Microsoft. Launch at speed and scale with a full-stack AI app and agent platform We’re making it easier to build, run, and scale AI agents that deliver real business outcomes. To accelerate the path to production for advanced AI applications and agents is delivering a complete, and flexible foundation that helps every organization move with speed and intelligence without compromising security, governance or operations. Microsoft Foundry helps organizations move from experimentation to execution at scale, providing the organization-wide observability and control that production AI requires. More than 80,000 customers, including 80% of the Fortune 500, use Microsoft Foundry to build, optimize, and govern AI apps and agents today. Foundry supports open frameworks like the Microsoft Agent Framework for orchestration, standard protocols like Model Context Protocol (MCP) for tool calling, and expansive integrations that enable context-aware, action-oriented agents. Companies like Nasdaq, Softbank, Sierra AI, and Blue Yonder are shipping innovative solutions with speed and precision. New at Ignite this year: Foundry Models With more than 11,000 models like OpenAI’s GPT-5, Anthropic’s Claude, and Microsoft’s Phi at their fingertips, developers, Foundry delivers the broadest model selection on any cloud. Developers have the power to benchmark, compare, and dynamically route models to optimize performance for every task. Model router is now generally available in Microsoft Foundry and in public preview in Foundry Agent Service. Foundry IQ, Delivering the deep context needed to make every agent grounded, productive, and reliable. Foundry IQ, now available in public preview, reimagines retrieval-augmented generation (RAG) as a dynamic reasoning process rather than a one-time lookup. Powered by Azure AI Search, it centralizes RAG workflows into a single grounding API, simplifying orchestration and improving response quality while respecting user permissions and data classifications. Foundry Agent Service now offers Hosted Agents, multi-agent workflows, built-in memory, and the ability to deploy agents directly to Microsoft 365 and Agent 365 in public preview. Foundry Tools, empowers developers to create agents with secure, real-time access to business systems, business logic, and multimodal capabilities. Developers can quickly enrich agents with real-time business context, multimodal capabilities, and custom business logic through secure, governed integration with 1,400+ systems and APIs. Foundry Control Plane, now in public preview, centralizes identity, policy, observability, and security signals and capabilities for AI developers in one portal. Build on an AI-Ready foundation for all applications Managed Instance on Azure App Service lets organizations migrate existing .NET web applications to the cloud without the cost or effort of rewriting code, allowing them to migrate directly into a fully managed platform-as-a-service (PaaS) environment. With Managed Instance, organizations can keep operating applications with critical dependencies on local Windows services, third-party vendor libraries, and custom runtimes without requiring any code changes. The result is faster modernizations with lower overhead, and access to cloud-native scalability, built-in security and Azure’s AI capabilities. MCP Governance with Azure API Management now delivers a unified control plane for APIs and MCP servers, enabling enterprises to extend their existing API investments directly into the agentic ecosystem with trusted governance, secure access, and full observability. Agent Loop and native AI integrations in Azure Logic Apps enable customers to move beyond rigid workflows to intelligent, adaptive automation that saves time and reduces complexity. These capabilities make it easier to build AI-powered, context-aware applications using low-code tools, accelerating innovation without heavy development effort. Azure Functions now supports hosting production-ready, reliable AI agents with stateful sessions, durable tool calls, and deterministic multi-agent orchestrations through the durable extension for Microsoft Agent Framework. Developers gain automatic session management, built-in HTTP endpoints, and elastic scaling from zero to thousands of instances — all with pay-per-use pricing and automated infrastructure. Azure Container Apps agents and security supercharges agentic workloads with automated deployment of multi-container agents, on-demand dynamic execution environments, and built-in security for runtime protection, and data confidentiality. Run and operate agents with confidence New at Ignite, we’re also expanding the use of agents to keep every application secure, managed and operating without compromise. Expanded agentic capabilities protect applications from code to cloud and continuously monitor and remediate production issues, while minimizing the efforts on developers, operators and security teams. Microsoft Defender for Cloud and GitHub Advanced Security: With the rise of multi-agent systems, the security threat surface continues to expand. Increased alert volumes, unprioritized threat signals, unresolved threats and a growing backlog of vulnerabilities is increasing risk for businesses while security teams and developers often operate in disconnected tools, making collaboration and remediation even more challenging. The new Defender for Cloud and GitHub Advanced Security integration closes this gap, connecting runtime context to code for faster alert prioritization and AI-powered remediation. Runtime context prioritizes security risks with insights that allow teams to focus on what matters most and fix issues faster with AI-powered remediation. When Defender for Cloud finds a threat exposed in production, it can now link to the exact code in GitHub. Developers receive AI suggested fixes directly inside GitHub, while security teams track progress in Defender for Cloud in real time. This gives both sides a faster, more connected way to identify issues, drive remediation, and keep AI systems secure throughout the app lifecycle. Azure SRE Agent is an always-on, AI-powered partner for cloud reliability, enabling production environments to become self-healing, proactively resolve issues, and optimize performance. Seamlessly integrated with Azure Monitor, GitHub Copilot, and incident management tools, Azure SRE Agent reduces operational toil. The latest update introduces no-code automation, empowering teams to tailor processes to their unique environments with minimal engineering overhead. Event-driven triggers enable proactive checks and faster incident response, helping minimize downtime. Expanded observability across Azure and third-party sources is designed to help teams troubleshoot production issues more efficiently, while orchestration capabilities support integration with MCP-compatible tools for comprehensive process automation. Finally, its adaptive memory system is designed to learn from interactions, helping improve incident handling and reduce operational toil, so organizations can achieve greater reliability and cost efficiency. The future is yours to build We are living in an extraordinary time, and across Microsoft we’re focused on helping every organization shape their future with AI. Today’s announcements are a big step forward on this journey. Whether you’re a startup fostering the next great concept or a global enterprise shaping your future, we can help you deliver on this vision. The frontier is open. Let’s build beyond expectations and build the future! Check out all the learning at Microsoft Ignite on-demand and read more about the announcements making it happen at: Recommended sessions BRK113: Connected, managed, and complete BRK103: Modernize your apps in days, not months, with GitHub Copilot BRK110: Build AI Apps fast with GitHub and Microsoft Foundry in action BRK100: Best practices to modernize your apps and databases at scale BRK114: AI Agent architectures, pitfalls and real-world business impact BRK115: Inside Microsoft's AI transformation across the software lifecycle Announcements aka.ms/AgentFactory aka.ms/AppModernizationBlog aka.ms/SecureCodetoCloudBlog aka.ms/AppPlatformBlog [1] IDC Info Snapshot, sponsored by Microsoft, 1.3 Billion AI Agents by 2028, #US53361825 and May 2025.628Views0likes0CommentsReimagining AI Ops with Azure SRE Agent: New Automation, Integration, and Extensibility features
Azure SRE Agent offers intelligent and context aware automation for IT operations. Enhanced by customer feedback from our preview, the SRE Agent has evolved into an extensible platform to automate and manage tasks across Azure and other environments. Built on an Agentic DevOps approach - drawing from proven practices in internal Azure operations - the Azure SRE Agent has already saved over 20,000 engineering hours across Microsoft product teams operations, delivering strong ROI for teams seeking sustainable AIOps. An Operations Agent that adapts to your playbooks Azure SRE Agent is an AI powered operations automation platform that empowers SREs, DevOps, IT operations, and support teams to automate tasks such as incident response, customer support, and developer operations from a single, extensible agent. Its value proposition and capabilities have evolved beyond diagnosis and mitigation of Azure issues, to automating operational workflows and seamless integration with the standards and processes used in your organization. SRE Agent is designed to automate operational work and reduce toil, enabling developers and operators to focus on high-value tasks. By streamlining repetitive and complex processes, SRE Agent accelerates innovation and improves reliability across cloud and hybrid environments. In this article, we will look at what’s new and what has changed since the last update. What’s New: Automation, Integration, and Extensibility Azure SRE Agent just got a major upgrade. From no-code automation to seamless integrations and expanded data connectivity, here’s what’s new in this release: No-code Sub-Agent Builder: Rapidly create custom automations without writing code. Flexible, event-driven triggers: Instantly respond to incidents and operational changes. Expanded data connectivity: Unify diagnostics and troubleshooting across more data sources. Custom actions: Integrate with your existing tools and orchestrate end-to-end workflows via MCP. Prebuilt operational scenarios: Accelerate deployment and improve reliability out of the box. Unlike generic agent platforms, Azure SRE Agent comes with deep integrations, prebuilt tools, and frameworks specifically for IT, DevOps, and SRE workflows. This means you can automate complex operational tasks faster and more reliably, tailored to your organization’s needs. Sub-Agent Builder: Custom Automation, No Code Required Empower teams to automate repetitive operational tasks without coding expertise, dramatically reducing manual workload and development cycles. This feature helps address the need for targeted automation, letting teams solve specific operational pain points without relying on one-size-fits-all solutions. Modular Sub-Agents: Easily create custom sub-agents tailored to your team’s needs. Each sub-agent can have its own instructions, triggers, and toolsets, letting you automate everything from outage response to customer email triage. Prebuilt System Tools: Eliminate the inefficiency of creating basic automation from scratch, and choose from a rich library of hundreds of built-in tools for Azure operations, code analysis, deployment management, diagnostics, and more. Custom Logic: Align automation to your unique business processes by defining your automation logic and prompts, teaching the agent to act exactly as your workflow requires. Flexible Triggers: Automate on Your Terms Invoke the agent to respond automatically to mission-critical events, not wait for manual commands. This feature helps speed up incident response and eliminate missed opportunities for efficiency. Multi-Source Triggers: Go beyond chat-based interactions, and trigger the agent to automatically respond to Incident Management and Ticketing systems like PagerDuty and ServiceNow, Observability Alerting systems like Azure Monitor Alerts, or even on a cron-based schedule for proactive monitoring and best-practices checks. Additional trigger sources such as GitHub issues, Azure DevOps pipelines, email, etc. will be added over time. This means automation can start exactly when and where you need it. Event-Driven Operations: Integrate with your CI/CD, monitoring, or support systems to launch automations in response to real-world events - like deployments, incidents, or customer requests. Vital for reducing downtime, it ensures that business-critical actions happen automatically and promptly. Expanded Data Connectivity: Unified Observability and Troubleshooting Integrate data, enabling comprehensive diagnostics and troubleshooting and faster, more informed decision-making by eliminating silos and speeding up issue resolution. Multiple Data Sources: The agent can now read data from Azure Monitor, Log Analytics, and Application Insights based on its Azure role-based access control (RBAC). Additional observability data sources such as Dynatrace, New Relic, Datadog, and more can be added via the Remote Model Context Protocol (MCP) servers for these tools. This gives you a unified view for diagnostics and automation. Knowledge Integration: Rather than manually detailing every instruction in your prompt, you can upload your Troubleshooting Guide (TSG) or Runbook directly, allowing the agent to automatically create an execution plan from the file. You may also connect the agent to resources like SharePoint, Jira, or documentation repositories through Remote MCP servers, enabling it to retrieve needed files on its own. This approach utilizes your organization’s existing knowledge base, streamlining onboarding and enhancing consistency in managing incidents. Azure SRE Agent is also building multi-agent collaboration by integrating with PagerDuty and Neubird, enabling advanced, cross-platform incident management and reliability across diverse environments. Custom Actions: Automate Anything, Anywhere Extend automation beyond Azure and integrate with any tool or workflow, solving the problem of limited automation scope and enabling end-to-end process orchestration. Out-of-the-Box Actions: Instantly automate common tasks like running azcli, kubectl, creating GitHub issues, or updating Azure resources, reducing setup time and operational overhead. Communication Notifications: The SRE Agent now features built-in connectors for Outlook, enabling automated email notifications, and for Microsoft Teams, allowing it to post messages directly to Teams channels for streamlined communication. Bring Your Own Actions: Drop in your own Remote MCP servers to extend the agent’s capabilities to any custom tool or workflow. Future-proof your agentic DevOps by automating proprietary or emerging processes with confidence. Prebuilt Operations Scenarios Address common operational challenges out of the box, saving teams time and effort while improving reliability and customer satisfaction. Incident Response: Minimize business impact and reduce operational risk by automating detection, diagnosis, and mitigation of your workload stack. The agent has built-in runbooks for common issues related to many Azure resource types including Azure Kubernetes Service (AKS), Azure Container Apps (ACA), Azure App Service, Azure Logic Apps, Azure Database for PostgreSQL, Azure CosmosDB, Azure VMs, etc. Support for additional resource types is being added continually, please see product documentation for the latest information. Root Cause Analysis & IaC Drift Detection: Instantly pinpoint incident causes with AI-driven root cause analysis including automated source code scanning via GitHub and Azure DevOps integration. Proactively detect and resolve infrastructure drift by comparing live cloud environments against source-controlled IaC, ensuring configuration consistency and compliance. Handle Complex Investigations: Enable the deep investigation mode that uses a hypothesis-driven method to analyze possible root causes. It collects logs and metrics, tests hypotheses with iterative checks, and documents findings. The process delivers a clear summary and actionable steps to help teams accurately resolve critical issues. Incident Analysis: The integrated dashboard offers a comprehensive overview of all incidents managed by the SRE Agent. It presents essential metrics, including the number of incidents reviewed, assisted, and mitigated by the agent, as well as those awaiting human intervention. Users can leverage aggregated visualizations and AI-generated root cause analyses to gain insights into incident processing, identify trends, enhance response strategies, and detect areas for improvement in incident management. Inbuilt Agent Memory: The new SRE Agent Memory System transforms incident response by institutionalizing the expertise of top SREs - capturing, indexing, and reusing critical knowledge from past incidents, investigations, and user guidance. Benefit from faster, more accurate troubleshooting, as the agent learns from both successes and mistakes, surfacing relevant insights, runbooks, and mitigation strategies exactly when needed. This system leverages advanced retrieval techniques and a domain-aware schema to ensure every on-call engagement is smarter than the last, reducing mean time to resolution (MTTR) and minimizing repeated toil. Automatically gain a continuously improving agent that remembers what works, avoids past pitfalls, and delivers actionable guidance tailored to the environment. GitHub Copilot and Azure DevOps Integration: Automatically triage, respond to, and resolve issues raised in GitHub or Azure DevOps. Integration with modern development platforms such as GitHub Copilot coding agent increases efficiency and ensures that issues are resolved faster, reducing bottlenecks in the development lifecycle. Ready to get started? Azure SRE Agent home page Product overview Pricing Page Pricing Calculator Pricing Blog Demo recordings Deployment samples What’s Next? Give us feedback: Your feedback is critical - You can Thumbs Up / Thumbs Down each interaction or thread, or go to the “Give Feedback” button in the agent to give us in-product feedback - or you can create issues or just share your thoughts in our GitHub repo at https://github.com/microsoft/sre-agent. We’re just getting started. In the coming months, expect even more prebuilt integrations, expanded data sources, and new automation scenarios. We anticipate continuous growth and improvement throughout our agentic AI platforms and services to effectively address customer needs and preferences. Let us know what Ops toil you want to automate next!866Views0likes0CommentsAgentic Power for AKS: Introducing the Agentic CLI in Public Preview
We are excited to announce the agentic CLI for AKS, available now in public preview directly through the Azure CLI. A huge thank you to all our private preview customers who took the time to try out our beta releases and provide feedback to our team. The agentic CLI is now available for everyone to try--continue reading to learn how you can get started. Why we built the agentic CLI for AKS The way we build software is changing with the democratization of coding agents. We believe the same should happen for how users manage their Kubernetes environments. With this feature, we want to simplify the management and troubleshooting of AKS clusters, while reducing the barrier to entry for startups and developers by bridging the knowledge gap. The agentic CLI for AKS is designed to simplify this experience by bringing agentic capabilities to your cluster operations and observability, translating natural language into actionable guidance and analysis. Whether you need to right-size your infrastructure, troubleshoot complex networking issues like DNS or outbound connectivity, or ensure smooth K8s upgrades, the agentic CLI helps you make informed decisions quickly and confidently. Our goal: streamline cluster operations and empower teams to ask questions like “Why is my pod restarting?” or “How can I optimize my cluster for cost?” and get instant, actionable answers. The agentic CLI for AKS is built on the open-source HolmesGPT project, which has recently been accepted as a CNCF Sandbox project. With a pluggable LLM endpoint structure and open-source backing, the agentic CLI is purpose-built for customizability and data privacy. From private to public preview: what's new? Earlier this year, we launched the agentic CLI in private beta for a small group of AKS customers. Their feedback has shaped what's new in our public preview release, which we are excited to share with the broader AKS community. Let’s dig in: Simplified setup: One-time initialization for LLM parameters with ‘az aks agent-init'. Configure your LLM parameters such as API key and model through a simple, guided user interface. AKS MCP integration: Enable the agent to install and run the AKS MCP server locally (directly in your CLI client) for advanced context-aware operations. The AKS MCP server includes tools for AKS clusters and associated Azure resources. Try it out: az aks agent “list all my unhealthy nodepools” --aks-mcp -n <cluster-name> -g <resource-group> Deeper investigations: New "Task List" feature which helps the agent plan and execute on complex investigations. Checklist-style tracker that allows you to stay updated on the agent's progress and planned tool calls. Provide in-line feedback: Share insights directly from the CLI about the agent's performance using /feedback. Provide a rating of the agent's analysis and optional written feedback directly to the agentic CLI team. Your feedback is highly appreciated and will help us improve the agentic CLI's capabilities. Performance and security improvements: Minor improvements for faster load times and reduced latency, as well as hardened initialization and token handling. Getting Started Install the extension az extension add --name aks-agent Set up you LLM endpoint az aks agent-init Start asking questions Some recommended scenarios to try out: Troubleshoot cluster health: az aks agent "Give me an overview of my cluster's health" Right-size your cluster: az aks agent "How can I optimize my node pool for cost?" Try out the AKS MCP integration: az aks agent "Show me CPU and memory usage trends" --aks-mcp -n <cluster-name> -g <resource-group> Get upgrade guidance: az aks agent "What should I check before upgrading my AKS cluster?" Update the agentic CLI extension az extension update --name aks-agent Join the Conversation We’d love your feedback! Use the built-in '/feedback' command or visit our GitHub repository to share ideas and issues. Learn more: https://aka.ms/aks/agentic-cli Share feedback: https://aka.ms/aks/agentic-cli/issues640Views1like0CommentsMicrosoft Azure at KubeCon North America 2025 | Atlanta, GA - Nov 10-13
KubeCon + CloudNativeCon North America is back - this time in Atlanta, Georgia, and the excitement is real. Whether you’re a developer, operator, architect, or just Kubernetes-curious, Microsoft Azure is showing up with a packed agenda, hands-on demos, and plenty of ways to connect and learn with our team of experts. Read on for all the ways you can connect with our team! Kick off with Azure Day with Kubernetes (Nov 10) Before the main conference even starts, join us for Azure Day with Kubernetes on November 10. It’s a full day of learning, best practices, deep-dive discussions, and hands-on labs, all designed to help you build cloud-native and AI apps with Kubernetes on Azure. You’ll get to meet Microsoft experts, dive into technical sessions, roll up your sleeves in the afternoon labs or have focused deep-dive discussions in our whiteboarding sessions. If you’re looking to sharpen your skills or just want to chat with folks who live and breathe Kubernetes on Azure, this is the place to be. Spots are limited, so register today at: https://aka.ms/AzureKubernetesDay Catch up with our experts at Booth #500 The Microsoft booth is more than just a spot to grab swag (though, yes, there will be swag and stickers!). It’s a central hub for connecting with product teams, setting up meetings, and seeing live demos. Whether you want to learn how to troubleshoot Kubernetes with agentic AI tools, explore open-source projects, or just talk shop, you’ll find plenty of friendly faces ready to help. We will be running a variety of theatre sessions and demos out of the booth week on topics including AKS Automatic, agentic troubleshooting, Azure Verified Modules, networking, app modernization, hybrid deployments, storage, and more. 🔥Hot tip: join us for our live Kubernetes Trivia Show at the Microsoft Azure booth during the KubeCrawl on Tuesday to win exclusive swag! Microsoft sessions at KubeCon NA 2025 Here’s a quick look at all the sessions with Microsoft speakers that you won’t want to miss. Click the titles for full details and add them to your schedule! Keynotes Date: Thu November 13, 2025 Start Time: 9:49 AM Room: Exhibit Hall B2 Title: Scaling Smarter: Simplifying Multicluster AI with KAITO and KubeFleet Speaker: Jorge Palma Abstract: As demand for AI workloads on Kubernetes grows, multicluster inferencing has emerged as a powerful yet complex architectural pattern. While multicluster support offers benefits in terms of geographic redundancy, data sovereignty, and resource optimization, it also introduces significant challenges around orchestration, traffic routing, cost control, and operational overhead. To address these challenges, we’ll introduce two CNCF projects—KAITO and KubeFleet—that work together to simplify and optimize multicluster AI operations. KAITO provides a declarative framework for managing AI inference workflows with built-in support for model versioning, and performance telemetry. KubeFleet complements this by enabling seamless workload distribution across clusters, based on cost, latency, and availability. Together, these tools reduce operational complexity, improve cost efficiency, and ensure consistent performance at scale. Date: Thu November 13, 2025 Start Time: 9:56 AM Room: Exhibit Hall B2 Title: Cloud Native Back to the Future: The Road Ahead Speakers: Jeremy Rickard (Microsoft), Alex Chircop (Akamai) Abstract: The Cloud Native Computing Foundation (CNCF) turns 10 this year, now home to more than 200 projects across the cloud native landscape. As we look ahead, the community faces new demands around security, sustainability, complexity, and emerging workloads like AI inference and agents. As many areas of the ecosystem transition to mature foundational building blocks, we are excited to explore the next evolution of cloud native development. The TOC will highlight how these challenges open opportunities to shape the next generation of applications and ensure the ecosystem continues to thrive. How are new projects addressing these new emerging workloads? How will these new projects impact security hygiene in the ecosystem? How will existing projects adapt to meet new realities? How is the CNCF evolving to support this next generation of computing? Join us as we reflect on the first decade of cloud native—and look ahead to how this community will power the age of AI, intelligent systems, and beyond. Featured Demo Date: Wed November 12, 2025 Start Time: 2:15-2:35 PM Room: Expo Demo Area Title: HolmesGPT: Agentic K8s troubleshooting in your terminal Speakers: Pavneet Singh Ahluwalia (Microsoft), Arik Alon (Robusta) Abstract: Troubleshooting Kubernetes shouldn’t require hopping across dashboards, logs, and docs. With open-source tools like HolmesGPT and the Model Context Protocol (MCP) server, you can now bring an agentic experience directly into your CLI. In this demo, we’ll show how this OSS stack can run everywhere, from lightweight kind clusters on your laptop to production-grade clusters at scale. The experience supports any LLM provider: in-cluster, local, or cloud, ensuring data never leaves your environment and costs remain predictable. We will showcase how users can ask natural-language questions (e.g., “why is my pod Pending?”) and get grounded reasoning, targeted diagnostics, and safe, human-in-the-loop remediation steps -- all without leaving the terminal. Whether you’re experimenting locally or running mission-critical workloads, you’ll walk away knowing how to extend these OSS components to build your own agentic workflows in Kubernetes. All sessions Microsoft Speaker(s) Session Will Case No Kubectl, No Problem: The Future With Conversational Kubernetes Ana Maria Lopez Moreno Smarter Together: Orchestrating Multi-Agent AI Systems With A2A and MCP on Container Neha Aggarwal 10 Years of Cilium: Connecting, Securing, and Simplifying the Cloud Native Stack Yi Zha Strengthening Supply Chain for Kubernetes: Cross-Cloud SLSA Attestation Verification Joaquim Rocha & Oleksandr Dubenko Contribfest: Power up Your CNCF Tools With Headlamp Jeremy Rickard Shaping LTS Together: What We’ve Learned the Hard Way Feynman Zhou Shipping Secure, Reusable, and Composable Infrastructure as Code: GE HealthCare’s Journey With ORAS Jackie Maertens & Nilekh Chaudhari No Joke: Two Security Maintainers Walk Into a Cluster Paul Yu, Sachi Desai Rage Against the Machine: Fighting AI Complexity with Kubernetes simplicity Dipti Pai Flux - The GitLess GitOps Edition Trask Stalnaker OpenTelemetry: Unpacking 2025, Charting 2026 Mike Morris Gateway API: Table Stakes Anish Ramasekar, Mo Khan, Stanislav Láznička, Rita Zhang & Peter Engelbert Strengthening Kubernetes Trust: SIG Auth's Latest Security Enhancements Ernest Wong AI Models Are Huge, but Your GPUs Aren’t: Mastering multi-mode distributed inference on Kubernetes Rita Zhang Navigating the Rapid Evolution of Large Model Inference: Where does Kubernetes fit? Suraj Deshmukh LLMs on Kubernetes: Squeeze 5x GPU Efficiency with cache, route, repeat! Aman Singh Drasi: A New Take on Change-driven Architectures Ganeshkumar Ashokavardhanan & Qinghui Zhuang Agent-Driven MCP for AI Workloads on Kubernetes Steven Jin Contribfest: From Farm (Fork) To Table (Feature): Growing Your First (Free-range Organic) Istio PR Jack Francis SIG Autoscaling Projects Update Mark Rossetti Kubernetes SIG-Windows Updates Apurup Chevuru & Michael Zappa Portable MTLS for Kubernetes: A QUIC-Based Plugin Compatible With Any CNI Ciprian Hacman The Next Decoupling: From Monolithic Cluster, To Control-Plane With Nodes Keith Mattix Istio Project Updates: AI Inference, Ambient Multicluster & Default Deny Jonathan Smith How Comcast Leverages Radius in Their Internal Developer Platform Jon Huhn Lightning Talk: Getting (and Staying) up To Speed on DRA With the DRA Example Driver Rita Zhang, Jaydip Gabani Open Policy Agent (OPA) Intro & Deep Dive Bridget Kromhout SIG Cloud Provider Deep Dive: Expanding Our Mission Pavneet Ahluwalia Beyond ChatOps: Agentic AI in Kubernetes—What Works, What Breaks, and What’s Next Ryan Zhang Finally, a Cluster Inventory I Can USE! Michael Katchinskiy, Yossi Weizman You Deployed What?! Data-driven lesson on Unsafe Helm Chart Defaults Mauricio Vásquez Bernal & Jose Blanquicet Contribfest: Inspektor Gadget Contribfest: Enhancing the Observability and Security of Your K8s Clusters Through an easy to use Framework Wei Fu etcd V3.6 and Beyond + Etcd-operator Updates Jeremy Rickard GitHub Actions: Project Usage and Deep Dive Dor Serero & Michael Katchinskiy What Doesn't Kill You Makes You Stronger: The Vulnerabilities that Redefined Kubernetes Security We can't wait to see you in Atlanta! Microsoft’s presence is all about empowering developers and operators to build, secure, and scale modern applications. You’ll see us leading sessions, sharing open-source contributions, and hosting roundtables on how cloud native powers AI in production. We’re here to learn from you, too - so bring your questions, ideas, and feedback.584Views0likes0CommentsAzure Container Registry Repository Permissions with Attribute-based Access Control (ABAC)
General Availability announcement Today marks the general availability of Azure Container Registry (ACR) repository permissions with Microsoft Entra attribute-based access control (ABAC). ABAC augments the familiar Azure RBAC model with namespace and repository-level conditions so platform teams can express least-privilege access at the granularity of specific repositories or entire logical namespaces. This capability is designed for modern multi-tenant platform engineering patterns where a central registry serves many business domains. With ABAC, CI/CD systems and runtime consumers like Azure Kubernetes Service (AKS) clusters have least-privilege access to ACR registries. Why this matters Enterprises are converging on a central container registry pattern that hosts artifacts and container images for multiple business units and application domains. In this model: CI/CD pipelines from different parts of the business push container images and artifacts only to approved namespaces and repositories within a central registry. AKS clusters, Azure Container Apps (ACA), Azure Container Instances (ACI), and consumers pull only from authorized repositories within a central registry. With ABAC, these repository and namespace permission boundaries become explicit and enforceable using standard Microsoft Entra identities and role assignments. This aligns with cloud-native zero trust, supply chain hardening, and least-privilege permissions. What ABAC in ACR means ACR registries now support a registry permissions mode called “RBAC Registry + ABAC Repository Permissions.” Configuring a registry to this mode makes it ABAC-enabled. When a registry is configured to be ABAC-enabled, registry administrators can optionally add ABAC conditions during standard Azure RBAC role assignments. This optional ABAC conditions scope the role assignment’s effect to specific repositories or namespace prefixes. ABAC can be enabled on all new and existing ACR registries across all SKUs, either during registry creation or configured on existing registries. ABAC-enabled built-in roles Once a registry is ABAC-enabled (configured to “RBAC Registry + ABAC Repository Permissions), registry admins can use these ABAC-enabled built-in roles to grant repository-scoped permissions: Container Registry Repository Reader: grants image pull and metadata read permissions, including tag resolution and referrer discoverability. Container Registry Repository Writer: grants Repository Reader permissions, as well as image and tag push permissions. Container Registry Repository Contributor: grants Repository Reader and Writer permissions, as well as image and tag delete permissions. Note that these roles do not grant repository list permissions. The separate Container Registry Repository Catalog Lister must be assigned to grant repository list permissions. The Container Registry Repository Catalog Lister role does not support ABAC conditions; assigning it grants permissions to list all repositories in a registry. Important role behavior changes in ABAC mode When a registry is ABAC-enabled by configuring its permissions mode to “RBAC Registry + ABAC Repository Permissions”: Legacy data-plane roles such as AcrPull, AcrPush, AcrDelete are not honored in ABAC-enabled registries. For ABAC-enabled registries, use the ABAC-enabled built-in roles listed above. Broad roles like Owner, Contributor, and Reader previously granted full control plane and data plane permissions, which is typically an overprivileged role assignment. In ABAC-enabled registries, these broad roles will only grant control plane permissions to the registry. They will no longer grant data plane permissions, such as image push, pull, delete or repository list permissions. ACR Tasks, Quick Tasks, Quick Builds, and Quick Runs no longer inherit default data-plane access to source registries; assign the ABAC-enabled roles above to the calling identity as needed. Identities you can assign ACR ABAC uses standard Microsoft Entra role assignments. Assign RBAC roles with optional ABAC conditions to users, groups, service principals, and managed identities, including AKS kubelet and workload identities, ACA and ACI identities, and more. Next steps Start using ABAC repository permissions in ACR to enforce least-privilege artifact push, pull, and delete boundaries across your CI/CD systems and container image workloads. This model is now the recommended approach for multi-tenant platform engineering patterns and central registry deployments. To get started, follow the step-by-step guides in the official ACR ABAC documentation: https://aka.ms/acr/auth/abac872Views1like0CommentsExpanding the Public Preview of the Azure SRE Agent
We are excited to share that the Azure SRE Agent is now available in public preview for everyone instantly – no sign up required. A big thank you to all our preview customers who provided feedback and helped shape this release! Watching teams put the SRE Agent to work taught us a ton, and we’ve baked those lessons into a smarter, more resilient, and enterprise-ready experience. You can now find Azure SRE Agent directly in the Azure Portal and get started, or use the link below. 📖 Learn more about SRE Agent. 👉 Create your first SRE Agent (Azure login required) What’s New in Azure SRE Agent - October Update The Azure SRE Agent now delivers secure-by-default governance, deeper diagnostics, and extensible automation—built for scale. It can even resolve incidents autonomously by following your team’s runbooks. With native integrations across Azure Monitor, GitHub, ServiceNow, and PagerDuty, it supports root cause analysis using both source code and historical patterns. And since September 1, billing and reporting are available via Azure Agent Units (AAUs). Please visit product documentation for the latest updates. Here are a few highlights for this month: Prioritizing enterprise governance and security: By default, the Azure SRE Agent operates with least-privilege access and never executes write actions on Azure resources without explicit human approval. Additionally, it uses role-based access control (RBAC) so organizations can assign read-only or approver roles, providing clear oversight and traceability from day one. This allows teams to choose their desired level of autonomy from read-only insights to approval-gated actions to full automation without compromising control. Covering the breadth and depth of Azure: The Azure SRE Agent helps teams manage and understand their entire Azure footprint. With built-in support for AZ CLI and kubectl, it works across all Azure services. But it doesn’t stop there—diagnostics are enhanced for platforms like PostgreSQL, API Management, Azure Functions, AKS, Azure Container Apps, and Azure App Service. Whether you're running microservices or managing monoliths, the agent delivers consistent automation and deep insights across your cloud environment. Automating Incident Management: The Azure SRE Agent now plugs directly into Azure Monitor, PagerDuty, and ServiceNow to streamline incident detection and resolution. These integrations let the Agent ingest alerts and trigger workflows that match your team’s existing tools—so you can respond faster, with less manual effort. Engineered for extensibility: The Azure SRE Agent incident management approach lets teams reuse existing runbooks and customize response plans to fit their unique workflows. Whether you want to keep a human in the loop or empower the Agent to autonomously mitigate and resolve issues, the choice is yours. This flexibility gives teams the freedom to evolve—from guided actions to trusted autonomy—without ever giving up control. Root cause, meet source code: The Azure SRE Agent now supports code-aware root cause analysis (RCA) by linking diagnostics directly to source context in GitHub and Azure DevOps. This tight integration helps teams trace incidents back to the exact code changes that triggered them—accelerating resolution and boosting confidence in automated responses. By bridging operational signals with engineering workflows, the agent makes RCA faster, clearer, and more actionable. Close the loop with DevOps: The Azure SRE Agent now generates incident summary reports directly in GitHub and Azure DevOps—complete with diagnostic context. These reports can be assigned to a GitHub Copilot coding agent, which automatically creates pull requests and merges validated fixes. Every incident becomes an actionable code change, driving permanent resolution instead of temporary mitigation. Getting Started Start here: Create a new SRE Agent in the Azure portal (Azure login required) Blog: Announcing a flexible, predictable billing model for Azure SRE Agent Blog: Enterprise-ready and extensible – Update on the Azure SRE Agent preview Product documentation Product home page Community & Support We’d love to hear from you! Please use our GitHub repo to file issues, request features, or share feedback with the team5.2KViews2likes3CommentsLeveraging Low Priority Pods for Rapid Scaling in AKS
If you're running workloads in Kubernetes, you'll know that scalability is key to keeping things available and responsive. But there's a problem: when your cluster runs out of resources, the node autoscaler needs to spin up new nodes, and this takes anywhere from 5 to 10 minutes. That's a long time to wait when you're dealing with a traffic spike. One way to handle this is using low priority pods to create buffer nodes that can be preempted when your actual workloads need the resources. The Problem Cloud-native applications are dynamic, and workload demands can spike quickly. Automatic scaling helps, but the delay in scaling up nodes when you run out of capacity can leave you vulnerable, especially in production. When a cluster runs out of available nodes, the autoscaler provisions new ones, and during that 5-10 minute wait you're facing: Increased Latency: Users experience lag or downtime whilst they're waiting for resources to become available. Resource Starvation: High-priority workloads don't get the resources they need, leading to degraded performance or failed tasks. Operational Overhead: SREs end up manually intervening to manage resource loads, which takes them away from more important work. This is enough reason to look at creating spare capacity in your cluster, and that's where low priority pods come in. The Solution The idea is pretty straightforward: you run low priority pods in your cluster that don't actually do any real work - they're just placeholders consuming resources. These pods are sized to take up enough space that the cluster autoscaler provisions additional nodes for them. Effectively, you're creating a buffer of "standby" nodes that are ready and waiting. When your real workloads need resources and the cluster is under pressure, Kubernetes kicks out these low priority pods to make room - this is called preemption. Essentially, Kubernetes looks at what's running, sees the low priority pods, and terminates them to free up the nodes. This happens almost immediately, and your high-priority workloads can use that capacity straight away. Meanwhile, those evicted low priority pods sit in a pending state, which triggers the autoscaler to spin up new nodes to replace the buffer you just used. The whole thing is self-maintaining. How Preemption Actually Works When a high-priority pod needs to be scheduled but there aren't enough resources, the Kubernetes scheduler kicks off preemption. This happens almost instantly compared to the 5-10 minute wait for new nodes. Here's what happens: Identification: The scheduler works out which low priority pods need to be evicted to make room. It picks the lowest priority pods first. Graceful Termination: The selected pods get a termination signal (SIGTERM) and a grace period (usually 30 seconds by default) to shut down cleanly. Resource Release: Once the low priority pods terminate, their resources are immediately released and available for scheduling. The high-priority pod can then be scheduled onto the node, typically within seconds. Buffer Pod Rescheduling: After preemption, the evicted low priority pods try to reschedule. If there's capacity on existing nodes, they'll land there. If not, they'll sit in a pending state, which triggers the cluster autoscaler to provision new nodes. This gives you a dual benefit: your critical workloads get immediate access to the nodes that were running low priority pods, and the system automatically replenishes the buffer in the background. Whilst your high-priority workloads are running on the newly freed capacity, the autoscaler is already provisioning replacement nodes for the evicted buffer pods. Your buffer capacity is continuously maintained without any manual work, so you're always ready for the next spike. The key advantage here is speed. Whilst provisioning a new node takes 5-10 minutes, preempting a low priority pod and scheduling a high-priority pod in its place typically completes in under a minute. Why This Approach Works Well Now that you understand how the solution works, let's look at why it's effective: Immediate Resource Availability: You maintain a pool of ready nodes that can rapidly scale up when needed. There's always capacity available to handle sudden load spikes without waiting for new nodes. Seamless Scaling: High-priority workloads never face resource starvation, even during traffic surges. They get immediate access to capacity, whilst the buffer automatically replenishes itself in the background. Self-Maintaining: Once set up, the system handles everything automatically. You don't need to manually manage the buffer or intervene when workloads spike. The Trade-Off Whilst low priority pods offer significant advantages for keeping your cluster responsive, you need to understand the cost implications. By maintaining buffer nodes with low priority pods, you're running machines that aren't hosting active, productive workloads. You're paying for additional infrastructure just for availability and responsiveness. These buffer nodes consume compute resources you're paying for, even though they're only running placeholder workloads. The decision for your organisation comes down to whether the improved responsiveness and elimination of that 5-10 minute scaling delay justifies the extra cost. For production environments with strict SLA requirements or where downtime is expensive, this trade-off is usually worth it. However, you'll want to carefully size your buffer capacity to balance cost with availability needs. Setting It Up Step 1: Define Your Low Priority Pod Configurations Start by defining low priority pods using the PriorityClass resource. This is where you create configurations that designate certain workloads as low priority. Here's what that configuration looks like: apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: low-priority value: 0 globalDefault: false description: "Priority class for buffer pods" --- apiVersion: apps/v1 kind: Deployment metadata: name: buffer-pods namespace: default spec: replicas: 3 # Adjust based on how much buffer capacity you need selector: matchLabels: app: buffer template: metadata: labels: app: buffer spec: priorityClassName: low-priority containers: - name: buffer-container image: registry.k8s.io/pause:3.9 # Lightweight image that does nothing resources: requests: cpu: "1000m" # Size these based on your typical workload needs memory: "2Gi" # Large enough to trigger node creation limits: cpu: "1000m" memory: "2Gi" The key things to note here: The PriorityClass has a value of 0, which is lower than the default priority for regular pods (typically 1000+) We're using a Deployment rather than individual pods so we can easily scale the buffer size The pause image is a minimal container that does basically nothing - perfect for a placeholder The resource requests are what matter - these determine how much space each buffer pod takes up You'll want to size the CPU and memory requests based on your actual workload needs Step 2: Deploy the Low Priority Pods Next, deploy these low priority pods across your cluster. Use affinity configurations to spread them out and let Kubernetes manage them. Step 3: Monitor and Adjust You'll want to monitor your deployment to make sure your buffer nodes are scaling up when needed and scaling down during idle periods to save costs. Tools like Prometheus and Grafana work well for monitoring resource usage and pod status so you can refine your setup over time. Best Practices Right-Sizing Your Buffer Pods: The resource requests for your low priority pods need careful thought. They need to be big enough to consume sufficient capacity that additional buffer nodes actually get provisioned by the autoscaler. But they shouldn't be so large that you end up over-provisioning beyond your required buffer size. Think about your typical workload resource requirements and size your buffer pods to create exactly the number of standby nodes you need. Regular Assessment: Keep assessing your scaling strategies and adjust based on what you're seeing with workload patterns and demands. Monitor how often your buffer pods are getting evicted and whether the buffer size makes sense for your traffic patterns. Communication and Documentation: Make sure your team understands what low priority pods do in your deployment and what this means for your SLAs. Document the cost of running your buffer nodes and why you're justifying this overhead. Automated Alerts: Set up alerts for when pod eviction happens so you can react quickly and make sure critical workloads aren't being affected. Also alert on buffer pod status to ensure your buffer capacity stays available. Wrapping Up Leveraging low priority pods to create buffer nodes is an effective way to handle resource constraints when you need rapid scaling and can't afford to wait for the node autoscaler. This approach is particularly valuable if you're dealing with workloads that experience sudden, unpredictable traffic spikes and need to scale up immediately - think scenarios like flash sales, breaking news events, or user-facing applications with strict SLA requirements. However, this isn't a one-size-fits-all solution. If your workloads are fairly static or you can tolerate the 5-10 minute wait for new nodes to provision, you probably don't need this. The buffer comes at an additional cost since you're running nodes that aren't doing productive work, so you need to weigh whether the improved responsiveness justifies the extra spend for your specific use case. If you do decide this approach fits your needs, remember to keep monitoring and iterating on your configuration for the best resource management. By maintaining a buffer of low priority pods, you can address resource scarcity before it becomes a problem, reduce latency, and provide a much better experience for your users. This approach will make your cluster more responsive and free up your operational capacity to focus on improving services instead of constantly firefighting resource issues.246Views0likes0CommentsChoosing the Right Azure Containerisation Strategy: AKS, App Service, or Container Apps?
Azure Kubernetes Service (AKS) What is it? AKS is Microsoft’s managed Kubernetes offering, providing full access to the Kubernetes API and control plane. It’s designed for teams that want to run complex, scalable, and highly customisable container workloads, with direct control over orchestration, networking, and security. When to choose AKS: You need advanced orchestration, custom networking, or integration with third-party tools. Your team has Kubernetes expertise and wants granular control. You’re running large-scale, multi-service, or hybrid/multi-cloud workloads. You require Windows container support (with some limitations). Advantages: Full Kubernetes API access and ecosystem compatibility. Supports both Linux and Windows containers. Highly customisable (networking, storage, security, scaling). Suitable for complex, stateful, or regulated workloads. Disadvantages: Steeper learning curve; requires Kubernetes knowledge. You manage cluster upgrades, scaling, and security patches (though Azure automates much of this). Potential for over-provisioning and higher operational overhead. Azure App Service What is it? App Service is a fully managed Platform-as-a-Service (PaaS) for hosting web apps, APIs, and backends. It supports both code and container deployments, but is optimised for web-centric workloads. When to choose App Service: You’re building traditional web apps, REST APIs, or mobile backends. You want to deploy quickly with minimal infrastructure management. Your team prefers a PaaS experience with built-in scaling, SSL, and CI/CD. You need to run Windows containers (with some limitations). Advantages: Easiest to use, minimal configuration, fast deployments. Built-in scaling, SSL, custom domains, and staging slots. Tight integration with Azure DevOps, GitHub Actions, and other Azure services. Handles infrastructure, patching, and scaling for you. Disadvantages: Less flexibility for complex microservices or custom orchestration. Limited access to underlying infrastructure and networking. Not ideal for event-driven or non-HTTP workloads. Azure Container Apps What is it? Container Apps is a fully managed, serverless container platform built on Kubernetes and open-source tech like Dapr and KEDA. It abstracts away Kubernetes complexity, letting you focus on microservices, event-driven, or background jobs. When to choose Container Apps: You want to run microservices or event-driven workloads without managing Kubernetes. You need automatic scaling (including scale to zero) based on HTTP traffic or events. You want to use Dapr for service discovery, pub/sub, or state management. You’re building modern, cloud-native apps but don’t need direct Kubernetes API access. Advantages: Serverless scaling (including to zero), pay only for what you use. Built-in support for microservices patterns, event-driven architectures, and background jobs. No cluster management—Azure handles the infrastructure. Integrates with Azure DevOps, GitHub Actions, and supports Linux containers from any registry. Disadvantages: No direct access to Kubernetes APIs or custom controllers. Linux containers only (no Windows container support). Some advanced networking and customisation options are limited compared to AKS. Key Differences Feature Azure Kubernetes Service (AKS) Azure App Service Azure Container Apps Best for Complex, scalable, custom workloads Web apps, APIs, backends Microservices, event-driven, jobs Management You manage (with Azure help) Fully managed Fully managed, serverless Scaling Manual/auto (pods, nodes) Auto (HTTP traffic) Auto (HTTP/events, scale to zero) API Access Full Kubernetes API No infra access No Kubernetes API OS Support Linux & Windows Linux & Windows Linux only Networking Advanced, customisable Basic (web-centric) Basic, with VNet integration Use Cases Hybrid/multi-cloud, regulated, large-scale Web, REST APIs, mobile Microservices, event-driven, background jobs Learning Curve Steep (Kubernetes skills needed) Low Low-medium Pricing Pay for nodes (even idle) Pay for plan (fixed/auto) Pay for usage (scale to zero) CI/CD Integration Azure DevOps, GitHub, custom Azure DevOps, GitHub Azure DevOps, GitHub How to Decide? Start with App Service if you’re building a straightforward web app or API and want the fastest path to production. Choose Container Apps for modern microservices, event-driven, or background processing workloads where you want serverless scaling and minimal ops. Go with AKS when you need full Kubernetes power, advanced customisation, or are running at enterprise scale with a skilled team. Conclusion Azure’s containerisation portfolio is broad, but each service is optimised for different scenarios. For most new cloud-native projects, Container Apps offers the best balance of simplicity and power. For web-centric workloads, App Service remains the fastest route. For teams needing full control and scale, AKS is unmatched. Tip: Start simple, and only move to more complex platforms as your requirements grow. Azure’s flexibility means you can mix and match these services as your architecture evolves.1.4KViews2likes0Comments