azure kubernetes service
191 TopicsAgentic Power for AKS: Introducing the Agentic CLI in Public Preview
We are excited to announce the agentic CLI for AKS, available now in public preview directly through the Azure CLI. A huge thank you to all our private preview customers who took the time to try out our beta releases and provide feedback to our team. The agentic CLI is now available for everyone to try--continue reading to learn how you can get started. Why we built the agentic CLI for AKS The way we build software is changing with the democratization of coding agents. We believe the same should happen for how users manage their Kubernetes environments. With this feature, we want to simplify the management and troubleshooting of AKS clusters, while reducing the barrier to entry for startups and developers by bridging the knowledge gap. The agentic CLI for AKS is designed to simplify this experience by bringing agentic capabilities to your cluster operations and observability, translating natural language into actionable guidance and analysis. Whether you need to right-size your infrastructure, troubleshoot complex networking issues like DNS or outbound connectivity, or ensure smooth K8s upgrades, the agentic CLI helps you make informed decisions quickly and confidently. Our goal: streamline cluster operations and empower teams to ask questions like “Why is my pod restarting?” or “How can I optimize my cluster for cost?” and get instant, actionable answers. The agentic CLI for AKS is built on the open-source HolmesGPT project, which has recently been accepted as a CNCF Sandbox project. With a pluggable LLM endpoint structure and open-source backing, the agentic CLI is purpose-built for customizability and data privacy. From private to public preview: what's new? Earlier this year, we launched the agentic CLI in private beta for a small group of AKS customers. Their feedback has shaped what's new in our public preview release, which we are excited to share with the broader AKS community. Let’s dig in: Simplified setup: One-time initialization for LLM parameters with ‘az aks agent-init'. Configure your LLM parameters such as API key and model through a simple, guided user interface. AKS MCP integration: Enable the agent to install and run the AKS MCP server locally (directly in your CLI client) for advanced context-aware operations. The AKS MCP server includes tools for AKS clusters and associated Azure resources. Try it out: az aks agent “list all my unhealthy nodepools” --aks-mcp -n <cluster-name> -g <resource-group> Deeper investigations: New "Task List" feature which helps the agent plan and execute on complex investigations. Checklist-style tracker that allows you to stay updated on the agent's progress and planned tool calls. Provide in-line feedback: Share insights directly from the CLI about the agent's performance using /feedback. Provide a rating of the agent's analysis and optional written feedback directly to the agentic CLI team. Your feedback is highly appreciated and will help us improve the agentic CLI's capabilities. Performance and security improvements: Minor improvements for faster load times and reduced latency, as well as hardened initialization and token handling. Getting Started Install the extension az extension add --name aks-agent Set up you LLM endpoint az aks agent-init Start asking questions Some recommended scenarios to try out: Troubleshoot cluster health: az aks agent "Give me an overview of my cluster's health" Right-size your cluster: az aks agent "How can I optimize my node pool for cost?" Try out the AKS MCP integration: az aks agent "Show me CPU and memory usage trends" --aks-mcp -n <cluster-name> -g <resource-group> Get upgrade guidance: az aks agent "What should I check before upgrading my AKS cluster?" Update the agentic CLI extension az extension update --name aks-agent Join the Conversation We’d love your feedback! Use the built-in '/feedback' command or visit our GitHub repository to share ideas and issues. Learn more: https://aka.ms/aks/agentic-cli Share feedback: https://aka.ms/aks/agentic-cli/issues487Views1like0CommentsMicrosoft Azure at KubeCon North America 2025 | Atlanta, GA - Nov 10-13
KubeCon + CloudNativeCon North America is back - this time in Atlanta, Georgia, and the excitement is real. Whether you’re a developer, operator, architect, or just Kubernetes-curious, Microsoft Azure is showing up with a packed agenda, hands-on demos, and plenty of ways to connect and learn with our team of experts. Read on for all the ways you can connect with our team! Kick off with Azure Day with Kubernetes (Nov 10) Before the main conference even starts, join us for Azure Day with Kubernetes on November 10. It’s a full day of learning, best practices, deep-dive discussions, and hands-on labs, all designed to help you build cloud-native and AI apps with Kubernetes on Azure. You’ll get to meet Microsoft experts, dive into technical sessions, roll up your sleeves in the afternoon labs or have focused deep-dive discussions in our whiteboarding sessions. If you’re looking to sharpen your skills or just want to chat with folks who live and breathe Kubernetes on Azure, this is the place to be. Spots are limited, so register today at: https://aka.ms/AzureKubernetesDay Catch up with our experts at Booth #500 The Microsoft booth is more than just a spot to grab swag (though, yes, there will be swag and stickers!). It’s a central hub for connecting with product teams, setting up meetings, and seeing live demos. Whether you want to learn how to troubleshoot Kubernetes with agentic AI tools, explore open-source projects, or just talk shop, you’ll find plenty of friendly faces ready to help. We will be running a variety of theatre sessions and demos out of the booth week on topics including AKS Automatic, agentic troubleshooting, Azure Verified Modules, networking, app modernization, hybrid deployments, storage, and more. 🔥Hot tip: join us for our live Kubernetes Trivia Show at the Microsoft Azure booth during the KubeCrawl on Tuesday to win exclusive swag! Microsoft sessions at KubeCon NA 2025 Here’s a quick look at all the sessions with Microsoft speakers that you won’t want to miss. Click the titles for full details and add them to your schedule! Keynotes Date: Thu November 13, 2025 Start Time: 9:49 AM Room: Exhibit Hall B2 Title: Scaling Smarter: Simplifying Multicluster AI with KAITO and KubeFleet Speaker: Jorge Palma Abstract: As demand for AI workloads on Kubernetes grows, multicluster inferencing has emerged as a powerful yet complex architectural pattern. While multicluster support offers benefits in terms of geographic redundancy, data sovereignty, and resource optimization, it also introduces significant challenges around orchestration, traffic routing, cost control, and operational overhead. To address these challenges, we’ll introduce two CNCF projects—KAITO and KubeFleet—that work together to simplify and optimize multicluster AI operations. KAITO provides a declarative framework for managing AI inference workflows with built-in support for model versioning, and performance telemetry. KubeFleet complements this by enabling seamless workload distribution across clusters, based on cost, latency, and availability. Together, these tools reduce operational complexity, improve cost efficiency, and ensure consistent performance at scale. Date: Thu November 13, 2025 Start Time: 9:56 AM Room: Exhibit Hall B2 Title: Cloud Native Back to the Future: The Road Ahead Speakers: Jeremy Rickard (Microsoft), Alex Chircop (Akamai) Abstract: The Cloud Native Computing Foundation (CNCF) turns 10 this year, now home to more than 200 projects across the cloud native landscape. As we look ahead, the community faces new demands around security, sustainability, complexity, and emerging workloads like AI inference and agents. As many areas of the ecosystem transition to mature foundational building blocks, we are excited to explore the next evolution of cloud native development. The TOC will highlight how these challenges open opportunities to shape the next generation of applications and ensure the ecosystem continues to thrive. How are new projects addressing these new emerging workloads? How will these new projects impact security hygiene in the ecosystem? How will existing projects adapt to meet new realities? How is the CNCF evolving to support this next generation of computing? Join us as we reflect on the first decade of cloud native—and look ahead to how this community will power the age of AI, intelligent systems, and beyond. Featured Demo Date: Wed November 12, 2025 Start Time: 2:15-2:35 PM Room: Expo Demo Area Title: HolmesGPT: Agentic K8s troubleshooting in your terminal Speakers: Pavneet Singh Ahluwalia (Microsoft), Arik Alon (Robusta) Abstract: Troubleshooting Kubernetes shouldn’t require hopping across dashboards, logs, and docs. With open-source tools like HolmesGPT and the Model Context Protocol (MCP) server, you can now bring an agentic experience directly into your CLI. In this demo, we’ll show how this OSS stack can run everywhere, from lightweight kind clusters on your laptop to production-grade clusters at scale. The experience supports any LLM provider: in-cluster, local, or cloud, ensuring data never leaves your environment and costs remain predictable. We will showcase how users can ask natural-language questions (e.g., “why is my pod Pending?”) and get grounded reasoning, targeted diagnostics, and safe, human-in-the-loop remediation steps -- all without leaving the terminal. Whether you’re experimenting locally or running mission-critical workloads, you’ll walk away knowing how to extend these OSS components to build your own agentic workflows in Kubernetes. All sessions Microsoft Speaker(s) Session Will Case No Kubectl, No Problem: The Future With Conversational Kubernetes Ana Maria Lopez Moreno Smarter Together: Orchestrating Multi-Agent AI Systems With A2A and MCP on Container Neha Aggarwal 10 Years of Cilium: Connecting, Securing, and Simplifying the Cloud Native Stack Yi Zha Strengthening Supply Chain for Kubernetes: Cross-Cloud SLSA Attestation Verification Joaquim Rocha & Oleksandr Dubenko Contribfest: Power up Your CNCF Tools With Headlamp Jeremy Rickard Shaping LTS Together: What We’ve Learned the Hard Way Feynman Zhou Shipping Secure, Reusable, and Composable Infrastructure as Code: GE HealthCare’s Journey With ORAS Jackie Maertens & Nilekh Chaudhari No Joke: Two Security Maintainers Walk Into a Cluster Paul Yu, Sachi Desai Rage Against the Machine: Fighting AI Complexity with Kubernetes simplicity Dipti Pai Flux - The GitLess GitOps Edition Trask Stalnaker OpenTelemetry: Unpacking 2025, Charting 2026 Mike Morris Gateway API: Table Stakes Anish Ramasekar, Mo Khan, Stanislav Láznička, Rita Zhang & Peter Engelbert Strengthening Kubernetes Trust: SIG Auth's Latest Security Enhancements Ernest Wong AI Models Are Huge, but Your GPUs Aren’t: Mastering multi-mode distributed inference on Kubernetes Rita Zhang Navigating the Rapid Evolution of Large Model Inference: Where does Kubernetes fit? Suraj Deshmukh LLMs on Kubernetes: Squeeze 5x GPU Efficiency with cache, route, repeat! Aman Singh Drasi: A New Take on Change-driven Architectures Ganeshkumar Ashokavardhanan & Qinghui Zhuang Agent-Driven MCP for AI Workloads on Kubernetes Steven Jin Contribfest: From Farm (Fork) To Table (Feature): Growing Your First (Free-range Organic) Istio PR Jack Francis SIG Autoscaling Projects Update Mark Rossetti Kubernetes SIG-Windows Updates Apurup Chevuru & Michael Zappa Portable MTLS for Kubernetes: A QUIC-Based Plugin Compatible With Any CNI Ciprian Hacman The Next Decoupling: From Monolithic Cluster, To Control-Plane With Nodes Keith Mattix Istio Project Updates: AI Inference, Ambient Multicluster & Default Deny Jonathan Smith How Comcast Leverages Radius in Their Internal Developer Platform Jon Huhn Lightning Talk: Getting (and Staying) up To Speed on DRA With the DRA Example Driver Rita Zhang, Jaydip Gabani Open Policy Agent (OPA) Intro & Deep Dive Bridget Kromhout SIG Cloud Provider Deep Dive: Expanding Our Mission Pavneet Ahluwalia Beyond ChatOps: Agentic AI in Kubernetes—What Works, What Breaks, and What’s Next Ryan Zhang Finally, a Cluster Inventory I Can USE! Michael Katchinskiy, Yossi Weizman You Deployed What?! Data-driven lesson on Unsafe Helm Chart Defaults Mauricio Vásquez Bernal & Jose Blanquicet Contribfest: Inspektor Gadget Contribfest: Enhancing the Observability and Security of Your K8s Clusters Through an easy to use Framework Wei Fu etcd V3.6 and Beyond + Etcd-operator Updates Jeremy Rickard GitHub Actions: Project Usage and Deep Dive Dor Serero & Michael Katchinskiy What Doesn't Kill You Makes You Stronger: The Vulnerabilities that Redefined Kubernetes Security We can't wait to see you in Atlanta! Microsoft’s presence is all about empowering developers and operators to build, secure, and scale modern applications. You’ll see us leading sessions, sharing open-source contributions, and hosting roundtables on how cloud native powers AI in production. We’re here to learn from you, too - so bring your questions, ideas, and feedback.537Views0likes0CommentsAzure Container Registry Repository Permissions with Attribute-based Access Control (ABAC)
General Availability announcement Today marks the general availability of Azure Container Registry (ACR) repository permissions with Microsoft Entra attribute-based access control (ABAC). ABAC augments the familiar Azure RBAC model with namespace and repository-level conditions so platform teams can express least-privilege access at the granularity of specific repositories or entire logical namespaces. This capability is designed for modern multi-tenant platform engineering patterns where a central registry serves many business domains. With ABAC, CI/CD systems and runtime consumers like Azure Kubernetes Service (AKS) clusters have least-privilege access to ACR registries. Why this matters Enterprises are converging on a central container registry pattern that hosts artifacts and container images for multiple business units and application domains. In this model: CI/CD pipelines from different parts of the business push container images and artifacts only to approved namespaces and repositories within a central registry. AKS clusters, Azure Container Apps (ACA), Azure Container Instances (ACI), and consumers pull only from authorized repositories within a central registry. With ABAC, these repository and namespace permission boundaries become explicit and enforceable using standard Microsoft Entra identities and role assignments. This aligns with cloud-native zero trust, supply chain hardening, and least-privilege permissions. What ABAC in ACR means ACR registries now support a registry permissions mode called “RBAC Registry + ABAC Repository Permissions.” Configuring a registry to this mode makes it ABAC-enabled. When a registry is configured to be ABAC-enabled, registry administrators can optionally add ABAC conditions during standard Azure RBAC role assignments. This optional ABAC conditions scope the role assignment’s effect to specific repositories or namespace prefixes. ABAC can be enabled on all new and existing ACR registries across all SKUs, either during registry creation or configured on existing registries. ABAC-enabled built-in roles Once a registry is ABAC-enabled (configured to “RBAC Registry + ABAC Repository Permissions), registry admins can use these ABAC-enabled built-in roles to grant repository-scoped permissions: Container Registry Repository Reader: grants image pull and metadata read permissions, including tag resolution and referrer discoverability. Container Registry Repository Writer: grants Repository Reader permissions, as well as image and tag push permissions. Container Registry Repository Contributor: grants Repository Reader and Writer permissions, as well as image and tag delete permissions. Note that these roles do not grant repository list permissions. The separate Container Registry Repository Catalog Lister must be assigned to grant repository list permissions. The Container Registry Repository Catalog Lister role does not support ABAC conditions; assigning it grants permissions to list all repositories in a registry. Important role behavior changes in ABAC mode When a registry is ABAC-enabled by configuring its permissions mode to “RBAC Registry + ABAC Repository Permissions”: Legacy data-plane roles such as AcrPull, AcrPush, AcrDelete are not honored in ABAC-enabled registries. For ABAC-enabled registries, use the ABAC-enabled built-in roles listed above. Broad roles like Owner, Contributor, and Reader previously granted full control plane and data plane permissions, which is typically an overprivileged role assignment. In ABAC-enabled registries, these broad roles will only grant control plane permissions to the registry. They will no longer grant data plane permissions, such as image push, pull, delete or repository list permissions. ACR Tasks, Quick Tasks, Quick Builds, and Quick Runs no longer inherit default data-plane access to source registries; assign the ABAC-enabled roles above to the calling identity as needed. Identities you can assign ACR ABAC uses standard Microsoft Entra role assignments. Assign RBAC roles with optional ABAC conditions to users, groups, service principals, and managed identities, including AKS kubelet and workload identities, ACA and ACI identities, and more. Next steps Start using ABAC repository permissions in ACR to enforce least-privilege artifact push, pull, and delete boundaries across your CI/CD systems and container image workloads. This model is now the recommended approach for multi-tenant platform engineering patterns and central registry deployments. To get started, follow the step-by-step guides in the official ACR ABAC documentation: https://aka.ms/acr/auth/abac788Views1like0CommentsExpanding the Public Preview of the Azure SRE Agent
We are excited to share that the Azure SRE Agent is now available in public preview for everyone instantly – no sign up required. A big thank you to all our preview customers who provided feedback and helped shape this release! Watching teams put the SRE Agent to work taught us a ton, and we’ve baked those lessons into a smarter, more resilient, and enterprise-ready experience. You can now find Azure SRE Agent directly in the Azure Portal and get started, or use the link below. 📖 Learn more about SRE Agent. 👉 Create your first SRE Agent (Azure login required) What’s New in Azure SRE Agent - October Update The Azure SRE Agent now delivers secure-by-default governance, deeper diagnostics, and extensible automation—built for scale. It can even resolve incidents autonomously by following your team’s runbooks. With native integrations across Azure Monitor, GitHub, ServiceNow, and PagerDuty, it supports root cause analysis using both source code and historical patterns. And since September 1, billing and reporting are available via Azure Agent Units (AAUs). Please visit product documentation for the latest updates. Here are a few highlights for this month: Prioritizing enterprise governance and security: By default, the Azure SRE Agent operates with least-privilege access and never executes write actions on Azure resources without explicit human approval. Additionally, it uses role-based access control (RBAC) so organizations can assign read-only or approver roles, providing clear oversight and traceability from day one. This allows teams to choose their desired level of autonomy from read-only insights to approval-gated actions to full automation without compromising control. Covering the breadth and depth of Azure: The Azure SRE Agent helps teams manage and understand their entire Azure footprint. With built-in support for AZ CLI and kubectl, it works across all Azure services. But it doesn’t stop there—diagnostics are enhanced for platforms like PostgreSQL, API Management, Azure Functions, AKS, Azure Container Apps, and Azure App Service. Whether you're running microservices or managing monoliths, the agent delivers consistent automation and deep insights across your cloud environment. Automating Incident Management: The Azure SRE Agent now plugs directly into Azure Monitor, PagerDuty, and ServiceNow to streamline incident detection and resolution. These integrations let the Agent ingest alerts and trigger workflows that match your team’s existing tools—so you can respond faster, with less manual effort. Engineered for extensibility: The Azure SRE Agent incident management approach lets teams reuse existing runbooks and customize response plans to fit their unique workflows. Whether you want to keep a human in the loop or empower the Agent to autonomously mitigate and resolve issues, the choice is yours. This flexibility gives teams the freedom to evolve—from guided actions to trusted autonomy—without ever giving up control. Root cause, meet source code: The Azure SRE Agent now supports code-aware root cause analysis (RCA) by linking diagnostics directly to source context in GitHub and Azure DevOps. This tight integration helps teams trace incidents back to the exact code changes that triggered them—accelerating resolution and boosting confidence in automated responses. By bridging operational signals with engineering workflows, the agent makes RCA faster, clearer, and more actionable. Close the loop with DevOps: The Azure SRE Agent now generates incident summary reports directly in GitHub and Azure DevOps—complete with diagnostic context. These reports can be assigned to a GitHub Copilot coding agent, which automatically creates pull requests and merges validated fixes. Every incident becomes an actionable code change, driving permanent resolution instead of temporary mitigation. Getting Started Start here: Create a new SRE Agent in the Azure portal (Azure login required) Blog: Announcing a flexible, predictable billing model for Azure SRE Agent Blog: Enterprise-ready and extensible – Update on the Azure SRE Agent preview Product documentation Product home page Community & Support We’d love to hear from you! Please use our GitHub repo to file issues, request features, or share feedback with the team5KViews2likes3CommentsLeveraging Low Priority Pods for Rapid Scaling in AKS
If you're running workloads in Kubernetes, you'll know that scalability is key to keeping things available and responsive. But there's a problem: when your cluster runs out of resources, the node autoscaler needs to spin up new nodes, and this takes anywhere from 5 to 10 minutes. That's a long time to wait when you're dealing with a traffic spike. One way to handle this is using low priority pods to create buffer nodes that can be preempted when your actual workloads need the resources. The Problem Cloud-native applications are dynamic, and workload demands can spike quickly. Automatic scaling helps, but the delay in scaling up nodes when you run out of capacity can leave you vulnerable, especially in production. When a cluster runs out of available nodes, the autoscaler provisions new ones, and during that 5-10 minute wait you're facing: Increased Latency: Users experience lag or downtime whilst they're waiting for resources to become available. Resource Starvation: High-priority workloads don't get the resources they need, leading to degraded performance or failed tasks. Operational Overhead: SREs end up manually intervening to manage resource loads, which takes them away from more important work. This is enough reason to look at creating spare capacity in your cluster, and that's where low priority pods come in. The Solution The idea is pretty straightforward: you run low priority pods in your cluster that don't actually do any real work - they're just placeholders consuming resources. These pods are sized to take up enough space that the cluster autoscaler provisions additional nodes for them. Effectively, you're creating a buffer of "standby" nodes that are ready and waiting. When your real workloads need resources and the cluster is under pressure, Kubernetes kicks out these low priority pods to make room - this is called preemption. Essentially, Kubernetes looks at what's running, sees the low priority pods, and terminates them to free up the nodes. This happens almost immediately, and your high-priority workloads can use that capacity straight away. Meanwhile, those evicted low priority pods sit in a pending state, which triggers the autoscaler to spin up new nodes to replace the buffer you just used. The whole thing is self-maintaining. How Preemption Actually Works When a high-priority pod needs to be scheduled but there aren't enough resources, the Kubernetes scheduler kicks off preemption. This happens almost instantly compared to the 5-10 minute wait for new nodes. Here's what happens: Identification: The scheduler works out which low priority pods need to be evicted to make room. It picks the lowest priority pods first. Graceful Termination: The selected pods get a termination signal (SIGTERM) and a grace period (usually 30 seconds by default) to shut down cleanly. Resource Release: Once the low priority pods terminate, their resources are immediately released and available for scheduling. The high-priority pod can then be scheduled onto the node, typically within seconds. Buffer Pod Rescheduling: After preemption, the evicted low priority pods try to reschedule. If there's capacity on existing nodes, they'll land there. If not, they'll sit in a pending state, which triggers the cluster autoscaler to provision new nodes. This gives you a dual benefit: your critical workloads get immediate access to the nodes that were running low priority pods, and the system automatically replenishes the buffer in the background. Whilst your high-priority workloads are running on the newly freed capacity, the autoscaler is already provisioning replacement nodes for the evicted buffer pods. Your buffer capacity is continuously maintained without any manual work, so you're always ready for the next spike. The key advantage here is speed. Whilst provisioning a new node takes 5-10 minutes, preempting a low priority pod and scheduling a high-priority pod in its place typically completes in under a minute. Why This Approach Works Well Now that you understand how the solution works, let's look at why it's effective: Immediate Resource Availability: You maintain a pool of ready nodes that can rapidly scale up when needed. There's always capacity available to handle sudden load spikes without waiting for new nodes. Seamless Scaling: High-priority workloads never face resource starvation, even during traffic surges. They get immediate access to capacity, whilst the buffer automatically replenishes itself in the background. Self-Maintaining: Once set up, the system handles everything automatically. You don't need to manually manage the buffer or intervene when workloads spike. The Trade-Off Whilst low priority pods offer significant advantages for keeping your cluster responsive, you need to understand the cost implications. By maintaining buffer nodes with low priority pods, you're running machines that aren't hosting active, productive workloads. You're paying for additional infrastructure just for availability and responsiveness. These buffer nodes consume compute resources you're paying for, even though they're only running placeholder workloads. The decision for your organisation comes down to whether the improved responsiveness and elimination of that 5-10 minute scaling delay justifies the extra cost. For production environments with strict SLA requirements or where downtime is expensive, this trade-off is usually worth it. However, you'll want to carefully size your buffer capacity to balance cost with availability needs. Setting It Up Step 1: Define Your Low Priority Pod Configurations Start by defining low priority pods using the PriorityClass resource. This is where you create configurations that designate certain workloads as low priority. Here's what that configuration looks like: apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: low-priority value: 0 globalDefault: false description: "Priority class for buffer pods" --- apiVersion: apps/v1 kind: Deployment metadata: name: buffer-pods namespace: default spec: replicas: 3 # Adjust based on how much buffer capacity you need selector: matchLabels: app: buffer template: metadata: labels: app: buffer spec: priorityClassName: low-priority containers: - name: buffer-container image: registry.k8s.io/pause:3.9 # Lightweight image that does nothing resources: requests: cpu: "1000m" # Size these based on your typical workload needs memory: "2Gi" # Large enough to trigger node creation limits: cpu: "1000m" memory: "2Gi" The key things to note here: The PriorityClass has a value of 0, which is lower than the default priority for regular pods (typically 1000+) We're using a Deployment rather than individual pods so we can easily scale the buffer size The pause image is a minimal container that does basically nothing - perfect for a placeholder The resource requests are what matter - these determine how much space each buffer pod takes up You'll want to size the CPU and memory requests based on your actual workload needs Step 2: Deploy the Low Priority Pods Next, deploy these low priority pods across your cluster. Use affinity configurations to spread them out and let Kubernetes manage them. Step 3: Monitor and Adjust You'll want to monitor your deployment to make sure your buffer nodes are scaling up when needed and scaling down during idle periods to save costs. Tools like Prometheus and Grafana work well for monitoring resource usage and pod status so you can refine your setup over time. Best Practices Right-Sizing Your Buffer Pods: The resource requests for your low priority pods need careful thought. They need to be big enough to consume sufficient capacity that additional buffer nodes actually get provisioned by the autoscaler. But they shouldn't be so large that you end up over-provisioning beyond your required buffer size. Think about your typical workload resource requirements and size your buffer pods to create exactly the number of standby nodes you need. Regular Assessment: Keep assessing your scaling strategies and adjust based on what you're seeing with workload patterns and demands. Monitor how often your buffer pods are getting evicted and whether the buffer size makes sense for your traffic patterns. Communication and Documentation: Make sure your team understands what low priority pods do in your deployment and what this means for your SLAs. Document the cost of running your buffer nodes and why you're justifying this overhead. Automated Alerts: Set up alerts for when pod eviction happens so you can react quickly and make sure critical workloads aren't being affected. Also alert on buffer pod status to ensure your buffer capacity stays available. Wrapping Up Leveraging low priority pods to create buffer nodes is an effective way to handle resource constraints when you need rapid scaling and can't afford to wait for the node autoscaler. This approach is particularly valuable if you're dealing with workloads that experience sudden, unpredictable traffic spikes and need to scale up immediately - think scenarios like flash sales, breaking news events, or user-facing applications with strict SLA requirements. However, this isn't a one-size-fits-all solution. If your workloads are fairly static or you can tolerate the 5-10 minute wait for new nodes to provision, you probably don't need this. The buffer comes at an additional cost since you're running nodes that aren't doing productive work, so you need to weigh whether the improved responsiveness justifies the extra spend for your specific use case. If you do decide this approach fits your needs, remember to keep monitoring and iterating on your configuration for the best resource management. By maintaining a buffer of low priority pods, you can address resource scarcity before it becomes a problem, reduce latency, and provide a much better experience for your users. This approach will make your cluster more responsive and free up your operational capacity to focus on improving services instead of constantly firefighting resource issues.238Views0likes0CommentsChoosing the Right Azure Containerisation Strategy: AKS, App Service, or Container Apps?
Azure Kubernetes Service (AKS) What is it? AKS is Microsoft’s managed Kubernetes offering, providing full access to the Kubernetes API and control plane. It’s designed for teams that want to run complex, scalable, and highly customisable container workloads, with direct control over orchestration, networking, and security. When to choose AKS: You need advanced orchestration, custom networking, or integration with third-party tools. Your team has Kubernetes expertise and wants granular control. You’re running large-scale, multi-service, or hybrid/multi-cloud workloads. You require Windows container support (with some limitations). Advantages: Full Kubernetes API access and ecosystem compatibility. Supports both Linux and Windows containers. Highly customisable (networking, storage, security, scaling). Suitable for complex, stateful, or regulated workloads. Disadvantages: Steeper learning curve; requires Kubernetes knowledge. You manage cluster upgrades, scaling, and security patches (though Azure automates much of this). Potential for over-provisioning and higher operational overhead. Azure App Service What is it? App Service is a fully managed Platform-as-a-Service (PaaS) for hosting web apps, APIs, and backends. It supports both code and container deployments, but is optimised for web-centric workloads. When to choose App Service: You’re building traditional web apps, REST APIs, or mobile backends. You want to deploy quickly with minimal infrastructure management. Your team prefers a PaaS experience with built-in scaling, SSL, and CI/CD. You need to run Windows containers (with some limitations). Advantages: Easiest to use, minimal configuration, fast deployments. Built-in scaling, SSL, custom domains, and staging slots. Tight integration with Azure DevOps, GitHub Actions, and other Azure services. Handles infrastructure, patching, and scaling for you. Disadvantages: Less flexibility for complex microservices or custom orchestration. Limited access to underlying infrastructure and networking. Not ideal for event-driven or non-HTTP workloads. Azure Container Apps What is it? Container Apps is a fully managed, serverless container platform built on Kubernetes and open-source tech like Dapr and KEDA. It abstracts away Kubernetes complexity, letting you focus on microservices, event-driven, or background jobs. When to choose Container Apps: You want to run microservices or event-driven workloads without managing Kubernetes. You need automatic scaling (including scale to zero) based on HTTP traffic or events. You want to use Dapr for service discovery, pub/sub, or state management. You’re building modern, cloud-native apps but don’t need direct Kubernetes API access. Advantages: Serverless scaling (including to zero), pay only for what you use. Built-in support for microservices patterns, event-driven architectures, and background jobs. No cluster management—Azure handles the infrastructure. Integrates with Azure DevOps, GitHub Actions, and supports Linux containers from any registry. Disadvantages: No direct access to Kubernetes APIs or custom controllers. Linux containers only (no Windows container support). Some advanced networking and customisation options are limited compared to AKS. Key Differences Feature Azure Kubernetes Service (AKS) Azure App Service Azure Container Apps Best for Complex, scalable, custom workloads Web apps, APIs, backends Microservices, event-driven, jobs Management You manage (with Azure help) Fully managed Fully managed, serverless Scaling Manual/auto (pods, nodes) Auto (HTTP traffic) Auto (HTTP/events, scale to zero) API Access Full Kubernetes API No infra access No Kubernetes API OS Support Linux & Windows Linux & Windows Linux only Networking Advanced, customisable Basic (web-centric) Basic, with VNet integration Use Cases Hybrid/multi-cloud, regulated, large-scale Web, REST APIs, mobile Microservices, event-driven, background jobs Learning Curve Steep (Kubernetes skills needed) Low Low-medium Pricing Pay for nodes (even idle) Pay for plan (fixed/auto) Pay for usage (scale to zero) CI/CD Integration Azure DevOps, GitHub, custom Azure DevOps, GitHub Azure DevOps, GitHub How to Decide? Start with App Service if you’re building a straightforward web app or API and want the fastest path to production. Choose Container Apps for modern microservices, event-driven, or background processing workloads where you want serverless scaling and minimal ops. Go with AKS when you need full Kubernetes power, advanced customisation, or are running at enterprise scale with a skilled team. Conclusion Azure’s containerisation portfolio is broad, but each service is optimised for different scenarios. For most new cloud-native projects, Container Apps offers the best balance of simplicity and power. For web-centric workloads, App Service remains the fastest route. For teams needing full control and scale, AKS is unmatched. Tip: Start simple, and only move to more complex platforms as your requirements grow. Azure’s flexibility means you can mix and match these services as your architecture evolves.1.3KViews2likes0CommentsPublic preview: Confidential containers on AKS
We are proud to announce the preview of confidential containers on AKS, which provides confidential computing capabilities to containerize workloads on AKS. This offering provides strong isolation at the pod-level, memory encryption, AMD SEV-SNP hardware-based attestation capabilities for containerized application code and data while in-use, building upon the existing security, scalability and resiliency benefits offered by AKS.
7.1KViews4likes1CommentAzure Monitor managed service for Prometheus now includes native Grafana dashboards
We are excited to announce that Azure Monitor managed service for Prometheus now includes native Grafana dashboards within the Azure portal at no additional cost. This integration marks a major milestone in our mission to simplify observability reducing the administrative overhead and complexity compared to deploying and maintaining your own Grafana instances. The use of open-source observability tools continues to grow for cloud-native scenarios such as application and infrastructure monitoring using Prometheus metrics and OpenTelemetry logs and traces. For these scenarios, DevOps and SRE teams need streamlined and cost-effective access to industry-standard tooling like Prometheus metrics and Grafana dashboards within their cloud-hosted environments. For many teams, this usually means deploying and managing separate monitoring stacks with some versions self-hosted or partner-managed Prometheus and Grafana. However, Azure Monitor's latest integrations with Grafana provides this capability out-of-the-box by enabling you to view Prometheus metrics and Azure other observability data in Grafana dashboards fully integrated into the Azure portal. Azure Monitor dashboards with Grafana delivers powerful visualization and data transformation capabilities on Prometheus metrics, Azure resource metrics, logs, and traces stored in Azure Monitor. Pre-built dashboards are included for several key scenarios like Azure Kubernetes Service, Azure Container Apps, Container Insights, and Application Insights. Why Grafana in Azure portal? Grafana dashboards are widely adopted visualization tool used with Prometheus metrics and cloud-native observability tools. Embedding it natively in Azure Portal offers: Unified Azure experience: No additional RBAC or network configuration required, users Azure login credentials and Azure RBAC are used to access dashboards and data. View Grafana dashboards alongside all your other Azure resources and Azure Monitor views in the same portal. No management overhead or compute costs: Dashboards with Grafana use a fully SaaS model built into Azure Monitor, where you do not have to administer the Grafana server or the compute on which it runs. Access to community dashboards: Open-source and Grafana community dashboards using Prometheus or Azure Monitor data sources can be imported with no modifications. These capabilities mean faster troubleshooting, deeper insights, and a more consistent observability platform for Azure-centric workloads. Figure 1: Dashboards with Grafana landing page in the context of Azure Monitor Workspace in the Azure portal Getting Started To get started, enable Managed Prometheus for your AKS cluster and then navigate to the Azure Monitor workspace or AKS cluster in the Azure portal and select Monitoring > Dashboards with Grafana (preview). From this page you can view, edit, create and import Grafana dashboards. Simply click on one of the pre-built dashboards to get started. You may use these dashboards as they have been provided or edit and add panels, update visualizations and create variables to create your own custom dashboards. With this approach, no Grafana servers or additional Azure resources need to be provisioned or maintained. Teams can quickly leverage and customize Grafana dashboards within the Azure portal, reducing their deployment and management time while still gaining the benefits of dashboards and visualizations to improve monitoring and troubleshooting times. Figure 2: Kubernetes Compute Resources dashboard being viewed in the context of Azure Monitor Workspace in the Azure portal When to upgrade to Azure Managed Grafana? Dashboards with Grafana in the Azure portal cover most common Prometheus scenarios but, Azure Managed Grafana remains the right choice for several advanced use cases, including: Extended data source support for non-Azure data sources e.g. open-source and third-party data stores Private networking and advanced authentication options Multi-cloud, hybrid and on-premises data source connectivity. See When to use Azure Managed Grafana for more details. Get started with Azure Monitor dashboards with Grafana today.811Views1like0CommentsGenerally Available - High scale mode in Azure Monitor - Container Insights
Container Insights is Azure Monitor’s solution for collecting logs from your Azure Kubernetes Service (AKS) clusters. As the adoption of AKS continues to grow, we are seeing an increasing number of customers with log scaling needs that hit the limits of log collection in Container Insights. Last August, we announced the public preview of High Scale mode in Container Insights to help customers achieve a higher log collection throughput from their AKS clusters. Today, we are happy to announce the General Availability of High Scale mode. High scale mode is ideal for customers approaching or above 10,000 logs/sec from a single node. When High Scale mode is enabled, Container Insights does multiple configuration changes leading to a higher overall throughput. These include using a more powerful agent setup, using a different data pipeline, allocating more memory for the agent, and more. All these changes are made in the background by the service and do not require input or configuration from customers. High Scale mode impacts only the data collection layer (with a new DCR) – the rest of the experience remains the same. Data flows to our existing tables, your queries and alerts work as before too. High Scale mode is available to all customers. Today, High scale is turned off by default. In the future, we plan to enable High Scale mode by default for all customers to reduce the chances of log loss when workloads scale. To get started with High Scale mode, please see our documentation at https://aka.ms/cihsmode275Views1like0CommentsSecuring Cloud Shell Access to AKS
Azure Cloud Shell is an online shell hosted by Microsoft that provides instant access to a command-line interface, enabling users to manage Azure resources without needing local installations. Cloud Shell comes equipped with popular tools and programming languages, including Azure CLI, PowerShell, and the Kubernetes command-line tool (kubectl). Using Cloud Shell can provide several benefits for administrators who need to work with AKS, especially if they need quick access from anywhere, or are in locked down environments: Immediate Access: There’s no need for local setup; you can start managing Azure resources directly from your web browser. Persistent Storage: Cloud Shell offers a file share in Azure, keeping your scripts and files accessible across multiple sessions. Pre-Configured Environment: It includes built-in tools, saving time on installation and configuration. The Challenge of Connecting to AKS By default, Cloud Shell traffic to AKS originates from a random Microsoft-managed IP address, rather than from within your network. As a result, the AKS API server must be publicly accessible with no IP restrictions, which poses a security risk as anyone on the internet can attempt to reach it. While credentials are still required, restricting access to the API server significantly enhances security. Fortunately, there are ways to lock down the API server while still enabling access via Cloud Shell, which we’ll explore in the rest of this article Options for Securing Cloud Shell Access to AKS Several approaches can be taken to secure the access to your AKS cluster while using Cloud Shell: IP Allow Listing On AKS clusters with a public API server, it is possible to lock down access to the API server with an IP allow list. Each Cloud Shell instance has a randomly selected outbound IP coming from the Azure address space whenever a new session is deployed. This means we cannot allow access to these IPs in advance, but we apply them once our session is running and this will work for the duration of our session. Below is an example script that you could run from Cloud Shell to check the current outbound IP address and allow it on your AKS clusters authorised IP list. #!/usr/bin/env bash set -euo pipefail RG="$1"; AKS="$2" IP="$(curl -fsS https://api.ipify.org)" echo "Adding ${IP} to allow list" CUR="$(az aks show -g "$RG" -n "$AKS" --query "apiServerAccessProfile.authorizedIpRanges" -o tsv | tr '\t' '\n' | awk 'NF')" NEW="$(printf "%s\n%s/32\n" "$CUR" "$IP" | sort -u | paste -sd, -)" if az aks update -g "$RG" -n "$AKS" --api-server-authorized-ip-ranges "$NEW" >/dev/null; then echo "IP ${IP} applied successfully"; else echo "Failed to apply IP ${IP}" >&2; exit 1; fi This method comes with some caveats: The users running the script would need to be granted permissions to update the authorised IP ranges in AKS - this permission could be used to add any IP address This script will need to be run each time a Cloud Shell session is created, and can take a few minutes to run The script only deals with adding IPs to the allow list, you would also need to implement a process to remove these IPs on a regular basis to avoid building up a long list of IPs that are no longer needed. Adding Cloud Shell IPs in bulk, through Service Tags or similar will result in your API server being accessible to a much larger range of IP addresses, and should be avoided. Command Invoke Azure provides a feature known as Command Invoke that allows you to send commands to be run in AKS, without the need for direct network connectivity. This method executes a container within AKS to run your command and then return the result, and works well from within Cloud Shell. This is probably the simplest approach that works with a locked down API server and the quickest to implement. However, there are some downsides: Commands take longer to run - when you execute the command, it needs to run a container in AKS, execute the command and then return the result. You only get exitCode and text output, and you lose API level details. All commands must be run within the context of the az aks command invoke CLI command, making commands much longer and complex to execute, rather than direct access with Kubectl Command Invoke can be a practical solution for occasional access to AKS, especially when the cost or complexity of alternative methods isn't justified. However, its user experience may fall short if relied upon as a daily tool. Further Details: Access a private Azure Kubernetes Service (AKS) cluster using the command invoke or Run command feature - Azure Kubernetes Service | Microsoft Learn Cloud Shell vNet Integration It is possible to deploy Cloud Shell into a virtual network (vNet), allowing it to route traffic via the vNet, and so access resources using private network, Private Endpoints, or even public resources, but using a NAT Gateway or Firewall for consistent outbound IP address. This approach uses Azure Relay to provide secure access to the vNet from Cloud Shell, without the need to open additional ports. When using Cloud Shell in this way, it does introduce additional cost for the Azure Relay service. Using this solution will require two different approaches, depending on whether you are using a private or public API server. When using a Private API server, which is either directly connected to the vNet, or configured with Private Endpoints, Cloud Shell will be able to connect directly to the private IP of this service over the vNet When using a Public API server, with a public IP, traffic for this will still leave the vNet and go to the internet. The benefit is that we can control the public IP used for the outbound traffic using a Nat Gateway or Azure Firewall. Once this is configured, we can then allow-list this fixed IP on the AKS API server authorised IP ranges. Further Details: Use Cloud Shell in an Azure virtual network | Microsoft Learn Azure Bastion Azure Bastion provides secure and seamless RDP and SSH connectivity to your virtual machines (VMs) directly from the Azure portal, without exposing them to the public internet. Recently, Bastion has also added support for direct connection to AKS with SSH, rather than needing to connect to a jump box and then use Kubectl from there. This greatly simplifies connecting to AKS, and also reduces the cost. Using this approach, we can deploy a Bastion into the vNet hosting AKS. From Cloud Shell we can then use the following command to create a tunnel to AKS. az aks bastion --name <aks name> --resource-group <resource group name> --bastion <bastion resource ID> Once this tunnel is connected, we can run Kubectl commands without any need for further configuration. As with Cloud Shell network integration, we take two slightly different approaches depending on whether the API server is public or private: When using a Private API server, which is either directly connected to the vNet, or configured with Private Endpoints, Cloud Shells connected via Bastion will be able to connect directly to the private IP of this service over the vNet When using a Public API server, with a public IP, traffic for this will still leave the vNet and go to the internet. As with Cloud Shell vNet integration, we can configure this to use a static outbound IP and allow list this on the API server. Using Bastion, we can still use NAT Gateway or Azure Firewall to achieve this, however you can also allow list the public IP assigned to the Bastion, removing the cost for NAT Gateway or Azure Firewall if these are not required for anything else. Connecting to AKS directly from Bastion requires the use of the Standard for Premium SKU of Bastion, which does have additional cost over the Developer or Basic SKU. This feature also requires that you enable native client support. Further details: Connect to AKS Private Cluster Using Azure Bastion (Preview) - Azure Bastion | Microsoft Learn Summary of Options IP Allow Listing The outbound IP addresses for Cloud Shell instances can be added to the Authorised IP list for your API server. As these IPs are dynamically assigned to sessions they would need to be added at runtime, to avoid adding a large list of IPs and reducing security. This can be achieved with a script. While easy to implement, this requires additional time to run the script with every new session, and increases the overhead for managing the Authorise IP list to remove unused IPs. Command Invoke Command Invoke allows you to run commands against AKS without requiring direct network access or any setup. This is a convenient option for occasional tasks or troubleshooting, but it’s not designed for regular use due to its limited user experience and flexibility. Cloud Shell vNet Integration This approach connects Cloud Shell directly to your virtual network, enabling secure access to AKS resources. It’s well-suited for environments where Cloud Shell is the primary access method and offers a more secure and consistent experience than default configurations. It does involve additional cost for Azure Relay. Azure Bastion Azure Bastion provides a secure tunnel to AKS that can be used from Cloud Shell or by users running the CLI locally. It offers strong security by eliminating public exposure of the API server and supports flexible access for different user scenarios, though it does require setup and may incur additional cost. Cloud Shell is a great tool for providing pre-configured, easily accessible CLI instances, but in the default configuration it can require some security compromises. With a little work, it is possible to make Cloud Shell work with a more secure configuration that limits how much exposure is needed for your AKS API server.284Views1like0Comments