azure kubernetes service
153 TopicsExpanding the Public Preview of the Azure SRE Agent
We are excited to share that the Azure SRE Agent is now available in public preview for everyone instantly – no sign up required. A big thank you to all our preview customers who provided feedback and helped shape this release! Watching teams put the SRE Agent to work taught us a ton, and we’ve baked those lessons into a smarter, more resilient, and enterprise-ready experience. You can now find Azure SRE Agent directly in the Azure Portal and get started, or use the link below. 📖 Learn more about SRE Agent. 👉 Create your first SRE Agent (Azure login required) What’s New in Azure SRE Agent - October Update The Azure SRE Agent now delivers secure-by-default governance, deeper diagnostics, and extensible automation—built for scale. It can even resolve incidents autonomously by following your team’s runbooks. With native integrations across Azure Monitor, GitHub, ServiceNow, and PagerDuty, it supports root cause analysis using both source code and historical patterns. And since September 1, billing and reporting are available via Azure Agent Units (AAUs). Please visit product documentation for the latest updates. Here are a few highlights for this month: Prioritizing enterprise governance and security: By default, the Azure SRE Agent operates with least-privilege access and never executes write actions on Azure resources without explicit human approval. Additionally, it uses role-based access control (RBAC) so organizations can assign read-only or approver roles, providing clear oversight and traceability from day one. This allows teams to choose their desired level of autonomy from read-only insights to approval-gated actions to full automation without compromising control. Covering the breadth and depth of Azure: The Azure SRE Agent helps teams manage and understand their entire Azure footprint. With built-in support for AZ CLI and kubectl, it works across all Azure services. But it doesn’t stop there—diagnostics are enhanced for platforms like PostgreSQL, API Management, Azure Functions, AKS, Azure Container Apps, and Azure App Service. Whether you're running microservices or managing monoliths, the agent delivers consistent automation and deep insights across your cloud environment. Automating Incident Management: The Azure SRE Agent now plugs directly into Azure Monitor, PagerDuty, and ServiceNow to streamline incident detection and resolution. These integrations let the Agent ingest alerts and trigger workflows that match your team’s existing tools—so you can respond faster, with less manual effort. Engineered for extensibility: The Azure SRE Agent incident management approach lets teams reuse existing runbooks and customize response plans to fit their unique workflows. Whether you want to keep a human in the loop or empower the Agent to autonomously mitigate and resolve issues, the choice is yours. This flexibility gives teams the freedom to evolve—from guided actions to trusted autonomy—without ever giving up control. Root cause, meet source code: The Azure SRE Agent now supports code-aware root cause analysis (RCA) by linking diagnostics directly to source context in GitHub and Azure DevOps. This tight integration helps teams trace incidents back to the exact code changes that triggered them—accelerating resolution and boosting confidence in automated responses. By bridging operational signals with engineering workflows, the agent makes RCA faster, clearer, and more actionable. Close the loop with DevOps: The Azure SRE Agent now generates incident summary reports directly in GitHub and Azure DevOps—complete with diagnostic context. These reports can be assigned to a GitHub Copilot coding agent, which automatically creates pull requests and merges validated fixes. Every incident becomes an actionable code change, driving permanent resolution instead of temporary mitigation. Getting Started Start here: Create a new SRE Agent in the Azure portal (Azure login required) Blog: Announcing a flexible, predictable billing model for Azure SRE Agent Blog: Enterprise-ready and extensible – Update on the Azure SRE Agent preview Product documentation Product home page Community & Support We’d love to hear from you! Please use our GitHub repo to file issues, request features, or share feedback with the team2.7KViews1like1CommentChoosing the Right Azure Containerisation Strategy: AKS, App Service, or Container Apps?
Azure Kubernetes Service (AKS) What is it? AKS is Microsoft’s managed Kubernetes offering, providing full access to the Kubernetes API and control plane. It’s designed for teams that want to run complex, scalable, and highly customisable container workloads, with direct control over orchestration, networking, and security. When to choose AKS: You need advanced orchestration, custom networking, or integration with third-party tools. Your team has Kubernetes expertise and wants granular control. You’re running large-scale, multi-service, or hybrid/multi-cloud workloads. You require Windows container support (with some limitations). Advantages: Full Kubernetes API access and ecosystem compatibility. Supports both Linux and Windows containers. Highly customisable (networking, storage, security, scaling). Suitable for complex, stateful, or regulated workloads. Disadvantages: Steeper learning curve; requires Kubernetes knowledge. You manage cluster upgrades, scaling, and security patches (though Azure automates much of this). Potential for over-provisioning and higher operational overhead. Azure App Service What is it? App Service is a fully managed Platform-as-a-Service (PaaS) for hosting web apps, APIs, and backends. It supports both code and container deployments, but is optimised for web-centric workloads. When to choose App Service: You’re building traditional web apps, REST APIs, or mobile backends. You want to deploy quickly with minimal infrastructure management. Your team prefers a PaaS experience with built-in scaling, SSL, and CI/CD. You need to run Windows containers (with some limitations). Advantages: Easiest to use, minimal configuration, fast deployments. Built-in scaling, SSL, custom domains, and staging slots. Tight integration with Azure DevOps, GitHub Actions, and other Azure services. Handles infrastructure, patching, and scaling for you. Disadvantages: Less flexibility for complex microservices or custom orchestration. Limited access to underlying infrastructure and networking. Not ideal for event-driven or non-HTTP workloads. Azure Container Apps What is it? Container Apps is a fully managed, serverless container platform built on Kubernetes and open-source tech like Dapr and KEDA. It abstracts away Kubernetes complexity, letting you focus on microservices, event-driven, or background jobs. When to choose Container Apps: You want to run microservices or event-driven workloads without managing Kubernetes. You need automatic scaling (including scale to zero) based on HTTP traffic or events. You want to use Dapr for service discovery, pub/sub, or state management. You’re building modern, cloud-native apps but don’t need direct Kubernetes API access. Advantages: Serverless scaling (including to zero), pay only for what you use. Built-in support for microservices patterns, event-driven architectures, and background jobs. No cluster management—Azure handles the infrastructure. Integrates with Azure DevOps, GitHub Actions, and supports Linux containers from any registry. Disadvantages: No direct access to Kubernetes APIs or custom controllers. Linux containers only (no Windows container support). Some advanced networking and customisation options are limited compared to AKS. Key Differences Feature Azure Kubernetes Service (AKS) Azure App Service Azure Container Apps Best for Complex, scalable, custom workloads Web apps, APIs, backends Microservices, event-driven, jobs Management You manage (with Azure help) Fully managed Fully managed, serverless Scaling Manual/auto (pods, nodes) Auto (HTTP traffic) Auto (HTTP/events, scale to zero) API Access Full Kubernetes API No infra access No Kubernetes API OS Support Linux & Windows Linux & Windows Linux only Networking Advanced, customisable Basic (web-centric) Basic, with VNet integration Use Cases Hybrid/multi-cloud, regulated, large-scale Web, REST APIs, mobile Microservices, event-driven, background jobs Learning Curve Steep (Kubernetes skills needed) Low Low-medium Pricing Pay for nodes (even idle) Pay for plan (fixed/auto) Pay for usage (scale to zero) CI/CD Integration Azure DevOps, GitHub, custom Azure DevOps, GitHub Azure DevOps, GitHub How to Decide? Start with App Service if you’re building a straightforward web app or API and want the fastest path to production. Choose Container Apps for modern microservices, event-driven, or background processing workloads where you want serverless scaling and minimal ops. Go with AKS when you need full Kubernetes power, advanced customisation, or are running at enterprise scale with a skilled team. Conclusion Azure’s containerisation portfolio is broad, but each service is optimised for different scenarios. For most new cloud-native projects, Container Apps offers the best balance of simplicity and power. For web-centric workloads, App Service remains the fastest route. For teams needing full control and scale, AKS is unmatched. Tip: Start simple, and only move to more complex platforms as your requirements grow. Azure’s flexibility means you can mix and match these services as your architecture evolves.491Views2likes0CommentsPublic preview: Confidential containers on AKS
We are proud to announce the preview of confidential containers on AKS, which provides confidential computing capabilities to containerize workloads on AKS. This offering provides strong isolation at the pod-level, memory encryption, AMD SEV-SNP hardware-based attestation capabilities for containerized application code and data while in-use, building upon the existing security, scalability and resiliency benefits offered by AKS.7.1KViews4likes1CommentSecuring Cloud Shell Access to AKS
Azure Cloud Shell is an online shell hosted by Microsoft that provides instant access to a command-line interface, enabling users to manage Azure resources without needing local installations. Cloud Shell comes equipped with popular tools and programming languages, including Azure CLI, PowerShell, and the Kubernetes command-line tool (kubectl). Using Cloud Shell can provide several benefits for administrators who need to work with AKS, especially if they need quick access from anywhere, or are in locked down environments: Immediate Access: There’s no need for local setup; you can start managing Azure resources directly from your web browser. Persistent Storage: Cloud Shell offers a file share in Azure, keeping your scripts and files accessible across multiple sessions. Pre-Configured Environment: It includes built-in tools, saving time on installation and configuration. The Challenge of Connecting to AKS By default, Cloud Shell traffic to AKS originates from a random Microsoft-managed IP address, rather than from within your network. As a result, the AKS API server must be publicly accessible with no IP restrictions, which poses a security risk as anyone on the internet can attempt to reach it. While credentials are still required, restricting access to the API server significantly enhances security. Fortunately, there are ways to lock down the API server while still enabling access via Cloud Shell, which we’ll explore in the rest of this article Options for Securing Cloud Shell Access to AKS Several approaches can be taken to secure the access to your AKS cluster while using Cloud Shell: IP Allow Listing On AKS clusters with a public API server, it is possible to lock down access to the API server with an IP allow list. Each Cloud Shell instance has a randomly selected outbound IP coming from the Azure address space whenever a new session is deployed. This means we cannot allow access to these IPs in advance, but we apply them once our session is running and this will work for the duration of our session. Below is an example script that you could run from Cloud Shell to check the current outbound IP address and allow it on your AKS clusters authorised IP list. #!/usr/bin/env bash set -euo pipefail RG="$1"; AKS="$2" IP="$(curl -fsS https://api.ipify.org)" echo "Adding ${IP} to allow list" CUR="$(az aks show -g "$RG" -n "$AKS" --query "apiServerAccessProfile.authorizedIpRanges" -o tsv | tr '\t' '\n' | awk 'NF')" NEW="$(printf "%s\n%s/32\n" "$CUR" "$IP" | sort -u | paste -sd, -)" if az aks update -g "$RG" -n "$AKS" --api-server-authorized-ip-ranges "$NEW" >/dev/null; then echo "IP ${IP} applied successfully"; else echo "Failed to apply IP ${IP}" >&2; exit 1; fi This method comes with some caveats: The users running the script would need to be granted permissions to update the authorised IP ranges in AKS - this permission could be used to add any IP address This script will need to be run each time a Cloud Shell session is created, and can take a few minutes to run The script only deals with adding IPs to the allow list, you would also need to implement a process to remove these IPs on a regular basis to avoid building up a long list of IPs that are no longer needed. Adding Cloud Shell IPs in bulk, through Service Tags or similar will result in your API server being accessible to a much larger range of IP addresses, and should be avoided. Command Invoke Azure provides a feature known as Command Invoke that allows you to send commands to be run in AKS, without the need for direct network connectivity. This method executes a container within AKS to run your command and then return the result, and works well from within Cloud Shell. This is probably the simplest approach that works with a locked down API server and the quickest to implement. However, there are some downsides: Commands take longer to run - when you execute the command, it needs to run a container in AKS, execute the command and then return the result. You only get exitCode and text output, and you lose API level details. All commands must be run within the context of the az aks command invoke CLI command, making commands much longer and complex to execute, rather than direct access with Kubectl Command Invoke can be a practical solution for occasional access to AKS, especially when the cost or complexity of alternative methods isn't justified. However, its user experience may fall short if relied upon as a daily tool. Further Details: Access a private Azure Kubernetes Service (AKS) cluster using the command invoke or Run command feature - Azure Kubernetes Service | Microsoft Learn Cloud Shell vNet Integration It is possible to deploy Cloud Shell into a virtual network (vNet), allowing it to route traffic via the vNet, and so access resources using private network, Private Endpoints, or even public resources, but using a NAT Gateway or Firewall for consistent outbound IP address. This approach uses Azure Relay to provide secure access to the vNet from Cloud Shell, without the need to open additional ports. When using Cloud Shell in this way, it does introduce additional cost for the Azure Relay service. Using this solution will require two different approaches, depending on whether you are using a private or public API server. When using a Private API server, which is either directly connected to the vNet, or configured with Private Endpoints, Cloud Shell will be able to connect directly to the private IP of this service over the vNet When using a Public API server, with a public IP, traffic for this will still leave the vNet and go to the internet. The benefit is that we can control the public IP used for the outbound traffic using a Nat Gateway or Azure Firewall. Once this is configured, we can then allow-list this fixed IP on the AKS API server authorised IP ranges. Further Details: Use Cloud Shell in an Azure virtual network | Microsoft Learn Azure Bastion Azure Bastion provides secure and seamless RDP and SSH connectivity to your virtual machines (VMs) directly from the Azure portal, without exposing them to the public internet. Recently, Bastion has also added support for direct connection to AKS with SSH, rather than needing to connect to a jump box and then use Kubectl from there. This greatly simplifies connecting to AKS, and also reduces the cost. Using this approach, we can deploy a Bastion into the vNet hosting AKS. From Cloud Shell we can then use the following command to create a tunnel to AKS. az aks bastion --name <aks name> --resource-group <resource group name> --bastion <bastion resource ID> Once this tunnel is connected, we can run Kubectl commands without any need for further configuration. As with Cloud Shell network integration, we take two slightly different approaches depending on whether the API server is public or private: When using a Private API server, which is either directly connected to the vNet, or configured with Private Endpoints, Cloud Shells connected via Bastion will be able to connect directly to the private IP of this service over the vNet When using a Public API server, with a public IP, traffic for this will still leave the vNet and go to the internet. As with Cloud Shell vNet integration, we can configure this to use a static outbound IP and allow list this on the API server. Using Bastion, we can still use NAT Gateway or Azure Firewall to achieve this, however you can also allow list the public IP assigned to the Bastion, removing the cost for NAT Gateway or Azure Firewall if these are not required for anything else. Connecting to AKS directly from Bastion requires the use of the Standard for Premium SKU of Bastion, which does have additional cost over the Developer or Basic SKU. This feature also requires that you enable native client support. Further details: Connect to AKS Private Cluster Using Azure Bastion (Preview) - Azure Bastion | Microsoft Learn Summary of Options IP Allow Listing The outbound IP addresses for Cloud Shell instances can be added to the Authorised IP list for your API server. As these IPs are dynamically assigned to sessions they would need to be added at runtime, to avoid adding a large list of IPs and reducing security. This can be achieved with a script. While easy to implement, this requires additional time to run the script with every new session, and increases the overhead for managing the Authorise IP list to remove unused IPs. Command Invoke Command Invoke allows you to run commands against AKS without requiring direct network access or any setup. This is a convenient option for occasional tasks or troubleshooting, but it’s not designed for regular use due to its limited user experience and flexibility. Cloud Shell vNet Integration This approach connects Cloud Shell directly to your virtual network, enabling secure access to AKS resources. It’s well-suited for environments where Cloud Shell is the primary access method and offers a more secure and consistent experience than default configurations. It does involve additional cost for Azure Relay. Azure Bastion Azure Bastion provides a secure tunnel to AKS that can be used from Cloud Shell or by users running the CLI locally. It offers strong security by eliminating public exposure of the API server and supports flexible access for different user scenarios, though it does require setup and may incur additional cost. Cloud Shell is a great tool for providing pre-configured, easily accessible CLI instances, but in the default configuration it can require some security compromises. With a little work, it is possible to make Cloud Shell work with a more secure configuration that limits how much exposure is needed for your AKS API server.242Views1like0CommentsAnnouncing a flexible, predictable billing model for Azure SRE Agent
Billing for Azure SRE Agent will start on September 1, 2025. Announced at Microsoft Build 2025, Azure SRE Agent is a pre-built AI agent for root cause analysis, uptime improvement, and operational cost reduction. Learn more about the billing model and example scenarios.2.8KViews1like1CommentPrivate Pod Subnets in AKS Without Overlay Networking
When deploying AKS clusters, a common concern is the amount of IP address space required. If you are deploying your AKS cluster into your corporate network, the size of the IP address space you can obtain may be quite small, which can cause problems with the number of pods you are able to deploy. The simplest and most common solution to this is to use an overlay network, which is fully supported in AKS. In an overlay network, pods are deployed to a private, non-routed address space that can be as large as you want. Translation between the routable and non-routed network is handled by AKS. For most people, this is the best option for dealing with IP addressing in AKS, and there is no need to complicate things further. However, there are some limitations with overlay networking, primarily that you cannot address the pods directly from the rest of the network— all inbound communication must go via services. There are also some advanced features that are not supported, such as Virtual Nodes. If you are in a scenario where you need some of these features, and overlay networking will not work for you, it is possible to use the more traditional vNet-based deployment method, with some tweaks. Azure CNI Pod Subnet The alternative to using the Azure CNI Overlay is to use the Azure CNI Pod Subnet. In this setup, you deploy a vNet with two subnets - one for your nodes and one for pods. You are in control of the IP address configuration for these subnets. To conserve IP addresses, you can create your pod subnet using an IP range that is not routable to the rest of your corporate network, allowing you to make it as large as you like. The node subnet remains routable from your corporate network. In this setup, if you want to talk to the pods directly, you would need to do so from within the AKS vNet or peer another network to your pod subnet. You would not be able to address these pods from the rest of your corporate network, even without using overlay networking. The Routing Problem When you deploy a setup using Azure CNI Pod Subnet, all the subnets in the vNet are configured with routes and can talk to each other. You may wish to connect this vNet to other Azure vNets via peering, or to your corporate network using ExpressRoute or VPN. However, where you will encounter an issue is if your pods try to connect to resources outside of your AKS vNet but inside your corporate network, or any peered Azure vNets (which are not peered to this isolated subnet). In this scenario, the pods will route their traffic directly out of the vNet using their private IP address. This private IP is not a valid, routable IP, so the resources on the other network will not be able to reply, and the request will fail. IP Masquerading To resolve this issue, we need a way to have traffic going to other networks present a private IP that is routable within the network. This can be achieved through several methods. One method would be to introduce a separate solution for routing this traffic, such as Azure Firewall or another Network Virtual Appliance (NVA). Traffic is routable between the pod and node subnet, so the pod can send its requests to the firewall, and then the requests to the remote network come from the IP of the firewall, which is routable. This solution will work but does require another resource to be deployed, with additional costs. If you are already using an Azure Firewall for outbound traffic, then this may be something you could use, but we are looking for a simpler and more cost-effective solution. Rather than implementing another device to present a routable IP, we can use the nodes of our AKS clusters. The AKS nodes are in the routable node subnet, so ideally we want our outbound traffic from the pods to use the node IP when it needs to leave the vNet to go to the rest of the private network. There are several different ways you could achieve this goal. You could look at using Egress Gateway services through tools like Istio, or you could look at making changes to the iptables configuration on the nodes using a DaemonSet. In this article, we will focus on using a tool called ip-masq-agent-v2. This tool provides a means for traffic to "masquerade" as coming from the IP address of the node it is running on and have the node perform Network Address Translation (NAT). If you deploy a cluster with an overlay network, this tool is already deployed and configured on your cluster. This is the tool that Microsoft uses to configure NAT for traffic leaving the overlay network. When using pod subnet clusters, this tool is not deployed, but you can deploy it yourself to provide the same functionality. Under the hood, this tool is making changes to iptables using a DaemonSet that runs on each node, so you could replicate this behaviour yourself—but this provides a simpler process that has been tested with AKS through overlay networking. The Microsoft v2 version of this is based on the original Kubernetes contribution, aiming to solve more specific networking cases, allow for more configuration options, and improve observability. Deploy ip-masq-agent-v2 There are two parts to deploying the agent. First, we deploy the agent, which runs as a DaemonSet, spawning a pod on each node in the cluster. This is important, as each node needs to have the iptables altered by the tool, and it needs to run anytime a new node is created. To deploy the agent, we need to create the DaemonSet in our cluster. The ip-masq-agent-v2 repo includes several examples, including an example of deploying the DaemonSet. The example is slightly out of date on the version of ip-masq-agent-v2 to use, so make sure you update this to the latest version. If you would prefer to build and manage your own containers for this, the repository also includes a Dockerfile to allow you to do this. Below is an example deployment using the Microsoft-hosted images. It references the ConfigMap we will create in the next step, and it is important that the same name is used as is referenced here. apiVersion: apps/v1 kind: DaemonSet metadata: name: ip-masq-agent namespace: kube-system labels: component: ip-masq-agent kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile spec: selector: matchLabels: k8s-app: ip-masq-agent template: metadata: labels: k8s-app: ip-masq-agent spec: hostNetwork: true containers: - name: ip-masq-agent image: mcr.microsoft.com/aks/ip-masq-agent-v2:v0.1.15 imagePullPolicy: Always securityContext: privileged: false capabilities: add: ["NET_ADMIN", "NET_RAW"] volumeMounts: - name: ip-masq-agent-volume mountPath: /etc/config readOnly: true volumes: - name: ip-masq-agent-volume projected: sources: - configMap: name: ip-masq-agent-config optional: true items: - key: ip-masq-agent path: ip-masq-agent mode: 0444 Once you deploy this DaemonSet, you should see instances of the agent running on each node in your cluster. Create Configuration Next, we need to create a ConfigMap that contains any configuration data we need to vary from the default deployed with the agent. The main thing we need to configure is the IP ranges that will be masqueraded as an agent IP. The default deployment of ip-masq-agent-v2 disables masquerading for all three private IP ranges specified by RFC 1918 (10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16). In our example above, this will therefore not masquerade traffic to the 10.1.64.0/18 subnet in the app network, and our routing problem will still exist. We need to amend the configuration so that these private IPs are masqueraded. However, we do want to avoid masquerading within our AKS network, as this traffic needs to come from the pod IPs. Therefore, we need to ensure we do not masquerade for traffic going from the pods to: The pod subnet The node subnet The AKS service CIDR range, for internal networking in AKS To do this, we need to add these IP ranges to the nonMasqueradeCIDRs array in the configuration. This is the list of IP addresses which, when traffic is sent to them, will continue to come from the pod IP and not the node IP. In addition, the configuration also allows us to define if we masquerade the link-local IPs, which we do not want to do. Below is an example ConfigMap that works for the setup detailed above. apiVersion: v1 kind: ConfigMap metadata: name: ip-masq-agent-config namespace: kube-system labels: component: ip-masq-agent kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists data: ip-masq-agent: |- nonMasqueradeCIDRs: - 10.0.0.0/16 # Entire VNet and service CIDR - 192.168.0.0/16 masqLinkLocal: false masqLinkLocalIPv6: false There are a couple of things to be aware of here: The node subnet and AKS Service CIDR are two contiguous address spaces in my setup, so both are covered by 10.0.0.0/16. I could have called them out separately. 192.168.0.0/16 covers the whole of my pod subnet. I do not enable masquerading on link-local. The ConfigMap needs to be created in the same namespace as the DaemonSet. The ConfigMap name needs to match what is used in the mount in the DaemonSet manifest. Once you apply this configuration, the agent will pick up the new configuration changes within around 60 seconds. Once these are applied, you should find that traffic going to private addresses outside of the list of nonMasqueradeCIDRs will now present from the node IP. Summary If you’re deploying AKS into an IP-constrained environment, overlay networking is generally the best and simplest option. It allows you to use non-routed pod IP ranges, conserve address space, and avoid complex routing considerations without additional configuration. If you can use it, then this should be your default approach. However, there are cases where overlay networking will not meet your needs. You might require features only available with pod subnet mode, such as the ability to send traffic directly to pods and nodes without tunnelling, or support for features like Virtual Nodes. In these situations, you can still keep your pod subnet private and non-routed by carefully controlling IP masquerading. With ip-masq-agent-v2, you can configure which destinations should (and should not) be NAT’d, ensuring isolated subnets while maintaining the functionality you need.406Views0likes0CommentsSimplifying Outbound Connectivity Troubleshooting in AKS with Connectivity Analysis (Preview)
Announce the Connectivity Analysis feature for AKS, now available in Public Preview and available through the AKS Portal. You can use the Connectivity Analysis (Preview) feature to quickly verify whether outbound traffic from your AKS nodes is being blocked by Azure network resources such as Azure Firewall, Network Security Groups (NSGs), route tables, and more.781Views1like0CommentsAzure at KubeCon India 2025 | Hyderabad, India – 6-7 August 2025
Welcome to KubeCon + CloudNativeCon India 2025! We’re thrilled to join this year’s event in Hyderabad as a Gold sponsor, where we’ll be highlighting the newest innovations in Azure and Azure Kubernetes Service (AKS) while connecting with India’s dynamic cloud-native community. We’re excited to share some powerful new AKS capabilities that bring AI innovation to the forefront, strengthen security and networking, and make it easier than ever to scale and streamline operations. Innovate with AI AI is increasingly central to modern applications and competitive innovation, and AKS is evolving to support intelligent agents more natively. The AKS Model Context Protocol (MCP) server, now in public preview, introduces a unified interface that abstracts Kubernetes and Azure APIs, allowing AI agents to manage clusters more easily across environments. This simplifies diagnostics and operations—even across multiple clusters—and is fully open-source, making it easier to integrate AI-driven tools into Kubernetes workflows. Enhance networking capabilities Networking is foundational to application performance and security. This wave of AKS improvements delivers more control, simplicity, and scalability in networking: Traffic between AKS services can now be filtered by HTTP methods, paths, and hostnames using Layer-7 network policies, enabling precise control and stronger zero-trust security. Built-in HTTP proxy management simplifies cluster-wide proxy configuration and allows easy disabling of proxies, reducing misconfigurations while preserving future settings. Private AKS clusters can be accessed securely through Azure Bastion integration, eliminating the need for VPNs or public endpoints by tunneling directly with kubectl. DNS performance and resilience are improved with LocalDNS for AKS, which enables pods to resolve names even during upstream DNS outages, with no changes to workloads. Outbound traffic from AKS can now use static egress IP prefixes, ensuring predictable IPs for compliance and smoother integration with external systems. Cluster scalability is enhanced by supporting multiple Standard Load Balancers, allowing traffic isolation and avoiding rule limits by assigning SLBs to specific node pools or services. Network troubleshooting is streamlined with Azure Virtual Network Verifier, which runs connectivity tests from AKS to external endpoints and identifies misconfigured firewalls or routes. Strengthen security posture Security remains a foundational priority for Kubernetes environments, especially as workloads scale and diversify. The following enhancements strengthen protection for data, infrastructure, and applications running in AKS—addressing key concerns around isolation, encryption, and visibility. Confidential VMs for Azure Linux enable containers to run on hardware-encrypted, isolated VMs using AMD SEV-SNP, providing data-in-use protection for sensitive workloads without requiring code changes. Confidential VMs for Ubuntu 24.04 combine AKS’s managed Kubernetes with memory encryption and VM-level isolation, offering enhanced security for Linux containers in Ubuntu-based clusters. Encryption in transit for NFS secures data between AKS pods and Azure Files NFS volumes using TLS 1.3, protecting sensitive information without modifying applications. Web Application Firewall for Containers adds OWASP rule-based protection to containerized web apps via Azure Application Gateway, blocking common exploits without separate WAF appliances. The AKS Security Dashboard in Azure Portal centralizes visibility into vulnerabilities, misconfigurations, compliance gaps, and runtime threats, simplifying cluster security management through Defender for Cloud. Simplify and scale operations To streamline operations at scale, AKS is introducing new capabilities that automate resource provisioning, enforce deployment best practices, and simplify multi-tenant management—making it easier to maintain performance and consistency across complex environments. Node Auto-Provisioning improves resource efficiency by automatically adding and removing standalone nodes based on pod demand, eliminating the need for pre-created node pools during traffic spikes. Deployment Safeguards help prevent misconfigurations by validating Kubernetes manifests against best practices and optionally enforcing corrections to reduce instability and security risks. Managed Namespaces streamline multi-tenant cluster operations by providing a unified view of accessible namespaces across AKS clusters, along with quick access credentials via CLI, API, or Portal. Maximize performance and visibility To enhance performance and observability in large-scale environments, AKS is also rolling out infrastructure-level upgrades that improve monitoring capacity and control plane efficiency. Prometheus quotas in Azure Monitor can now be raised to 20 million samples per minute or active time series, ensuring full metric coverage for massive AKS deployments. Control plane performance has been improved with a backported Kubernetes enhancement (KEP-5116), reducing API server memory usage by ~10× during large listings and enabling faster kubectl responses with lower risk of OOM issues in AKS versions 1.31.9 and above. Microsoft is at KubeCon India 2025 - come say hi! Connect with us in Hyderabad! Microsoft has a strong on-site presence at KubeCon + CloudNativeCon India 2025. Here are some highlights of how you can connect with us at the event: August 6-7: Visit Microsoft at Booth G4 for live demos and expert Q&A throughout the conference. Microsoft engineers are also delivering several breakout sessions on AKS and cloud-native technologies. Microsoft Sessions: Throughout the conference, Microsoft engineers are speaking in various sessions, including: Keynote: The Last Mile Problem: Why AI Won’t Replace You (Yet) Lightning Talk: Optimizing SNAT Port and IP Address Management in Kubernetes Smart Capacity-Aware Volume Provisioning for LVM Local Storage Across Multi-Cluster Kubernetes Fleet Minimal OS, Maximum Impact: Journey To a Flatcar Maintainer We’re thrilled to connect with you at KubeCon + CloudNativeCon India 2025. Whether you attend sessions, drop by our booth, or watch the keynote, we look forward to discussing these announcements and hearing your thoughts. Thank you for being part of the community, and happy KubeCon! 👋509Views2likes0CommentsCustomising Node-Level Configuration in AKS
When you deploy AKS, you deploy the control plan, which is managed by Microsoft, and one or more node pools, which contain the worker nodes used to run your Kubernetes workloads. These node pools are usually deployed as Virtual Machine Scale Sets. These scale sets are visible in your subscription, but generally you would not make changes to these directly, as they will be managed by AKS and all of the configuration and management of these is done through AKS. However, there are some scenarios where you do need to make changes to the underlying node configuration to be able to handle the workloads you need to run. Whilst you can make some changes to these nodes, you need to make sure you do it in a supported manner, which will be applied consistently to all your nodes. An example of this requirement is a recent issue I saw with deploying Elasticsearch onto AKS. Let's take a look at this issue and see how it can be resolved, both for this specific issue, but also for any other scenario were you need to make changes on the nodes. The Issue For the rest of this article, we will use a specific scenario to illustrate the requirement to make node changes, but this could be applied to any requirement to make changes to the nodes. Elasticsearch has a requirement where it needs to increase the limit on mmap count, due to the way it uses "mmapfs" for storing indices. The docs state you can resolve this by running: sysctl -w vm.max_map_count=262144 This command needs to be run on the machine that is running the container, not inside the container. In our case, this is the AKS nodes. Whilst this is fairly easy to do on my laptop, this isn't really feasible to run manually on all of our AKS nodes, especially because nodes could be destroyed and recreated during updates or downtime. We need to make the changes consistently on all nodes, and automate the process so it is applied to all nodes, even new ones. Changes to Avoid Whilst we want to make changes to our nodes, we want to do so in a way that doesn't result in our nodes being in an unsupported state. One key example of this is making changes directly to the scale set. Using the IaaS/ARM APIs to make changes directly to the scale set, outside of Kubernetes, will result in your nodes being unsupported and should be avoided. This includes making changes to the CustomScriptExtension configured on the scale set. Similarly, we want to avoid SSH'ing into the nodes operating system and making the changes manually. Whilst this will apply the change you want, as soon as that node is destroyed and recreated, your change will be gone. Similarly, if you want to use the node autoscaler, any new nodes won't have your changes. Solutions There are a few different options that we could use to solve this issue and customise our node configuration. Let's take a look at them in order of ease of use. 1. Customised Node Configuration The simplest method to customise node configuration is through the use of node configuration files that can be applied at the creation of a cluster or a node pool. Using these configuration files you are able to customise a specific set of configuration settings for both the Node Operating System and the Kubelet configuration. Below is an example of a Linux OS configuration: { "transparentHugePageEnabled": "madvise", "transparentHugePageDefrag": "defer+madvise", "swapFileSizeMB": 1500, "sysctls": { "netCoreSomaxconn": 163849, "netIpv4TcpTwReuse": true, "netIpv4IpLocalPortRange": "32000 60000" } } We would then apply this at the time of creating a cluster or node pool by providing the file to the CLI command. For example, creating a cluster: az aks create --name myAKSCluster --resource-group myResourceGroup --linux-os-config ./linuxosconfig.jsonaz aks create --name myAKSCluster --resource-group myResourceGroup --linux-os-config ./linuxosconfig.json Creating a node pool: az aks nodepool add --name mynodepool1 --cluster-name myAKSCluster --resource-group myResourceGroup --kubelet-config ./linuxkubeletconfig.json There are lots of different configuration settings that can be changed for both OS and Kublet, for both Linux and Windows nodes. The full list can be found here. For our scenario where want to change the vm.max_map_count setting. This is available as one of the configuration options in the virtual memory section. Our OS configuration would look like this: { "vmMaxMapCount": 262144 } Note that the value used in the JSON is a camel case version of the property name, so vm.max_map_count becomes vmMaxMapCount 2. Daemonsets Another way we can make these changes using a Kubernetes native method is through the use of Daemonsets. As you may know, Daemonsets provide a mechanism to run a pod on every node in your cluster. We can use this Daemonset to execute a script that sets the appropriate settings on the nodes when run, and the Deamonset will ensure that this is done on every node, including any new nodes that get created but the autoscaler or during updates. To be able to make changes to the node, we will need to run the Daemonset with some elevated privileges, and so you may want to consider whether the node customisation file option, listed above, works for your scenario, before using this option. For this to work, we need two things, a container to run, and a Daemonset configuration. Container All our Daemonset does is run a container, it's the container that defines what is done. There are two options that we can use for our scenario: Create our own container that has the script to run defined in the Docker file. Use a pre-built container, like BusyBox, which accepts parameters defining what commands to run. The first option is a more secure option, as the container is fixed to running only the command you want, and any malicious changes would require someone to re-build and publish a new image and update the Daemonset configuration to run it. The image we create is very basic, it just needs to have the tools you require for your script installed, and then run your script. The only caveat to this is that Daemonsets need to have their Restart Policy set to always, and so we can't just run our script and stop, as the container will just be restarted. To avoid this, we can have our container sleep once it is done. If the node is ever restarted or replaced, the container will still run again. Here is the most simple Dockerfile we can use to solve our Elasticsearch issue: FROM alpine CMD sysctl -w vm.max_map_count=262144; sleep 365d Daemonset Configuration To run our Daemonset, we need to configure our YAML to do the following: Run our custom container, or use a pre-built container with the right parameters Grant the Daemonset the required privileges to be able to make changes to the node Set the Restart Policy to Always If we want, we can also restrict our Daemonset to only run on nodes that we know are going to run this workload. For example, we can restrict this to only run on a specific node pool in AKSm using a node selector. apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name: node-config spec: template: metadata: labels: name: node-config spec: containers: - name: node-config image: scdemo.azurecr.io/node-config:1 securityContext: privileged: true restartPolicy: Always nodeSelector: agentpool: elasticsearch Once we deploy this to our cluster, the Daemonset will run and make the changes we require. When using a Daemonset or Init container approach, pay special attention to security. This container will run in privileged mode, which gives it a high level of permissions, and not just the ability to change the specific configuration setting you are interested in. Ensure that access to these containers and their configuration is restricted. Consider using init containers if possible as their runtime is more limited. 3. Init Containers This is a similar approach to Dameonsets, but instead of running on everything node, we use an init container in our application to only run on the nodes where our application is present. An init container allows us to specify that a specific container must run, and complete successfully prior to our main application being run. We can take our container that runs our custom script, as with the Daemonset option, and run this as an init container instead. The benefit of this approach is that the init container only runs once when the application is started, and then stops. This avoids needing to have the sleep command that keeps the process running at all times. The downside is that using an init container requires editing the YAML for the application you are deploying, which may be difficult or impossible if you are using a third party application. Some third party applications will have Helm charts or similar configured that do allow passing in custom init containers, but many do now. If you are creating your own applications then this is easier. Below is an example using this approach, in this example we use a pre-built container (BusyBox) for running our script, rather than a custom container. Either approach can be used. apiVersion: v1 kind: Pod metadata: name: app-pod labels: app.kubernetes.io/name: MyApp spec: containers: - name: main-app image: scdemo.azurecr.io/main-app:1 initContainers: - name: init-sysctl image: busybox command: - sysctl - -w - vm.max_map_count=262144 imagePullPolicy: IfNotPresent securityContext: privileged: true Conclusions Making changes to underlying AKS nodes is something that most people won't need to do, most of the time. However, there are some scenarios you may hit where this is important. AKS comes with functionality to do is in a controlled and supported manner via the use the of configuration files. This approach is recommended if the configuration you need to change is supported, as is simpler to implement, doesn't require creating custom containers and is the most secure approach. If the change you need is not supported then you still have a way to deal with this via Daemonsets or Init containers, but special attention should be paid to security when using this solution.428Views0likes0CommentsValidating Change Requests with Kubernetes Admission Controllers
Promoting an application or infrastructure change into production often comes with a requirement to follow a change control process. This ensures that changes to production are properly reviewed and that they adhere to required approvals, change windows and QA process. Often this change request (CR) process will be conducted using a system for recording and auditing the change request and the outcome. When deploying a release, there will often be places in the process to go through this change control workflow. This may be as part of a release pipeline, it may be managed in a pull request or it may be a manual process. Ultimately, by the time the actual changes are made to production infrastructure or applications, they should already be approved. This relies on the appropriate controls and restrictions being in place to make sure this happens. When it comes to the point of deploying resources into production Kubernetes clusters, they should have already been through a CR process. However, what if you wanted a way to validate that this is the case, and block anything from being deployed that does not have an approved CR, providing a backstop to ensure that no unapproved resources get deployed? Let's take a look at how we can use an Admission Controller to do this. Admission Controllers A Kubernetes Admission Controller is a mechanism to provide a checkpoint during a deployment that validates resources and applies rules and policies before this resource is accepted into the cluster. Any request to create, update or delete (CRUD) a resource is first run through any applicable admission controllers to check if it violates any of the required rules. Only if all admission controllers allow the request is it then processed. Kubernetes includes some built-in admission controllers, but you can also create your own. Admission controllers are essentially webhooks that are registered with the Kubernetes API server. When a CRUD request is processed by the API server, it calls any of these webhooks that are registered, and processes the response. When creating your own Admission controller, you would usually implement the webhook as a pod running in the cluster. There are three types of Admission Controller webhooks: MutatingAdmissionWebhook: Can modify the incoming object before it is persisted (e.g., injecting sidecars). ValidatingAdmissionWebhook: Can only approve or reject the request based on validation logic. ValidatingAdmissionPolicy: Validation logic is embedded in the API, rather than requiring a separate web service For our scenario we are going to look at using a ValidatingAdmissionWebhook, as we only want to approve or reject a request based on its change request status. Sample Code In this article, we are not going to go line by line through the code for this admission controller, however you can see an example implementation of this in this repo. In this example, we do not build out the full web service for validating change requests themselves. We have some pre-defined CR IDs with pre-configured statuses returned by the application. In a real world implementation your web service would call out to your change management solution to get the current status of the change request. This does not impact how you would build the Admission Controller, just the business logic inside your controller. Components Our Admission Controller consists of several components: Application Our actual admission controller application, which runs a HTTP service that receives the request from the API Server calling the webhook, processes it and applies business logic, and returns a response. In our example this service has been written in GO, but you can use whatever language you like. Your service must meet the API contract defined for the admission webhook. Our application does the following: Reads the incoming change body YAML and extracts the Change ID from the change.company.com/id annotation that should be applied to the resource. We also support the argocd.argoproj.io/change-id and deployment.company.com/change-id annotations. func extractChangeID(req *admissionv1.AdmissionRequest) string { // Try to extract change ID from object annotations obj := req.Object.Raw var objMap map[string]interface{} if err := json.Unmarshal(obj, &objMap); err != nil { return "" } if metadata, ok := objMap["metadata"].(map[string]interface{}); ok { if annotations, ok := metadata["annotations"].(map[string]interface{}); ok { // Look for change ID in various annotation formats if changeID, ok := annotations["change.company.com/id"].(string); ok { return changeID } if changeID, ok := annotations["argocd.argoproj.io/change-id"].(string); ok { return changeID } if changeID, ok := annotations["deployment.company.com/change-id"].(string); ok { return changeID } } } return "" } If it does not find the required annotation, it immediately fails the validation, as no CR is present. if changeID == "" { // Reject resources without change ID annotation klog.Infof("No change ID found, rejecting request") ac.respond(w, &admissionReview, false, "Change ID annotation is required") return } If the CR is present, it validates it. In our demo application this is checked against a hard-coded list of CRs, but in the real world, this is where you would make a call out to your external change management solution to get the CR with that ID. There are 3 possible outcomes here: The CR ID does not match an ID in our system, the validation fails The CR does match an ID in our system, but this CR is not approved, the validation fails The CR does match an ID in our system and this CR has been approved, the validation passes and the resources are created. changeRecord, err := ac.changeService.ValidateChange(changeID) if err != nil { klog.Errorf("Change validation failed: %v", err) ac.respond(w, &admissionReview, false, fmt.Sprintf("Change validation failed: %v", err)) return } if !changeRecord.Approved { klog.Infof("Change %s is not approved (status: %s)", changeID, changeRecord.Status) ac.respond(w, &admissionReview, false, fmt.Sprintf("Change %s is not approved (status: %s)", changeID, changeRecord.Status)) return } klog.Infof("Change %s is approved, allowing deployment", changeID) ac.respond(w, &admissionReview, true, fmt.Sprintf("Change %s approved by %s", changeID, changeRecord.Requester)) Container To run our Admission Controller inside the AKS cluster we need to create a Docker container that runs our application. In the sample code you will find a Docker file used to build this container. We then push the container to a Docker registry, so we can consume the image when we run the webhook service. Kubernetes Resources To run our Docker container and setup a URL that the API server can call we will deploy: A Kubernetes Deployment A Kubernetes Service A set of RBAC roles and bindings to grant access to the Admission Controller Finally, we will deploy the actual ValidatingAdmissionWebhook resource itself. This resource tells the API servers: Where to call the webhook Which operations should require calling the webhook - in our demo application we look at create and delete operations. If you wanted to validate delete operations had a CR, you could also add that Which resource types need to be validated - in our demo we are looking at Deployments, Services and Configmaps, but you could make this as wide or narrow as you require Which namespaces to validate - we added a condition that only applies this validation to namespaces that have a label of changeValidation set to enabled, this way we can control where this is applied and avoid applying it to things like system namespaces. This is very important to ensure you don't break your core Kubernetes infrastructure. This also allows for differentiation between development and production namespaces, where you likely would not want to require Change Requests in development. Finally, we define what happens when the validation fails. There are two options: fail which blocks the resource creation ignore which ignores the failure and allows the resource to be created apiVersion: admissionregistration.k8s.io/v1 kind: ValidatingAdmissionWebhook metadata: name: change-validation-webhook spec: clientConfig: service: name: admission-controller namespace: admission-controller path: "/admit" rules: - operations: ["CREATE", "UPDATE"] apiGroups: ["apps"] apiVersions: ["v1"] resources: ["deployments"] - operations: ["CREATE", "UPDATE"] apiGroups: [""] apiVersions: ["v1"] resources: ["services", "configmaps"] namespaceSelector: matchLabels: change-validation: "enabled" admissionReviewVersions: ["v1", "v1beta1"] sideEffects: None failurePolicy: Fail Admission Controller In Action Now that we have our admission controller setup, let's attempt to make a change to a resource. Using a Kubernetes Deployment resource, we will attempt to change the number of replicas from three to two. For this resource, the change.company.com/id annotation is set to CHG-2025-000 which is a change request that doesn't exist in our change management system. apiVersion: apps/v1 kind: Deployment metadata: name: demo-app namespace: demo annotations: change.company.com/id: "CHG-2025-000" labels: app: demo-app environment: development spec: replicas: 2 selector: matchLabels: app: demo-app Once we attempt to deploy this, we will quickly see that the the request to update the resource is denied: one or more objects failed to apply, reason: error when patching "/dev/shm/1236013741": admission webhook "change-validation.company.com" denied the request: Change validation failed: change record not found,admission webhook "change-validation.company.com" denied the request: Change validation failed: change record not found. Similarly, if we change the annotation to CHG-2025-999 which is a change request that does exist, but has not been approved, we again see that the request is denied, but this time the error is clear that it is not approved: one or more objects failed to apply, reason: error when patching "/dev/shm/28290353": admission webhook "change-validation.company.com" denied the request: Change CHG-2025-999 is not approved (status: pending),admission webhook "change-validation.company.com" denied the request: Change validation failed: change record not found. Finally, we update the annotation to CHG-2025-002, which has been approved. This time our deployment update succeeds and the number of replicas has been reduced to two. Next Steps What we have created so far works as a Proof of Concept to confirm that using an Admission Controller for this job will work. To move this into production use, we'd need to take a few more steps: Update our web API to call out to our external change management solution and retrieve real change requests Implement proper security for the Admission Controller with SSL certificates and network restrictions inside the cluster Implement high availability with multiple replicas to ensure the service is always able to respond to requests Implement monitoring and log collection for our service to ensure we are aware of any issues Automate the build and release of this solution, including implementing it's own set of change controls! Conclusions Controlling updates into production through a change control process is vital for a stable, secure and audited production environments. Ideally these CR processes will happen early in the release pipeline in a clear, automated process that avoids getting to the point where anyone tries to deploy unapproved changes into production. However, if you want to ensure that this cannot happen, and put some safeguards to ensure that unapproved changes are always blocked, then the use of Admission Controllers is one way to do this. Creating a custom Admission Controller is relatively straightforward and it allows you to integrate your business processes into the decision on whether a resource can be deployed or not. A change control Admission Controller should not be your only change control process, but it can form part of your layers of control and audit. Further Reading Sample Code Admission Control in Kubernetes Manage Change in the Cloud Adoption Framework339Views0likes0Comments