ebpf
7 TopicsRetina 1.0 Is Now Available
We are excited to announce the first major release of Retina - a significant milestone for the project. This version brings along many new features, enhancements and bug fixes. The Retina maintainer team would like to thank all contributors, community members, and early adopters who helped make this 1.0 release possible. What is Retina? Retina is an open-source, Kubernetes network observability platform. It enables you to continuously observe and measure network health, and investigate network issues on-demand with integrated Kubernetes-native workflows. Why Retina? Kubernetes networking failures are rarely isolated or easy to reproduce. Pods are ephemeral, services span multiple nodes, and network traffic crosses multiple layers (CNI, kube-proxy, node networking, policies), making crucial evidence difficult to capture. Manually connecting to nodes and stitching together logs or packet captures simply does not scale as clusters grow in size and complexity. A modern approach to observability must automate and centralize data collection while exposing rich, actionable insights. Retina represents a major step forward in solving the complexities of Kubernetes observability by leveraging the power of eBPF. Its cloud-agnostic design, deep integration with Hubble, and support for both real-time metrics and on-demand packet captures make it an invaluable tool for DevOps, SecOps, and compliance teams across diverse environments. What Does It Do? Retina can collect two types of telemetry: metrics and packet captures. The Retina shell enables ad-hoc troubleshooting via pre-installed networking tools. Metrics Metrics provide continuous observability. They can be exported to multiple storage options such as Prometheus or Azure Monitor, and visualized in a variety of ways, including Grafana or Azure Log Analytics. Retina supports two control planes: Hubble and Standard. Both are supported regardless of the underlying CNI. The choice of control plane affects the metrics which are collected. Hubble metrics Standard metrics You can customize which metrics are collected by enabling/disabling their corresponding plugins. Some examples of metrics may include: Incoming/outcoming traffic Dropped packets TCP/UDP DNS API Server latency Node/interface statistics Packet Captures Captures provide on-demand observability. They allow users to perform distributed packet captures across the cluster, based on specified Nodes/Pods and other supported filters. They can be triggered via the CLI or through the capture CRD, and may be output to persistent storage options such as the host filesystem, a PVC, or a storage blob. The result of the capture contains more than just a .pcap file. Retina also captures a number of networking metadata such as iptables rules, socket statistics, kernel network information from /proc/net, and more. Shell The Retina shell enables deep ad-hoc troubleshooting by providing a suite of networking tools. The CLI command starts an interactive shell on a Kubernetes node that runs a container image which includes standard tools such as ping or curl, as well as specialized tools like bpftool, pwru, Inspektor Gadget and more. The Retina shell is currently only available on Linux. Note that some tools require particular capabilities to execute. These can be passed as parameters through the CLI. Use Cases Debugging Pod Connectivity Issues: When services can’t communicate, Retina enables rapid, automated distributed packet capture and drop metrics, drastically reducing troubleshooting time. The Retina shell also brings specialized tools for deep manual investigations. Continuous Monitoring of Network Health: Operators can set up alerts and dashboards for DNS failures, API server latency, or packet drops, gaining ongoing visibility into cluster networking. Security Auditing and Compliance: Flow logs (in Hubble mode) and metrics support security investigations and compliance reporting, enabling quick identification of unexpected connections or data transfers. Multi-Cluster / Multi-Cloud Visibility: Retina standardizes network observability across clouds, supporting unified dashboards and processes for SRE teams. Where Does It Run? Retina is designed for broad compatibility across Kubernetes distributions, cloud providers, and operating systems. There are no Azure-specific dependencies - Retina runs anywhere Kubernetes does. Operating Systems: Both Linux and Windows nodes are supported. Kubernetes Distributions: Retina is distribution-agnostic, deployable on managed services (AKS, EKS, GKE) or self-managed clusters. CNI / Network Stack: Retina works with any CNI, focusing on kernel-level events rather than CNI-specific logs. Cloud Integration: Retina exports metrics to Azure Monitor and Log Analytics, with pre-built Grafana dashboards for AKS. Integration with AWS CloudWatch or GCP Stackdriver is possible via Prometheus. Observability Stacks: Retina integrates with Prometheus & Grafana, Cilium Hubble (for flow logs and UI), and can be extended to other exporters. Design Overview Retina’s architecture consists of two layers: a data collection layer in the kernel-space, and processing layer that converts low-level signals into Kubernetes-aware telemetry in the user-space. When Retina is installed, each node in the cluster runs a Retina agent which collects raw network telemetry from the host kernel - backed by eBPF on Linux, and HNS/VFP on Windows. The agent processes the raw network data and enriches it with Kubernetes metadata, which is then exported for consumption by monitoring tools such as Prometheus, Grafana, or Hubble UI. Modularity and extensibility are central to the design philosophy. Retina's plugin model lets you enable only the telemetry you need, and add new sources by implementing a common plugin interface. Built-in plugins include Drop Reason, DNS, Packet Forward, and more. Check out our architecture docs for a deeper dive into Retina's design. Get Started Thanks to Helm charts deploying Retina is streamlined across all environments, and can be done with one configurable command. For complete documentation, visit our installation docs. To install Retina with the Standard control plane and Basic metrics mode: VERSION=$( curl -sL https://api.github.com/repos/microsoft/retina/releases/latest | jq -r .name) helm upgrade --install retina oci://ghcr.io/microsoft/retina/charts/retina \ --version $VERSION \ --namespace kube-system \ --set image.tag=$VERSION \ --set operator.tag=$VERSION \ --set logLevel=info \ --set operator.enabled=true \ --set enabledPlugin_linux="\[dropreason\,packetforward\,linuxutil\,dns\]" Once Retina is running in your cluster, you can then configure Prometheus and Grafana to scrape and visualize your metrics. Install the Retina CLI with Krew: kubectl krew install retina Get Involved Retina is open-source under the MIT License and welcomes community contributions. Since its announcement in early 2024, the project has gained significant traction, with contributors from multiple organizations helping to expand its capabilities. The project is hosted on GitHub · microsoft/retina and documentation is available at retina.sh. If you would like to contribute to Retina you can follow our contributor guide. What's Next? Retina 1.1 of course! We are also discussing the future roadmap, and exploring the possibility of moving the project to community ownership. Stay tuned! In the meantime, we welcome you to raise an issue if you find any bugs, or start a discussion if you have any questions or suggestions. You can also reach out to the Retina team via email, we would love to hear from you! References Retina Deep Dive into Retina Open-Source Kubernetes Network Observability Troubleshooting Network Issues with Retina Retina: Bridging Kubernetes Observability and eBPF Across the Clouds565Views0likes0CommentsScaling DNS on AKS with Cilium: NodeLocal DNSCache, LRP, and FQDN Policies
Why Adopt NodeLocal DNSCache? The primary drivers for adoption are usually: Eliminating Conntrack Pressure: In high-QPS UDP DNS scenarios, conntrack contention and UDP tracking can cause intermittent DNS response loss and retries; depending on resolver retry/timeouts, this can appear as multi-second lookup delays and sometimes much longer tails. Reducing Latency: By placing a cache on every node, you remove the network hop to the CoreDNS service. Responses are practically instantaneous for cached records. Offloading CoreDNS: A DaemonSet architecture effectively shards the DNS query load across the entire cluster, preventing the central CoreDNS deployment from becoming a single point of congestion during bursty scaling events. Who needs this? You should prioritize this architecture if you run: Large-scale clusters large clusters (hundreds of nodes or thousands of pods), where CoreDNS scaling becomes difficult to manage. High-churn endpoints, such as spot instances or frequent auto-scaling jobs that trigger massive waves of DNS queries. Real-time applications where multi-second (and occasionally longer) DNS lookup delays are unacceptable. The Challenge with Cilium Deploying NodeLocal DNSCache on a cluster managed by Cilium (CNI) requires a specific approach. Standard NodeLocal DNSCache relies on node-level interface/iptables setup. In Cilium environments, you can instead implement the interception via Cilium Local Redirect Policy (LRP), which redirects traffic destined to the kube-dns ClusterIP service to a node-local backend pod. This post details a production-ready deployment strategy aligned with Cilium’s Local Redirect Policy model. It covers necessary configuration tweaks to avoid conflicts and explains how to maintain security filtering. Architecture Overview In a standard Kubernetes deployment, NodeLocal DNSCache creates a dummy network interface and uses extensive iptables rules to hijack traffic destined for the Cluster DNS IP. When using Cilium, we can achieve this more elegantly and efficiently using Local Redirect Policies. DaemonSet: Runs node-local-dns on every node. Configuration: Configured to skip interface creation and iptables manipulation. Redirection: Cilium LRP intercepts traffic to the kube-dns Service IP and redirects it to the local pod on the same node. 1. The NodeLocal DNSCache DaemonSet The critical difference in this manifest is the arguments passed to the node-local-dns binary. We must explicitly disable its networking setup functions to let Cilium handle the traffic. The NodeLocal DNSCache deployment also requires the node-local-dns ConfigMap and the kube-dns-upstream Service (plus RBAC/ServiceAccount). For brevity, the snippet below shows only the DaemonSet arguments that differ in the Cilium/LRP approach. The node-cache reads the template Corefile (/etc/coredns/Corefile.base) and generates the active Corefile (/etc/Corefile). The -conf flag points CoreDNS at the active Corefile it should load. The node-cache binary accepts -localip as an IP list; 0.0.0.0 is a valid value and makes it listen on all interfaces, appropriate for the LRP-based redirection model. apiVersion: apps/v1 kind: DaemonSet metadata: name: node-local-dns namespace: kube-system labels: k8s-app: node-local-dns spec: selector: matchLabels: k8s-app: node-local-dns template: metadata: labels: k8s-app: node-local-dns annotations: # Optional: policy.cilium.io/no-track-port can be used to bypass conntrack for DNS. # Validate the impact on your Cilium version and your observability/troubleshooting needs. policy.cilium.io/no-track-port: "53" spec: # IMPORTANT for the "LRP + listen broadly" approach: # keep hostNetwork off so you don't hijack node-wide :53 hostNetwork: false dnsPolicy: ClusterFirst containers: - name: node-cache image: registry.k8s.io/dns/k8s-dns-node-cache:1.15.16 args: - "-localip" # Use a bind-all approach. Ensure server blocks bind broadly in your Corefile. - "0.0.0.0" - "-conf" - "/etc/Corefile" - "-upstreamsvc" - "kube-dns-upstream" # CRITICAL: Disable internal setup - "-skipteardown=true" - "-setupinterface=false" - "-setupiptables=false" ports: - containerPort: 53 name: dns protocol: UDP - containerPort: 53 name: dns-tcp protocol: TCP # Ensure your Corefile includes health :8080 so the liveness probe works livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 60 timeoutSeconds: 5 volumeMounts: - name: config-volume mountPath: /etc/coredns - name: kube-dns-config mountPath: /etc/kube-dns volumes: - name: kube-dns-config configMap: name: kube-dns optional: true - name: config-volume configMap: name: node-local-dns items: - key: Corefile path: Corefile.base 2. The Cilium Local Redirect Policy (LRP) Instead of iptables, we define a CRD that tells Cilium: "When you see traffic for `kube-dns`, send it to the `node-local-dns` pod on this same node." apiVersion: "cilium.io/v2" kind: CiliumLocalRedirectPolicy metadata: name: "nodelocaldns" namespace: kube-system spec: redirectFrontend: # ServiceMatcher mode is for ClusterIP services serviceMatcher: serviceName: kube-dns namespace: kube-system redirectBackend: # The backend pods selected by localEndpointSelector must be in the same namespace as the LRP localEndpointSelector: matchLabels: k8s-app: node-local-dns toPorts: - port: "53" name: dns protocol: UDP - port: "53" name: dns-tcp protocol: TCP This is an LRP-based NodeLocal DNSCache deployment: we disable node-cache’s iptables/interface setup and let Cilium LRP handle local redirection. This differs from the upstream NodeLocal DNSCache manifest, which uses hostNetwork + dummy interface + iptables. LRP must be enabled in Cilium (e.g., localRedirectPolicies.enabled=true) before applying the CRD. Official Cilium LRP doc DNS-Based FQDN Policy Enforcement Flow The diagram below illustrates how Cilium enforces FQDN-based egress policies using DNS observation and datapath programming. During the DNS resolution phase, queries are redirected to NodeLocal DNS (or CoreDNS), where responses are observed and used to populate Cilium’s FQDN-to-IP cache. Cilium then programs these mappings into eBPF maps in the datapath. In the connection phase, when the client initiates an HTTPS connection to the resolved IP, the datapath checks the IP against the learned FQDN map and applies the policy decision before allowing or denying the connection. The Network Policy "Gotcha" If you use CiliumNetworkPolicy to restrict egress traffic, specifically for FQDN filtering, you typically allow access to CoreDNS like this: - toEndpoints: - matchLabels: k8s:io.kubernetes.pod.namespace: kube-system k8s:k8s-app: kube-dns toPorts: - ports: - port: "53" protocol: ANY This will break with local redirection. Why? Because LRP redirects the DNS request to the node-local-dns backend endpoint; strict egress policies must therefore allow both kube-dns (upstream) and node-local-dns (the redirected destination). The Repro Setup To demonstrate this failure, the cluster is configured with: NodeLocal DNSCache: Deployed as a DaemonSet (node-local-dns) to cache DNS requests locally on every node. Local Redirect Policy (LRP): An active LRP intercepts traffic destined for the kube-dns Service IP and redirects it to the local node-local-dns pod. Incomplete Network Policy: A strict CiliumNetworkPolicy (CNP) is enforced on the client pod. While it explicitly allows egress to kube-dns, it misses the corresponding rule for node-local-dns. Reveal the issue using Hubble: In this scenario, the client pod dns-client is attempting to resolve the external domain github.com. When inspecting the traffic flows, you will see EGRESS DENIED verdicts. Crucially, notice the destination pod in the logs below: kube-system/node-local-dns, not kube-dns. Although the application originally sent the packet to the Cluster IP of CoreDNS, Cilium's Local Redirect Policy modified the destination to the local node cache. Since strictly defined Network Policies assume traffic is going to the kube-dns identity, this redirected traffic falls outside the allowed rules and is dropped by the default deny stance. The Fix: You must allow egress to both labels. - toEndpoints: - matchLabels: k8s:io.kubernetes.pod.namespace: kube-system k8s:k8s-app: kube-dns # Add this selector for the local cache - matchLabels: k8s:io.kubernetes.pod.namespace: kube-system k8s:k8s-app: node-local-dns toPorts: - ports: - port: "53" protocol: ANY Without this addition, pods protected by strict egress policies will timeout resolving DNS, even though the cache is running. Use Hubble to observe the network flows: After adding matchLabels: k8s:k8s-app: node-local-dns, the traffic is now allowed. Hubble confirms a policy verdict of EGRESS ALLOWED for UDP traffic on port 53. Because DNS resolution now succeeds, the response populates the Cilium FQDN cache, subsequently allowing the TCP traffic to github.com on port 443 as intended. Real-World Example: Restricting Egress with FQDN Policies Here is a complete CiliumNetworkPolicy that locks down a workload to only access api.example.com. Note how the DNS rule explicitly allows traffic to both kube-dns (for upstream) and node-local-dns (for the local cache). apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy metadata: name: secure-workload-policy spec: endpointSelector: matchLabels: app: critical-workload egress: # 1. Allow DNS Resolution (REQUIRED for FQDN policies) - toEndpoints: - matchLabels: k8s:io.kubernetes.pod.namespace: kube-system k8s:k8s-app: kube-dns # Allow traffic to the local cache redirection target - matchLabels: k8s:io.kubernetes.pod.namespace: kube-system k8s:k8s-app: node-local-dns toPorts: - ports: - port: "53" protocol: ANY rules: dns: - matchPattern: "*" # 2. Allow specific FQDN traffic (populated via DNS lookups) - toFQDNs: - matchName: "api.example.com" toPorts: - ports: - port: "443" protocol: TCP Configuration & Upstream Loops When configuring the ConfigMap for node-local-dns, use the standard placeholders provided by the image. The binary replaces them at runtime: __PILLAR__CLUSTER__DNS__: The Upstream Service IP (kube-dns-upstream). __PILLAR__UPSTREAM__SERVERS__: The system resolvers (usually /etc/resolv.conf). Ensure kube-dns-upstream exists as a Service selecting the CoreDNS pods so cache misses are forwarded to the actual CoreDNS backends. Alternative: AKS LocalDNS LocalDNS is an Azure Kubernetes Services (AKS)-managed node-local DNS proxy/cache. Pros: Managed lifecycle at the node pool level. Support for custom configuration via localdnsconfig.json (e.g., custom server blocks, cache tuning). No manual DaemonSet management required. Cons & Limitations: Incompatibility with FQDN Policies: As noted in the official documentation, LocalDNS isn’t compatible with applied FQDN filter policies in ACNS/Cilium; if you rely on FQDN enforcement, prefer a DNS path that preserves FQDN learning/enforcement. Updating configuration requires reimaging the node pool. For environments heavily relying on strict Cilium Network Policies and FQDN filtering, the manual deployment method described above (using LRP) can be more reliable and transparent. AKS recommends not enabling both upstream NodeLocal DNSCache and LocalDNS in the same node pool, as DNS traffic is routed through LocalDNS and results may be unexpected. References Kubernetes Documentation: NodeLocal DNSCache Cilium Documentation: Local Redirect Policy AKS Documentation: Configure LocalDNS513Views2likes0CommentsProject Pavilion Presence at KubeCon NA 2025
KubeCon + CloudNativeCon NA took place in Atlanta, Georgia, from 10-13 November, and continued to highlight the ongoing growth of the open source, cloud-native community. Microsoft participated throughout the event and supported several open source projects in the Project Pavilion. Microsoft’s involvement reflected our commitment to upstream collaboration, open governance, and enabling developers to build secure, scalable and portable applications across the ecosystem. The Project Pavilion serves as a dedicated, vendor-neutral space on the KubeCon show floor reserved for CNCF projects. Unlike the corporate booths, it focuses entirely on open source collaboration. It brings maintainers and contributors together with end users for hands-on demos, technical discussions, and roadmap insights. This space helps attendees discover emerging technologies and understand how different projects fit into the cloud-native ecosystem. It plays a critical role for idea exchanges, resolving challenges and strengthening collaboration across CNCF approved technologies. Why Our Presence Matters KubeCon NA remains one of the most influential gatherings for developers and organizations shaping the future of cloud-native computing. For Microsoft, participating in the Project Pavilion helps advance our goals of: Open governance and community-driven innovation Scaling vital cloud-native technologies Secure and sustainable operations Learning from practitioners and adopters Enabling developers across clouds and platforms Many of Microsoft’s products and cloud services are built on or aligned with CNCF and open-source technologies. Being active within these communities ensures that we are contributing back to the ecosystem we depend on and designing by collaborating with the community, not just for it. Microsoft-Supported Pavilion Projects containerd Representative: Wei Fu The containerd team engaged with project maintainers and ecosystem partners to explore solutions for improving AI model workflows. A key focus was the challenge of handling large OCI artifacts (often 500+ GiB) used in AI training workloads. Current image-pulling flows require containerd to fetch and fully unpack blobs, which significantly delays pod startup for large models. Collaborators from Docker, NTT, and ModelPack discussed a non-unpacking workflow that would allow training workloads to consume model data directly. The team plans to prototype this behavior as an experimental feature in containerd. Additional discussions included updates related to nerdbox and next steps for the erofs snapshotter. Copacetic Representative: Joshua Duffney The Copa booth attracted roughly 75 attendees, with strong representation from federal agencies and financial institutions, a sign of growing adoption in regulated industries. A lightning talk delivered at the conference significantly boosted traffic and engagement. Key feedback and insights included: High interest in customizable package update sources Demand for application-level patching beyond OS-level updates Need for clearer CI/CD integration patterns Expectations around in-cluster image patching Questions about runtime support, including Podman The conversations revealed several documentation gaps and feature opportunities that will inform Copa’s roadmap and future enablement efforts. Drasi Representative: Nandita Valsan KubeCon NA 2025 marked Drasi’s first in-person presence since its launch in October 2024 and its entry into the CNCF Sandbox in early 2025. With multiple kiosk slots, the team interacted with ~70 visitors across shifts. Engagement highlights included: New community members joining the Drasi Discord and starring GitHub repositories Meaningful discussions with observability and incident management vendors interested in change-driven architectures Positive reception to Aman Singh’s conference talk, which led attendees back to the booth for deeper technical conversations Post-event follow-ups are underway with several sponsors and partners to explore collaboration opportunities. Flatcar Container Linux Representatives: Sudhanva Huruli and Vamsi Kavuru The Flatcar project had some fantastic conversations at the pavilion. Attendees were eager to learn about bare metal provisioning, GPU support for AI workloads, and how Flatcar’s fully automated build and test process keeps things simple and developer friendly. Questions around Talos vs. Flatcar and CoreOS sparked lively discussions, with the team emphasizing Flatcar’s usability and independence from an OS-level API. Interest came from government agencies and financial institutions, and the preview of Flatcar on AKS opened the door to deeper conversations about real-world adoption. The Project Pavilion proved to be the perfect venue for authentic, technical exchanges. Flux Representatives: Dipti Pai The Flux booth was active throughout all three days of the Project Pavilion, where Microsoft joined other maintainers to highlight new capabilities in Flux 2.7, including improved multi-tenancy, enhanced observability, and streamlined cloud-native integrations. Visitors shared real-world GitOps experiences, both successes and challenges, which provided valuable insights for the project’s ongoing development. Microsoft’s involvement reinforced strong collaboration within the Flux community and continued commitment to advancing GitOps practices. Headlamp Representatives: Joaquim Rocha, Will Case, and Oleksandr Dubenko Headlamp had a booth for all three days of the conference, engaging with both longstanding users and first-time attendees. The increased visibility from becoming a Kubernetes sub-project was evident, with many attendees sharing their usage patterns across large tech organizations and smaller industrial teams. The booth enabled maintainers to: Gather insights into how teams use Headlamp in different environments Introduce Headlamp to new users discovering it via talks or hallway conversations Build stronger connections with the community and understand evolving needs Inspektor Gadget Representatives: Jose Blanquicet and Mauricio Vásquez Bernal Hosting a half-day kiosk session, Inspektor Gadget welcomed approximately 25 visitors. Attendees included newcomers interested in learning the basics and existing users looking for updates. The team showcased new capabilities, including the tcpdump gadget and Prometheus metrics export, and invited visitors to the upcoming contribfest to encourage participation. Istio Representatives: Keith Mattix, Jackie Maertens, Steven Jin Xuan, Niranjan Shankar, and Mike Morris The Istio booth continued to attract a mix of experienced adopters and newcomers seeking guidance. Technical discussions focused on: Enhancements to multicluster support in ambient mode Migration paths from sidecars to ambient Improvements in Gateway API availability and usage Performance and operational benefits for large-scale deployments Users, including several Azure customers, expressed appreciation for Microsoft’s sustained investment in Istio as part of their service mesh infrastructure. Notary Project Representative: Feynman Zhou and Toddy Mladenov The Notary Project booth saw significant interest from practitioners concerned with software supply chain security. Attendees discussed signing, verification workflows, and integrations with Azure services and Kubernetes clusters. The conversations will influence upcoming improvements across Notary Project and Ratify, reinforcing Microsoft’s commitment to secure artifacts and verifiable software distribution. Open Policy Agent (OPA) - Gatekeeper Representative: Jaydip Gabani The OPA/Gatekeeper booth enabled maintainers to connect with both new and existing users to explore use cases around policy enforcement, Rego/CEL authoring, and managing large policy sets. Many conversations surfaced opportunities around simplifying best practices and reducing management complexity. The team also promoted participation in an ongoing Gatekeeper/OPA survey to guide future improvements. ORAS Representative: Feynman Zhou and Toddy Mladenov ORAS engaged developers interested in OCI artifacts beyond container images which includes AI/ML models, metadata, backups, and multi-cloud artifact workflows. Attendees appreciated ORAS’s ecosystem integrations and found the booth examples useful for understanding how artifacts are tagged, packaged, and distributed. Many users shared how they leverage ORAS with Azure Container Registry and other OCI-compatible registries. Radius Representative: Zach Casper The Radius booth attracted the attention of platform engineers looking for ways to simplify their developer's experience while being able to enforce enterprise-grade infrastructure and security best practices. Attendees saw demos on deploying a database to Kubernetes and using managed databases from AWS and Azure without modifying the application deployment logic. They also saw a preview of Radius integration with GitHub Copilot enabling AI coding agents to autonomously deploy and test applications in the cloud. Conclusion KubeCon + CloudNativeCon North America 2025 reinforced the essential role of open source communities in driving innovation across cloud native technologies. Through the Project Pavilion, Microsoft teams were able to exchange knowledge with other maintainers, gather user feedback, and support projects that form foundational components of modern cloud infrastructure. Microsoft remains committed to building alongside the community and strengthening the ecosystem that powers so much of today’s cloud-native development. For anyone interested in exploring or contributing to these open source efforts, please reach out directly to each project’s community to get involved, or contact Lexi Nadolski at lexinadolski@microsoft.com for more information.236Views1like0CommentsFrom Policy to Practice: Built-In CIS Benchmarks on Azure - Flexible, Hybrid-Ready
Security is more important than ever. The industry-standard for secure machine configuration is the Center for Internet Security (CIS) Benchmarks. These benchmarks provide consensus-based prescriptive guidance to help organizations harden diverse systems, reduce risk, and streamline compliance with major regulatory frameworks and industry standards like NIST, HIPAA, and PCI DSS. In our previous post, we outlined our plans to improve the Linux server compliance and hardening experience on Azure and shared a vision for integrating CIS Benchmarks. Today, that vision has turned into reality. We're now announcing the next phase of this work: Center for Internet Security (CIS) Benchmarks are now available on Azure for all Azure endorsed distros, at no additional cost to Azure and Azure Arc customers. With today's announcement, you get access to the CIS Benchmarks on Azure with full parity to what’s published by the Center for Internet Security (CIS). You can adjust parameters or define exceptions, tailoring security to your needs and applying consistent controls across cloud, hybrid, and on-premises environments - without having to implement every control manually. Thanks to this flexible architecture, you can truly manage compliance as code. How we achieve parity To ensure accuracy and trust, we rely on and ingest CIS machine-readable Benchmark content (OVAL/XCCDF files) as the source of truth. This guarantees that the controls and rules you apply in Azure match the official CIS specifications, reducing drift and ensuring compliance confidence. What’s new under the hood At the core of this update is azure-osconfig’s new compliance engine - a lightweight, open-source module developed by the Azure Core Linux team. It evaluates Linux systems directly against industry-standard benchmarks like CIS, supporting both audit and, in the future, auto-remediation. This enables accurate, scalable compliance checks across large Linux fleets. Here you can read more about azure-osconfig. Dynamic rule evaluation The new compliance engine supports simple fact-checking operations, evaluation of logic operations on them (e.g., anyOf, allOf) and Lua based scripting, which allows to express complex checks required by the CIS Critical Security Controls - all evaluated natively without external scripts. Scalable architecture for large fleets When the assignment is created, the Azure control plane instructs the machine to pull the latest Policy package via the Machine Configuration agent. Azure-osconfig’s compliance engine is integrated as a light-weight library to the package and called by Machine Configuration agent for evaluation – which happens every 15-30minutes. This ensures near real-time compliance state without overwhelming resources and enables consistent evaluation across thousands of VMs and Azure Arc-enabled servers. Future-ready for remediation and enforcement While the Public Preview starts with audit-only mode, the roadmap includes per-rule remediation and enforcement using technologies like eBPF for kernel-level controls. This will allow proactive prevention of configuration drift and runtime hardening at scale. Please reach out if you interested in auto-remediation or enforcement. Extensibility beyond CIS Benchmarks The architecture was designed to support other security and compliance standards as well and isn’t limited to CIS Benchmarks. The compliance engine is modular, and we plan to extend the platform with STIG and other relevant industry benchmarks. This positions Azure as a platform for a place where you can manage your compliance from a single control-plane without duplicating efforts elsewhere. Collaboration with the CIS This milestone reflects a close collaboration between Microsoft and the CIS to bring industry-standard security guidance into Azure as a built-in capability. Our shared goal is to make cloud-native compliance practical and consistent, while giving customers the flexibility to meet their unique requirements. We are committed to continuously supporting new Benchmark releases, expanding coverage with new distributions and easing adoption through built-in workflows, such as moving from your current Benchmark version to a new version while preserving your custom configurations. Certification and trust We can proudly announce that azure-osconfig has met all the requirements and is officially certified by the CIS for Benchmark assessment, so you can trust compliance results as authoritative. Minor benchmark updates will be applied automatically, while major version will be released separately. We will include workflows to help migrate customizations seamlessly across versions. Key Highlights Built-in CIS Benchmarks for Azure Endorsed Linux distributions Full parity with official CIS Benchmarks content and certified by the CIS for Benchmark Assessment Flexible configuration: adjust parameters, define exceptions, tune severity Hybrid support: enforce the same baseline across Azure, on-prem, and multi-cloud with Azure Arc Reporting format in CIS tooling style Supported use cases Certified CIS Benchmarks for all Azure Endorsed Distros - Audit only (L1/L2 server profiles) Hybrid / On-premises and other cloud machines with Azure Arc for the supported distros Compliance as Code (example via Github -> Azure OIDC auth and API integration) Compatible with GuestConfig workbook What’s next? Our next mission is to bring the previously announced auto-remediation capability into this experience, expand the distribution coverage and elevate our workflows even further. We’re focused on empowering you to resolve issues while honoring the unique operational complexity of your environments. Stay tuned! Get Started Documentation link for this capability Enable CIS Benchmarks in Machine Configuration and select the “Official Center for Internet Security (CIS) Benchmarks for Linux Workloads” then select the distributions for your assignment, and customize as needed. In case if you want any additional distribution supported or have any feedback for azure-osconfig – please open an Azure support case or a Github issue here Relevant Ignite 2025 session: Hybrid workload compliance from policy to practice on Azure Connect with us at Ignite Meet the Linux team and stop by the Linux on Azure booth to see these innovations in action: Session Type Session Code Session Name Date/Time (PST) Theatre THR 712 Hybrid workload compliance from policy to practice on Azure Tue, Nov 18/ 3:15 PM – 3:45 PM Breakout BRK 143 Optimizing performance, deployments, and security for Linux on Azure Thu, Nov 20/ 1:00 PM – 1:45 PM Breakout BRK 144 Build, modernize, and secure AKS workloads with Azure Linux Wed, Nov 19/ 1:30 PM – 2:15 PM Breakout BRK 104 From VMs and containers to AI apps with Azure Red Hat OpenShift Thu, Nov 20/ 8:30 AM – 9:15 AM Theatre THR 701 From Container to Node: Building Minimal-CVE Solutions with Azure Linux Wed, Nov 19/ 3:30 PM – 4:00 PM Lab Lab 505 Fast track your Linux and PostgreSQL migration with Azure Migrate Tue, Nov 18/ 4:30 PM – 5:45 PM PST Wed, Nov 19/ 3:45 PM – 5:00 PM PST Thu, Nov 20/ 9:00 AM – 10:15 AM PST1KViews0likes0CommentseBPF-Powered Observability Beyond Azure: A Multi-Cloud Perspective with Retina
Kubernetes simplifies container orchestration but introduces observability challenges due to dynamic pod lifecycles and complex inter-service communication. eBPF technology addresses these issues by providing deep system insights and efficient monitoring. The open-source Retina project leverages eBPF for comprehensive, cloud-agnostic network observability across AKS, GKE, and EKS, enhancing troubleshooting and optimization through real-world demo scenarios.1.1KViews10likes0CommentsEnhancing Observability with Inspektor Gadget
Thorough observability is essential to a pain free cloud experience. Azure provides many general-purpose observability tools, but you may want to create custom tooling . Inspektor Gadget is an open-source framework that makes customizable data collection easy. Microsoft recently contributed new features to Inspektor Gadget that further enhance its modular framework, making it even easier to meet your specific systems inspection needs. Of course, we also made it easy for Azure Kubernetes Service (AKS) users to use.1.2KViews0likes0Comments