troubleshooting
7 TopicseBPF-Powered Observability Beyond Azure: A Multi-Cloud Perspective with Retina
Kubernetes simplifies container orchestration but introduces observability challenges due to dynamic pod lifecycles and complex inter-service communication. eBPF technology addresses these issues by providing deep system insights and efficient monitoring. The open-source Retina project leverages eBPF for comprehensive, cloud-agnostic network observability across AKS, GKE, and EKS, enhancing troubleshooting and optimization through real-world demo scenarios.926Views9likes0CommentsInnovations and Strengthening Platforms Reliability Through Open Source
The Linux Systems Group (LSG) at Microsoft is the team building OS innovations in Azure enabling secure and high-performance platforms that power millions of workloads worldwide. From providing the OS for Boost, optimizing Linux kernels for hyperscale environments or contributing to open-source projects like Rust-VMM and Cloud Hypervisor, LSG ensures customers get the best of Linux on Azure. Our work spans performance tuning, security hardening, and feature enablement for new silicon enablement and cutting-edge technologies, such as Confidential Computing, ARM64 and Nvidia Grace Blackwell all while strengthening the global open-source ecosystem. Our philosophy is simple: we develop in the open and upstream first, integrating improvements into our products after they’ve been accepted by the community. At Ignite we like to highlight a few open-source key contributions in 2025 that are the foundations for many product offerings and innovations you will see during the whole week. We helped bring seamless kernel update features (Kexec HandOver) to the Linux kernel, improved networking paths for AI platforms, strengthened container orchestration and security efforts, and shared engineering insights with global communities and conferences. This work reflects Microsoft’s long-standing commitment to open source, grounded in active upstream participation and close collaboration with partners across the ecosystem. Our engineers work side-by-side with maintainers, Linux distro partners, and silicon providers to ensure contributions land where they help the most, from kernel updates to improvements that support new silicon platforms. Linux Kernel Contributions Enabling Seamless Kernel Updates: Persistent uptime for critical services is a top priority. This year, Microsoft engineer Mike Rapoport successfully merged Kexec HandOver (KHO) into Linux 6.16 1 . KHO is a kernel mechanism that preserves memory state across a reboot (kexec), allowing systems to carry over important data when loading a new kernel. In practice, this means Microsoft can apply security patches or kernel updates to Azure platform and customers VMs without rebooting or with significantly reduced downtime. It’s a technical achievement with real impact: cloud providers and enterprises can update Linux on the fly, enhancing security and reliability for services that demand continuous availability. Optimizing Network Drivers for AI Scale: Massive AI models require massive bandwidth. Working closely with our partners deploying large AI workloads on Azure, LSG engineers delivered a breakthrough in Linux networking performance. LSG team rearchitected the receive path of the MANA network driver (used by our smart NICs) to eliminate wasted memory and enable recycling of buffers. 2x higher effective network throughput on 64 KB page systems 35% better memory efficiency for RX buffers 15% higher throughput and roughly half the memory use even on standard x86_64 VMs References MANA RX optimization patch: net: mana: Use page pool fragments for RX buffers LKML Linux Plumbers 2025 talk: Optimizing traffic receive (RX) path in Linux kernel MANA Driver for larger PAGE_SIZE systems Improving Reliability for Cloud Networking: In addition to raw performance, reliability got a boost. One critical fix addressed a race condition in the Hyper-V hv_netvsc driver that sometimes caused packet loss when a VM’s network channel initialized. By patching this upstream, we improved network stability for all Linux guests running on Hyper-V keeping customer VMs running smoothly during dynamic operations like scale-out or live migrations. Our engineers also upstreamed numerous improvements to Hyper-V device drivers (covering storage, memory, and general virtualization).We fixed interrupt handling bugs, eliminated outdated patches, and resolved issues affecting ARM64 architectures. Each of these fixes was contributed to the mainline kernel, ensuring that any Linux distribution running on Hyper-V or Azure benefits from the enhanced stability and performance. References Upstream fix: hv_netvsc race on early receive events: kernel.org commit referenced by Ubuntu bug Launchpad Ubuntu Azure backport write-up: Bug 2127705 – hv_netvsc: fix loss of early receive events from host during channel open Launchpad Older background on hv_netvsc packet-loss issues: kernel.org bug 81061 Strengthening Core Linux Infrastructure: Several of our contributions targeted fundamental kernel subsystems that all Linux users rely on. For example, we led significant enhancements to the Virtual File System (VFS) layer reworking how Linux handles process core dumps and expanding file management capabilities. These changes improve how Linux handles files and memory under the hood, benefiting scenarios from large-scale cloud storage to local development. We also continued upstream efforts to support advanced virtualization features.Our team is actively upstreaming the mshv_vtl driver (for managing secure partitions on Hyper-V) and improving Linux’s compatibility with nested virtualization on Azure’s Microsoft Hypervisor (MSHV). All this low-level work adds up to a more robust and feature-rich kernel for everyone. References Example VFS coredump work: split file coredumping into coredump_file() mshv_vtl driver patchset: Drivers: hv: Introduce new driver – mshv_vtl (v10) and v12 patch series on patchew Bolstering Linux Security in the Cloud: Security has been a major thread across our upstream contributions. One focus area is making container workloads easier to verify and control. Microsoft engineers proposed an approach for code integrity in containers built on containerd’s EROFS snapshotter, shared as an open RFC in the containerd project -GitHub. The idea is to use read-only images plus integrity metadata so that container file systems can be measured and checked against policy before they run. We also engaged deeply with industry partners on kernel vulnerability handling. Through the Cloud-LTS Linux CVE workgroup, cloud providers and vendors collaborate in the open on a shared analysis of Linux CVEs. The group maintains a public repository that records how each CVE affects various kernels and configurations, which helps reduce duplicated triage work and speeds up security responses. On the platform side, our engineers contributed fixes to the OP-TEE secure OS used in trusted execution and secure-boot scenarios, making sure that the cryptographic primitives required by Azure’s Linux boot flows behave correctly across supported devices. These changes help ensure that Linux verified boot chains remain reliable on Azure hardware. References containerd RFC: Code Integrity for OCI/containerd Containers using erofs-snapshotter GitHub Cloud-LTS public CVE analysis repo: cloud-lts/linux-cve-analysis Linux CVE workgroup session at Linux Plumbers 2025: Linux CVE workgroup OP-TEE project docs: OP-TEE documentation Developer Tools & Experience Smoother OS Management with Systemd: Ensuring Linux works seamlessly on Azure scale. The core init system systemd saw important improvements from our team this year. LSG contributed and merged upstream support for disk quota controls in systemd services. With new directives (like StateDirectoryQuota and CacheDirectoryQuota), administrators can easily enforce storage limits for service data, which is especially useful in scenarios like IoT devices with eMMC storage on Azure’s custom SoCs. In addition, Sea-Team added an auto-reload feature to systemd-journald, allowing log configuration changes to apply at runtime without restarting the logging service . These improvements, now part of upstream systemd, help Azure and other Linux environments perform updates or maintenance with minimal disruption to running services. These improvements help Azure and other environments roll out configuration updates with less impact on running workloads. References systemd quota directives: systemd.exec(5) – StateDirectoryQuota and related options systemd journald reload behavior: systemd-journald.service(8) Empowering Linux Quality at Scale: Running Linux on Azure at global scale requires extensive, repeatable testing. Microsoft continues to invest in LISA (Linux Integration Services Automation), an open-source framework that validates Linux kernels and distributions on Azure and other Hyper-V–based environments. Over the past year we expanded LISA with: New stress tests for rapid reboot sequences to catch elusive timing bugs Better failure diagnostics to make complex issues easier to root-cause Extended coverage for ARM64 scenarios and technologies like InfiniBand networking Integration of Azure VM SKU metadata and policy checks so that image validation can automatically confirm conformance to Azure requirements These changes help us qualify new kernels, distributions, and VM SKUs before they are shipped to customers. Because LISA is open source, partners and Linux vendors can run the same tests and share results, which raises quality across the ecosystem. References LISA GitHub repo: microsoft/lisa LISA documentation: Welcome to Linux Integration Services Automation LISA Documentation Community Engagement and Leadership Sharing Knowledge Globally: Open-source contribution is not just about code - it’s about people and knowledge exchange. Our team members took active roles in community events worldwide, reflecting Microsoft’s growing leadership in the Linux community. We were proud to be a Platinum Sponsor of the inaugural Open Source Summit India 2025 in Hyderabad, where LSG engineers served on the program committee and hosted technical sessions. At Linux Security Summit Europe 2025, Microsoft’s security experts shaped the agenda as program committee members, delivered talks (such as “The State of SELinux”), and even led panel discussions alongside colleagues from Intel, Arm, and others. And in Paris at Kernel Recipes 2025, our own SMEs shared kernel insights with fellow developers. By engaging in these events, Microsoft not only contributes code but also helps guide the conversation on the future of Linux. These relationships and public interactions build mutual trust and ensure that we remain closely aligned with community priorities. References Event: Open Source Summit India 2025 – Linux Foundation Paul Moore’s talk archive: LSS-EU 2025 Conference: Kernel Recipes 2025 and Kernel Recipes 2025 schedule Closing Thoughts Microsoft’s long-term commitment to open source remains strong, and the Linux Systems Group will continue contributing upstream, collaborating across the industry, and supporting the upstream communities that shape the technologies we rely on. Our work begins in upstream projects such as the Linux kernel, Kubernetes, and systemd, where improvements are shared openly before they reach Azure. The progress highlighted in this blog was made possible by the wider Linux community whose feedback, reviews, and shared ideas help refine every contribution. As we move ahead, we welcome maintainers, developers, and enterprise teams to engage with our projects, offer input, and collaborate with us. We will continue contributing code, sharing knowledge, and supporting the open-source technologies that power modern computing, working with the community to strengthen the foundation and shape a future that benefits everyone. References & Resources: Microsoft’s Open-Source Journey – Azure Blog https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/linux-and-open-source-on-azure-quarterly-update-february-2025/ba-p/4382722 Cloud Hypervisor Project Rust-VMM Community Microsoft LISA (Linux Integration Services Automation) Repository Cloud-LTS Linux CVE Analysis Project340Views1like0CommentsAutomating the Linux Quality Assurance with LISA on Azure
Introduction Building on the insights from our previous blog regarding how MSFT ensures the quality of Linux images, this article aims to elaborate on the open-source tools that are instrumental in securing exceptional performance, reliability, and overall excellence of virtual machines on Azure. While numerous testing tools are available for validating Linux kernels, guest OS images and user space packages across various cloud platforms, finding a comprehensive testing framework that addresses the entire platform stack remains a significant challenge. A robust framework is essential, one that seamlessly integrates with Azure's environment while providing the coverage for major testing tools, such as LTP and kselftest and covers critical areas like networking, storage and specialized workloads, including Confidential VMs, HPC, and GPU scenarios. This unified testing framework is invaluable for developers, Linux distribution providers, and customers who build custom kernels and images. This is where LISA (Linux Integration Services Automation) comes into play. LISA is an open-source tool specifically designed to automate and enhance the testing and validation processes for Linux kernels and guest OS images on Azure. In this blog, we will provide the history of LISA, its key advantages, the wide range of test cases it supports, and why it is an indispensable resource for the open-source community. Moreover, LISA is available under the MIT License, making it free to use, modify, and contribute. History of LISA LISA was initially developed as an internal tool by Microsoft to streamline the testing process of Linux images and kernel validations on Azure. Recognizing the value it could bring to the broader community, Microsoft open-sourced LISA, inviting developers and organizations worldwide to leverage and enhance its capabilities. This move aligned with Microsoft's growing commitment to open-source collaboration, fostering innovation and shared growth within the industry. LISA serves as a robust solution to validate and certify that Linux images meet the stringent requirements of modern cloud environments. By integrating LISA into the development and deployment pipeline, teams can: Enhance Quality Assurance: Catch and resolve issues early in the development cycle. Reduce Time to Market: Accelerate deployment by automating repetitive testing tasks. Build Trust with Users: Deliver stable and secure applications, bolstering user confidence. Collaborate and Innovate: Leverage community-driven improvements and share insights. Benefits of Using LISA Scalability: Designed to run large-scale test cases, from 1 test case to 10k test cases in one command. Multiple platform orchestration: LISA is created with modular design, to support run the same test cases on various platforms including Microsoft Azure, Windows HyperV, BareMetal, and other cloud-based platforms. Customization: Users can customize test cases, workflow, and other components to fit specific needs, allowing for targeted testing strategies. It’s like building kernels on-the-fly, sending results to custom database, etc. Community Collaboration: Being open source under the MIT License, LISA encourages community contributions, fostering continuous improvement and shared expertise. Extensive Test Coverage: It offers a rich suite of test cases covering various aspects of compatibility of Azure and Linux VMs, from kernel, storage, networking to middleware. How it works Infrastructure LISA is designed to be componentized and maximize compatibility with different distros. Test cases can focus only on test logic. Once test requirements (machines, CPU, memory, etc) are defined, just write the test logic without worrying about environment setup or stopping services on different distributions. Orchestration. LISA uses platform APIs to create, modify and delete VMs. For example, LISA uses Azure API to create VMs, run test cases, and delete VMs. During the test case running, LISA uses Azure API to collect serial log and can hot add/remove data disks. If other platforms implement the same serial log and data disk APIs, the test cases can run on the other platforms seamlessly. Ensure distro compatibility by abstracting over 100 commands in test cases, allowing focus on validation logic rather than distro compatibility. Pre-processing workflow assists in building the kernel on-the-fly, installing the kernel from package repositories, or modifying all test environments. Test matrix helps one run to test all. For example, one run can test different vm sizes on Azure, or different images, even different VM sizes and different images together. Anything is parameterizable, can be tested in a matrix. Customizable notifiers enable the saving of test results and files to any type of storage and database. Agentless and low dependency LISA operates test systems via SSH without requiring additional dependencies, ensuring compatibility with any system that supports SSH. Although some test cases require installing extra dependencies, LISA itself does not. This allows LISA to perform tests on systems with limited resources or even different operating systems. For instance, LISA can run on Linux, FreeBSD, Windows, and ESXi. Getting Started with LISA Ready to dive in? Visit the LISA project at aka.ms/lisa to access the documentation. Install: Follow the installation guide provided in the repository to set up LISA in your testing environment. Run: Follow the instructions to run LISA on local machine, Azure or existing systems. Extend: Follow the documents to extend LISA by test cases, data sources, tools, platform, workflow, etc. Join the Community: Engage with other users and contributors through forums and discussions to share experiences and best practices. Contribute: Modify existing test cases or create new ones to suit your needs. Share your contributions with the community to enhance LISA's capabilities. Conclusion LISA offers open-source collaborative testing solutions designed to operate across diverse environments and scenarios, effectively narrowing the gap between enterprise demands and community-led innovation. By leveraging LISA, customers can ensure their Linux deployments are reliable and optimized for performance. Its comprehensive testing capabilities, combined with the flexibility and support of an active community, make LISA an indispensable tool for anyone involved in Linux quality assurance and testing. Your feedback is invaluable, and we would greatly appreciate your insights.547Views1like0Comments