observability

8 Topics

eBPF-Powered Observability Beyond Azure: A Multi-Cloud Perspective with Retina
Kubernetes simplifies container orchestration but introduces observability challenges due to dynamic pod lifecycles and complex inter-service communication. eBPF technology addresses these issues by providing deep system insights and efficient monitoring. The open-source Retina project leverages eBPF for comprehensive, cloud-agnostic network observability across AKS, GKE, and EKS, enhancing troubleshooting and optimization through real-world demo scenarios.
Simone_Rodigari
Apr 17, 2025 Place Linux and Open Source Blog
983Views
9likes
0Comments
Beyond the Chat Window: How Change-Driven Architecture Enables Ambient AI Agents
AI agents are everywhere now. Powering chat interfaces, answering questions, helping with code. We've gotten remarkably good at this conversational paradigm. But while the world has been focused on chat experiences, something new is quietly emerging: ambient agents. These aren't replacements for chat, they're an entirely new category of AI system that operates in the background, sensing, processing, and responding to the world in real time. And here's the thing, this is a new frontier. The infrastructure we need to build these systems barely exists yet. Or at least, it didn't until now. Two Worlds: Conversational and Ambient Let me paint you a picture of the conversational AI paradigm we know well. You open a chat window. You type a question. You wait. The AI responds. Rinse and repeat. It's the digital equivalent of having a brilliant assistant sitting at a desk, ready to help when you tap them on the shoulder. Now imagine a completely different kind of assistant. One that watches for important changes, anticipates needs, and springs into action without being asked. That's the promise of ambient agents. AI systems that, as LangChain puts it: "listen to an event stream and act on it accordingly, potentially acting on multiple events at a time." This isn't an evolution of chat; it's a fundamentally different interaction paradigm. Both have their place. Chat is great for collaboration and back-and-forth reasoning. Ambient agents excel at continuous monitoring and autonomous response. Instead of human-initiated conversations, ambient agents operate through detecting changes in upstream systems and maintaining context across time without constant prompting. The use cases are compelling and distinct from chat. Imagine a project management assistant that operates in two modes: you can chat with it to ask, "summarize project status", but it also runs in the background, constantly monitoring new tickets that are created, or deployment pipelines that fail, automatically reassigning tasks. Or consider a DevOps agent that you can query conversationally ("what's our current CPU usage?") but also monitors your infrastructure continuously, detecting anomalies and starting remediation before you even know there's a problem. The Challenge: Real-Time Change Detection Here's where building ambient agents gets tricky. While chat-based agents work perfectly within the request-response paradigm, ambient agents need something entirely different: continuous monitoring and real-time change detection. How do you efficiently detect changes across multiple data sources? How do you avoid the performance nightmare of constant polling? How do you ensure your agent reacts instantly when something critical happens? Developers trying to build ambient agents hit the same wall: creating a reliable, scalable change detection system is hard. You either end up with: Polling hell: Constantly querying databases, burning through resources, and still missing changes between polls Legacy system rewrites: Massive expensive multi-year projects to re-write legacy systems so that they produce domain events Webhook spaghetti: Managing dozens of event sources, each with different formats and reliability guarantees This is where the story takes an interesting turn. Enter Drasi: The Change Detection Engine You Didn't Know You Needed Drasi is not another AI framework. Instead, it solves the problem that ambient agents need solved: intelligent change detection. Think of it as the sensory system for your AI agents, the infrastructure that lets them perceive changes in the world. Drasi is built around three simple components: Sources: Connectivity to the systems that Drasi can observe as sources of change (PostgreSQL, MySQL, Cosmos DB, Kubernetes, EventHub) Continuous Queries: Graph-based queries (using Cypher/GQL) that monitor for specific change patterns Reactions: What happens when a continuous query detects changes, or lack thereof But here's the killer feature: Drasi doesn't just detect that something changed. It understands what changed and why it matters, and even if something should have changed but did not. Using continuous queries, you can define complex conditions that your agents care about, and Drasi handles all the plumbing to deliver those insights in real time. The Bridge: langchain-drasi Integration Now, detecting changes is only part of the challenge. You need to connect those changes to your AI agents in a way that makes sense. That's where langchain-drasi comes in, a purpose-built integration that bridges Drasi's change detection with LangChain's agent frameworks. It achieves this by leveraging the Drasi MCP Reaction, which exposes Drasi continuous queries as MCP resources. The integration provides a simple Tool that agents can use to: Discover available queries automatically Read current query results on demand Subscribe to real-time updates that flow directly into agent memory and workflow Here's what this looks like in practice: from langchain_drasi import create_drasi_tool, MCPConnectionConfig # Configure connection to Drasi MCP server mcp_config = MCPConnectionConfig(server_url="http://localhost:8083") # Create the tool with notification handlers drasi_tool = create_drasi_tool( mcp_config=mcp_config, notification_handlers=[buffer_handler, console_handler] ) # Now your agent can discover and subscribe to data changes # No more polling, no more webhooks, just reactive intelligence The beauty is in the notification handlers: pre-built components that determine how changes flow into your agent's consciousness: BufferHandler: Queues changes for sequential processing LangGraphMemoryHandler: Automatically integrates changes into agent checkpoints LoggingHandler: Integrates with standard logging infrastructure This isn't just plumbing; it's the foundation for what we might call "change-driven architecture" for AI systems. Example: The Seeker Agent Has Entered the Chat Let's make this concrete with my favorite example from the langchain-drasi repository: a hide and seek inspired non-player character (NPC) AI agent that seeks human players in a multi-player game environment. The Scenario Imagine a game where players move around a 2D map, updating their positions in a PostgreSQL database. But here's the twist: the NPC agent doesn't have omniscient vision. It can only detect players under specific conditions: Stationary targets: When a player doesn't move for more than 3 seconds (they're exposed) Frantic movement: When a player moves more than once in less than a second (panicking reveals your position) This creates interesting strategic gameplay, players must balance staying still (safe from detection but vulnerable if found) with moving carefully (one move per second is the sweet spot). The NPC agent seeks based on these glimpses of player activity. These detection rules are defined as Drasi continuous queries that monitor the player positions table. For reference, these are the two continuous queries we will use: When a player doesn't move for more than 3 seconds, this is a great example of detecting the absence of change use the trueLater function: MATCH (p:player { type: 'human' }) WHERE drasi.trueLater( drasi.changeDateTime(p) <= (datetime.realtime() - duration( { seconds: 3 } )), drasi.changeDateTime(p) + duration( { seconds: 3 } ) ) RETURN p.id, p.x, p.y When a player moves more than once in less than a second is an example of using the previousValue function to compare that current state with a prior state: MATCH (p:player { type: 'human' }) WHERE drasi.changeDateTime(p).epochMillis - drasi.previousValue(drasi.changeDateTime(p).epochMillis) < 1000 RETURN p.id, p.x, p.y Here's the neat part: you can dynamically adjust the game's difficulty by adding or removing queries with different conditions; no code changes required, just deploy new Drasi queries. The traditional approach would have your agent constantly polling the data source checking these conditions: "Any player moves? How about now? Now? Now?" The Workflow in Action The agent operates through a LangGraph based state machine with two distinct phases: 1. Setup Phase (First Run Only) Setup queries prompt - Prompts the AI model to discover available Drasi queries Setup queries call model - AI model calls the Drasi tool with discover operation Setup queries tools - Executes the Drasi tool calls to subscribe to relevant queries This phase loops until the AI model has discovered and subscribed to all relevant queries 2. Main Seeking Loop (Continuous) Check sensors - Consumes any new Drasi notifications from the buffer into the workflow state Evaluate targets - Uses AI model to parse sensor data and extract target positions Select and plan - Selects closest target and plans path Execute move - Executes the next move via game API Loop continues indefinitely, reacting to new notifications No polling. No delays. No wasted resources checking positions that don't meet the detection criteria. Just pure, reactive intelligence flowing from meaningful data changes to agent actions. The continuous queries act as intelligent filters, only alerting the agent when relevant changes occur. Click here for the full implementation The Bigger Picture: Change-Driven Architecture What we're seeing with Drasi and ambient agents isn't just a new tool, it's a new architectural pattern for AI systems. The core idea is profound: AI agents can react to the world changing, not just wait to be asked about it. This pattern enables entirely new categories of applications that complement traditional chat interfaces. The example might seem playful, but it demonstrates that AI agents can perceive and react to their environment in real time. Today it's seeking players in a game. Tomorrow it could be: Managing city traffic flows based on real-time sensor data Coordinating disaster response as situations evolve Optimizing supply chains as demand patterns shift Protecting networks as threats emerge The change detection infrastructure is here. The patterns are emerging. The only question is: what will you build? Where to Go from Here Ready to dive deeper? Here are your next steps: Explore Drasi: Head to drasi.io and discover the power of the change detection platform Try langchain-drasi: Clone the GitHub repository and run the Hide-and-Seek example yourself Join the conversation: The space is new and needs diverse perspectives. Join the community on Discord. Let us know if you have built ambient agents and what challenges you faced with real-time change detection.
CollinBrian
Dec 03, 2025 Place Linux and Open Source Blog
266Views
2likes
0Comments
Troubleshooting Network Issues with Retina
An overview of how Retina solves key challenges of performing packet captures in a Kubernetes environment, and additional debug tools which are provided within the Retina Shell.
kamilp
Aug 22, 2025 Place Linux and Open Source Blog
939Views
2likes
0Comments
Project Pavilion Presence at KubeCon NA 2025
KubeCon + CloudNativeCon NA took place in Atlanta, Georgia, from 10-13 November, and continued to highlight the ongoing growth of the open source, cloud-native community. Microsoft participated throughout the event and supported several open source projects in the Project Pavilion. Microsoft’s involvement reflected our commitment to upstream collaboration, open governance, and enabling developers to build secure, scalable and portable applications across the ecosystem. The Project Pavilion serves as a dedicated, vendor-neutral space on the KubeCon show floor reserved for CNCF projects. Unlike the corporate booths, it focuses entirely on open source collaboration. It brings maintainers and contributors together with end users for hands-on demos, technical discussions, and roadmap insights. This space helps attendees discover emerging technologies and understand how different projects fit into the cloud-native ecosystem. It plays a critical role for idea exchanges, resolving challenges and strengthening collaboration across CNCF approved technologies. Why Our Presence Matters KubeCon NA remains one of the most influential gatherings for developers and organizations shaping the future of cloud-native computing. For Microsoft, participating in the Project Pavilion helps advance our goals of: Open governance and community-driven innovation Scaling vital cloud-native technologies Secure and sustainable operations Learning from practitioners and adopters Enabling developers across clouds and platforms Many of Microsoft’s products and cloud services are built on or aligned with CNCF and open-source technologies. Being active within these communities ensures that we are contributing back to the ecosystem we depend on and designing by collaborating with the community, not just for it. Microsoft-Supported Pavilion Projects containerd Representative: Wei Fu The containerd team engaged with project maintainers and ecosystem partners to explore solutions for improving AI model workflows. A key focus was the challenge of handling large OCI artifacts (often 500+ GiB) used in AI training workloads. Current image-pulling flows require containerd to fetch and fully unpack blobs, which significantly delays pod startup for large models. Collaborators from Docker, NTT, and ModelPack discussed a non-unpacking workflow that would allow training workloads to consume model data directly. The team plans to prototype this behavior as an experimental feature in containerd. Additional discussions included updates related to nerdbox and next steps for the erofs snapshotter. Copacetic Representative: Joshua Duffney The Copa booth attracted roughly 75 attendees, with strong representation from federal agencies and financial institutions, a sign of growing adoption in regulated industries. A lightning talk delivered at the conference significantly boosted traffic and engagement. Key feedback and insights included: High interest in customizable package update sources Demand for application-level patching beyond OS-level updates Need for clearer CI/CD integration patterns Expectations around in-cluster image patching Questions about runtime support, including Podman The conversations revealed several documentation gaps and feature opportunities that will inform Copa’s roadmap and future enablement efforts. Drasi Representative: Nandita Valsan KubeCon NA 2025 marked Drasi’s first in-person presence since its launch in October 2024 and its entry into the CNCF Sandbox in early 2025. With multiple kiosk slots, the team interacted with ~70 visitors across shifts. Engagement highlights included: New community members joining the Drasi Discord and starring GitHub repositories Meaningful discussions with observability and incident management vendors interested in change-driven architectures Positive reception to Aman Singh’s conference talk, which led attendees back to the booth for deeper technical conversations Post-event follow-ups are underway with several sponsors and partners to explore collaboration opportunities. Flatcar Container Linux Representatives: Sudhanva Huruli and Vamsi Kavuru The Flatcar project had some fantastic conversations at the pavilion. Attendees were eager to learn about bare metal provisioning, GPU support for AI workloads, and how Flatcar’s fully automated build and test process keeps things simple and developer friendly. Questions around Talos vs. Flatcar and CoreOS sparked lively discussions, with the team emphasizing Flatcar’s usability and independence from an OS-level API. Interest came from government agencies and financial institutions, and the preview of Flatcar on AKS opened the door to deeper conversations about real-world adoption. The Project Pavilion proved to be the perfect venue for authentic, technical exchanges. Flux Representatives: Dipti Pai The Flux booth was active throughout all three days of the Project Pavilion, where Microsoft joined other maintainers to highlight new capabilities in Flux 2.7, including improved multi-tenancy, enhanced observability, and streamlined cloud-native integrations. Visitors shared real-world GitOps experiences, both successes and challenges, which provided valuable insights for the project’s ongoing development. Microsoft’s involvement reinforced strong collaboration within the Flux community and continued commitment to advancing GitOps practices. Headlamp Representatives: Joaquim Rocha, Will Case, and Oleksandr Dubenko Headlamp had a booth for all three days of the conference, engaging with both longstanding users and first-time attendees. The increased visibility from becoming a Kubernetes sub-project was evident, with many attendees sharing their usage patterns across large tech organizations and smaller industrial teams. The booth enabled maintainers to: Gather insights into how teams use Headlamp in different environments Introduce Headlamp to new users discovering it via talks or hallway conversations Build stronger connections with the community and understand evolving needs Inspektor Gadget Representatives: Jose Blanquicet and Mauricio Vásquez Bernal Hosting a half-day kiosk session, Inspektor Gadget welcomed approximately 25 visitors. Attendees included newcomers interested in learning the basics and existing users looking for updates. The team showcased new capabilities, including the tcpdump gadget and Prometheus metrics export, and invited visitors to the upcoming contribfest to encourage participation. Istio Representatives: Keith Mattix, Jackie Maertens, Steven Jin Xuan, Niranjan Shankar, and Mike Morris The Istio booth continued to attract a mix of experienced adopters and newcomers seeking guidance. Technical discussions focused on: Enhancements to multicluster support in ambient mode Migration paths from sidecars to ambient Improvements in Gateway API availability and usage Performance and operational benefits for large-scale deployments Users, including several Azure customers, expressed appreciation for Microsoft’s sustained investment in Istio as part of their service mesh infrastructure. Notary Project Representative: Feynman Zhou and Toddy Mladenov The Notary Project booth saw significant interest from practitioners concerned with software supply chain security. Attendees discussed signing, verification workflows, and integrations with Azure services and Kubernetes clusters. The conversations will influence upcoming improvements across Notary Project and Ratify, reinforcing Microsoft’s commitment to secure artifacts and verifiable software distribution. Open Policy Agent (OPA) - Gatekeeper Representative: Jaydip Gabani The OPA/Gatekeeper booth enabled maintainers to connect with both new and existing users to explore use cases around policy enforcement, Rego/CEL authoring, and managing large policy sets. Many conversations surfaced opportunities around simplifying best practices and reducing management complexity. The team also promoted participation in an ongoing Gatekeeper/OPA survey to guide future improvements. ORAS Representative: Feynman Zhou and Toddy Mladenov ORAS engaged developers interested in OCI artifacts beyond container images which includes AI/ML models, metadata, backups, and multi-cloud artifact workflows. Attendees appreciated ORAS’s ecosystem integrations and found the booth examples useful for understanding how artifacts are tagged, packaged, and distributed. Many users shared how they leverage ORAS with Azure Container Registry and other OCI-compatible registries. Radius Representative: Zach Casper The Radius booth attracted the attention of platform engineers looking for ways to simplify their developer's experience while being able to enforce enterprise-grade infrastructure and security best practices. Attendees saw demos on deploying a database to Kubernetes and using managed databases from AWS and Azure without modifying the application deployment logic. They also saw a preview of Radius integration with GitHub Copilot enabling AI coding agents to autonomously deploy and test applications in the cloud. Conclusion KubeCon + CloudNativeCon North America 2025 reinforced the essential role of open source communities in driving innovation across cloud native technologies. Through the Project Pavilion, Microsoft teams were able to exchange knowledge with other maintainers, gather user feedback, and support projects that form foundational components of modern cloud infrastructure. Microsoft remains committed to building alongside the community and strengthening the ecosystem that powers so much of today’s cloud-native development. For anyone interested in exploring or contributing to these open source efforts, please reach out directly to each project’s community to get involved, or contact Lexi Nadolski at lexinadolski@microsoft.com for more information.
lexinadolski
Nov 25, 2025 Place Linux and Open Source Blog
211Views
1like
0Comments
Automating the Linux Quality Assurance with LISA on Azure
Introduction Building on the insights from our previous blog regarding how MSFT ensures the quality of Linux images, this article aims to elaborate on the open-source tools that are instrumental in securing exceptional performance, reliability, and overall excellence of virtual machines on Azure. While numerous testing tools are available for validating Linux kernels, guest OS images and user space packages across various cloud platforms, finding a comprehensive testing framework that addresses the entire platform stack remains a significant challenge. A robust framework is essential, one that seamlessly integrates with Azure's environment while providing the coverage for major testing tools, such as LTP and kselftest and covers critical areas like networking, storage and specialized workloads, including Confidential VMs, HPC, and GPU scenarios. This unified testing framework is invaluable for developers, Linux distribution providers, and customers who build custom kernels and images. This is where LISA (Linux Integration Services Automation) comes into play. LISA is an open-source tool specifically designed to automate and enhance the testing and validation processes for Linux kernels and guest OS images on Azure. In this blog, we will provide the history of LISA, its key advantages, the wide range of test cases it supports, and why it is an indispensable resource for the open-source community. Moreover, LISA is available under the MIT License, making it free to use, modify, and contribute. History of LISA LISA was initially developed as an internal tool by Microsoft to streamline the testing process of Linux images and kernel validations on Azure. Recognizing the value it could bring to the broader community, Microsoft open-sourced LISA, inviting developers and organizations worldwide to leverage and enhance its capabilities. This move aligned with Microsoft's growing commitment to open-source collaboration, fostering innovation and shared growth within the industry. LISA serves as a robust solution to validate and certify that Linux images meet the stringent requirements of modern cloud environments. By integrating LISA into the development and deployment pipeline, teams can: Enhance Quality Assurance: Catch and resolve issues early in the development cycle. Reduce Time to Market: Accelerate deployment by automating repetitive testing tasks. Build Trust with Users: Deliver stable and secure applications, bolstering user confidence. Collaborate and Innovate: Leverage community-driven improvements and share insights. Benefits of Using LISA Scalability: Designed to run large-scale test cases, from 1 test case to 10k test cases in one command. Multiple platform orchestration: LISA is created with modular design, to support run the same test cases on various platforms including Microsoft Azure, Windows HyperV, BareMetal, and other cloud-based platforms. Customization: Users can customize test cases, workflow, and other components to fit specific needs, allowing for targeted testing strategies. It’s like building kernels on-the-fly, sending results to custom database, etc. Community Collaboration: Being open source under the MIT License, LISA encourages community contributions, fostering continuous improvement and shared expertise. Extensive Test Coverage: It offers a rich suite of test cases covering various aspects of compatibility of Azure and Linux VMs, from kernel, storage, networking to middleware. How it works Infrastructure LISA is designed to be componentized and maximize compatibility with different distros. Test cases can focus only on test logic. Once test requirements (machines, CPU, memory, etc) are defined, just write the test logic without worrying about environment setup or stopping services on different distributions. Orchestration. LISA uses platform APIs to create, modify and delete VMs. For example, LISA uses Azure API to create VMs, run test cases, and delete VMs. During the test case running, LISA uses Azure API to collect serial log and can hot add/remove data disks. If other platforms implement the same serial log and data disk APIs, the test cases can run on the other platforms seamlessly. Ensure distro compatibility by abstracting over 100 commands in test cases, allowing focus on validation logic rather than distro compatibility. Pre-processing workflow assists in building the kernel on-the-fly, installing the kernel from package repositories, or modifying all test environments. Test matrix helps one run to test all. For example, one run can test different vm sizes on Azure, or different images, even different VM sizes and different images together. Anything is parameterizable, can be tested in a matrix. Customizable notifiers enable the saving of test results and files to any type of storage and database. Agentless and low dependency LISA operates test systems via SSH without requiring additional dependencies, ensuring compatibility with any system that supports SSH. Although some test cases require installing extra dependencies, LISA itself does not. This allows LISA to perform tests on systems with limited resources or even different operating systems. For instance, LISA can run on Linux, FreeBSD, Windows, and ESXi. Getting Started with LISA Ready to dive in? Visit the LISA project at aka.ms/lisa to access the documentation. Install: Follow the installation guide provided in the repository to set up LISA in your testing environment. Run: Follow the instructions to run LISA on local machine, Azure or existing systems. Extend: Follow the documents to extend LISA by test cases, data sources, tools, platform, workflow, etc. Join the Community: Engage with other users and contributors through forums and discussions to share experiences and best practices. Contribute: Modify existing test cases or create new ones to suit your needs. Share your contributions with the community to enhance LISA's capabilities. Conclusion LISA offers open-source collaborative testing solutions designed to operate across diverse environments and scenarios, effectively narrowing the gap between enterprise demands and community-led innovation. By leveraging LISA, customers can ensure their Linux deployments are reliable and optimized for performance. Its comprehensive testing capabilities, combined with the flexibility and support of an active community, make LISA an indispensable tool for anyone involved in Linux quality assurance and testing. Your feedback is invaluable, and we would greatly appreciate your insights.
KashanK
Jan 28, 2025 Place Linux and Open Source Blog
588Views
1like
0Comments