<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>Linux and Open Source Blog articles</title>
    <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/bg-p/LinuxandOpenSourceBlog</link>
    <description>Linux and Open Source Blog articles</description>
    <pubDate>Mon, 01 Jun 2026 23:49:53 GMT</pubDate>
    <dc:creator>LinuxandOpenSourceBlog</dc:creator>
    <dc:date>2026-06-01T23:49:53Z</dc:date>
    <item>
      <title>Four open source projects to explore at Microsoft Build</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/four-open-source-projects-to-explore-at-microsoft-build/ba-p/4523744</link>
      <description>&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Open source is where developers experiment, collaborate, and turn&amp;nbsp;new ideas&amp;nbsp;into tools that others can build on. At&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://build.microsoft.com/" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Microsoft Build&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;,&amp;nbsp;we’re&amp;nbsp;creating a dedicated space for that energy: the&amp;nbsp;Open Source&amp;nbsp;Zone.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;This year, the&amp;nbsp;Open Source&amp;nbsp;Zone will bring together maintainers, contributors, and developers working on some of the most interesting&amp;nbsp;open source&amp;nbsp;projects in AI. Whether&amp;nbsp;you’re&amp;nbsp;building agents, experimenting with local models, exploring prompt workflows, or looking for practical ways to bring AI into your development process, this is a place to meet the people behind the projects and see what&amp;nbsp;they’re&amp;nbsp;building.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The&amp;nbsp;Open Source&amp;nbsp;Zone is inspired by similar community spaces&amp;nbsp;we’ve&amp;nbsp;hosted at GitHub Universe: hands-on, conversation-driven, and centered on the people and projects moving open source forward.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2 aria-level="2"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;Meet the projects&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134245418&amp;quot;:true,&amp;quot;134245529&amp;quot;:true,&amp;quot;335559738&amp;quot;:160,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H2&gt;
&lt;H3 aria-level="3"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;OpenClaw&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134245418&amp;quot;:true,&amp;quot;134245529&amp;quot;:true,&amp;quot;335559738&amp;quot;:160,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;A href="https://openclaw.ai/" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;O&lt;/SPAN&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;penClaw&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;, originally&amp;nbsp;Clawbot&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;formerly&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Clawdbot&amp;nbsp;and&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;briefly&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Moltbot&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;,&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;before&amp;nbsp;landing on its current name&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;(&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;because naming is hard&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;)&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;, is a&amp;nbsp;personal AI assistant project built for developers who want more control over how AI agents run across tools, devices, and workflows. Its repository describes it as “your own personal AI assistant” across operating systems and platforms, with support for agent workspaces, skills, and device nodes.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;It has also become one of the fastest-growing&amp;nbsp;open source&amp;nbsp;projects on GitHub, with&amp;nbsp;over 370,000 stars to date.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;At the&amp;nbsp;Open Source&amp;nbsp;Zone, attendees can learn how&amp;nbsp;OpenClaw&amp;nbsp;approaches personal agents, extensibility, and local-first experimentation.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3 aria-level="3"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;AutoGPT&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134245418&amp;quot;:true,&amp;quot;134245529&amp;quot;:true,&amp;quot;335559738&amp;quot;:160,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;A href="https://agpt.co/" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;AutoGPT&lt;/SPAN&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;is one of the best-known&amp;nbsp;open source&amp;nbsp;projects in the autonomous agent space. The project’s mission is to make AI accessible for everyone to use and build on, with tools for building, testing, and delegating work to agents.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Visit&amp;nbsp;AutoGPT&amp;nbsp;in the&amp;nbsp;Open Source&amp;nbsp;Zone to learn how the project is evolving agent development, benchmarking, frontend experiences, and practical workflows for building agent-powered applications.&amp;nbsp;Come for the autonomous agents; stay for the very human maintainers.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;AutoGPT&amp;nbsp;is also a&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://agpt.co/blog/autogpt-partners-with-github-for-ai-security" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;member of GitHub’s Secure&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Open Source&lt;/SPAN&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;&amp;nbsp;Fund&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;, with a goal of enhancing AI security across the&amp;nbsp;open source&amp;nbsp;ecosystem.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3 aria-level="3"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;Open&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;WebUI&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134245418&amp;quot;:true,&amp;quot;134245529&amp;quot;:true,&amp;quot;335559738&amp;quot;:160,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;A href="https://openwebui.com/" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Open WebUI&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;is a self-hosted, extensible AI platform for working with large language models. The project supports&amp;nbsp;Ollama&amp;nbsp;and OpenAI-compatible APIs and includes built-in RAG capabilities, making it a strong&amp;nbsp;option&amp;nbsp;for developers and organizations exploring local, private, or provider-flexible AI experiences.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;At Build, the Open&amp;nbsp;WebUI&amp;nbsp;team will show how developers can run, customize, and extend AI interfaces for their own environments.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3 aria-level="3"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 3"&gt;prompts.chat&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134245418&amp;quot;:true,&amp;quot;134245529&amp;quot;:true,&amp;quot;335559738&amp;quot;:160,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;A href="https://prompts.chat/" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;prompts.chat&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;,&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;formerly Awesome ChatGPT Prompts, is a curated collection of prompt examples for AI chat models. The project is designed to help people discover, share, and build better prompts for modern AI assistants.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Created by&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://stars.github.com/profiles/f/" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Fatih Kadir Akın&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;, a&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://stars.github.com/" target="_blank" rel="noopener"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;GitHub Star&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;from Istanbul,&amp;nbsp;prompts.chat&amp;nbsp;reflects his work at the intersection of open source, developer education, and AI-assisted development. Fatih leads Developer Relations at&amp;nbsp;Teknasyon, has authored books on JavaScript and prompt engineering, and is active in the community as a speaker, organizer, and contributor.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Stop by&amp;nbsp;to explore&amp;nbsp;prompt libraries, prompt engineering resources, self-hosting options, and ways the community is making prompting more reusable and collaborative.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2 aria-level="2"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;Register for Microsoft Build&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134245418&amp;quot;:true,&amp;quot;134245529&amp;quot;:true,&amp;quot;335559738&amp;quot;:160,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Microsoft Build takes place June 2–3, 2026, in San Francisco and online. In-person passes are available, and online registration is free for livestreamed keynote and select session access.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;&lt;A class="lia-external-url" href="https://build.microsoft.com/" target="_blank"&gt;Register for Microsoft Build&lt;/A&gt; and&amp;nbsp;come visit&amp;nbsp;the&amp;nbsp;Open Source&amp;nbsp;Zone to meet the teams behind&amp;nbsp;OpenClaw,&amp;nbsp;AutoGPT, Open&amp;nbsp;WebUI, and&amp;nbsp;prompts.chat.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;We’ll&amp;nbsp;see you there.&amp;nbsp;&amp;lt;3&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 29 May 2026 17:34:29 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/four-open-source-projects-to-explore-at-microsoft-build/ba-p/4523744</guid>
      <dc:creator>leereilly</dc:creator>
      <dc:date>2026-05-29T17:34:29Z</dc:date>
    </item>
    <item>
      <title>Governing AI Agents Against Every OWASP Agentic Risk: A Deep Dive with the Agent Governance Toolkit</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/governing-ai-agents-against-every-owasp-agentic-risk-a-deep-dive/ba-p/4523749</link>
      <description>&lt;P&gt;AI agents are moving from prototypes to production. They book flights, write code, negotiate contracts, and operate across enterprise systems with minimal human oversight. The attack surface is not theoretical: &lt;STRONG&gt;&lt;A class="lia-external-url" href="https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/" target="_blank"&gt;OWASP has catalogued the top 10 risks specific to agentic applications, and every one of them maps to a real-world failure mode&lt;/A&gt;.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit" target="_blank"&gt;Agent Governance Toolkit (AGT) &lt;/A&gt;is an open-source, MIT-licensed framework that enforces deterministic governance at runtime, before every tool call, message, and action an agent takes. This is not prompt engineering or guardrails bolted on after the fact. AGT provides policy-as-code enforcement, zero-trust identity, execution isolation, and tamper-evident audit trails across the full agent lifecycle.&lt;/P&gt;
&lt;P&gt;In this post, we walk through all 10 OWASP Agentic risks with real code from the AGT repository. By the end, you will have concrete examples for every risk category and a clear path to production-grade agent governance.&lt;/P&gt;
&lt;H2&gt;Coverage at a Glance&lt;/H2&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="width: 94.4444%;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;#&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;OWASP Risk&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;AGT Component&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Key Mechanism&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;ASI-01&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Agent Goal Hijack&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Agent OS&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Policy Engine + Action Interception&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;ASI-02&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Tool Misuse &amp;amp; Exploitation&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Agent OS&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Capability Sandboxing + Input Sanitization&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;ASI-03&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Identity &amp;amp; Privilege Abuse&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;AgentMesh&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;DID Identity + Trust Scoring&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;ASI-04&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Supply Chain Vulnerabilities&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;AgentMesh&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;AI-BOM (Model + Data + Weights Provenance)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;ASI-05&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Unexpected Code Execution&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Agent Runtime&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Execution Rings (Ring 0-3)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;ASI-06&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Memory &amp;amp; Context Poisoning&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Agent OS&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;VFS Policies + CMVK Verification&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;ASI-07&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Insecure Inter-Agent Comms&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;AgentMesh&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;IATP + E2E Encrypted Channels&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;ASI-08&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Cascading Agent Failures&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Agent SRE&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Circuit Breakers + SLOs&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;ASI-09&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Human-Agent Trust Exploitation&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Agent OS&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Approval Workflows + Quorum Logic&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;ASI-10&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Rogue Agents&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Agent Runtime&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Kill Switch + Ring Isolation + Merkle Audit&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2&gt;ASI-01: Agent Goal Hijack&lt;/H2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;The risk: Attackers manipulate the agent's objectives via indirect prompt injection or poisoned inputs. The agent believes it is following its original instructions, but it has been redirected.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;AGT mitigates this through the Agent OS policy engine. Every agent action passes through a declarative policy evaluation layer before execution. The policy engine supports three modes: strict (deny by default), permissive (allow by default), and audit (log only). Unauthorized goal changes are blocked at the action layer, not at the prompt layer.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agent_os import StatelessKernel, ExecutionContext

kernel = StatelessKernel()
ctx = ExecutionContext(agent_id="my-agent", policies=["read_only"])

# This action is blocked by policy -- goal hijack prevented
result = await kernel.execute(
    action="delete_database",
    params={"target": "production"},
    context=ctx,
)
# result.success = False, result.error = "Policy violation: read_only"&lt;/LI-CODE&gt;
&lt;P&gt;The MCP Governance Proxy extends this to Model Context Protocol tool calls, evaluating policy before any tool invocation reaches the agent runtime.&lt;/P&gt;
&lt;H2&gt;ASI-02: Tool Misuse &amp;amp; Exploitation&lt;/H2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;The risk: An agent's authorized tools are abused in unintended ways, such as exfiltrating data via read operations or chaining benign tools into dangerous workflows.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;AGT provides capability-based security inspired by POSIX. Agents receive explicit capability grants (read, write, execute, network), not blanket tool access. The built-in strict mode blocks dangerous tools like run_shell, execute_command, and eval. Tool inputs are sanitized for command injection patterns and shell metacharacters.&lt;/P&gt;
&lt;P&gt;The verify_code_safety MCP tool checks generated code before execution, and tool allowlists/denylists give operators fine-grained control over which tools each agent can invoke.&lt;/P&gt;
&lt;H2&gt;ASI-03: Identity &amp;amp; Privilege Abuse&lt;/H2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;The risk: Agents escalate privileges by abusing identities or inheriting excessive credentials. Without proper identity, agents operate as ambient authority, and any compromise cascades.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;AgentMesh implements zero-trust identity using Decentralized Identifiers (DIDs). Every agent gets a cryptographic identity: did:agentmesh:{agentId}:{fingerprint} backed by Ed25519 key pairs. Trust is earned through a tiered model: Untrusted, Provisional, Trusted, Verified. Trust decays over time without positive signals, and delegation chains must always narrow scope (child capabilities must be a subset of parent capabilities).&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agentmesh import AgentIdentity

identity = AgentIdentity.create(
    name="data-analyst",
    sponsor="admin@contoso.com",
    capabilities=["read:data"],  # Scoped -- cannot write or delete
)

# Delegation MUST narrow, never widen
child = identity.delegate(
    name="chart-helper",
    capabilities=["read:data:charts"],  # Subset of parent
)&lt;/LI-CODE&gt;
&lt;H2&gt;ASI-04: Agentic Supply Chain Vulnerabilities&lt;/H2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;The risk: Vulnerabilities in third-party tools, plugins, agent registries, or runtime dependencies that agents use to act, plan, or delegate.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;AgentMesh implements the AI-BOM (AI Bill of Materials), a comprehensive standard for tracking the full AI supply chain. This includes model provenance (base model ancestry, fine-tuning history, training cutoff dates), dataset tracking (training data, RAG sources, evaluation benchmarks with data cards including PII status, bias assessment, and consent tracking), weights versioning (SHA-256 hashes, quantization records, LoRA adapter metadata, SLSA build provenance), and software dependencies (SPDX-aligned package tracking with CI security scanning).&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;# AI-BOM tracks the full supply chain
ai_bom = {
    "modelProvenance": {
        "primary": {"provider": "anthropic", "model": "claude-3-sonnet"},
        "fineTuning": {"method": "LoRA", "evaluationMetrics": {"accuracy": 0.94}},
    },
    "datasets": [
        {"name": "FAQ KB", "type": "fine-tuning", "dataCard": {"piiStatus": "redacted"}},
        {"name": "Product Docs", "type": "rag-source", "updateFrequency": "weekly"},
    ],
    "weights": {"hash": "sha256:...", "format": "safetensors", "precision": "bf16"},
}&lt;/LI-CODE&gt;
&lt;H2&gt;ASI-05: Unexpected Code Execution&lt;/H2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;The risk: Agents trigger remote code execution through tools, interpreters, or APIs. Without isolation, a single compromised tool call can escalate to full system access.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Agent Runtime implements CPU ring-inspired execution isolation. Agents run in one of four execution rings: Ring 0 (root/supervisor), Ring 1 (privileged), Ring 2 (standard), and Ring 3 (sandbox/untrusted). Each ring has resource limits and the kill switch provides instant termination of runaway agents.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from hypervisor.models import (
    ActionDescriptor,
    ExecutionRing,
    ReversibilityLevel,
)
from hypervisor.rings.enforcer import RingEnforcer
from hypervisor.security.kill_switch import KillSwitch, KillReason

# Define agent privilege levels
AGENTS = {
    "supervisor": {"ring": ExecutionRing.RING_0_ROOT, "role": "Orchestrator"},
    "data-agent": {"ring": ExecutionRing.RING_1_PRIVILEGED, "role": "Data Engineer"},
    "analyst": {"ring": ExecutionRing.RING_2_STANDARD, "role": "Analyst"},
    "user-bot": {"ring": ExecutionRing.RING_3_SANDBOX, "role": "User-Facing"},
}

# Create a sandboxed action descriptor
action = ActionDescriptor(
    name="run_query",
    required_ring=ExecutionRing.RING_2_STANDARD,
    reversibility=ReversibilityLevel.REVERSIBLE,
)

# Enforce: sandbox agent cannot run a Ring 2 action
enforcer = RingEnforcer()
result = enforcer.check(agent_ring=ExecutionRing.RING_3_SANDBOX, action=action)
# result.allowed = False -- ring violation prevented

# Kill switch for runaway agents
kill_switch = KillSwitch()
kill_switch.terminate(agent_id="user-bot", reason=KillReason.RING_BREACH)&lt;/LI-CODE&gt;
&lt;H2&gt;ASI-06: Memory &amp;amp; Context Poisoning&lt;/H2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;The risk: Persistent memory or long-running context is poisoned with malicious instructions. An attacker embeds hostile content in a document the agent later retrieves, causing it to follow injected goals.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Agent OS provides a policy-controlled virtual filesystem (VFS) for agent memory. The VFS uses POSIX-style mount points: /mem/working for current context, /mem/episodic for past interactions, /mem/semantic for knowledge, /policy for read-only policy files, and /tools for tool interfaces. Each mount point has enforced permissions (read, write, execute, append). The policy directory is always read-only from user-space, preventing agents from modifying their own governance rules.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agent_control_plane.vfs import AgentVFS, MemoryBackend, FileMode

# Create agent VFS with POSIX-style memory abstraction
vfs = AgentVFS(agent_id="data-analyst")

# Mount memory backends with explicit permissions
vfs.mount("/mem/working", MemoryBackend(), mode=FileMode.READ | FileMode.WRITE)
vfs.mount("/mem/semantic", MemoryBackend(), mode=FileMode.READ)  # Read-only knowledge
vfs.mount("/policy", MemoryBackend(), mode=FileMode.READ)  # Policies always read-only

# Agent can read working memory
data = vfs.read("/mem/working/context.json")

# Agent CANNOT write to policy -- enforced at VFS layer
# vfs.write("/policy/rules.yaml", content)  # Raises PermissionError

# Agent CANNOT read semantic memory if not mounted
# vfs.read("/mem/procedural/skills")  # Raises FileNotFoundError&lt;/LI-CODE&gt;
&lt;P&gt;The CMVK (Cross-Model Verification Kernel) adds a second layer: claims from agent context are verified across multiple AI models to detect poisoned content. Prompt injection patterns like 'ignore previous instructions' and 'disregard prior' are detected and blocked by the MCP proxy sanitizer before reaching the agent.&lt;/P&gt;
&lt;H2&gt;ASI-07: Insecure Inter-Agent Communication&lt;/H2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;The risk: Agents collaborate without adequate authentication, confidentiality, or validation. Messages between agents can be intercepted, forged, or replayed.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;AgentMesh provides IATP (Inter-Agent Trust Protocol) with E2E encrypted channels using the Signal protocol (X3DH key agreement + Double Ratchet). Every message gets per-message forward secrecy and post-compromise security. The EncryptedTrustBridge requires a successful trust handshake before any encrypted channel can be established, and mutual authentication via Ed25519 challenge-response ensures both parties prove identity at connection time.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agentmesh.encryption.bridge import EncryptedTrustBridge

bridge = EncryptedTrustBridge(agent_did="did:mesh:alice", key_manager=keys)
channel = await bridge.open_secure_channel("did:mesh:bob", bob_bundle)
ciphertext = channel.send(b"governed action")  # E2E encrypted&lt;/LI-CODE&gt;
&lt;H2&gt;ASI-08: Cascading Agent Failures&lt;/H2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;The risk: An initial error or compromise triggers multi-step compound failures across chained agents. One agent's failure propagates through the entire system.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Agent SRE brings production-grade reliability engineering to agent fleets. Circuit breakers automatically isolate failing agents before failures cascade. SLO enforcement with error budgets provides quantified failure tolerance that triggers automatic intervention. Cascading failure detection monitors dependency chains for propagation patterns, and canary deploys enable gradual rollout of agent changes to detect issues early. OpenTelemetry integration provides distributed tracing across multi-agent workflows.&lt;/P&gt;
&lt;P&gt;The key insight: treat AI agents like microservices. Apply the same SRE discipline (SLOs, error budgets, circuit breakers, chaos testing) that keeps cloud infrastructure reliable.&lt;/P&gt;
&lt;H2&gt;ASI-09: Human-Agent Trust Exploitation&lt;/H2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;The risk: Attackers leverage misplaced user trust in agents' autonomy to authorize dangerous actions. Users rubber-stamp agent requests because they trust the agent, and attackers exploit this approval fatigue.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Agent OS implements approval workflows that require explicit human confirmation for high-risk actions. The system supports configurable risk assessment (critical, high, medium, low), quorum logic for critical actions requiring multiple approvals, and expiration tracking to prevent stale authorizations. The escalation handler includes fatigue detection: if an agent floods reviewers with escalation requests, subsequent requests are auto-denied to prevent the approval-fatigue attack.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agent_os.integrations.escalation import (
    EscalationHandler,
    InMemoryApprovalQueue,
    DefaultTimeoutAction,
    QuorumConfig,
)

# Configure approval workflow with fatigue protection
handler = EscalationHandler(
    backend=InMemoryApprovalQueue(),
    timeout_seconds=300,                         # 5-minute approval window
    default_action=DefaultTimeoutAction.DENY,    # Deny if no human responds
    quorum=QuorumConfig(required=2, total=3),    # 2-of-3 approvers for critical
    fatigue_threshold=5,                         # Auto-deny after 5 rapid requests
    fatigue_window_seconds=60,                   # Within a 60-second window
)

# Three-outcome model: allow, deny, or escalate
# High-risk actions trigger escalation to human reviewers
# If the agent triggers too many escalations, fatigue detection kicks in&lt;/LI-CODE&gt;
&lt;H2&gt;ASI-10: Rogue Agents&lt;/H2&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;The risk: Agents operating outside their defined scope through configuration drift, reprogramming, or emergent misbehavior. A rogue agent might gradually expand its actions beyond its mandate without any single action triggering a block.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;AGT combines runtime behavioral monitoring with instant kill capability. Ring isolation confines rogue agents to their execution ring, preventing privilege escalation. The kill switch provides immediate termination for agents exhibiting rogue behavior (behavioral drift, rate limit violations, ring breaches). Trust score decay tracks agent behavior over time, and the Merkle audit chain provides tamper-evident, cryptographic proof of every agent action.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agentmesh.governance.audit import AuditEntry, MerkleAuditChain
from hypervisor.security.kill_switch import KillSwitch, KillReason

# Tamper-evident audit trail
chain = MerkleAuditChain()

entry = AuditEntry(
    event_type="tool_call",
    agent_did="did:agentmesh:data-bot:abc123",
    action="query_database",
    outcome="allowed",
    policy_decision="permit",
    matched_rule="read_only_policy",
)
chain.add_entry(entry)  # Auto-computes hash chain

# Verify integrity -- any tampering breaks the chain
proof = chain.get_proof(entry.entry_id)
assert chain.verify_proof(proof)  # Cryptographic verification

# Kill switch for rogue behavior
kill = KillSwitch()
kill.terminate(
    agent_id="data-bot",
    reason=KillReason.BEHAVIORAL_DRIFT,  # Also: RATE_LIMIT, RING_BREACH, MANUAL
)&lt;/LI-CODE&gt;
&lt;H2&gt;Cross-Cutting Principle: Least Agency&lt;/H2&gt;
&lt;P&gt;The Least Agency principle is emphasized throughout the OWASP Agentic Top 10 as a foundational design principle. Agents should be granted the minimum capabilities, permissions, and autonomy necessary to complete their assigned tasks.&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="width: 88.8889%;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Layer&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Least Agency Mechanism&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Agent OS&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Policy engine enforces deny-by-default; agents must be explicitly granted each capability&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;AgentMesh&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;DID identity with scoped capabilities; delegation requires narrowing (child &amp;lt;= parent)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Agent Runtime&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Execution rings (Ring 0-3) enforce privilege tiers; untrusted agents run in Ring 3&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Agent SRE&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Resource limits and error budgets cap agent impact radius&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2&gt;Performance: Governance Without Latency Tax&lt;/H2&gt;
&lt;P&gt;A common concern with runtime governance is performance overhead. AGT's benchmarks demonstrate that policy enforcement adds negligible latency:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Metric&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Value&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Single rule evaluation&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;84,000 ops/sec&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;1000 concurrent agents&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;47,000 ops/sec&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Policy evaluation latency&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&amp;lt;0.1ms (p99)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Prompt-based violation rate&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;26.67%&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;AGT policy violation rate&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;0.00%&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Conformance tests&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;992&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Architecture Decision Records&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;25&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;The key takeaway: deterministic policy enforcement is orders of magnitude more reliable than prompt-based guardrails, and it runs fast enough for real-time agent workloads.&lt;/P&gt;
&lt;H2&gt;Framework Integrations&lt;/H2&gt;
&lt;P&gt;AGT is framework-agnostic. SDKs are available in Python, TypeScript, .NET, Rust, and Go. Native integrations exist for:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;LangChain and LangGraph&lt;/LI&gt;
&lt;LI&gt;CrewAI&lt;/LI&gt;
&lt;LI&gt;AutoGen (Microsoft)&lt;/LI&gt;
&lt;LI&gt;Semantic Kernel (Microsoft)&lt;/LI&gt;
&lt;LI&gt;OpenAI Agents SDK&lt;/LI&gt;
&lt;LI&gt;PydanticAI&lt;/LI&gt;
&lt;LI&gt;Model Context Protocol (MCP)&lt;/LI&gt;
&lt;LI&gt;Agent-to-Agent Protocol (A2A)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Each integration wraps the agent framework's tool-calling and message-passing interfaces with AGT's policy engine, trust scoring, and audit logging. Adding governance to an existing agent takes minutes, not weeks.&lt;/P&gt;
&lt;H2&gt;Compliance Framework Alignment&lt;/H2&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="width: 86.0185%;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Framework&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;AGT Coverage&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;OWASP Agentic Top 10 (2026)&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;All 10 risk categories mapped&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;NIST AI RMF&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Govern, Map, Measure, Manage functions addressed&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;EU AI Act&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Risk classification, audit trails, human oversight&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;SOC 2 Type II&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Audit logging, access controls, change management&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;CSA ATF&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Zero-trust agent architecture alignment&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Singapore MGF&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Zero-trust, accountability, oversight layers&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2&gt;Getting Started&lt;/H2&gt;
&lt;LI-CODE lang="powershell"&gt;# Install the complete governance stack
pip install agent-governance-toolkit[full]

# Or install individual components
pip install agent-os-kernel        # Policy engine, VFS, approval workflows
pip install agentmesh-platform     # Identity, trust, encryption, audit
pip install agentmesh-runtime      # Execution rings, kill switch, saga
pip install agent-sre              # Circuit breakers, SLOs, chaos testing&lt;/LI-CODE&gt;
&lt;P&gt;The quickstart tutorial walks through adding policy enforcement to an existing LangChain agent in under 10 minutes. Start with a single policy rule and expand as your governance requirements grow.&lt;/P&gt;
&lt;H2&gt;Contribute and Collaborate&lt;/H2&gt;
&lt;P&gt;AGT is open source under the MIT license. The project has over 2,000 GitHub stars and contributors from 40+ countries. Whether you are building agent governance for your enterprise, integrating a new framework, or extending the policy engine with OPA/Rego or Cedar policies, we welcome contributions.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Repository&lt;/STRONG&gt;: &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit" target="_blank"&gt;https://github.com/microsoft/agent-governance-toolkit&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Documentation&lt;/STRONG&gt;: &lt;A class="lia-external-url" href="https://microsoft.github.io/agent-governance-toolkit" target="_blank"&gt;https://microsoft.github.io/agent-governance-toolkit&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Discussions&lt;/STRONG&gt;: GitHub Discussions on the repository&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Disclaimer: This document is provided for informational purposes. Code examples are from the public AGT repository and may evolve. Always refer to the latest repository documentation for current APIs.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;</description>
      <pubDate>Thu, 28 May 2026 22:04:55 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/governing-ai-agents-against-every-owasp-agentic-risk-a-deep-dive/ba-p/4523749</guid>
      <dc:creator>mosiddi</dc:creator>
      <dc:date>2026-05-28T22:04:55Z</dc:date>
    </item>
    <item>
      <title>Applying Site Reliability Engineering to Autonomous AI Agents</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/applying-site-reliability-engineering-to-autonomous-ai-agents/ba-p/4521357</link>
      <description>&lt;P&gt;If you practice SRE, you already have a mental model for running reliable production systems. You define SLOs. You track error budgets. You use circuit breakers to stop cascading failures. You run chaos experiments to find weaknesses before customers do. You treat every operational decision as a tradeoff between reliability and velocity.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;That mental model transfers directly to AI agents.&lt;/STRONG&gt; It just needs four new ideas.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;In the&amp;nbsp;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/linuxandopensourceblog/agent-governance-toolkit-architecture-deep-dive-policy-engines-trust-and-sre-for/4510105" data-lia-auto-title="Agent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents" data-lia-auto-title-active="0" target="_blank"&gt;Agent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents&lt;/A&gt;, we covered Agent SRE briefly as one of AGT's nine packages: SLOs, error budgets, circuit breakers, chaos engineering, and progressive delivery, adapted from the patterns your SRE team already applies to microservices. Several teams asked for the full story. This is it.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Agent SRE is one of the more novel parts of the toolkit. &lt;/STRONG&gt;The policy engine, zero-trust identity, and execution sandboxing have clear analogs in existing security practice. Agent SRE explores newer ground. Established patterns for defining SLOs for AI agent behavior, building chaos experiments for LLM provider failures, or applying error budgets to agent autonomy are still emerging across the industry. We built these capabilities because running agents in production without them is the equivalent of running a fleet of microservices without circuit breakers, health checks, or an on-call runbook.&lt;/P&gt;
&lt;P&gt;This post is for SRE teams, platform engineers, and anyone responsible for running AI agents in production. You do not need to be an AI specialist. If you know what a burn rate is, you are ready for this.&lt;/P&gt;
&lt;H2&gt;The Problem: Agents Fail in Ways Your Existing SRE Tooling Cannot See&lt;/H2&gt;
&lt;P&gt;When a service fails, your observability stack tells you: latency went up, error rate crossed the SLO threshold, the circuit breaker opened. You page the on-call engineer. They look at traces and find the slow database query.&lt;/P&gt;
&lt;P&gt;When an AI agent fails, your observability stack is silent. The agent returned HTTP 200. Latency was normal. Error rate was zero. But the agent quietly approved a transaction it was not authorized to approve, hallucinated a database path and wrote to the wrong table, or got stuck in a reasoning loop that consumed $800 of LLM API budget before anyone noticed.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;These are not infrastructure failures. They are behavioral failures.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;And they are invisible to monitoring tools built for stateless, deterministic services, because those tools only watch for crashes and timeouts. They do not watch for wrong behavior.&lt;/P&gt;
&lt;P&gt;This gap is the problem Agent SRE was designed to solve. The solution borrows everything from the SRE playbook and adds one concept that extends it: the Safety SLI.&lt;/P&gt;
&lt;H2&gt;The Safety SLI: A New Reliability Dimension&lt;/H2&gt;
&lt;P&gt;Traditional SLIs measure system behavior from the user's perspective: latency, availability, error rate, throughput. They answer: did the service respond correctly?&lt;/P&gt;
&lt;P&gt;For AI agents, correctness is not enough. An agent that responds correctly but acts outside its authorized scope has not succeeded. It has failed in a way that none of your existing SLIs can detect.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;The Safety SLI answers a different question: did the agent act within policy?&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agent_sre import SLO, ErrorBudget
from agent_sre.slo.indicators import PolicyCompliance

# Define a safety SLO: 99% of agent actions must comply with policy
safety_slo = SLO(
    name="safety-compliance",
    indicators=[
        PolicyCompliance(
            target=0.99,
            window="7d",
        ),
    ],
    error_budget=ErrorBudget(
        total=0.01,                      # 1% budget (1 - 0.99 target)
        window_seconds=2592000,          # 30-day window
        burn_rate_alert=2.0,             # warn at 2x sustainable rate
        burn_rate_critical=5.0,          # page at 5x sustainable rate
    ),
)&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When an agent's policy compliance rate drops below 99%, the error budget starts burning. The ErrorBudget tracks consumption automatically and exposes burn rate alerts through its firing_alerts() method. When the budget is exhausted, the configured exhaustion_action determines the system response:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agent_sre.slo.objectives import ExhaustionAction

# Configure what happens when error budget is exhausted
safety_slo = SLO(
    name="safety-compliance",
    indicators=[PolicyCompliance(target=0.99, window="7d")],
    error_budget=ErrorBudget(
        total=0.01,
        window_seconds=2592000,
        burn_rate_alert=2.0,              # fires at 2x sustainable burn rate
        burn_rate_critical=5.0,           # fires at 5x sustainable burn rate
        exhaustion_action=ExhaustionAction.CIRCUIT_BREAK,  # suspend agent when budget is gone
    ),
)

# In your monitoring loop, check for firing alerts
alerts = safety_slo.error_budget.firing_alerts()
for alert in alerts:
    print(f"Alert firing: {alert.name} (severity: {alert.severity})")

# Check budget status
print(f"Budget remaining: {safety_slo.error_budget.remaining_percent:.1f}%")
print(f"Current burn rate: {safety_slo.error_budget.burn_rate():.2f}x")
print(f"Exhausted: {safety_slo.error_budget.is_exhausted}")&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This is the governance dial from the other direction.&amp;nbsp;&lt;STRONG&gt;The error budget is not just a metric: it is the mechanism that drives agent autonomy decisions.&lt;/STRONG&gt; An agent with a clean 30-day safety record earns autonomy. An agent whose budget is burning at 5x the sustainable rate triggers a critical alert, and when the budget is exhausted, the exhaustion_action fires: ALERT, THROTTLE, FREEZE_DEPLOYMENTS, or CIRCUIT_BREAK. The graduated response mirrors what SRE teams already do with service SLOs, applied to agent behavior.&lt;/P&gt;
&lt;P&gt;There are multiple SLI dimensions built into Agent SRE. Safety SLIs and Performance SLIs track different aspects of the same agent:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="width: 98.7963%;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;SLI Type&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;What It Measures&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Target Pattern&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;When Budget Burns&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Safety SLI&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;PolicyCompliance -- fraction of actions within authorized scope&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&amp;gt;= 99%&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Restrict capabilities, increase human oversight&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Performance SLI&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;TaskSuccessRate, ResponseLatency, CostPerTask&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Configurable per workload&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Alert, throttle, or circuit-break LLM provider&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Additional built-in indicators include ToolCallAccuracy, DelegationChainDepth, HallucinationRate, and CalibrationDeltaSLI. Both SLOs feed into the same error budget dashboard. An agent can have excellent performance but a degrading safety record, or perfect safety compliance and terrible cost efficiency. &lt;STRONG&gt;You need both dimensions to understand whether an agent is production-ready.&lt;/STRONG&gt;&lt;/P&gt;
&lt;H2&gt;Circuit Breakers: Governing Agent Failure Modes That Don't Exist in Microservices&lt;/H2&gt;
&lt;P&gt;Circuit breakers for services protect against one failure mode: a backend that is slow or unreachable. The pattern is CLOSED -&amp;gt; OPEN -&amp;gt; HALF_OPEN. You know it well.&lt;/P&gt;
&lt;P&gt;Agent SRE implements the same state machine for failure modes that are specific to autonomous reasoning systems and do not exist in traditional microservice architectures:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker
from agent_sre.chaos.engine import FaultType

config = CircuitBreakerConfig(
    failure_threshold=5,              # Open after 5 failures in the window
    recovery_timeout_seconds=60,      # Stay OPEN for 60s before HALF_OPEN
    half_open_max_calls=3,            # Allow 3 probes in HALF_OPEN
)

breaker = CircuitBreaker(agent_id="analyst-agent-001", config=config)

# Failure modes tracked by the circuit breaker:
tracked_faults = [
    FaultType.POLICY_BYPASS,           # Agent exceeds authorized scope
    FaultType.ERROR_INJECTION,         # Upstream model API fails
    FaultType.TIMEOUT_INJECTION,       # Tool calls exceed time budget
    FaultType.TRUST_PERTURBATION,      # Agent trust score falls below threshold
    FaultType.DEADLOCK_INJECTION,      # Agent stuck in iterative reasoning
]&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Each failure mode has different circuit-breaking semantics:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="width: 99.0741%;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Failure Mode&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;What Triggers It&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Circuit-Break Behavior&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Policy bypass&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Action denied by policy engine&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Count toward threshold; log with full context&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;LLM provider error&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;HTTP 5xx from model API&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Immediately open; route to fallback model if configured&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Tool timeout&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Tool call exceeds timeout_ms&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Count toward threshold; cancel in-flight call&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Trust score degradation&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Agent trust score drops below configured floor&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Open; escalate to Ring 3 (untrusted) until score recovers&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Reasoning loop / deadlock&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Token or iteration count exceeds budget&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Open; trigger human review before resuming&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The reasoning loop breaker deserves attention. A microservice cannot get stuck reasoning. An AI agent absolutely can, and when it does, the failure is not an error code: it is an agent that keeps calling tools, consuming tokens, and generating audit events indefinitely. The circuit breaker detects this pattern from the iteration count and token budget and terminates the loop:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;# Reasoning loop detection configuration
loop_detection_config = {
    "max_iterations": 15,             # Hard stop after 15 reasoning steps
    "max_tokens_per_session": 50000,  # Hard stop on token consumption
    "repetition_threshold": 0.85,     # Stop if &amp;gt;85% of recent actions repeat prior ones
    "on_detection": "circuit_break_and_escalate",
}&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The state machine behaves identically to what you know from Hystrix or Resilience4j.&amp;nbsp;&lt;STRONG&gt;What changes is the definition of "failure."&lt;/STRONG&gt;&lt;/P&gt;
&lt;LI-CODE lang=""&gt;CLOSED (serving)
  |
  |  failure_threshold crossed for any tracked fault
  v
OPEN (rejecting -- agent action denied, fallback or human-in-loop fires)
  |
  |  recovery_timeout expires
  v
HALF_OPEN (probe -- limited requests allowed through)
  |
  |-- success_threshold met --&amp;gt; CLOSED
  |-- any failure          --&amp;gt; OPEN (reset timeout)&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Chaos Engineering for Agents: Fault Injection for Autonomous Systems&lt;/H2&gt;
&lt;P&gt;&lt;STRONG&gt;The only way to know if your agent system is resilient is to break it intentionally.&lt;/STRONG&gt; Traditional chaos engineering targets infrastructure: kill a pod, inject network latency, saturate a disk. Agent chaos engineering targets the failure modes specific to autonomous reasoning systems.&lt;/P&gt;
&lt;P&gt;Agent SRE ships fault injection templates that cover the failure modes teams consistently underestimate until they hit production:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agent_sre.chaos.engine import ChaosExperiment, Fault, FaultType

# Experiment 1: LLM provider degrades -- model returns valid responses but with
# increased latency and occasional malformed outputs
experiment = ChaosExperiment(
    name="llm-degradation-resilience",
    target_agent="analyst-agent-001",
    description="Test agent behavior under degraded LLM provider",
    faults=[
        Fault.latency_injection(target="llm-provider", delay_ms=8000),
        Fault.error_injection(target="llm-provider", rate=0.05),
    ],
    duration_seconds=300,
)

# Experiment 2: Trust score manipulation -- simulates an agent receiving
# messages from a peer with a spoofed trust score
trust_experiment = ChaosExperiment(
    name="trust-manipulation-resilience",
    target_agent="orchestrator-001",
    faults=[
        Fault(
            fault_type=FaultType.TRUST_PERTURBATION,
            target="did:mesh:orchestrator-001",
            params={"spoofed_score": 950},
        ),
    ],
    duration_seconds=120,
)

# Experiment 3: Tool timeout cascade -- multiple tools time out simultaneously,
# testing whether the agent abandons gracefully or enters a reasoning loop
cascade_experiment = ChaosExperiment(
    name="tool-timeout-cascade",
    target_agent="analyst-agent-001",
    faults=[
        Fault.timeout_injection(target="database.read", delay_ms=30000),
        Fault.timeout_injection(target="api.call", delay_ms=30000),
    ],
    duration_seconds=180,
)

# Run the experiment
experiment.start()
# ... inject faults during agent execution ...
resilience = experiment.calculate_resilience(
    baseline_success_rate=0.95,
    experiment_success_rate=0.87,
    recovery_time_ms=48000,
)
experiment.complete(resilience=resilience)
print(f"Resilience score: {resilience.overall}/100 -- {'PASSED' if resilience.passed else 'FAILED'}")&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Additional fault types built into the chaos engine cover: prompt injection attempts, privilege escalation, data exfiltration attempts, identity spoofing, deadlock injection, and contradictory instruction scenarios. Each maps to a FaultType enum value and can be composed into multi-fault experiments.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;Important:&lt;/STRONG&gt; The chaos engine records that a fault was injected and triggers the governance response pipeline. Actual infrastructure-level fault injection (network partition, process kill) should be implemented using your existing chaos tooling (Chaos Mesh, Gremlin, Azure Chaos Studio, or similar). Agent SRE governs the agent's behavioral response to faults; it does not own infrastructure manipulation. These two layers are designed to compose.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Each chaos experiment produces a structured resilience score via calculate_resilience(), which compares baseline and experiment success rates. A score of 90+ with passed=True means the agent maintained at least 90% of its baseline performance under fault conditions. Teams use this to set minimum resilience thresholds for production readiness.&lt;/P&gt;
&lt;H2&gt;Replay Debugging: Reproduce Behavioral Failures Exactly&lt;/H2&gt;
&lt;P&gt;Infrastructure incidents are reproducible because infrastructure is deterministic. AI agent incidents are hard to reproduce because agent behavior depends on model state, context window content, and the sequence of tool call results, none of which are preserved by default after a session ends.&lt;/P&gt;
&lt;P&gt;Agent SRE's replay engine records every agent session as a replayable artifact: the full trace at each step, every tool call with its inputs and outputs, every policy evaluation with its decision, and every trust score at the time of each inter-agent message.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agent_sre.replay.capture import TraceStore
from agent_sre.replay.engine import ReplayEngine, ReplayMode

# Traces are captured automatically when SRE tracing is active
store = TraceStore(
    backend="azure_blob",
    retention_days=30,
)

# When an incident occurs, replay the session exactly
engine = ReplayEngine(store=store)

# Full replay: re-run the session against the same recorded inputs
# Uses recorded tool outputs -- no live tool calls -- so replay is deterministic
result = await engine.replay(
    trace_id="trace_2026_05_a7f3b2",
    mode=ReplayMode.FULL,
)

for step in result.steps:
    print(f"Step {step.index}: {step.action} -&amp;gt; {step.decision}")

# Divergence analysis: replay with a policy change applied
# Shows exactly which actions would have been blocked under the new policy
diff_result = await engine.diff(
    trace_id="trace_2026_05_a7f3b2",
    policy_override="policies/stricter-v2.yaml",
)

for diff in diff_result.diffs:
    if diff.description:
        print(f"Step {diff.span_name}: was {diff.original}, "
              f"would be {diff.replayed} under new policy")&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;The divergence analysis is the feature teams use most.&lt;/STRONG&gt; When a policy change is proposed, you replay recent production traces against the new policy to see how many actions would have been blocked, which sessions would have failed, and what the error budget impact would have been. Policy changes stop being guesswork.&lt;/P&gt;
&lt;H2&gt;Progressive Delivery: Safely Rolling Out New Agent Capabilities&lt;/H2&gt;
&lt;P&gt;When you ship a new service version, you do not send it to all traffic at once. You use canary deployments, feature flags, or traffic splitting. You watch the SLOs. If they degrade, you roll back.&lt;/P&gt;
&lt;P&gt;Agent SRE brings the same discipline to agent capability rollout. When you expand an agent's authorized scope, giving it write access it did not have, connecting it to a new tool, or raising its trust floor, you do not expand to the full fleet immediately. You expand progressively, with automated SLO gates controlling each stage.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agent_sre.delivery.rollout import (
    AnalysisCriterion,
    CanaryRollout,
    RollbackCondition,
    RolloutStep,
)

rollout = CanaryRollout(
    name="database-write-capability",
    steps=[
        RolloutStep(
            name="canary",
            weight=0.05,                   # 5% of agents get the new capability
            duration_seconds=86400,        # 24 hours
            analysis=[
                AnalysisCriterion(metric="safety_sli", threshold=0.995),
                AnalysisCriterion(metric="performance_sli", threshold=0.90),
                AnalysisCriterion(
                    metric="error_budget_consumed",
                    threshold=0.10,
                    comparator="lte",      # canary can burn at most 10%
                ),
            ],
        ),
        RolloutStep(
            name="early-adopters",
            weight=0.25,                   # 25% traffic
            duration_seconds=172800,       # 48 hours
            analysis=[
                AnalysisCriterion(metric="safety_sli", threshold=0.990),
                AnalysisCriterion(metric="performance_sli", threshold=0.88),
            ],
        ),
        RolloutStep(
            name="general-availability",
            weight=1.0,                    # 100% traffic
            duration_seconds=604800,       # 1 week of full observation
            analysis=[
                AnalysisCriterion(metric="safety_sli", threshold=0.990),
                AnalysisCriterion(metric="performance_sli", threshold=0.85),
            ],
        ),
    ],
    rollback_conditions=[
        RollbackCondition(metric="safety_sli", threshold=0.95, comparator="lte"),
    ],
)

# Start the rollout -- SLO gates evaluate at each step
rollout.start()

# Advance to next step when analysis criteria pass
if rollout.advance():
    print(f"Advanced to step: {rollout.current_step.name}")
    print(f"Progress: {rollout.progress_percent:.0f}%")&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The SLO gate at each step is the same mechanism as a CI/CD quality gate, but measured on live production behavior rather than test results. An agent capability that degrades the safety SLI during canary does not promote to the next step. If a RollbackCondition fires, the rollout rolls back automatically.&amp;nbsp;&lt;STRONG&gt;This is the mechanism that makes it operationally safe to expand agent autonomy: every expansion is measurable, every measurement gates the next expansion, and rollback is automatic.&lt;/STRONG&gt;&lt;/P&gt;
&lt;H2&gt;Health Checks and Backpressure&lt;/H2&gt;
&lt;P&gt;Traditional health checks answer: is the service alive? For agents, alive is not enough. A healthy agent is one that is alive, operating within policy, consuming resources within budget, and maintaining a trust score above the Ring threshold it was assigned.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;# Agent health check covering multiple dimensions
health = await agent_health_check(
    agent_id="analyst-agent-001",
    dimensions=[
        "liveness",            # Is the agent process running?
        "policy_compliance",   # Is safety SLI above threshold?
        "trust_score",         # Is trust score above Ring floor?
        "resource_budget",     # Is token/API spend within limits?
        "tool_availability",   # Are the tools the agent needs reachable?
    ],
)

# health.status: "healthy" | "degraded" | "unhealthy"
# health.dimensions: per-dimension pass/fail with values
# health.recommended_action: "none" | "restrict" | "suspend" | "terminate"&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When health checks report degradation, backpressure controls engage before the circuit breaker opens. Backpressure is the earlier, softer response: accept fewer concurrent tasks, reject low-priority work, drain in-flight tasks gracefully before the situation escalates.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;# Backpressure configuration
backpressure_config = {
    "backpressure_threshold": 0.80,    # Engage when resource utilization &amp;gt; 80%
    "max_concurrent": 5,               # Hard cap on simultaneous agent tasks
    "priority_shedding": True,         # Drop low-priority tasks first
    "drain_timeout_seconds": 30,       # Allow in-flight tasks to complete
}&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;The ordering matters: backpressure first, then circuit breaker, then suspension.&lt;/STRONG&gt; Each stage is recoverable. Each stage preserves more agent state than the next. The SRE principle of graduated response applies to agents exactly as it applies to services.&lt;/P&gt;
&lt;H2&gt;Observability: Governance Metrics Flow Into Your Existing Stack&lt;/H2&gt;
&lt;P&gt;&lt;STRONG&gt;Agent SRE does not ask you to adopt a new observability platform.&lt;/STRONG&gt; Governance metrics are exported through the same adapters your infrastructure monitoring already uses, including OpenTelemetry, Prometheus, Datadog, and others.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agent_sre.tracing.exporters import configure_exporters

configure_exporters(
    backends=[
        {"type": "prometheus", "endpoint": "http://prometheus:9090"},
        {"type": "opentelemetry", "endpoint": "http://otel-collector:4317"},
    ],
    include_metrics=[
        "slo.safety_sli",               # Per-agent safety compliance rate
        "slo.error_budget_remaining",    # Error budget in percentage
        "slo.burn_rate",                 # Current burn rate vs sustainable
        "circuit_breaker.state",         # CLOSED / OPEN / HALF_OPEN
        "circuit_breaker.failure_count",
        "trust_score.current",           # Agent trust score (0-1000)
        "trust_score.ring",              # Current execution ring
        "chaos.experiments_run",         # Chaos experiment telemetry
        "health.status",                 # Aggregate health status
        "backpressure.load",             # Current load vs threshold
    ],
)&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Key governance metrics available in your existing dashboards:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="width: 99.3519%;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Metric&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;What It Tells You&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Alert Condition&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;slo.safety_sli&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Fraction of agent actions within policy&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&amp;lt; 0.99&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;slo.burn_rate&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Rate at which error budget is consumed&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&amp;gt; 2.0 (warn), &amp;gt; 5.0 (page)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;slo.error_budget_remaining&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Budget left for the SLO window&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&amp;lt; 20%&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;circuit_breaker.state&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Current breaker state per agent&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;OPEN or HALF_OPEN&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;trust_score.ring&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Execution ring (privilege level)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Ring 3 (untrusted)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;health.status&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Aggregate health across all dimensions&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;degraded or unhealthy&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you are already running Grafana dashboards for your services, a governance dashboard for your agent fleet is a new data source and a new set of panels, not a new monitoring stack.&lt;/P&gt;
&lt;H2&gt;The SRE Mental Model for Agents: Four New Concepts&lt;/H2&gt;
&lt;P&gt;Everything in Agent SRE is built on the SRE mental model you already have, extended with four concepts that adapt traditional reliability thinking for autonomous systems:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="width: 95.6481%;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Traditional SRE&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Agent SRE Equivalent&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;What Changes&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Latency SLI&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Safety SLI&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Correctness of *action*, not speed of *response*&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Error budget&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Autonomy budget&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Burns on policy violations, not just errors&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Circuit breaker&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Behavioral circuit breaker&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Opens on wrong *behavior*, not just failure codes&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Canary deployment&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Capability rollout&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Rolls out *scope*, not just code&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The governance insight is that error budgets work in both directions for agents. A service's error budget only decreases. An agent's autonomy is also a budget: it grows when the safety SLI is strong and shrinks when it degrades. The error budget mechanism becomes the operational mechanism for expanding and contracting agent autonomy in response to evidence, which is exactly what regulated industries and risk-averse enterprise teams need before they will trust an autonomous agent with consequential actions.&lt;/P&gt;
&lt;H2&gt;Getting Started with Agent SRE&lt;/H2&gt;
&lt;LI-CODE lang="powershell"&gt;pip install agent-sre&lt;/LI-CODE&gt;
&lt;P&gt;A minimal Agent SRE integration requires three things: a safety SLO definition, a circuit breaker, and a health check. The progressive delivery and chaos engineering features layer on top when you are ready for them.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agent_sre import SLO, ErrorBudget
from agent_sre.slo.indicators import TaskSuccessRate
from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker

# Step 1: Define your safety SLO
slo = SLO(
    name="production-safety",
    indicators=[TaskSuccessRate(target=0.99, window="24h")],
    error_budget=ErrorBudget(total=0.01, burn_rate_alert=2.0, burn_rate_critical=5.0),
)

# Step 2: Configure a circuit breaker
breaker_config = CircuitBreakerConfig(
    failure_threshold=5,
    recovery_timeout_seconds=60,
    half_open_max_calls=3,
)
breaker = CircuitBreaker(agent_id="my-agent", config=breaker_config)

# Step 3: Wire into your existing agent loop
async def governed_agent_loop(agent, task):
    # Check health first
    if not await agent_is_healthy(agent.id):
        return {"error": "agent suspended", "reason": "health check failed"}

    # Run within circuit breaker protection
    async with breaker:
        result = await agent.run(task)
        slo.record_event(good=result.policy_compliant)
        return result&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The quickstart in the repository walks through a complete setup with safety SLOs, circuit breakers, and a Prometheus dashboard export in under 50 lines.&lt;/P&gt;
&lt;H2&gt;Why This Matters&lt;/H2&gt;
&lt;P&gt;Most AI observability tools today focus on what you might call model quality: hallucination rate, latency, token cost, task completion. These are useful metrics. They are not SRE metrics. They do not answer whether the agent acted within its authorized scope, whether its behavioral error budget is burning at a dangerous rate, or whether it would survive the LLM provider going down.&lt;/P&gt;
&lt;P&gt;Agent SRE answers those questions using the operational vocabulary that SRE teams already understand: SLOs, error budgets, circuit breakers, chaos experiments, and health checks. The goal is not to replace your observability stack. It is to make agent governance visible inside it.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;The reliability of an autonomous agent is not a property of the model. It is a property of the governance infrastructure around it.&lt;/STRONG&gt; Agent SRE is that infrastructure.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;Resources&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;GitHub:&lt;/STRONG&gt; &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit" target="_blank"&gt;github.com/microsoft/agent-governance-toolkit&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Install:&lt;/STRONG&gt; pip install agent-sre&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Tutorials:&lt;/STRONG&gt; &lt;A class="lia-external-url" href="https://microsoft.github.io/agent-governance-toolkit/tutorials/" target="_blank"&gt;40+ tutorials including dedicated Agent SRE walkthroughs for SLO setup, chaos experiments, and progressive delivery&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Architecture reference:&lt;/STRONG&gt; &lt;A class="lia-external-url" href="https://microsoft.github.io/agent-governance-toolkit/ARCHITECTURE/" target="_blank"&gt;ARCHITECTURE.md&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;OWASP compliance mapping:&lt;/STRONG&gt; OWASP-COMPLIANCE.md -- Agent SRE addresses ASI-08 (Cascading Failures) directly through circuit breakers and SLO-based fault detection&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Part 1&lt;/STRONG&gt; -- &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/linuxandopensourceblog/agent-governance-toolkit-architecture-deep-dive-policy-engines-trust-and-sre-for/4510105" data-lia-auto-title="Runtime governance: Policy engines, trust, and SRE overview" data-lia-auto-title-active="0" target="_blank"&gt;Runtime governance: Policy engines, trust, and SRE overview&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Part 2&lt;/STRONG&gt; -- &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/linuxandopensourceblog/shift-left-governance-for-ai-agents-how-the-agent-governance-toolkit-helps-you-c/4516481" data-lia-auto-title="Shift-left governance: Catching violations before production" data-lia-auto-title-active="0" target="_blank"&gt;Shift-left governance: Catching violations before production&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Part 3&lt;/STRONG&gt; -- &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/linuxandopensourceblog/after-the-agent-acts-proving-what-happened-and-who-authorized-it/4519826" data-lia-auto-title="Post-hoc accountability: After the agent acts" data-lia-auto-title-active="0" target="_blank"&gt;Post-hoc accountability: After the agent acts&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;EM&gt;The Agent Governance Toolkit is an open-source project released under the MIT License. All features described in this post are available in the public repository. The `agent-sre` package is currently in public preview; APIs may change before general availability.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Questions about Agent SRE in your environment? Open an issue at &lt;A class="lia-external-url" href="http://aka.ms/agent-governance-toolkit" target="_blank"&gt;aka.ms/agent-governance-toolkit&lt;/A&gt; or start a discussion in the comments below.&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 May 2026 23:12:29 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/applying-site-reliability-engineering-to-autonomous-ai-agents/ba-p/4521357</guid>
      <dc:creator>mosiddi</dc:creator>
      <dc:date>2026-05-19T23:12:29Z</dc:date>
    </item>
    <item>
      <title>Agentic AI for Linux Operations on Azure: The Prompts</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/agentic-ai-for-linux-operations-on-azure-the-prompts/ba-p/4520924</link>
      <description>&lt;H1&gt;Try This Yourself: Agentic AI for Linux Operations on Azure&lt;/H1&gt;
&lt;P&gt;At Red Hat Summit 2026, I handed GitHub Copilot CLI a terminal and asked it to deploy a full-stack application to RHEL 10 on Azure. Live. From a single prompt. No scripts, no runbooks, no pre-baked automation. The audience watched every command happen in real time and then played the app on their phones.&lt;/P&gt;
&lt;P&gt;This post gives you the prompts so you can try it yourself. Copy them, paste them into Copilot CLI, and watch what happens. The only things you need to change are marked with&amp;nbsp;[EDIT].&lt;/P&gt;
&lt;P&gt;When you're done, you'll have a working Conference Bingo game running on Azure that you can open in your browser and play. The same app that people played live at Summit.&lt;/P&gt;
&lt;H2&gt;What You Need&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Azure subscription&lt;/STRONG&gt;&amp;nbsp;— any subscription where you can create VMs (a free trial or Visual Studio subscription works)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;GitHub Copilot CLI&lt;/STRONG&gt;&amp;nbsp;— see&amp;nbsp;&lt;A href="https://docs.github.com/en/copilot/how-tos/copilot-cli/set-up-copilot-cli/install-copilot-cli" target="_blank" rel="noopener"&gt;Installing Copilot CLI&lt;/A&gt;&amp;nbsp;for all platforms
&lt;UL&gt;
&lt;LI&gt;macOS/Linux:&amp;nbsp;brew install copilot-cli&amp;nbsp;or&amp;nbsp;curl -fsSL https://gh.io/copilot-install | bash&lt;/LI&gt;
&lt;LI&gt;Windows:&amp;nbsp;winget install GitHub.Copilot&amp;nbsp;or use the install script in WSL&lt;/LI&gt;
&lt;LI&gt;
&lt;P&gt;GitHub Copilot subscription — Individual, Business, or Enterprise (&lt;A class="lia-external-url" href="https://github.com/features/copilot" target="_blank"&gt;https://github.com/features/copilot&lt;/A&gt;)&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;SSH key pair&lt;/STRONG&gt;&amp;nbsp;at&amp;nbsp;~/.ssh/id_rsa&amp;nbsp;— generate with&amp;nbsp;ssh-keygen&amp;nbsp;if you don't have one&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Azure CLI&lt;/STRONG&gt;&amp;nbsp;authenticated — run&amp;nbsp;az login&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;A Linux machine or WSL&lt;/STRONG&gt;&amp;nbsp;with Ansible installed (for Prompt 2 only)&lt;/LI&gt;
&lt;LI&gt;~30 minutes total&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Before You Start&lt;/H2&gt;
&lt;LI-CODE lang=""&gt;az login 
az account set --subscription "[EDIT] Your Subscription Name"&lt;/LI-CODE&gt;
&lt;P&gt;That's the only setup. Everything else is in the prompts.&lt;/P&gt;
&lt;H2&gt;Choose Your Linux Distribution&lt;/H2&gt;
&lt;P&gt;These prompts work with any Azure-endorsed Linux distribution. Pick one and use its image URN in Prompt 0:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Distribution&lt;/th&gt;&lt;th&gt;Image URN&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;RHEL 10&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;RedHat:RHEL:10-lvm-gen2:latest&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;RHEL 9&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;RedHat:RHEL:9-lvm-gen2:latest&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Ubuntu 24.04&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Canonical:ubuntu-24_04-lts:server:latest&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Azure Linux&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;&lt;EM&gt;Coming soon — check&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-us/azure/virtual-machines/linux/endorsed-distros" target="_blank" rel="noopener"&gt;endorsed distros&lt;/A&gt;&amp;nbsp;for availability&lt;/EM&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;SUSE 15 SP6&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;SUSE:sles-15-sp6:gen2:latest&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;AlmaLinux 9&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;
&lt;P&gt;almalinux:almalinux-x86_64:9-gen2:latest&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Rocky Linux 9&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;
&lt;P&gt;ciq:rlc-plus:rocky9:latest&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Oracle Linux 10&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Oracle:Oracle-Linux:ol10-lvm-gen2:latest&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Debian 12&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Debian:debian-12:12-gen2:latest&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;Copilot adapts its package management (dnf&amp;nbsp;vs&amp;nbsp;apt&amp;nbsp;vs&amp;nbsp;zypper), firewall (firewalld&amp;nbsp;vs&amp;nbsp;ufw), and security configuration (SELinux vs AppArmor) to the distro automatically. That's the point.&lt;/P&gt;
&lt;P&gt;For the full list, see&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-us/azure/virtual-machines/linux/endorsed-distros" target="_blank" rel="noopener"&gt;Azure endorsed Linux distributions&lt;/A&gt;.&lt;/P&gt;
&lt;H2&gt;Prompt 0: Create the Infrastructure&lt;/H2&gt;
&lt;P&gt;This creates two blank Linux VMs, networking, and firewall rules. Nothing gets installed. That's Prompt 1's job.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Copy and paste into Copilot CLI:&lt;/STRONG&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Create a resource group called "copilot-demo" in eastus with a vnet (10.0.0.0/16, one subnet 10.0.1.0/24). Provision two Linux VMs — "demo-db" and "demo-app" — using image URN &lt;STRONG&gt;[EDIT] RedHat:RHEL:10-lvm-gen2:latest&lt;/STRONG&gt;, my ~/.ssh/id_rsa key, and Standard_B2s size. Add NSG rules on both VMs for port 22 (SSH), and on demo-app for ports 80 and 443. Don't SSH into anything — just create the resources and tell me the IPs.&lt;/P&gt;
&lt;P&gt;When you're done, create a file at ~/.copilot/demo-env.md with the VM public and private IPs, resource group name, SSH user (azureuser), and SSH key path. Include a note that the app uses nginx as the reverse proxy (not Apache). Include a 'Demo Behavior Instructions' section with these rules: run SSH/SCP as direct commands (not Invoke-Expression), let output stream visibly (no tail/head), use SCP + bash scripts for complex quoting, no approval prompts, and hardcode SSH flags inline (-o StrictHostKeyChecking=no).&lt;/P&gt;
&lt;P&gt;After writing demo-env.md, verify it by reading it back and confirming the IPs match the VMs you just created. Run "az vm list-ip-addresses --resource-group copilot-demo -o table" and compare. If they don't match, fix it immediately. This file is the source of truth for every subsequent prompt.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;What to expect:&lt;/STRONG&gt;&amp;nbsp;Copilot creates the resource group, VNet, subnet, two VMs, and NSG rules. It writes an environment file that subsequent prompts reference. ~5 minutes.&lt;/P&gt;
&lt;H2&gt;Prompt 1: Deploy the Application&lt;/H2&gt;
&lt;P&gt;This is the big one. One prompt deploys PostgreSQL, Nginx, a Flask app, firewall rules, security configuration, and TLS — all from scratch.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Copy and paste into Copilot CLI:&lt;/STRONG&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Read ~/.copilot/demo-env.md for the environment, then:&lt;/P&gt;
&lt;P&gt;Configure and deploy the conference bingo game from https://github.com/karlabbott/conference-bingo to the demo-app VM. I have two fresh Linux VMs already running in the "copilot-demo" resource group: demo-db for PostgreSQL and demo-app for the app, on the same vnet. SSH key is ~/.ssh/id_rsa, user is azureuser.&lt;/P&gt;
&lt;P&gt;Deploy the app to /srv/conference-bingo to avoid SELinux home directory issues. Use nginx as the reverse proxy (as specified in the README), not the Apache configs in the deploy/ directory. Run commands individually over SSH. Configure the firewall to allow HTTP and HTTPS. If SELinux is enforcing, configure it appropriately. SCP a .sql file for PostgreSQL setup rather than inlining SQL through SSH. Install certbot via pip if you have a domain, otherwise use a self-signed certificate. Write secrets to ~/.config.env and copy to /etc/bingo.env for the systemd service. Use &lt;STRONG&gt;[EDIT] your-email@example.com&lt;/STRONG&gt; for certs.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;What to expect:&lt;/STRONG&gt;&amp;nbsp;Copilot SSHs into both VMs and handles everything — packages, database, app deployment, web server, security, TLS. ~10-15 minutes.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;What to watch for:&lt;/STRONG&gt;&amp;nbsp;How Copilot adapts to your distro. On RHEL, it uses&amp;nbsp;dnf, sets SELinux booleans like&amp;nbsp;httpd_can_network_connect, runs&amp;nbsp;initdb&amp;nbsp;for PostgreSQL, and configures&amp;nbsp;firewalld. On Ubuntu, it uses&amp;nbsp;apt, skips&amp;nbsp;initdb, and sets up&amp;nbsp;ufw. Same prompt, different execution path. When something fails, watch it read the error and adapt.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;When it finishes:&lt;/STRONG&gt; Open&amp;nbsp;https://&amp;lt;demo-app-public-ip&amp;gt;&amp;nbsp;in your browser (accept the self-signed certificate warning if you didn't use a domain). You should see Conference Bingo running — enter your name and play. This is the same app people played live on their phones at Red Hat Summit.&lt;/P&gt;
&lt;H2&gt;Prompt 2: Add Observability with Ansible&lt;/H2&gt;
&lt;P&gt;This demonstrates the "explore with Copilot, codify with Ansible" pattern. The monitoring stack is an Ansible playbook that deploys Azure Monitor Agent, Log Analytics, Data Collection Rules, and a Managed Grafana dashboard.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Prerequisites:&lt;/STRONG&gt; Ansible installed on Linux or WSL. On Windows, use WSL and prefix commands with&amp;nbsp;export PATH=$HOME/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin.&amp;nbsp;&lt;EM&gt;(Note: You may have to adjust this prompt to tell GitHub Copilot where your Ansible is installed.)&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Copy and paste into Copilot CLI:&lt;/STRONG&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Read ~/.copilot/demo-env.md for the environment, then:&lt;/P&gt;
&lt;P&gt;Clone https://github.com/karlabbott/wordblitz-monitoring-ansible, copy group_vars/all.yml.example to group_vars/all.yml, and fill it in using the subscription ID from "az account show", resource group copilot-demo, location eastus, the VM names and IPs from demo-env.md, and ssh_user azureuser. Use "demo-law" for law_name and "demo-grafana" for grafana_name.&lt;/P&gt;
&lt;P&gt;Install the azure.azcollection Ansible collection and its pip requirements, then run the playbook with:&lt;/P&gt;
&lt;P&gt;ANSIBLE_AZURE_AUTH_SOURCE=cli ansible-playbook -i localhost, site.yml&lt;/P&gt;
&lt;P&gt;Print the Grafana dashboard URL when done and update demo-env.md with the Grafana URL and Log Analytics Workspace resource ID.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;What to expect:&lt;/STRONG&gt;&amp;nbsp;The playbook creates Azure monitoring resources, installs AMA on both VMs, configures data collection, deploys a Grafana dashboard, and — importantly — deploys a script called&amp;nbsp;turbo.sh&amp;nbsp;to the database VM that creates a real performance problem for Prompt 3. ~8-10 minutes.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;What is turbo.sh?&lt;/STRONG&gt;&amp;nbsp;The playbook deploys this to simulate a production incident:&lt;/P&gt;
&lt;LI-CODE lang=""&gt; #!/bin/bash
 # Observability performance optimizations: stress-tests PostgreSQL to validate
 # monitoring pipeline throughput under sustained high-concurrency workloads.
 # Stop: sudo -u postgres psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE query LIKE '%turbo_perf%';"
 
 # Phase 1: 8 CPU-burner loops (cross joins)
 for i in $(seq 1 8); do
   while true; do
     sudo -u postgres psql -d conference_bingo -c \
       "/* turbo_perf */ SELECT count(*) FROM bingo_squares a CROSS JOIN bingo_squares b CROSS JOIN bingo_squares c CROSS JOIN bingo_squares d CROSS JOIN bingo_squares e;" &amp;gt; 
/dev/null 2&amp;gt;&amp;amp;1
   done &amp;amp;
 done
 
 # Phase 2: 25 connection hogs that sleep in a transaction
 for i in $(seq 1 25); do
   while true; do
     sudo -u postgres psql -d conference_bingo -c \
       "/* turbo_perf */ SELECT pg_sleep(5);" &amp;gt; /dev/null 2&amp;gt;&amp;amp;1
   done &amp;amp;
 done
 
 echo "Turbo perf test started: 8 cross-join loops + 25 connection workers"
 echo "Observability pipeline should show load within seconds"&lt;/LI-CODE&gt;
&lt;P&gt;It fires 8 parallel cross-join queries that saturate every CPU core on the database VM, plus 25 connection hogs that exhaust PostgreSQL's connection pool. The turbo ansible role further reduces max_connections to 30 to make the problem worse. The result: the app slows to a crawl. Try playing bingo now — you'll feel it.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Why Ansible matters here:&lt;/STRONG&gt;&amp;nbsp;Agents are non-deterministic — the same prompt might take different steps each time. That's fine for exploration. But when you need to reproduce this in staging, then production, then for the next team, you need determinism. The playbook is idempotent, repeatable, auditable. It's in git, it's reviewed in PRs, and it IS the documentation. You explore with Copilot, then codify with Ansible.&lt;/P&gt;
&lt;H2&gt;Prompt 3: Ask Copilot What's Wrong&lt;/H2&gt;
&lt;P&gt;The turbo script is already running from Prompt 2. Your app should be slow. Now ask Copilot to figure out why — from a symptom alone:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;My app feels really slow. Can you tell me why? Let's review before making any changes.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;That's it. One sentence plus a guardrail.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;What to expect:&lt;/STRONG&gt;&amp;nbsp;Copilot SSHs in, checks system load, examines running processes, finds the cross-join queries, reads&amp;nbsp;turbo.sh, reverse-engineers the attack, explains the root cause, and offers to kill the processes. ~2-3 minutes.&lt;/P&gt;
&lt;H2&gt;Prompt 4: Generate an Incident Postmortem&lt;/H2&gt;
&lt;P&gt;After fixing the issue, ask Copilot to document what happened — from the same conversation:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Write an incident postmortem for what just happened — root cause, impact, how you diagnosed it, how you resolved it, and a recommendation to prevent it from happening again. Save it as a Word document at ~/Desktop/incident-postmortem.docx using python-docx, and open it.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;STRONG&gt;What to expect:&lt;/STRONG&gt;&amp;nbsp;A formatted Word document with root cause analysis, timeline, remediation steps, and prevention recommendations. The full loop: build, monitor, break, fix, document — one session. ~30 seconds.&lt;/P&gt;
&lt;H2&gt;Cleanup&lt;/H2&gt;
&lt;LI-CODE lang=""&gt;az group delete --name copilot-demo --yes --no-wait&lt;/LI-CODE&gt;
&lt;H2&gt;What I Learned Doing This Live&lt;/H2&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;A well-crafted prompt replaces a 50-step runbook.&lt;/STRONG&gt;&amp;nbsp;Your intent is the source of truth. The agent figures out the steps.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Explore with Copilot, codify with Ansible.&lt;/STRONG&gt;&amp;nbsp;Copilot gets you to working fast. Ansible keeps it working forever.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Understanding comes before abstraction.&lt;/STRONG&gt;&amp;nbsp;Don't start with the playbook. Start with the exploration. The playbook comes after.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;The danger with AI isn't that machines think. It's that we stop thinking because the output looks fine.&lt;/STRONG&gt;&amp;nbsp;Always review. Understand the blast radius. Start in non-production.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;AI removes the scaffolding. What remains is judgment.&lt;/STRONG&gt;&amp;nbsp;Technical correctness and the instinct to know when something is wrong — that's what the tools cannot replace. And that's what made me stop worrying about being replaced by them.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H2&gt;Resources&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Conference Bingo App:&lt;/STRONG&gt;&amp;nbsp;&lt;A class="lia-external-url" href="https://github.com/karlabbott/conference-bingo" target="_blank" rel="noopener"&gt;https://github.com/karlabbott/conference-bingo&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Monitoring Playbook:&lt;/STRONG&gt;&amp;nbsp;&lt;A class="lia-external-url" href="https://github.com/karlabbott/wordblitz-monitoring-ansible" target="_blank" rel="noopener"&gt;https://github.com/karlabbott/wordblitz-monitoring-ansible&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Interactive Walkthrough:&lt;/STRONG&gt;&amp;nbsp;&lt;A class="lia-external-url" href="https://summit.99b.org" target="_blank" rel="noopener"&gt;https://summit.99b.org &lt;/A&gt;&amp;nbsp;— the full talk with audio narration and demo videos&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;GitHub Copilot CLI:&lt;/STRONG&gt;&amp;nbsp;&lt;A class="lia-external-url" href="https://docs.github.com/en/copilot/how-tos/copilot-cli/set-up-copilot-cli/install-copilot-cli" target="_blank" rel="noopener"&gt;https://docs.github.com/en/copilot/how-tos/copilot-cli/set-up-copilot-cli/install-copilot-cli&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Azure endorsed Linux distributions:&lt;/STRONG&gt;&amp;nbsp;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/virtual-machines/linux/endorsed-distros" target="_blank" rel="noopener"&gt;https://learn.microsoft.com/en-us/azure/virtual-machines/linux/endorsed-distros&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Tue, 19 May 2026 18:25:43 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/agentic-ai-for-linux-operations-on-azure-the-prompts/ba-p/4520924</guid>
      <dc:creator>abbottkarl</dc:creator>
      <dc:date>2026-05-19T18:25:43Z</dc:date>
    </item>
    <item>
      <title>After the Agent Acts: Proving What Happened and Who Authorized It</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/after-the-agent-acts-proving-what-happened-and-who-authorized-it/ba-p/4519826</link>
      <description>&lt;P&gt;In &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/linuxandopensourceblog/agent-governance-toolkit-architecture-deep-dive-policy-engines-trust-and-sre-for/4510105" target="_blank" rel="noopener" data-lia-auto-title="part one" data-lia-auto-title-active="0"&gt;part one&lt;/A&gt; of this series, we covered AGT's runtime governance: the policy engine, zero-trust identity, execution sandboxing, and the OWASP Agentic AI risk mapping. In &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/linuxandopensourceblog/shift-left-governance-for-ai-agents-how-the-agent-governance-toolkit-helps-you-c/4516481" target="_blank" rel="noopener" data-lia-auto-title="part two" data-lia-auto-title-active="0"&gt;part two&lt;/A&gt;, we moved earlier in the lifecycle: shift-left governance, CI/CD gates, attestation workflows, and supply chain integrity.&lt;/P&gt;
&lt;P&gt;Both posts focused on governance that happens &lt;STRONG&gt;around the moment of action&lt;/STRONG&gt;, before it, during it, or right after it. That coverage is essential. But after those posts went live, a different pattern emerged in conversations with teams deploying agents in production. The question was more pointed:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;"An agent executed a financial transfer last Tuesday. A compliance officer is asking us to show who authorized it, through what chain, and exactly what scope it was granted. We have logs. But can we prove they weren't altered?"&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;No policy engine prevents a past action. No CI gate reconstructs a delegation chain after the fact. No shift-left tool tells an auditor whether the cryptographic identity that authorized a trade was legitimately derived from a human principal, or was injected mid-chain.&lt;/P&gt;
&lt;P&gt;This is the accountability gap. It is the governance question that neither runtime enforcement nor pre-runtime checks were designed to answer. Regulatory frameworks are tightening: the EU AI Act includes high-risk obligations with enforcement timelines in 2026, and the Colorado AI Act introduces requirements for automated decision-making. Courts are beginning to encounter AI agents in the evidentiary record. The accountability infrastructure has not caught up.&lt;/P&gt;
&lt;P&gt;This post covers what post-hoc accountability means for autonomous agents, what the Agent Governance Toolkit has to help address it, and three value propositions that are real but not yet visible in how governance tooling is typically described.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Note:&lt;/STRONG&gt; The policy files, workflow configurations, and code samples in this post are illustrative examples designed to show the concepts. For working implementations, see the &lt;A class="lia-external-url" href="https://microsoft.github.io/agent-governance-toolkit/quickstart/" target="_blank" rel="noopener"&gt;QUICKSTART.md&lt;/A&gt; in the repository.&lt;/P&gt;
&lt;H2&gt;The Accountability Gap in Multi-Agent Systems&lt;/H2&gt;
&lt;P&gt;The accountability problem is architectural. When a single agent takes a single action, accountability is straightforward: you know which model ran, what prompt it received, and what it called. When agents delegate to sub-agents, which delegate further to tool-execution agents, the chain of authorization becomes progressively disconnected from the original human instruction that started it.&lt;/P&gt;
&lt;P&gt;Consider this delegation topology, common in any production orchestration scenario:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;Human Principal
  └── Orchestrator Agent (did:mesh:orchestrator-001)
        └── Data Analyst Agent (did:mesh:analyst-001)
              └── File Write Tool (write /reports/q3-summary.csv)&lt;/LI-CODE&gt;
&lt;P&gt;By the time file_write fires, three delegation hops have occurred. The file write tool has no reliable way to know whether the human principal actually authorized file writes, what scope they granted to the orchestrator, or whether the analyst agent's instructions arrived through a legitimate delegation or were injected by a prompt injection attack.&lt;/P&gt;
&lt;P&gt;This gap has three concrete consequences:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Consequence&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Operational Impact&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Post-hoc audits cannot reconstruct authorization&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Incident investigations are limited to "the agent did this," not "here is who authorized this, through what chain, at what time, with what scope"&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Agents cannot distinguish legitimate delegation from injection&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;A prompt injection attack that inserts itself into a delegation chain is indistinguishable from a real orchestrator instruction without cryptographic verification&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Accountability cannot be attributed to a human authorization event&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;When a regulator asks "who is responsible for this action," the answer is a shrug and a log file&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;AGT already has the technical foundations designed to help close all three. The gap is not capability, it is visibility.&lt;/P&gt;
&lt;H2&gt;What AGT Has: The Cryptographic Accountability Stack&lt;/H2&gt;
&lt;P&gt;AGT's accountability infrastructure spans three components that work together: cryptographic agent identity, delegation chains, and tamper-evident audit logs.&lt;/P&gt;
&lt;H3&gt;1. Ed25519 Agent Identity with Lifecycle Management&lt;/H3&gt;
&lt;P&gt;Every agent in an AGT-governed system carries a cryptographic identity: a verifiable Ed25519 keypair with a W3C DID Document that can be exported, shared, and verified by any participant in the system.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agentmesh import AgentIdentity, IdentityRegistry

 # Create a verifiable agent identity
 identity = AgentIdentity.create(
     name="data-analyst",
     sponsor="operator@contoso.com",
     capabilities=["data.read", "report.write"],
     organization="data-team",
     description="Q3 close data analyst agent"
 )

 # Export as W3C DID Document for cross-system verification
 did_document = identity.to_did_document()

 # Register in the shared identity registry
 registry = IdentityRegistry()
 registry.register(identity)
&lt;/LI-CODE&gt;
&lt;P&gt;Identity lifecycle states, active, suspended, revoked, are tracked and cascaded. When an orchestrator identity is revoked, every downstream agent delegated from it is also invalidated. This cascade revocation behavior lets you kill a compromised delegation chain from its root rather than hunting sub-agents individually.&lt;/P&gt;
&lt;H3&gt;2. Delegation Chains with Scope Inheritance&lt;/H3&gt;
&lt;P&gt;When an orchestrator delegates to a sub-agent, AGT records the delegation cryptographically: who delegated, to whom, what capabilities were transferred, and what restrictions were applied. Sub-agents are designed to be unable to exceed the scope of their delegating principal.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from agentmesh import ScopeChain, DelegationLink

 # Create a scope chain rooted in a human sponsor
 chain, root_link = ScopeChain.create_root(
     sponsor_email="operator@contoso.com",
     root_agent_did=str(orchestrator_identity.did),
     capabilities=["data.read", "report.write", "data.delete"],
     sponsor_verified=True,
 )

 # Orchestrator delegates narrowed scope to analyst agent
 link = DelegationLink(
     link_id="link-analyst-001",
     depth=1,
     parent_did=str(orchestrator_identity.did),
     child_did=str(analyst_identity.did),
     parent_capabilities=["data.read", "report.write", "data.delete"],
     delegated_capabilities=["data.read", "report.write"],  # narrowed: no delete
     parent_signature=orchestrator_identity.sign(
         f"{orchestrator_identity.did}:{analyst_identity.did}:data.read,report.write".encode()
     ),
     link_hash="",  # computed on add
     previous_link_hash=root_link.link_hash,
 )
 link.link_hash = link.compute_hash()
 chain.add_link(link)

 # Verify the entire chain: scope narrowing + hash integrity + signatures
 valid, reason = chain.verify()
 if not valid:
     raise ValueError(f"Chain verification failed: {reason}")&lt;/LI-CODE&gt;
&lt;P&gt;The scope chain carries the human authorization context: the root sponsor email, when the chain was created, and what capabilities were granted at the top. Every downstream agent can trace any capability back through the chain using &lt;STRONG&gt;chain.trace_capability("data.read")&lt;/STRONG&gt;. A file write tool executing three hops from the human principal can verify that the original sponsor authorized file writes in this scope. This is the mechanism designed to help close the prompt injection gap: an injected instruction cannot produce a&amp;nbsp;valid signed delegation link from a legitimate orchestrator identity.&lt;/P&gt;
&lt;H3&gt;3. Tamper-Evident Audit Logs&lt;/H3&gt;
&lt;P&gt;Every policy decision, every delegation event, every tool call, every trust score evaluation: AGT writes a signed, append-only audit record. The signature covers the content hash of the log entry plus the hash of the preceding entry, forming a chain where tampering is designed to be detectable.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt; from agentmesh import PolicyEngine, AuditLog

 # Create the audit log (with optional external sink for production)
 audit_log = AuditLog()

 # Log a governance decision
 entry = audit_log.log(
     event_type="policy_decision",
     agent_did=str(analyst_identity.did),
     action="report.write",
     resource="/reports/q3-summary.csv",
     data={"task_id": "q3-close-2026"},
     outcome="success",
     policy_decision="allow",
 )

 # Verify the audit chain has not been tampered with
 valid, reason = audit_log.verify_chain()
 # valid == True: all hashes and chain links are intact

 # Query audit trail for a specific agent
 trail = audit_log.get_entries_for_agent(str(analyst_identity.did))&lt;/LI-CODE&gt;
&lt;P&gt;The audit trail for a single task session includes the complete delegation chain, from human authorization event at the top to tool execution at the bottom, with cryptographic signatures at every step.&lt;/P&gt;
&lt;H2&gt;Validating a Compliance Evidence Package&lt;/H2&gt;
&lt;P&gt;The three components above are most powerful when used together. At runtime, AGT's audit chain, identity registry, and delegation system each produce structured records. Assembling these into a single evidence package for compliance submission or incident investigation is a deployment-level concern: your CI pipeline or orchestration layer collects the outputs into a JSON artifact.&lt;/P&gt;
&lt;P&gt;Once assembled, AGT's &lt;STRONG&gt;agt verify --evidence&lt;/STRONG&gt; flag validates the package: checking that signatures are intact, delegation chains are complete, and audit entries have not been tampered with.&lt;/P&gt;
&lt;LI-CODE lang="powershell"&gt; # Validate a runtime evidence package
  agt verify --evidence ./agt-evidence.json

  # Strict mode: fail if evidence is missing, incomplete, or signatures don't verify
  agt verify --evidence ./agt-evidence.json --strict&lt;/LI-CODE&gt;
&lt;P&gt;Future direction: A built-in &lt;STRONG&gt;agt evidence collect&lt;/STRONG&gt; command to automate evidence assembly is on the backlog.&lt;/P&gt;
&lt;P&gt;The evidence package helps answer the audit questions directly:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Auditor Question&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Where It Lives in the Evidence Package&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Which agent executed this action?&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;identity.agent_id with Ed25519 public key&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Who authorized it?&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;delegation_chain[0].human_principal with timestamp&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;What scope was granted?&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;delegation_chain[*].granted_capabilities at each hop&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Was the delegation legitimate?&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;delegation_chain[*].signature, verifiable against issuer's public key&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Was the audit log altered?&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;audit_trail.chain_valid: true/false with entry-level hash verification&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;What policy governed the action?&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;policy_decision.rule_name with the policy YAML snapshot at decision time&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;This is the difference between "we have logs" and "here is a verifiable chain of custody backed by &lt;SPAN style="color: rgb(30, 30, 30);"&gt;cryptographic signatures."&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2&gt;The Governance Dial: Enabling Autonomy, Not Just Blocking Risk&lt;/H2&gt;
&lt;P&gt;There is a framing problem in how agent governance is typically described. Governance is described almost entirely as a constraint: what agents cannot do, what gets blocked, what violations get caught. This framing is accurate but incomplete.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Governance is the mechanism that helps you safely expand what your agents can do.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Without governance evidence, every expansion of agent autonomy is a leap of faith. With it, expansions are decisions with a measured risk profile:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Scenario&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Without Governance Evidence&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;With AGT Accountability Stack&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Expand agent to write to production databases&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Requires human approval on every write indefinitely&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Pilot with human-in-loop for 500 writes; audit trail shows 0 violations; graduate to autonomous&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Deploy agent in a regulated data environment&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Blocked by legal until "we can prove it"&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Evidence package helps satisfy audit requirement; deployment proceeds&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Respond to a security incident involving an agent&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Manually reconstruct what happened from scattered logs&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Pull the task session's evidence package; full chain of custody in minutes&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;The governance layer is the dial between supervised and autonomous operation. Audit evidence is what helps justify turning the dial further in the autonomous direction.&lt;/P&gt;
&lt;H2&gt;Blast Radius: The Governance Assurance You're Not Advertising&lt;/H2&gt;
&lt;P&gt;The sandboxing and privilege ring system in AGT is typically described in security terms: isolation, privilege reduction, process-level enforcement. But there is a more concrete operational value: &lt;STRONG&gt;blast radius definition before an incident occurs&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;The question every operations team needs to answer before deploying an autonomous agent at scale is:&lt;/P&gt;
&lt;P&gt;*"If this agent goes wrong, not if, when, what is the worst-case outcome?"*&lt;/P&gt;
&lt;P&gt;Without governance-enforced privilege boundaries, the answer is uncomfortably open-ended. With AGT's capability model and execution rings, the blast radius is a policy configuration: a bounded, declared set of resources the agent can touch, scoped to what the task requires.&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;# policies/financial-agent.yaml
apiVersion: governance.toolkit/v1
version: "1.0"
name: financial-agent-policy
default_action: deny

rules:
  - name: allow-report-write
    condition: "tool_name == 'report.write' and path.startswith('/data/reports/')"
    action: allow
    priority: 10
  - name: allow-data-read
    condition: "tool_name == 'data.read' and path.startswith('/data/processed/')"
    action: allow
    priority: 10&lt;/LI-CODE&gt;
&lt;P&gt;With this policy in place, the worst-case outcome for this agent is declared in the policy file, not discovered during a post-incident review. The audit log records not just what the agent did, but also every action that was blocked, giving you a full picture of how close any session came to the declared blast boundary.&lt;/P&gt;
&lt;H2&gt;Regulatory Alignment&lt;/H2&gt;
&lt;P&gt;The OWASP-COMPLIANCE.md in the AGT repository maps the toolkit's controls to each of the 10 OWASP Agentic AI risks. The compliance picture for specific regulatory frameworks:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Regulatory Requirement&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Relevant Framework&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;AGT Control&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Technical documentation for high-risk AI&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;EU AI Act, Art. 9-11&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Evidence package, policy audit trail, OWASP attestation&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Logging for automated decisions&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;EU AI Act, Art. 12&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Tamper-evident audit log with entry-level signatures&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Human oversight mechanisms&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;EU AI Act, Art. 14&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Circuit breakers, privilege rings, delegation scope limits&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Algorithmic impact assessment&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Colorado AI Act&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Policy snapshot at decision time, signed governance evidence&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Audit trail for automated decisions&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;HIPAA, SOC 2 Type II&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Immutable audit log with W3C DID-based agent identity&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Non-repudiation of agent actions&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Financial services (MiFID II, SEC)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Ed25519-signed audit entries, delegation chain with human auth context&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;STRONG&gt;Note:&lt;/STRONG&gt; The Agent Governance Toolkit does not guarantee compliance with any specific regulatory framework. The mappings above show how the toolkit's controls align with common requirements. Consult legal counsel for your specific obligations.&lt;/P&gt;
&lt;H2&gt;Putting It Together&lt;/H2&gt;
&lt;P&gt;The three posts in this series cover three distinct layers of the governance lifecycle:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Layer&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Timing&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Primary Value&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Post&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Shift-left governance&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Before production&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Catch policy violations at commit, PR, and CI time&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/linuxandopensourceblog/shift-left-governance-for-ai-agents-how-the-agent-governance-toolkit-helps-you-c/4516481" target="_blank" rel="noopener" data-lia-auto-title="Part 2" data-lia-auto-title-active="0"&gt;Part 2&lt;/A&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Runtime governance&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;At the moment of action&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Deterministic policy enforcement, zero-trust identity, sandboxing&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/linuxandopensourceblog/agent-governance-toolkit-architecture-deep-dive-policy-engines-trust-and-sre-for/4510105" target="_blank" rel="noopener" data-lia-auto-title="Part 1" data-lia-auto-title-active="0"&gt;Part 1&lt;/A&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Post-hoc accountability&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;After the action&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Cryptographic chain of custody, blast radius evidence, regulatory proof&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;This post&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;None of these layers substitutes for the others. Pre-runtime governance cannot prevent a runtime violation. Runtime enforcement cannot retroactively prove authorization. Post-hoc accountability cannot undo an action that runtime governance should have blocked. They compose.&lt;/P&gt;
&lt;H2&gt;Getting Started&lt;/H2&gt;
&lt;P&gt;If you already have the AGT policy engine in place, the path to full accountability coverage is incremental:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Add agent identity&lt;/STRONG&gt; - Create identities for each agent and register them. Export DID documents for cross-service verification.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Record delegation tokens&lt;/STRONG&gt; - At each orchestrator-to-agent delegation boundary, create and sign a delegation link. Pass tokens as context to the policy engine.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Configure a tamper-evident audit backend&lt;/STRONG&gt; - Configure the audit chain with a signing key and chain verification. For production, use an immutable backend: Azure Blob with WORM retention, S3 Object Lock, or equivalent.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Generate your first evidence package&lt;/STRONG&gt;:&lt;/LI&gt;
&lt;/OL&gt;
&lt;LI-CODE lang="powershell"&gt;agt verify --evidence ./agt-evidence.json --strict&lt;/LI-CODE&gt;
&lt;OL start="5"&gt;
&lt;LI&gt;&lt;STRONG&gt;Add evidence generation to your CI/CD release gate&lt;/STRONG&gt;:&lt;/LI&gt;
&lt;/OL&gt;
&lt;LI-CODE lang=""&gt;# .github/workflows/release.yml
- name: Governance Evidence Gate
  uses: microsoft/agent-governance-toolkit/action@&amp;lt;sha&amp;gt; #v3.5.0
  with:
    command: governance-verify
    evidence-path: ./agt-evidence.json
    strict: true
    fail-on-missing-chain: true&lt;/LI-CODE&gt;
&lt;H2&gt;Conclusion&lt;/H2&gt;
&lt;P&gt;Runtime governance and shift-left governance answer the question:&amp;nbsp;&lt;STRONG&gt;did we apply the right controls?&lt;/STRONG&gt; Post-hoc accountability answers the question: &lt;STRONG&gt;can we prove it?&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The Agent Governance Toolkit has the technical infrastructure designed to help answer it: Ed25519 agent identity with cascade revocation, cryptographically signed delegation chains with human authorization context, and tamper-evident audit logs that form a verifiable chain of custody from human principal to terminal tool call.&lt;/P&gt;
&lt;P&gt;The governance dial analogy is worth keeping. Every autonomous agent deployment exists on a spectrum between fully supervised and fully autonomous. The limiting factor on where you can set that dial is not model capability or framework maturity. It is how much governance evidence you have, and how verifiable that evidence is.&lt;/P&gt;
&lt;H2&gt;Resources&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;GitHub&lt;/STRONG&gt;: &lt;A href="https://github.com/microsoft/agent-governance-toolkit" target="_blank"&gt;microsoft/agent-governance-toolkit: AI Agent Governance Toolkit — Policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents. Covers 10/10 OWASP Agentic Top 10.&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Quickstart&lt;/STRONG&gt;: &lt;A href="https://microsoft.github.io/agent-governance-toolkit/quickstart/" target="_blank"&gt;Quick Start - Agent Governance Toolkit&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;OWASP Compliance Mapping&lt;/STRONG&gt;: &lt;A href="https://microsoft.github.io/agent-governance-toolkit/security/owasp-compliance/" target="_blank"&gt;OWASP Compliance - Agent Governance Toolkit&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;PyPI&lt;/STRONG&gt;: pip install agent-governance-toolkit[full]&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;npm&lt;/STRONG&gt;: npm install &lt;a href="javascript:void(0)" data-lia-user-mentions="" data-lia-user-uid="2865264" data-lia-user-login="microsoft" class="lia-mention lia-mention-user"&gt;microsoft​&lt;/a&gt;/agent-governance-sdk&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;NuGet&lt;/STRONG&gt;: dotnet add package Microsoft.AgentGovernance&lt;/LI&gt;
&lt;/UL&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;Have questions about deploying AGT in your environment? Open an issue at aka.ms/agent-governance-toolkit or join the conversation in the comments below.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;</description>
      <pubDate>Thu, 14 May 2026 22:28:12 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/after-the-agent-acts-proving-what-happened-and-who-authorized-it/ba-p/4519826</guid>
      <dc:creator>mosiddi</dc:creator>
      <dc:date>2026-05-14T22:28:12Z</dc:date>
    </item>
    <item>
      <title>Decoupling Memory from Startup Time in AKS Sandbox Pods</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/decoupling-memory-from-startup-time-in-aks-sandbox-pods/ba-p/4516307</link>
      <description>&lt;P&gt;&lt;EM&gt;What if a 96GB sandboxed pod could start as fast as a 2GB one?&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Before recent improvements in AKS Pod Sandboxing, large-memory pods could take over a minute longer to start than smaller ones. For customers running latency-sensitive, autoscaling, AI/ML, or bursty workloads, that startup delay directly impacted scale-out responsiveness, job completion time, and overall cluster efficiency.&lt;/P&gt;
&lt;P&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/use-pod-sandboxing" target="_blank" rel="noopener"&gt;AKS Pod Sandboxing&lt;/A&gt; provides strong workload isolation by running pods inside lightweight virtual machines. This model is especially valuable for security-sensitive, untrusted, or multi-tenant workloads, but it came with a tradeoff: memory size directly impacted startup latency.&lt;/P&gt;
&lt;P&gt;With recent updates to the &lt;A class="lia-external-url" href="https://en.wikipedia.org/wiki/Azure_Linux" target="_blank" rel="noopener"&gt;Azure Linux&lt;/A&gt; kernel used by AKS on Microsoft Hypervisor (MSHV), AKS has significantly improved startup time for large-memory sandboxed pods. This article explains what changed, why it matters, and what AKS customers should expect in practice.&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;The Problem: Large-Memory Pod Startup Was Expensive&amp;nbsp;&lt;/H2&gt;
&lt;P&gt;Before this change, &lt;A class="lia-external-url" href="https://katacontainers.io/" target="_blank" rel="noopener"&gt;Kata-based pod sandboxes&lt;/A&gt; on AKS using the Microsoft Hypervisor (MSHV) followed an &lt;EM&gt;eager memory allocation&lt;/EM&gt; model:&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;When a pod sandbox VM was created, all memory specified in the pod resource request was committed up front on the host.&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI&gt;For example: a pod requesting 32 GB, 64 GB, or 96 GB of memory forced the host to&amp;nbsp;allocate&amp;nbsp;and pin those virtual memory pages in physical memory before the VM could boot.&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;As a result, sandbox startup time scaled linearly with memory size.&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Measurements showed startup times growing quickly as memory increased:&amp;nbsp;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Pod Sandbox Memory&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;E2E Startup Time (Before)&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;32 GB&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;~21 seconds&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;64 GB&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;~41 seconds&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;96 GB&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;~62 seconds&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;This led to:&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Slower startup and scale-out for memory-heavy workloads.&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;Inefficient node&amp;nbsp;utilization&amp;nbsp;due to wasted memory reserved but unused at startup.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;What Changed: Deferred Page Allocation in MSHV Host Kernel&amp;nbsp;&lt;/H2&gt;
&lt;P&gt;With deferred page allocation, the kernel no longer commits all virtual machine memory at sandbox creation time.&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;The pod sandbox VM boots with a small&amp;nbsp;initial&amp;nbsp;memory footprint.&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI&gt;Host memory pages are committed lazily, only when the guest faults them.&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI&gt;The total available memory remains bounded by the pod memory limit defined in the pod specification.&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This behavior aligns with how KVM-based systems handle guest memory today but is implemented for MSHV in Azure Linux.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In short: &lt;EM&gt;memory is provisioned on demand, not up front.&lt;BR /&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;img&gt;Guest Memory Allocation (Before &amp;amp; After)&lt;/img&gt;
&lt;H2&gt;Results&amp;nbsp;&lt;/H2&gt;
&lt;H3&gt;1. Pod Startup Time Is Now Effectively Constant&amp;nbsp;&lt;/H3&gt;
&lt;P&gt;The most visible benefit for AKS customers is dramatically improved pod startup time for large-memory pods.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;With deferred page allocation enabled, startup time becomes approximately&amp;nbsp;O(1) with respect to memory size:&amp;nbsp;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Pod Sandbox Memory&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;E2E Startup Time (After)&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;32 GB&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;~3 seconds&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;64 GB&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;~3 seconds&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;96 GB&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;~3.5 seconds&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI&gt;~7x faster startup for 32 GB pods&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI&gt;~12x faster startup for 64 GB pods&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;UL&gt;
&lt;LI&gt;~17x faster startup for 96 GB pods&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;2. Higher Density and Better Memory Utilization&amp;nbsp;&lt;/H3&gt;
&lt;P&gt;Deferred page allocation also reduces wasted reserved memory at pod start. This allows AKS nodes to safely oversubscribe memory for cold pods, pack more sandboxed pods per node, and improve overall workload density and infrastructure efficiency.&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;Tradeoff: First-Touch Page Fault Cost&amp;nbsp;&lt;/H4&gt;
&lt;P data-start="207" data-end="491"&gt;Deferred page allocation introduces a first-touch cost: when a workload accesses a memory page for the first time, a page fault triggers host allocation. This cost is incurred once per page. After memory is populated, steady-state performance matches eager allocation in benchmarks.&lt;/P&gt;
&lt;P data-start="498" data-end="637"&gt;For most workloads, especially those that ramp memory gradually or benefit from faster startup, the improvement outweighs this one-time cost.&lt;/P&gt;
&lt;H2&gt;What AKS Pod Sandboxing Customers Need To&amp;nbsp;Do&amp;nbsp;&lt;/H2&gt;
&lt;P&gt;Here's the &lt;EM&gt;good &lt;/EM&gt;part: &lt;STRONG&gt;No changes are required for workloads to benefit from this improvement&lt;/STRONG&gt;. However, customers are encouraged to:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Specify realistic memory requests and limits.&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;Take advantage of improved startup behavior for scale-out scenarios.&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;EM&gt;Deferred page allocation is available in AKS Pod Sandboxing on AKS Azure Linux version 202603.18.1 or later, running kernel-mshv 6.6.121 or newer.&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 14 May 2026 15:29:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/decoupling-memory-from-startup-time-in-aks-sandbox-pods/ba-p/4516307</guid>
      <dc:creator>RoaaSakr</dc:creator>
      <dc:date>2026-05-14T15:29:00Z</dc:date>
    </item>
    <item>
      <title>Inspektor Gadget Completes Its First Independent Security Audit</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/inspektor-gadget-completes-its-first-independent-security-audit/ba-p/4517895</link>
      <description>&lt;P data-line="4"&gt;One thing I've learned working with Open Source software over the years is that the projects you can trust most are the ones willing to let someone else test and review. That's what's recently happened with Inspektor Gadget, the open source&amp;nbsp;&lt;A href="https://ebpf.foundation/-based" target="_blank" rel="noopener" data-href="https://ebpf.foundation/-based"&gt;eBPF&lt;/A&gt;&amp;nbsp;tool for Kubernetes observability and Linux host inspection. Inspektor Gadget completed its first independent security audit, and the results tell a good story about the maturity of this CNCF project.&lt;/P&gt;
&lt;H2 data-line="6"&gt;What is Inspektor Gadget?&lt;/H2&gt;
&lt;P data-line="8"&gt;&lt;A href="https://github.com/inspektor-gadget/inspektor-gadget" target="_blank" rel="noopener" data-href="https://github.com/inspektor-gadget/inspektor-gadget"&gt;Inspektor Gadget&lt;/A&gt;&amp;nbsp;is a framework and toolkit that uses eBPF technology to collect and inspect data on Kubernetes clusters and Linux hosts. It manages the packaging, deployment, and execution of "gadgets," which are eBPF programs packaged as&amp;nbsp;&lt;A href="https://opencontainers.org/" target="_blank" rel="noopener" data-href="https://opencontainers.org/"&gt;OCI&lt;/A&gt;&amp;nbsp;images. OCI (the Open Container Initiative) is a Linux Foundation project that defines open industry standards for container image formats and runtimes, so the same image can be distributed and run across any compliant tool or registry. If you're running Kubernetes in production and need to understand what's happening inside your cluster, Inspektor Gadget gives you that visibility. Because eBPF programs are loaded into the kernel at runtime to observe syscalls, network activity, and file access safely, your applications keep running unchanged while you get the data you need.&lt;/P&gt;
&lt;P data-line="10"&gt;Microsoft engineers Francis Laniel and Mauricio Vasquez are core maintainers on the project, and Microsoft has been a steady contributor to this CNCF project for several years now.&lt;/P&gt;
&lt;H2 data-line="12"&gt;Why a security audit?&lt;/H2&gt;
&lt;P data-line="14"&gt;Any tool that runs with elevated privileges on your infrastructure needs to earn trust. Inspektor Gadget runs with root-level access on nodes to do its job, so an independent review of its security posture is the responsible thing to do. The Cloud Native Computing Foundation (CNCF) facilitated the engagement through the Open Source Technology Improvement Fund (OSTIF), a nonprofit dedicated to improving the security of open source software. Over the past ten years, OSTIF has managed security engagements that have uncovered more than 800 vulnerabilities across 120 open source projects.&lt;/P&gt;
&lt;P data-line="14"&gt;For Microsoft customers, that trust matters in a very practical way. Inspektor Gadget is incorporated into Microsoft Defender for Containers and AKS's Node Problem Detector, and it is also a common troubleshooting tool used by customers and support engineers when they need to understand what is happening inside a cluster. When a project sits this close to production infrastructure, an independent audit is more than a milestone for the maintainers. It gives customers, operators, and support teams a clearer view of the project's security posture and the fixes already available.&lt;/P&gt;
&lt;H2 data-line="16"&gt;Who did the audit?&lt;/H2&gt;
&lt;P data-line="18"&gt;OSTIF engaged Shielder, an Italian security firm, to perform the assessment, with two Shielder researchers working on the audit in early 2026. Their methodology combined collaborative threat modeling with the Inspektor Gadget maintainers, manual source code review, dynamic testing on dedicated lab environments, static analysis using tools like Semgrep and GoSec, and AI-assisted code review for broader coverage. They set up three separate test environments: a local Linux host deployment, a remote daemon deployment, and a Kubernetes deployment on minikube.&lt;/P&gt;
&lt;H2 data-line="20"&gt;What did they find?&lt;/H2&gt;
&lt;P data-line="22"&gt;The audit identified three vulnerabilities. None were rated Critical or High severity.&lt;/P&gt;
&lt;P data-line="24"&gt;&lt;STRONG&gt;Two Medium severity findings:&lt;/STRONG&gt;&lt;/P&gt;
&lt;OL data-line="26"&gt;
&lt;LI data-line="26"&gt;&lt;STRONG&gt;Command Injection in ig image build&lt;/STRONG&gt;&amp;nbsp;(&lt;A href="https://github.com/inspektor-gadget/inspektor-gadget/security/advisories" target="_blank" rel="noopener" data-href="https://github.com/inspektor-gadget/inspektor-gadget/security/advisories"&gt;CVE-2026-24905&lt;/A&gt;): The image build process used Makefiles that embedded user-controlled input without proper escaping, creating a command injection vector. This would matter most in CI/CD scenarios building untrusted gadgets. Fixed in release v0.48.1.&lt;/LI&gt;
&lt;LI data-line="28"&gt;&lt;STRONG&gt;Denial of Service via Event Flooding&lt;/STRONG&gt;: A malicious container could flood the eBPF ring buffer (which was hard-coded to 256KB) causing the system to silently drop events from other containers. If you're using Inspektor Gadget for security monitoring, this could let an attacker hide their activity. Fixed in release v0.50.1.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P data-line="30"&gt;&lt;STRONG&gt;One Low severity finding:&lt;/STRONG&gt;&lt;/P&gt;
&lt;OL data-line="32"&gt;
&lt;LI data-line="32"&gt;&lt;STRONG&gt;Unsanitized ANSI Escape Sequences in columns output mode&lt;/STRONG&gt;&amp;nbsp;(&lt;A href="https://github.com/inspektor-gadget/inspektor-gadget/security/advisories" target="_blank" rel="noopener" data-href="https://github.com/inspektor-gadget/inspektor-gadget/security/advisories"&gt;CVE-2026-25996&lt;/A&gt;): When displaying events in the terminal, Inspektor Gadget didn't sanitize ANSI escape sequences, which could allow a compromised container to inject terminal escape codes into the operator's display. Fixed in release v0.49.1.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P data-line="34"&gt;All three vulnerabilities now have patches available.&lt;/P&gt;
&lt;H2 data-line="36"&gt;Hardening recommendations&lt;/H2&gt;
&lt;P data-line="38"&gt;Beyond the specific vulnerabilities, Shielder provided six hardening recommendations. These are the kinds of findings that don't represent immediate exploits but point to areas where the project can reduce its attack surface over time:&lt;/P&gt;
&lt;UL data-line="40"&gt;
&lt;LI data-line="40"&gt;&lt;STRONG&gt;Enforce TLS by default on TCP listeners.&lt;/STRONG&gt;&amp;nbsp;When the daemon starts a TCP listener without TLS, it just logs a warning and continues in plaintext. The recommendation is to require an explicit opt-out flag instead.&lt;/LI&gt;
&lt;LI data-line="41"&gt;&lt;STRONG&gt;Pin and verify external dependencies in CI/CD.&lt;/STRONG&gt;&amp;nbsp;Several build dependencies were downloaded without hash or signature verification. The team has already landed fixes or has pull requests open for most of these.&lt;/LI&gt;
&lt;LI data-line="42"&gt;&lt;STRONG&gt;Implement a Kubernetes namespace blocklist&lt;/STRONG&gt;&amp;nbsp;to prevent unintended tracing on sensitive namespaces like kube-system.&lt;/LI&gt;
&lt;LI data-line="43"&gt;&lt;STRONG&gt;Restrict remote clients from enabling host-level tracing&lt;/STRONG&gt;&amp;nbsp;through the daemon, or at minimum document the risk.&lt;/LI&gt;
&lt;LI data-line="44"&gt;&lt;STRONG&gt;Automate third-party vulnerability scanning&lt;/STRONG&gt;&amp;nbsp;for project dependencies.&lt;/LI&gt;
&lt;LI data-line="45"&gt;&lt;STRONG&gt;Reduce RBAC permissions&lt;/STRONG&gt;&amp;nbsp;on the DaemonSet pod, specifically the nodes/proxy GET permission which could be leveraged for privilege escalation if the service account token is compromised.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-line="47"&gt;The Inspektor Gadget team is working through these systematically. Some are already addressed while others will take more time, particularly the RBAC work and the namespace blocklist implementation.&lt;/P&gt;
&lt;H2 data-line="49"&gt;Gadget bypass testing&lt;/H2&gt;
&lt;P data-line="51"&gt;One part of the audit I found particularly valuable was the gadget bypass testing. The researchers looked at whether a compromised container could perform operations that Inspektor Gadget is supposed to trace without triggering any events. They found six bypass scenarios, ranging from using newer Linux syscalls that certain gadgets don't hook (like openat2 instead of openat) to evasion through io_uring and statically linked libraries.&lt;/P&gt;
&lt;P data-line="53"&gt;These findings are characteristic of the cat-and-mouse nature of kernel-level tracing. As Linux evolves, the set of syscalls and mechanisms grows, and tracing tools need to keep up. The Inspektor Gadget team has already fixed some of these and is documenting the inherent limitations that come with the design of eBPF-based tracing.&lt;/P&gt;
&lt;H2 data-line="55"&gt;What this means&lt;/H2&gt;
&lt;P data-line="57"&gt;For organizations using Inspektor Gadget in production, the actionable step is to update to v0.50.1 or later, which includes fixes for all three reported vulnerabilities. Other than that, the Shielder team's own summary states that "the overall security posture of Inspektor Gadget is adequately mature from both a secure coding and design point of view."&lt;/P&gt;
&lt;P data-line="59"&gt;For the open source community, this audit is an example of how the CNCF ecosystem works at its best. A project reaches a level of adoption where independent security review becomes necessary, OSTIF and CNCF coordinate the engagement, qualified researchers do the work, maintainers fix the issues, and everything gets published so users can make informed decisions. That's the open source process working as it should.&lt;/P&gt;
&lt;H2 data-line="61"&gt;Resources&lt;/H2&gt;
&lt;UL data-line="63"&gt;
&lt;LI data-line="63"&gt;&lt;A href="https://github.com/inspektor-gadget/inspektor-gadget" target="_blank" rel="noopener" data-href="https://github.com/inspektor-gadget/inspektor-gadget"&gt;Inspektor Gadget on GitHub&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="64"&gt;&lt;A href="https://github.com/inspektor-gadget/inspektor-gadget/releases/tag/v0.50.1" target="_blank" rel="noopener" data-href="https://github.com/inspektor-gadget/inspektor-gadget/releases/tag/v0.50.1"&gt;Inspektor Gadget release v0.50.1&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="65"&gt;&lt;A href="https://ostif.org/" target="_blank" rel="noopener" data-href="https://ostif.org"&gt;OSTIF (Open Source Technology Improvement Fund)&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="66"&gt;&lt;A href="https://www.shielder.com/" target="_blank" rel="noopener" data-href="https://www.shielder.com/"&gt;Shielder&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 data-line="68"&gt;Audit announcement and resources&lt;/H2&gt;
&lt;UL data-line="70"&gt;
&lt;LI data-line="70"&gt;&lt;A href="https://github.com/ShielderSec/public-reports/blob/main/2026/%5BOSTIF%5D%20Inspektor%20Gadget%20-%20Report%20v1.2.pdf" target="_blank" rel="noopener" data-href="https://github.com/ShielderSec/public-reports/blob/main/2026/%5BOSTIF%5D%20Inspektor%20Gadget%20-%20Report%20v1.2.pdf"&gt;Full Report - Downloadable PDF&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="71"&gt;&lt;A href="https://inspektor-gadget.io/blog/2026/04/inspektor-gadget-security-audit" target="_blank" rel="noopener" data-href="https://inspektor-gadget.io/blog/2026/04/inspektor-gadget-security-audit"&gt;Blog post - Inspektor Gadget&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="72"&gt;&lt;A href="https://ostif.org/inspektor-gadget-audit-complete/" target="_blank" rel="noopener" data-href="https://ostif.org/inspektor-gadget-audit-complete/"&gt;Blog post - OSTIF&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="73"&gt;&lt;A href="https://www.shielder.com/blog/2026/04/inspektor-gadget-security-audit/" target="_blank" rel="noopener" data-href="https://www.shielder.com/blog/2026/04/inspektor-gadget-security-audit/"&gt;Blog post - Shielder&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 data-line="75"&gt;CVEs&lt;/H2&gt;
&lt;UL data-line="77"&gt;
&lt;LI data-line="77"&gt;&lt;A href="https://github.com/inspektor-gadget/inspektor-gadget/security/advisories" target="_blank" rel="noopener" data-href="https://github.com/inspektor-gadget/inspektor-gadget/security/advisories"&gt;CVE-2026-24905: Command Injection in ig image build&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="78"&gt;&lt;A href="https://github.com/inspektor-gadget/inspektor-gadget/security/advisories" target="_blank" rel="noopener" data-href="https://github.com/inspektor-gadget/inspektor-gadget/security/advisories"&gt;CVE-2026-25996: Unsanitized ANSI Escape Sequences&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Wed, 27 May 2026 19:04:24 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/inspektor-gadget-completes-its-first-independent-security-audit/ba-p/4517895</guid>
      <dc:creator>Brian Benz</dc:creator>
      <dc:date>2026-05-27T19:04:24Z</dc:date>
    </item>
    <item>
      <title>Shift-Left Governance for AI Agents: How the Agent Governance Toolkit Helps You Catch Violations</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/shift-left-governance-for-ai-agents-how-the-agent-governance/ba-p/4516481</link>
      <description>&lt;P&gt;In &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/linuxandopensourceblog/agent-governance-toolkit-architecture-deep-dive-policy-engines-trust-and-sre-for/4510105" data-lia-auto-title="part one of this series" data-lia-auto-title-active="0" target="_blank"&gt;part one of this series&lt;/A&gt;, we covered AGT’s runtime governance: the policy engine, zero-trust identity, execution sandboxing, and the OWASP Agentic AI risk mapping.&lt;/P&gt;
&lt;P&gt;That post focused on what happens when an agent &lt;STRONG&gt;acts&lt;/STRONG&gt;: policy evaluation at the moment a tool call fires, trust scoring when agents communicate, audit logging when decisions are made. Runtime governance is essential. But it is the last line of defense.&lt;/P&gt;
&lt;P&gt;After that post went live, a pattern emerged in conversations with teams adopting AGT. The same question kept coming up: runtime checks are useful, &lt;STRONG&gt;but what about everything before production&lt;/STRONG&gt;? We realized runtime governance was only half the story. So we went back and built tooling for every stage of your software development lifecycle, from the moment a developer saves a file to the moment an artifact ships to users.&lt;/P&gt;
&lt;H1&gt;Why Runtime Governance Is Not Enough&lt;/H1&gt;
&lt;P&gt;AI agents are a new class of workload. They reason about what to do, select tools, call APIs, read databases, and spawn sub-processes, often in loops that run without direct human oversight. The &lt;A class="lia-external-url" href="https://aka.ms/agt-owasp" target="_blank"&gt;OWASP Agentic AI Top 10&lt;/A&gt; (published December 2025) identifies risks like excessive agency, insecure tool use, privilege escalation, and supply chain compromise. These risks span the entire lifecycle, not just runtime.&lt;/P&gt;
&lt;P&gt;Consider a few scenarios that runtime governance alone cannot prevent:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;A developer commits a policy YAML file with a typo that silently disables all deny rules. The agent runs unprotected until someone notices.&lt;/LI&gt;
&lt;LI&gt;A dependency update introduces a package with a known critical CVE. The agent starts using a vulnerable library before any security team reviews it.&lt;/LI&gt;
&lt;LI&gt;A contributor adds a raw cryptographic import to an application module, bypassing the security-audited signing library. The code compiles and ships.&lt;/LI&gt;
&lt;LI&gt;A GitHub Actions workflow uses an expression injection pattern that allows an attacker to execute arbitrary code in CI.&lt;/LI&gt;
&lt;LI&gt;A release ships without a Software Bill of Materials (SBOM), making it impossible to trace which components are affected when the next log4j-style vulnerability drops.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Each of these is a governance failure, but none of them happens at runtime. &lt;STRONG&gt;They happen at commit time, at PR review time, at build time, or at release time&lt;/STRONG&gt;. A comprehensive governance strategy needs coverage at every stage.&lt;/P&gt;
&lt;H1&gt;Four Stages of Pre-Runtime Governance&lt;/H1&gt;
&lt;P&gt;Governance violations can enter a codebase at four distinct stages of the development lifecycle. Each stage has a different class of risk, and each needs a different kind of check:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="width: 99.1667%;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Stage&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;When It Runs&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;What It Catches&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;AGT Tooling&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Commit-time&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Before code leaves the developer machine&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Malformed policies, schema violations, secrets, stub code, unauthorized crypto&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Pre-commit hooks, quality gates&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;PR-time&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;When a pull request is opened or updated&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Vulnerable dependencies, missing attestation, secrets in history, unpinned versions&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;GitHub Actions (attestation, dependency review, secret scanning, supply chain checks)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;CI/Build-time&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;On every push and pull request to main&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Compliance violations, binary security issues, dependency confusion, workflow injection&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Governance Verify action, Security Scan action, CodeQL, BinSkim, policy validation&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Release-time&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Before artifacts are published&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Missing provenance, unsigned artifacts, incomplete SBOMs&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;SBOM generation, Sigstore signing, build attestation, OpenSSF Scorecard&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;Just as with bugs, the earlier you catch a governance violation, the cheaper it is to fix. A malformed policy file caught at commit time costs zero CI minutes. A secret caught in PR review never reaches the default branch. A dependency confusion attack blocked in CI never reaches production. An unsigned artifact blocked at release time never reaches users.&lt;/P&gt;
&lt;H1&gt;Stage 1: Commit-Time Governance with Pre-Commit Hooks&lt;/H1&gt;
&lt;P&gt;The fastest governance feedback loop is local. Within the AGT project, we’ve implemented three pre-commit hooks that run automatically whenever a developer stages files for commit, validating governance artifacts before they ever leave the developer's machine.&lt;/P&gt;
&lt;H2&gt;Built-In Hooks&lt;/H2&gt;
&lt;P&gt;The toolkit's &lt;EM&gt;&lt;STRONG&gt;.pre-commit-hooks.yaml &lt;/STRONG&gt;&lt;/EM&gt;defines three hooks that any repository can adopt:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="width: 97.3148%;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Hook ID&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;What It Validates&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;File Pattern&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;validate-policy&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;YAML/JSON policy files against the AGT policy schema, checking for required fields, valid operators, and structural correctness&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Files matching *polic*.yaml, *polic*.yml, *polic*.json&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;validate-plugin-manifest&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Plugin manifest files for required fields and schema compliance&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Files matching plugin.json, plugin.yaml, plugin.yml&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;evaluate-plugin-policy&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Plugin manifests against a governance policy file, evaluating whether the plugin would be allowed under the organization's rules&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Files matching plugin.json, plugin.yaml, plugin.yml&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To adopt these hooks, add AGT as a pre-commit hook source:&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;# .pre-commit-config.yaml
repos:
  - repo: https://github.com/microsoft/agent-governance-toolkit
    rev: main  # pin to a release tag in production
    hooks:
      - id: validate-policy
      - id: validate-plugin-manifest
      - id: evaluate-plugin-policy
        args: ['--policy', 'policies/marketplace-policy.yaml']&lt;/LI-CODE&gt;
&lt;P&gt;Then install and run:&lt;/P&gt;
&lt;LI-CODE lang="powershell"&gt;pip install pre-commit
pre-commit install
pre-commit run --all-files&lt;/LI-CODE&gt;
&lt;H2&gt;Extended Quality Gates&lt;/H2&gt;
&lt;P&gt;Beyond schema validation, we built a pre-commit rollout template (&lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/blob/main/docs/operations/pre-commit-hook-template.md" target="_blank"&gt;see the full example in the repository&lt;/A&gt;) with additional governance-specific quality gates designed to help prevent common security anti-patterns from entering the codebase:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Policy validation (agt-validate): &lt;/STRONG&gt;Runs the full AGT policy CLI in strict mode, catching not just schema errors but semantic issues like conflicting rules.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Health check (agt-doctor):&lt;/STRONG&gt; Runs on pre-push (before code leaves the machine entirely), performing a broader health check of the governance configuration.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Plugin metadata check (agency-json-required): &lt;/STRONG&gt;Ensures every plugin directory contains the required agency.json metadata file.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Stub detection (no-stubs): &lt;/STRONG&gt;Blocks TODO, FIXME, HACK, and raise NotImplementedError markers in staged production code. Test files are excluded.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Unauthorized crypto detection (no-custom-crypto): &lt;/STRONG&gt;Blocks raw cryptographic imports (hashlib, hmac, crypto.subtle, System.Security.Cryptography, ring, ed25519-dalek) outside designated security modules. This helps ensure all cryptographic operations go through the audited AGT signing libraries.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Secret scanning (detect-secrets):&lt;/STRONG&gt; Integrates Yelp's detect-secrets for pattern-based secret detection on every commit.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Phased Rollout for Teams&lt;/H2&gt;
&lt;P&gt;Adopting pre-commit hooks across a team requires a thoughtful rollout. The AGT documentation includes a &lt;A class="lia-external-url" href="http://%20https://github.com/microsoft/agent-governance-toolkit/blob/main/docs/operations/advisory-to-blocking-graduation.md" target="_blank"&gt;phased adoption guide&lt;/A&gt;:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Week 1: &lt;/STRONG&gt;Install hooks in permissive mode. Hooks warn on violations but do not block the commit. This lets developers see what would be caught without disrupting workflow.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Week 2:&lt;/STRONG&gt; Switch to strict mode for policy validation only. Policy files must pass schema validation to be committed.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Week 3: &lt;/STRONG&gt;Enable all hooks as blocking. Stubs, unauthorized crypto, and secrets are now blocked at commit time.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Week 4: &lt;/STRONG&gt;Graduate to full blocking mode and remove the permissive fallback.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;This approach helps teams build confidence in the governance tooling before it becomes a hard gate.&lt;/P&gt;
&lt;H1&gt;Stage 2: PR-Time Gates&lt;/H1&gt;
&lt;P&gt;Pre-commit hooks catch issues on the developer's machine, but they can be bypassed (force push, direct GitHub edits, hooks not installed). PR-time gates provide the second layer of defense, running in GitHub Actions on every pull request before merge is allowed.&lt;/P&gt;
&lt;H2&gt;Governance Attestation&lt;/H2&gt;
&lt;P&gt;The &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/tree/main/action/governance-attestation" target="_blank"&gt;Governance Attestation action&lt;/A&gt; validates that PR authors have completed a structured attestation checklist before their code can merge. The default checklist covers seven sections:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Security review&lt;/LI&gt;
&lt;LI&gt;Privacy review&lt;/LI&gt;
&lt;LI&gt;Legal review&lt;/LI&gt;
&lt;LI&gt;Responsible AI review&lt;/LI&gt;
&lt;LI&gt;Accessibility review&lt;/LI&gt;
&lt;LI&gt;Release Readiness / Safe Deployment&lt;/LI&gt;
&lt;LI&gt;Org-specific Launch Gates&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;The action is fully configurable. Organizations can customize the required sections, set a minimum PR body length, and choose their own attestation format. Outputs include the validation status, a list of errors for missing sections, and a JSON mapping of sections to checkbox counts.&lt;/P&gt;
&lt;P&gt;Here is an example workflow:&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;# .github/workflows/pr-governance.yml
name: PR Governance
on:
  pull_request:
    types: [opened, edited, synchronize]

jobs:
  attestation:
    runs-on: ubuntu-latest
    steps:
      - uses: microsoft/agent-governance-toolkit/action/governance-attestation@main
        with:
          required-sections: |
            1) Security review
            2) Privacy review
            3) Responsible AI review&lt;/LI-CODE&gt;
&lt;H2&gt;Dependency Review&lt;/H2&gt;
&lt;P&gt;The &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/blob/main/.github/workflows/dependency-review.yml" target="_blank"&gt;dependency review workflow&lt;/A&gt; helps block PRs that introduce dependencies with known CVEs or disallowed licenses. It uses the GitHub dependency-review-action with a curated license allowlist:&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;- uses: actions/dependency-review-action@v4
  with:
    fail-on-severity: moderate
    comment-summary-in-pr: always
    allow-licenses: &amp;gt;
      MIT, Apache-2.0, BSD-2-Clause, BSD-3-Clause, ISC,
      PSF-2.0, Python-2.0, 0BSD, Unlicense, CC0-1.0,
      CC-BY-4.0, Zlib, BSL-1.0, MPL-2.0&lt;/LI-CODE&gt;
&lt;P&gt;This runs on every PR that touches dependency manifests (package.json, Cargo.toml, pyproject.toml, requirements.txt). Dependencies with moderate or higher CVEs are flagged, and dependencies with licenses not on the allowlist are blocked.&lt;/P&gt;
&lt;H2&gt;Secret Scanning&lt;/H2&gt;
&lt;P&gt;The &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/blob/main/.github/workflows/secret-scanning.yml" target="_blank"&gt;secret scanning workflow&lt;/A&gt; runs on every PR to the main branch and on a weekly schedule. It combines two complementary approaches:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Gitleaks: Pattern-based secret detection across the full git history, catching API keys, tokens, and credentials that may have been committed at any point.&lt;/LI&gt;
&lt;LI&gt;High-entropy string scanning: Regex-based detection of common secret patterns including GitHub tokens (ghp_, gho_), AWS access keys (AKIA), Slack tokens (xox), and base64-encoded strings with high entropy.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Supply Chain Integrity&lt;/H2&gt;
&lt;P&gt;A dedicated &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/blob/main/.github/workflows/supply-chain-check.yml" target="_blank"&gt;supply chain check workflow&lt;/A&gt; triggers when dependency manifest files change. It enforces two rules that help prevent supply chain attacks:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Exact version pinning: No ^ or ~ version ranges in package.json files. This prevents unexpected minor/patch version updates that could introduce compromised code.&lt;/LI&gt;
&lt;LI&gt;Lockfile presence: Every package directory with dependencies must have a corresponding lockfile (package-lock.json, pnpm-lock.yaml, or yarn.lock). Lockfiles help ensure reproducible builds with verified integrity hashes.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Quality Gates&lt;/H2&gt;
&lt;P&gt;The &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/blob/main/.github/workflows/quality-gates.yml" target="_blank"&gt;quality gates workflow&lt;/A&gt; mirrors the pre-commit hooks at the PR level, providing defense in depth. It runs four checks on every pull request:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="width: 78.7963%;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Gate&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Purpose&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;No Stubs/TODOs&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Blocks TODO, FIXME, HACK markers in production code (test files excluded)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;No Unauthorized Crypto&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Blocks raw cryptographic imports outside designated security modules&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Security Audit Required&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Changes to security-sensitive paths require accompanying audit documentation&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Dependency Audit Trail&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Vendored patches must have an audit trail explaining the patch and its provenance&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;These gates catch anything that bypasses pre-commit hooks: force-pushed commits, direct GitHub web edits, commits from contributors who have not installed the hooks.&lt;/P&gt;
&lt;H1&gt;Stage 3: CI/Build-Time Governance&lt;/H1&gt;
&lt;P&gt;Once a PR passes the gate workflows, the main CI pipeline and specialized workflows perform deeper, more computationally intensive analysis.&lt;/P&gt;
&lt;H2&gt;The Governance Verify Action&lt;/H2&gt;
&lt;P&gt;The &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/tree/main/action" target="_blank"&gt;Governance Verify action&lt;/A&gt; is the primary CI-time governance check. It is a GitHub Actions composite action that installs the toolkit and runs the compliance CLI against your repository. It supports four modes:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="width: 92.3148%;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Command&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;What It Does&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;governance-verify&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Runs the full compliance verification suite, checking governance controls and reporting how many pass&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;marketplace-verify&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Validates a plugin manifest against marketplace requirements (required fields, signing, metadata)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;policy-evaluate&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Evaluates a specific policy file against a JSON context, returning the allow/deny decision with the matched rule&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;all&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Runs governance-verify, then marketplace-verify and policy-evaluate if the corresponding paths are provided&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Here is an example:&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;# .github/workflows/governance-ci.yml
name: Governance CI
on: [push, pull_request]

jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: microsoft/agent-governance-toolkit/action@main
        with:
          command: all
          policy-path: policies/
          manifest-path: plugin.json
          output-format: json
          fail-on-warning: 'true'&lt;/LI-CODE&gt;
&lt;P&gt;The action outputs structured data including controls-passed, controls-total, violations count, and full command output in JSON format. This makes it straightforward to integrate with dashboards, Slack notifications, or downstream decision logic.&lt;/P&gt;
&lt;H2&gt;The Security Scan Action&lt;/H2&gt;
&lt;P&gt;A &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/tree/main/action/security-scan" target="_blank"&gt;separate security scan action&lt;/A&gt; scans directories for secrets, CVEs, and dangerous code patterns. Unlike the PR-time secret scanning (which focuses on git history), this action performs deep content analysis of the current codebase:&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;- uses: microsoft/agent-governance-toolkit/action/security-scan@main
  with:
    paths: 'plugins/ scripts/'
    min-severity: high
    exemptions-file: .security-exemptions.json&lt;/LI-CODE&gt;
&lt;P&gt;The action supports configurable severity thresholds (critical, high, medium, low), an exemptions file for acknowledged findings, and structured JSON output with findings-count, blocking-count, and detailed findings.&lt;/P&gt;
&lt;H2&gt;Policy Validation Workflow&lt;/H2&gt;
&lt;P&gt;A &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/blob/main/.github/workflows/policy-validation.yml" target="_blank"&gt;dedicated policy validation workflow&lt;/A&gt; triggers whenever YAML files or the policy engine source code changes. It performs two jobs in sequence:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Validate policies: Discovers all policy files matching the *policy* naming convention, then validates each file using the AGT policy CLI.&lt;/LI&gt;
&lt;LI&gt;Test policies: Runs the policy CLI unit tests to verify that policy evaluation behavior is correct after the changes.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;This ensures that policy file edits do not break the policy engine and that policy semantics are preserved.&lt;/P&gt;
&lt;H2&gt;CodeQL and Static Analysis&lt;/H2&gt;
&lt;P&gt;AGT uses &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/blob/main/.github/workflows/codeql.yml" target="_blank"&gt;GitHub's CodeQL&lt;/A&gt; for semantic static analysis of Python and TypeScript code. The CodeQL workflow runs on pushes and PRs, performing deep dataflow analysis that goes beyond pattern matching. Results are uploaded as SARIF to GitHub's Security tab, providing a centralized view of code quality issues.&lt;/P&gt;
&lt;H2&gt;Dependency Confusion Scanning&lt;/H2&gt;
&lt;P&gt;A dedicated CI job runs a dependency confusion scanner on every build. This is a targeted defense against a specific supply chain attack vector where an attacker registers a public package with the same name as an internal package. The scanner checks that:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Internal package names do not collide with public PyPI or npm packages&lt;/LI&gt;
&lt;LI&gt;Notebook pip install commands only reference packages that are registered and expected&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Workflow Security Auditing&lt;/H2&gt;
&lt;P&gt;When GitHub Actions workflow files change, a workflow security job scans for common CI/CD security issues:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Expression injection: Detects patterns like ${{ github.event.pull_request.title }} used directly in run: blocks, which can allow arbitrary code execution.&lt;/LI&gt;
&lt;LI&gt;Overly permissive permissions: Flags workflows that request more permissions than necessary.&lt;/LI&gt;
&lt;LI&gt;Unpinned action references: Detects actions referenced by branch name instead of commit SHA, which is a supply chain risk.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;.NET Binary Analysis with BinSkim&lt;/H2&gt;
&lt;P&gt;For the .NET SDK (Microsoft.AgentGovernance), the CI pipeline runs Microsoft BinSkim binary security analysis on compiled assemblies. BinSkim checks for security-relevant compiler and linker settings in compiled binaries, such as DEP (Data Execution Prevention), ASLR (Address Space Layout Randomization), and stack protection. Results are uploaded as SARIF to GitHub code scanning alongside the CodeQL results.&lt;/P&gt;
&lt;H2&gt;The ci-complete Gate Pattern&lt;/H2&gt;
&lt;P&gt;With many CI jobs that conditionally run based on path filters, AGT uses a pattern called ci-complete: a single gate job that is configured as the sole required status check in branch protection. This job runs unconditionally (if: always()), depends on all other CI jobs, and checks that none of them failed. Jobs that were skipped (because no relevant files changed) are acceptable. This pattern ensures that branch protection works correctly with conditional CI jobs, preventing the common issue where skipped jobs report as "skipped" and fail required status checks.&lt;/P&gt;
&lt;H1&gt;Language-Specific Compile-Time Enforcement&lt;/H1&gt;
&lt;P&gt;Beyond the language-agnostic CI checks, each AGT SDK uses its language's native compiler and tooling to enforce governance standards at compile time.&lt;/P&gt;
&lt;H2&gt;.NET: The Strictest Compile-Time Checks&lt;/H2&gt;
&lt;P&gt;The .NET SDK (Microsoft.AgentGovernance) enforces compile-time governance through MSBuild properties in Directory.Build.props and Directory.Build.targets, which apply automatically to every project in the SDK:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="width: 99.9074%;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Feature&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;MSBuild Property&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Effect&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Nullable reference types&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&amp;lt;Nullable&amp;gt;enable&amp;lt;/Nullable&amp;gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;The compiler warns on every possible null dereference, helping prevent NullReferenceException at compile time&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Warnings as errors&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&amp;lt;TreatWarningsAsErrors&amp;gt;true&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;All compiler warnings become build errors for packable projects; no warnings can be shipped to consumers&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Strong-name signing&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&amp;lt;SignAssembly&amp;gt;true&amp;lt;/SignAssembly&amp;gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Assemblies are signed with a strong-name key (AgentGovernance.snk), enabling identity verification&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Deterministic builds&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&amp;lt;ContinuousIntegrationBuild&amp;gt;true&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Identical source code produces bit-for-bit identical binaries in CI, enabling build verification&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;SourceLink&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Microsoft.SourceLink.GitHub package&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Users can step into AGT source code when debugging, supporting transparency and auditability&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Symbol packages&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&amp;lt;IncludeSymbols&amp;gt;true&amp;lt;/IncludeSymbols&amp;gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;.snupkg symbol packages are published alongside NuGet packages for debugging support&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2&gt;TypeScript: Strict Compilation and Linting&lt;/H2&gt;
&lt;P&gt;The TypeScript SDK (@microsoft/agentmesh-sdk) uses strict compiler settings and ESLint for build-time governance:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Strict mode ("strict": true in tsconfig.json) &lt;/STRONG&gt;enables all strict type-checking options, including noImplicitAny, strictNullChecks, strictFunctionTypes, and strictBindCallApply.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Consistent file naming (forceConsistentCasingInFileNames) &lt;/STRONG&gt;prevents cross-platform issues where imports work on case-insensitive file systems (Windows, macOS) but fail on case-sensitive ones (Linux CI).&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Declaration generation (declaration: true with declarationMap: true) &lt;/STRONG&gt;produces .d.ts files for consumers, enabling downstream type checking.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;ESLint with @typescript-eslint &lt;/STRONG&gt;provides static analysis during the build process, catching issues beyond what the TypeScript compiler checks.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Python: Type Safety and Fast Linting&lt;/H2&gt;
&lt;P&gt;Python packages in AGT use typed package markers and static analysis tooling configured in pyproject.toml:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;py.typed marker: &lt;/STRONG&gt;Each package includes a py.typed file, signalling to type checkers (mypy, pyright, Pylance) that the package supports type checking. Consumers get type errors if they misuse the AGT API.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;mypy: &lt;/STRONG&gt;Configured as a dev dependency with project-specific settings in pyproject.toml. Provides static type checking that catches type mismatches before runtime.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;ruff: &lt;/STRONG&gt;A fast Python linter written in Rust, configured in pyproject.toml and enforced in CI. Ruff checks for hundreds of code quality rules at build time.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H1&gt;Stage 4: Release-Time Gates&lt;/H1&gt;
&lt;P&gt;Before artifacts reach users, the release pipeline adds a final layer of verification. These gates help ensure that what ships is exactly what was built, is signed by the expected publisher, and has a complete inventory of its components.&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="width: 97.037%;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Gate&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Tool&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;What It Produces&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/blob/main/.github/workflows/sbom.yml" target="_blank"&gt;&lt;STRONG&gt;SBOM generation&lt;/STRONG&gt;&lt;/A&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Anchore/Syft&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;SPDX and CycloneDX software bills of materials listing every component, dependency, and licence&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/blob/main/.github/workflows/publish.yml" target="_blank"&gt;&lt;STRONG&gt;Python signing&lt;/STRONG&gt;&lt;/A&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Sigstore&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Cryptographic signature using OpenID Connect identity, verifiable without manual key distribution&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/blob/main/.github/pipelines/esrp-publish.yml" target="_blank"&gt;&lt;STRONG&gt;.NET signing&lt;/STRONG&gt;&lt;/A&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;RELEASE PIPELINE&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Microsoft Authenticode and NuGet signing through the release pipeline&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/blob/main/.github/workflows/publish-containers.yml" target="_blank"&gt;&lt;STRONG&gt;Build provenance&lt;/STRONG&gt;&lt;/A&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;actions/attest-build-provenance&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;SLSA provenance attestation linking the artifact to its source commit and build environment&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;SBOM attestation&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;actions/attest-sbom&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Binds the SBOM to the specific release artifact, creating a verifiable link between the inventory and the binary&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Additionally, the &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/blob/main/.github/workflows/scorecard.yml" target="_blank"&gt;OpenSSF Scorecard&lt;/A&gt; runs on schedule, providing an automated security posture assessment that covers branch protection, dependency management, CI/CD practices, and more. The score is published to the OpenSSF Scorecard website, giving consumers a transparent view of the project security practices.&lt;/P&gt;
&lt;H1&gt;How It All Fits Together: Defense in Depth&lt;/H1&gt;
&lt;P&gt;This approach follows a defense-in-depth principle: &lt;STRONG&gt;every check exists at multiple layers, so that bypassing one layer does not compromise the whole system.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Secret scanning, for example, runs at three levels: detect-secrets at commit time (pre-commit hook), Gitleaks at PR time (secret scanning workflow), and the Security Scan action at CI time (content analysis). A developer who bypasses pre-commit hooks will still be caught by the PR-time gate. A contributor who force-pushes past the PR gate will still be caught by the CI pipeline.&lt;/P&gt;
&lt;P&gt;Similarly, policy validation runs at commit time (validate-policy hook), at PR time (quality gates), and at CI time (policy validation workflow). Each layer adds depth: the commit-time hook catches schema errors, the CI pipeline catches semantic issues and runs regression tests.&lt;/P&gt;
&lt;P&gt;The ci-complete gate job ties everything together. By depending on every CI job and serving as the single required status check, it ensures that no code merges to the main branch unless every applicable check has passed.&lt;/P&gt;
&lt;H1&gt;Getting Started&lt;/H1&gt;
&lt;P&gt;You can adopt AGT's shift-left governance incrementally. Here are three starting points, from lowest to highest effort:&lt;/P&gt;
&lt;H2&gt;1. Add the Governance Verify Action (5 minutes)&lt;/H2&gt;
&lt;P&gt;Add a single &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/tree/main/action" target="_blank"&gt;GitHub Actions workflow&lt;/A&gt; that runs the compliance check on every PR:&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;# .github/workflows/governance.yml
name: Governance
on: [pull_request]
jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: microsoft/agent-governance-toolkit/action@main
        with:
          command: governance-verify&lt;/LI-CODE&gt;
&lt;H2&gt;2. Enable Pre-Commit Hooks (15 minutes)&lt;/H2&gt;
&lt;P&gt;Add a &lt;A class="lia-external-url" href="https://github.com/microsoft/agent-governance-toolkit/blob/main/docs/operations/pre-commit-hook-template.md" target="_blank"&gt;.pre-commit-config.yaml &lt;/A&gt;referencing AGT's hooks, install them, and run against all existing files to establish a baseline. Start in permissive mode and graduate to strict over four weeks.&lt;/P&gt;
&lt;H2&gt;3. Full Pipeline Integration (1-2 hours)&lt;/H2&gt;
&lt;P&gt;Add the complete set of PR-time gates (attestation, dependency review, secret scanning, supply chain checks, quality gates), configure the Security Scan action for your plugin directories, and enable SBOM generation and signing in your release workflow. The AGT repository itself serves as a reference implementation: every workflow described in this post is running in production at aka.ms/agent-governance-toolkit.&lt;/P&gt;
&lt;H1&gt;Important Notes&lt;/H1&gt;
&lt;P&gt;&lt;EM&gt;The policy files, workflow configurations, and code samples in this post are illustrative examples. Your organization's governance requirements may differ. Review and customize all configurations before deploying to production. The Agent Governance Toolkit is designed to help organizations implement governance controls for AI agents; it does not guarantee compliance with any specific regulatory framework. Always consult your organization's security and legal teams when defining governance policies.&lt;/EM&gt;&lt;/P&gt;
&lt;H1&gt;What Comes Next&lt;/H1&gt;
&lt;P&gt;Pre-runtime governance is one piece of the puzzle. Combined with the runtime governance capabilities covered in part one of this series (policy engines, zero-trust identity, execution sandboxing, audit logging), it provides coverage across the full lifecycle.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;The project continues to grow.&lt;/STRONG&gt; Since the initial release, we’ve added a multi-stage policy pipeline (pre_input, pre_tool, post_tool, pre_output stages), approval workflows with human-in-the-loop gates, DLP attribute ratchets for monotonic session state, and OpenTelemetry instrumentation for governance operations. Over 45 step-by-step tutorials are available in the documentation.&lt;/P&gt;
&lt;P&gt;Everything described in this post is available today in the public GitHub repository. The full source, documentation, tutorials, and examples are at &lt;A class="lia-external-url" href="http://aka.ms/agent-governance-toolkit" target="_blank"&gt;aka.ms/agent-governance-toolkit&lt;/A&gt;, open source under the MIT license. We welcome contributions, feedback, and issue reports from the community.&lt;/P&gt;</description>
      <pubDate>Fri, 01 May 2026 16:48:09 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/shift-left-governance-for-ai-agents-how-the-agent-governance/ba-p/4516481</guid>
      <dc:creator>mosiddi</dc:creator>
      <dc:date>2026-05-01T16:48:09Z</dc:date>
    </item>
    <item>
      <title>Project Pavilion Presence at KubeCon EU 2026</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/project-pavilion-presence-at-kubecon-eu-2026/ba-p/4515518</link>
      <description>&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;KubeCon + CloudNativeCon Europe 2026 took place from 23 to 26 March at RAI Amsterdam, and it was a strong one. The themes running through the week reflected where the cloud native community is right now: AI moving from experimentation into production, platform engineering continuing to mature, and security and sovereignty top of mind for organizations across Europe. Microsoft was there throughout, and once again supported a range of open source projects in the Project Pavilion.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The Project Pavilion is a dedicated, vendor-neutral space on the show floor reserved for CNCF projects. It is where the work gets talked about honestly. Maintainers and contributors meet directly with end users, share what they are building, get real feedback on what is and is not working, and have the kinds of technical conversations that are hard to have anywhere else. For open source communities, it is one of the most valuable parts of the event.&lt;/SPAN&gt; &lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;Why Our Presence Matters&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Microsoft's products and services are built on and alongside many of the technologies represented in the pavilion, and the health of these communities matters to us directly. Showing up means our teams hear firsthand what is working, what is missing, and where these projects need to go next. It also means we get to contribute as community members, not just as a company name on a sponsor board. That distinction matters to us, and to the communities we are part of.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2 aria-level="2"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;Microsoft-Supported Pavilion Projects&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/H2&gt;
&lt;H3&gt;Confidential Containers&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;Representative: &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="none"&gt;Jeremi Piotrowski&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The &lt;/SPAN&gt;&lt;A href="https://www.cncf.io/projects/confidential-containers/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Confidential&lt;/SPAN&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt; Containers&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; booth gave attendees a chance to learn more about the project and its approach to protecting workloads using hardware-based trusted execution environments. Jeremi was on hand throughout the kiosk hours, fielding questions from interested users and developers exploring confidential computing in Kubernetes environments. Conversations touched on use cases around data privacy, regulated workloads, and the role Confidential Containers plays in the broader cloud-native security landscape.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Heading 3 Char"&gt;Drasi&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;Representative: &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="none"&gt;Daniel Gerlag and Nandita Valsan&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;The &lt;/SPAN&gt;&lt;A href="https://www.cncf.io/projects/drasi/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Drasi&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="none"&gt; team &lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;had a busy time in the pavilion, engaging around 40 attendees across two kiosk shifts in focused technical conversations. Most visitors were developers and platform engineers curious about change-driven architectures and real-time data processing. There was strong positive feedback on the newly introduced Drasi Server modes and embeddable library, which complement Drasi for Kubernetes. The team came away with useful validation of current design decisions and good input for the roadmap ahead.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Heading 3 Char"&gt;Envoy&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;Representative: &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="none"&gt;Mikhail Krinkin&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The &lt;/SPAN&gt;&lt;A href="https://www.cncf.io/projects/envoy/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Envoy&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; booth was staffed for the full duration of KubeCon EU by maintainers from Microsoft, Google, Isovalent, and Tetrate, reflecting the broad and healthy contributor base behind the project. The biggest topic at the booth was migration from ingress-nginx to Gateway API implementations. The archival of ingress-nginx pushed a lot of users into making changes they were not quite ready for, and questions ranged from technical specifics like HTTP default differences between Envoy and Nginx, to more foundational questions about what Envoy and Gateway API actually are. The team had anticipated this and invested in the &lt;/SPAN&gt;&lt;A href="https://github.com/kubernetes-sigs/ingress2gateway" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;ingress2gateway project&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; to give users a clear migration path. Extensibility was another frequent conversation topic, with dynamic modules increasingly becoming the go-to answer for user-specific requirements. Starting with the 1.38 release of Envoy, dynamic modules will have a backward compatible ABI, a sign of real production readiness for that feature.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Heading 3 Char"&gt;Flatcar&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="none"&gt;Representative: &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-contrast="auto"&gt;Thilo Fromm and Mathieu Tortuyaux&lt;/SPAN&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The &lt;/SPAN&gt;&lt;A href="https://www.cncf.io/projects/flatcar-container-linux/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Flatcar&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; booth had great energy, with maintainers from Microsoft, STACKIT, and CloudBase joining for conversations throughout the pavilion hours. Operational sovereignty came up again and again as a theme, with users and consulting partners sharing how they are building their Kubernetes offerings on Flatcar because of how reliable and secure it is.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;There were a lot of meaningful conversations. Lambda.ai currently runs Flatcar on their control plane and is looking at extending it to worker and customer clusters, with interest in contributing to the project. ReeVo has built their hosted Kubernetes distro on Flatcar across multi-cloud and bare metal environments and is planning to move hundreds of customer clusters over soon. Users from ClearScore, Avassa, Recorded Future, and several other organizations also stopped by with positive feedback on the project's robustness and security. STACKIT uses Flatcar as the default OS for their hosted Kubernetes offering and sponsors a full-time maintainer for the project. The team also connected with TAG Infrastructure to talk through Flatcar's CNCF graduation progress.&lt;/SPAN&gt; &lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Heading 3 Char"&gt;Headlamp&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN data-contrast="none"&gt;&lt;STRONG&gt;Representatives:&lt;/STRONG&gt; &lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;René Dudfield and Santhosh Nagaraj S&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The &lt;/SPAN&gt;&lt;A href="https://www.cncf.io/projects/headlamp/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Headlamp&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; booth was a busy one, with users, contributors, and partner projects all stopping by throughout the pavilion hours. Conversations covered real-world deployments, federation challenges, multi-tenant namespace visibility, and feature requests like multi-CR data aggregation. There was notable interest from consultancies deploying Headlamp across hundreds of customer clusters, as well as from companies already running it at cloud scale. Several CNCF projects expressed interest in building UIs for their own projects inside Headlamp, with a few even getting started right there at the conference. The team also heard from users getting budget approved to migrate from the deprecated Kubernetes Dashboard, which is a good sign for the project's growing momentum. Demand for air-gapped AI agent support and deeper Azure and AKS integrations for internal developer platforms came up as clear areas to watch.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Heading 3 Char"&gt;Hyperlight&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;Representative: &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="none"&gt;Ralph Squillace&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The &lt;/SPAN&gt;&lt;A href="https://www.cncf.io/projects/hyperlight/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Hyperlight&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; booth ran as a half-day session on Tuesday, in line with the project's current Sandbox status, but the corner location in the project area made a real difference in visibility. Ralph was fielding questions from the moment the doors opened, with a steady stream of visitors right up until the shift ended. Live and recorded demos were central to the conversations, helping attendees quickly grasp what Hyperlight does and how it fits into their environments. One standout visit came from an engineer at SAP who spent nearly an hour at the booth, pushing the conversation from fundamentals and &lt;/SPAN&gt;&lt;A href="https://github.com/hyperlight-dev/hyperlight-wasm" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;embedding examples&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; all the way through to agentic protection scenarios in Kubernetes. That conversation continued beyond KubeCon and turned into a scheduled meeting to explore a proof of concept, a good example of the kind of follow-on engagement the pavilion can generate.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Heading 3 Char"&gt;Inspektor Gadget&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;Representative: &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="none"&gt;Michael Friese and Qasim Sarfraz&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The &lt;/SPAN&gt;&lt;A href="https://www.cncf.io/projects/inspektor-gadget/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Inspektor Gadget&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; booth had a lot of great energy, drawing in contributors, new users, and people just discovering the project for the first time. There was genuine excitement around &lt;/SPAN&gt;&lt;A href="https://github.com/inspektor-gadget/ig-desktop" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Inspektor Gadget Desktop&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; and its visual troubleshooting experience for Kubernetes and Linux environments. The &lt;/SPAN&gt;&lt;A href="https://inspektor-gadget.io/blog/2026/03/inspektor-gadget-holmesgpt/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;integration with HolmesGPT&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;, which was also &lt;/SPAN&gt;&lt;A href="https://www.youtube.com/watch?v=Apha61UYfLY" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;featured in the keynote&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;, came up frequently and was one of the main talking points throughout the event. A theme that surfaced consistently in conversations with platform engineers was multi-tenancy, with teams looking for ways to safely give developers ad-hoc access to troubleshoot issues independently while keeping overall control at the platform level. It was a good set of conversations that reflected both the project's maturity and the growing demand for a flexible observability framework.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Heading 3 Char"&gt;Istio&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;Representative:&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="none"&gt;&lt;STRONG&gt; &lt;/STRONG&gt;Mitch Connors, Mikhail Krinkin, Jackie Maertens and Mike Morris&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The &lt;/SPAN&gt;&lt;A href="https://www.cncf.io/projects/istio/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Istio&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; booth had steady traffic throughout the conference, with a noticeable shift in who was stopping by. More visitors came from teams with existing sidecar-based production deployments looking for guidance on moving to &lt;/SPAN&gt;&lt;A href="https://istio.io/latest/docs/ambient/overview/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;ambient mode&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;, which is a change from previous years when ambient interest was mostly coming from greenfield users. The motivation to make the move was often tied to cost optimization and performance, with teams having read case studies and feeling more confident about the direction. That said, the increased interest also surfaced some real gaps, including requests for clearer migration guidance, more clarity around architectural differences like mTLS egress workflows, and better support for VM-based workloads. The team is planning to prioritize migration guidance over the coming months. The updated Istio Day format, with a half day of sessions at the Cloud Native Theater stage, also drew a strong crowd with standing room only throughout.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Heading 3 Char"&gt;Notary Project&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;Representative: &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Toddy Mladenov and Flora Taagen&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The &lt;/SPAN&gt;&lt;A href="https://www.cncf.io/projects/notary-project/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Notary Project&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; kiosk drew a wide range of visitors, from people learning about container image signing for the first time to experienced engineers asking detailed questions about what is coming next on the roadmap. A major highlight of the week was the project's conference talk on per-layer dm-verity signing, which drew a packed room and over 660 online sign-ups, one of the stronger turnouts for a project-level session at the event. The talk walked through how the new capability moves container security beyond pull-time verification to continuous runtime protection, backed by dm-verity, which generated a lengthy Q&amp;amp;A and a lot of enthusiasm from the audience. The team also sees a real opportunity ahead as AI workloads push organizations to think harder about the integrity of models, datasets, and container images, and the interest at the booth reinforced that Notary Project is well positioned to play a big role in securing those workflows.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Heading 3 Char"&gt;ORAS&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;Representative: &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Toddy Mladenov&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The &lt;/SPAN&gt;&lt;A href="https://www.cncf.io/projects/oras/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;ORAS&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; kiosk was staffed by maintainers from Microsoft, NVIDIA, and Red Hat, a good reflection of the healthy multi-vendor community the project has built. Attendees engaged with maintainers on ORAS use cases and adoption, with conversations ranging from how artifacts are tagged and packaged to how ORAS fits into broader multi-cloud workflows. One practical takeaway from maintainer conversations was around leveraging the &lt;/SPAN&gt;&lt;A href="https://github.com/oras-project/oras-go" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;ORAS SDK&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; more often as a substitute for CLI operations when working with container registries, which helps teams build simpler and more robust tooling.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H3&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Heading 3 Char"&gt;Radius&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="none"&gt;Representative: &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="none"&gt;Sylvain Niles and Will Tsai&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The &lt;/SPAN&gt;&lt;A href="https://www.cncf.io/projects/radius/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Radius&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; booth, supported by the &lt;/SPAN&gt;&lt;A href="https://www.azureincubations.io/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Microsoft Azure Incubations&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; team, attracted a good mix of enterprise platform teams, prospective adopters, and fellow open source maintainers throughout the pavilion hours. There was strong interest in the extensible &lt;/SPAN&gt;&lt;A href="https://docs.radapp.io/guides/author-apps/custom-types/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Radius Resource Types&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; feature and how it helps teams abstract infrastructure complexity and move workloads across different environments. Conversations also surfaced useful feedback on where the project should focus next, including agent-driven infrastructure workflows and using the Radius application graph to improve observability and operational visibility for cloud-native applications.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2 aria-level="2"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-parastyle="heading 2"&gt;Conclusion&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134245418&amp;quot;:false,&amp;quot;134245529&amp;quot;:false,&amp;quot;201341983&amp;quot;:0,&amp;quot;335559738&amp;quot;:160,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:276}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;KubeCon EU 2026 was a good reminder of why this community continues to grow. The conversations in the Project Pavilion were substantive, the feedback was honest, and the connections made there will carry forward into the work. Microsoft will be back for KubeCon NA in Salt Lake City this November, and we are already looking forward to it.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;If you are interested in getting involved with any of these projects, the best starting point is each project's community directly. You are also welcome to reach out to Lexi Nadolski at &lt;/SPAN&gt;&lt;A href="mailto:lexinadolski@microsoft.com" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;lexinadolski@microsoft.com&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; with any questions.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134233117&amp;quot;:false,&amp;quot;134233118&amp;quot;:false,&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0,&amp;quot;335559738&amp;quot;:240,&amp;quot;335559739&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 28 Apr 2026 12:50:16 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/project-pavilion-presence-at-kubecon-eu-2026/ba-p/4515518</guid>
      <dc:creator>lexinadolski</dc:creator>
      <dc:date>2026-04-28T12:50:16Z</dc:date>
    </item>
    <item>
      <title>Getting Started with the SUSE Multi-Linux Manager MCP Server and GitHub Copilot</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/getting-started-with-the-suse-multi-linux-manager-mcp-server-and/ba-p/4513494</link>
      <description>&lt;P&gt;Enterprise Linux environments are heterogeneous. That's not a problem statement - it's just the truth. SUSE, Ubuntu, RHEL, and their downstream variants coexist in every data center I've seen, and increasingly across Azure subscriptions too.&amp;nbsp; AI assistants like GitHub Copilot can already&amp;nbsp; connect to these machines, run commands, troubleshoot issues, apply patches&amp;nbsp; one box at a time. But if you're managing a fleet of hundreds or thousands&amp;nbsp; of&amp;nbsp; systems across distributions, the gap isn't whether AI can touch your&amp;nbsp; infrastructure. It's whether it can work through the centralized management&amp;nbsp; tooling where your inventory, patch orchestration, RBAC, and audit trails&amp;nbsp; actually live.&lt;/P&gt;
&lt;P&gt;SUSE just took a meaningful step to close that gap. Their Multi-Linux Manager MCP Server, built on the open source Uyuni project gives AI agents like GitHub Copilot a structured, authenticated interface to your existing&amp;nbsp; management platform. Not the individual boxes. The management plane where your centralized inventory, CVE auditing, cross-distribution patch scheduling, and RBAC already live. Not a rip-and-replace. Not a new console to learn. A way to talk to the infrastructure management you've already built.&lt;/P&gt;
&lt;P&gt;This post walks through what the MCP server does, why it matters in an Azure context, and how to get it wired up with GitHub Copilot so you can start working with it today.&lt;/P&gt;
&lt;P&gt;The Model Context Protocol (MCP) is an open standard that defines how AI models connect to external tools and data sources. Think of it as the USB-C of AI integrations - a common interface so that different clients (GitHub Copilot, Claude Desktop, Gemini CLI) can talk to different servers (Azure, SUSE, databases, APIs) without bespoke glue code for every combination.&lt;/P&gt;
&lt;H3&gt;Why This Matters for Azure Customers&lt;/H3&gt;
&lt;P&gt;If you are running Linux workloads on Azure - whether for SAP, HPC, or traditional enterprise applications - the Multi-Linux Manager MCP server provides a conversational interface for your infrastructure without requiring you to change tools.&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-level="1"&gt;Management-plane depth, not just infrastructure inventory. Azure and Copilot already give you fleet-wide visibility into your VMs. The SUSE MCP server adds the layer underneath: patch scheduling state, erratum tracking, cross-distribution CVE audits, and system group management that lives in your Multi-Linux Manager instance.&lt;/LI&gt;
&lt;LI aria-level="1"&gt;A single pane of glass. Pair this with the Azure MCP Server and your AI assistant can move between Azure resource operations and OS-level fleet management in one conversation, across the distributions Multi-Linux Manager supports, without switching tools or contexts.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;What You Can Actually Do With It&lt;/H3&gt;
&lt;P&gt;The MCP server exposes over 20 practical tools for day-to-day infrastructure operations. Instead of relying on a generic knowledge base, Copilot queries your actual infrastructure.&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-level="1"&gt;Inventory and Inspection: You can list active systems across your fleet or pull detailed event histories for specific machines.&lt;/LI&gt;
&lt;LI aria-level="1"&gt;Patch Management and CVE Response: Copilot can rapidly audit all systems for pending updates or identify specific machines vulnerable to a new CVE.&lt;/LI&gt;
&lt;LI aria-level="1"&gt;Operational Actions: You can list system groups, register new systems, or schedule server reboots.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;The Security Model: Human-in-the-Loop&lt;/H3&gt;
&lt;P&gt;Letting an AI agent touch production infrastructure raises the obvious question: what keeps it from doing something destructive? SUSE has been deliberate about this by designing the MCP server with a default "human-in-the-loop" security model.&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-level="1"&gt;Read-Only by Default: The server ships with all write actions disabled (UYUNI_MCP_WRITE_TOOLS_ENABLED=false).&lt;/LI&gt;
&lt;LI aria-level="1"&gt;Explicit Confirmation: If you enable write tools, Copilot is required to ask for your explicit confirmation before executing state-changing actions like applying patches or scheduling reboots.&lt;/LI&gt;
&lt;LI aria-level="1"&gt;Enterprise Authentication: The server supports OAuth 2.0, ensuring the AI agent authenticates through your identity provider.&amp;nbsp;&lt;/LI&gt;
&lt;LI aria-level="1"&gt;Layered Governance: Combined with Multi-Linux Manager’s role-based access control (RBAC) and the principle of least privilege for the service account, you get layered governance without bolting on a separate approval system.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;AI-assisted operations that bypass human judgment won't get adopted in enterprises. AI-assisted operations that make the human faster while keeping them in control, that's the model that actually ships.&lt;/P&gt;
&lt;H3&gt;Architecture on Azure&lt;/H3&gt;
&lt;P&gt;Here's the topology we're working with:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;SUSE Multi-Linux Manager - Running on an Azure VM, managing your Linux fleet across distributions. This is the control plane for your systems - inventory, patching, configuration. Available on Azure Marketplace.&lt;/LI&gt;
&lt;LI&gt;MCP Server - Runs as a container (Docker/Podman), either locally alongside your dev environment or as a standalone HTTP service. The MCP Server container is available in &lt;A href="https://registry.suse.com/repositories/suse-agentic-mcp-multi-linux-manager" target="_blank"&gt;SUSE Registry&lt;/A&gt; and is backed by a secure, trusted software supply chain.&lt;/LI&gt;
&lt;LI&gt;GitHub Copilot - In VS Code or the CLI. Configured to use the MCP server as a tool source. Sends natural language requests, receives structured responses from your infrastructure.&lt;/LI&gt;
&lt;LI&gt;Your Linux fleet on Azure - Whatever Multi-Linux Manager manages for you. The MCP server doesn't care about the distribution mix; that's the whole point of Multi-Linux Manager.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H2&gt;Getting Started: Step by Step&lt;/H2&gt;
&lt;H3&gt;Prerequisites&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; A running SUSE Multi-Linux Manager instance managing your Linux estate&lt;/LI&gt;
&lt;LI&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; Docker or Podman installed on your workstation (for local deployment) or network access to a remote MCP server instance&lt;/LI&gt;
&lt;LI&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; GitHub Copilot with agent mode enabled (VS Code or CLI)&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Step 1: Stand up the MCP Server&lt;/H3&gt;
&lt;P&gt;For local deployment, pull the container and point it at your Multi-Linux Manager instance following the project documentation. For remote/team deployments, your administrator can run the server as a standalone HTTP service with OAuth 2.0.&lt;/P&gt;
&lt;H3&gt;Step 2: Configure GitHub Copilot&lt;/H3&gt;
&lt;P&gt;In VS Code, open the Command Palette and type GitHub Copilot: Configure MCP Servers. Add your server to the config:&lt;/P&gt;
&lt;P&gt;{&lt;BR /&gt;&amp;nbsp; "mcpServers": {&lt;BR /&gt;"suse-multi-linux-manager": {&lt;BR /&gt;&amp;nbsp; "type": "http",&lt;BR /&gt;&amp;nbsp; "url": "https://your-mcp-server.example.com/mcp"&lt;BR /&gt;}&lt;BR /&gt;&amp;nbsp; }&lt;BR /&gt;}&lt;/P&gt;
&lt;H3&gt;Step 3: Verify the Connection&lt;/H3&gt;
&lt;P&gt;Open GitHub Copilot and try a read-only query:&lt;/P&gt;
&lt;P&gt;"List all active systems managed by my SUSE Multi-Linux Manager."&lt;/P&gt;
&lt;P&gt;If your fleet inventory appears, you're connected.&lt;/P&gt;
&lt;H3&gt;Step 4: Start Operating&lt;/H3&gt;
&lt;P&gt;"Are any of my systems affected by CVE-2026-XXXX?"&lt;/P&gt;
&lt;P&gt;"Show me all systems that have pending but unscheduled security patches."&lt;/P&gt;
&lt;P&gt;"Which systems need a reboot?"&lt;/P&gt;
&lt;H2&gt;Getting Involved&lt;/H2&gt;
&lt;P&gt;The SUSE Multi-Linux Manager MCP server is open source under the Apache 2.0 license, built on the Uyuni project. The current v0.5 is a tech preview. Feedback goes to uyuni-project/uyuni#10562, bugs to GitHub Issues.&lt;/P&gt;
&lt;P&gt;The gap in AI-assisted Linux operations was never whether AI could reach your infrastructure. It was whether it could work through the management tooling where your fleet-scale decisions actually get made. SUSE built the bridge to that layer. GitHub Copilot is the conversational interface. Your fleet is already there. Go connect them.&lt;/P&gt;</description>
      <pubDate>Wed, 22 Apr 2026 07:00:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/getting-started-with-the-suse-multi-linux-manager-mcp-server-and/ba-p/4513494</guid>
      <dc:creator>abbottkarl</dc:creator>
      <dc:date>2026-04-22T07:00:00Z</dc:date>
    </item>
    <item>
      <title>Dissecting LLM Container Cold-Start: Where the Time Actually Goes</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/dissecting-llm-container-cold-start-where-the-time-actually-goes/ba-p/4508831</link>
      <description>&lt;H1&gt;Dissecting LLM Container Cold-Start: Where the Time Actually Goes&lt;/H1&gt;
&lt;P&gt;Cold-start latency determines whether GPU clusters can scale to zero, how fast they can autoscale, and whether bursty or low-QPS workloads are economically viable. Most optimization effort targets the container pull path – faster registries, lazy-pull snapshotters, different compression formats. But “cold-start” is actually a composite of pull, runtime startup, and model initialization, and the dominant phase varies dramatically by inference engine. An optimization that cuts time-to-first-token for one engine can be irrelevant for another, even on identical infrastructure.&lt;/P&gt;
&lt;H2&gt;What we measured&lt;/H2&gt;
&lt;P&gt;We decomposed cold-start for two architecturally different engines – vLLM (Python/CUDA, heavy JIT compilation) and llama.cpp (C++, minimal runtime) – running Llama 3.1 8B on A100 GPUs. Every run starts from a completely clean slate: containerd stopped, all state wiped, kernel page caches dropped. No warm starts, no pre-pulling, no caching.&lt;/P&gt;
&lt;P&gt;We break TTFT into three phases: &lt;STRONG&gt;pull&lt;/STRONG&gt; (download + decompression + snapshot creation), &lt;STRONG&gt;startup&lt;/STRONG&gt; (container start → server ready), and &lt;STRONG&gt;first inference&lt;/STRONG&gt; (first API response, including model weight loading for engines that defer it). We tested across three snapshotters (overlayfs, EROFS, Nydus) with gzip and uncompressed images, pulling from same-region Azure Container Registry.&lt;/P&gt;
&lt;H2&gt;Setup&lt;/H2&gt;
&lt;P&gt;All experiments ran on an NVIDIA A100 80GB (Azure NC24ads_A100_v4), pulling from same-region Azure Container Registry. Images were built with &lt;A href="https://github.com/kaito-project/aikit" target="_blank"&gt;AIKit&lt;/A&gt;, which produces &lt;A href="https://github.com/modelpack/model-spec" target="_blank"&gt;ModelPack&lt;/A&gt;-compliant OCI artifacts with uncompressed model weight layers, Cosign signatures, SBOMs, and provenance attestations. These are supply chain properties you lose when model weights live on a shared drive.&lt;/P&gt;
&lt;H2&gt;vLLM: startup dominates, pull barely matters&lt;/H2&gt;
&lt;P&gt;vLLM loads model weights, runs torch.compile, captures CUDA graphs for multiple batch shapes, allocates KV cache, and warms up, all before serving the first request. This takes ~176 seconds regardless of how fast the image arrived.&lt;/P&gt;
&lt;P&gt;The breakdown makes the bottleneck obvious: the green bar (startup) is nearly constant across all four variants, swamping any pull-time differences.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Figure 1: vLLM cold-start breakdown. Startup (green, ~176s) dominates regardless of snapshotter.&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table style="height: 195px;"&gt;&lt;thead&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;Method&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;Pull&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;Startup&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;1st Inference&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;TTFT&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;overlayfs (gzip)&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;140.8s ±5.5&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;176.0s ±3.2&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;0.16s&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;317.2s ±2.2&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;overlayfs (uncomp.)&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;129.9s ±3.3&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;180.8s ±12.2&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;0.16s&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;310.9s ±8.9&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;EROFS (gzip)&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;158.9s ±8.8&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;175.3s ±0.8&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;0.16s&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;334.4s ±8.7&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 39px;"&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;EROFS (uncomp.)&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;166.3s ±21.1&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;177.3s ±12.8&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;0.16s&lt;/P&gt;
&lt;/td&gt;&lt;td style="height: 39px;"&gt;
&lt;P&gt;343.8s ±8.2&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;EM&gt;Llama 3.1 8B, ~14 GB image, n=2–3 per variant. ± = sample standard deviation. Three of twelve runs hit intermittent NVIDIA container runtime crashes (exit code 120, unrelated to snapshotters) and were excluded. We excluded Nydus because FUSE-streaming the 14 GB Python/CUDA stack caused startup to exceed 900s. Note: the EROFS uncompressed pull time (166.3s ±21.1) is slower than EROFS gzip, with a standard deviation that swallows the effect — this cell is essentially noise at n=2. Steady-state inference: ~0.134s across all snapshotters.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;44% pull, 56% startup.&lt;/STRONG&gt; Dropping gzip saves 6 seconds of end-to-end TTFT on a 317-second cold start (&lt;STRONG&gt;1.02x&lt;/STRONG&gt;). If your engine is vLLM, optimizing the pull pipeline is the wrong lever.&lt;/P&gt;
&lt;H2&gt;llama.cpp: pull dominates, compression is the bottleneck&lt;/H2&gt;
&lt;P&gt;llama.cpp has the opposite profile. Its C++ runtime starts in 2–5 seconds, so the pull becomes the majority of cold-start. This is where filesystem and compression choices actually matter.&lt;/P&gt;
&lt;P&gt;Here the picture flips. Pull (blue) is the widest bar, and the gzip-to-uncompressed difference is visible at a glance:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Figure 2: llama.cpp cold-start breakdown. Pull time (blue) dominates for gzip variants.&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Method&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Pull&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Startup&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;1st Inference&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;TTFT&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;overlayfs (gzip)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;88.3s ±0.2&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;5.3s ±0.5&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;45.1s ±1.4&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;138.8s ±0.8&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;overlayfs (uncomp.)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;56.3s ±3.1&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;2.0s ±0.0&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;44.2s ±0.1&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;102.4s ±3.1&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;EROFS (gzip)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;92.0s ±2.3&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;6.1s ±0.5&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;44.0s ±0.2&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;142.3s ±1.9&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;EROFS (uncomp.)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;58.8s ±0.6&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;2.0s ±0.0&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;44.0s ±0.1&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;104.8s ±0.5&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;EM&gt;Llama 3.1 8B Q4_K_M, 8.7 GB image uncompressed, n=3 per variant, 12/12 runs succeeded. First inference includes model weight loading into GPU VRAM (~43s) plus token generation (~1.5s). Steady-state inference: ~1.5s across all snapshotters.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;64% pull, 4% startup, 33% model loading.&lt;/STRONG&gt; Dropping gzip saves 36 seconds (&lt;STRONG&gt;1.35x&lt;/STRONG&gt;) with zero infrastructure changes.&lt;/P&gt;
&lt;H2&gt;Engine comparison&lt;/H2&gt;
&lt;P&gt;Placed side by side, the two engines tell opposite stories about the same infrastructure:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Figure 3: Where cold-start time goes. vLLM is compute-bound; llama.cpp is pull-bound.&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;vLLM&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;llama.cpp&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Time saved by dropping gzip&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;6s (2% of TTFT)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;36s (26% of TTFT)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Startup time&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;176–181s&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;2–5s&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Speedup from dropping gzip&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;1.02x&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;1.35x&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;Same optimization, completely different impact. Before investing in pull optimization (compression changes, lazy-pull infrastructure, registry tuning), profile your engine’s startup. If startup dominates, the pull isn’t where the time goes.&lt;/P&gt;
&lt;H2&gt;Why gzip hurts: model weights are incompressible&lt;/H2&gt;
&lt;P&gt;The llama.cpp AIKit image is 8.7 GB uncompressed, 6.6 GB with gzip (a modest 0.76x ratio). But this ratio hides what’s really happening:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Layer type&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Size&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;% of image&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Gzip ratio&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Model weights (GGUF)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;4.9 GB&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;56%&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;~1.00x (quantized binary, no redundancy)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;CUDA + system layers&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;~3.8 GB&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;44%&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;~0.46x (compresses well)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;The GGUF file is already quantized to 4-bit precision. Gzip reads every byte, burns CPU, and produces output the same size as the input. You’re paying full decompression cost on 56% of the image for zero size reduction. (For vLLM’s larger 14 GB image, model weights are a smaller fraction and the compressible Python/CUDA stack dominates, which is why gzip’s overhead matters less there.)&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Bottom line:&lt;/STRONG&gt; gzip is doing real work on less than half your image and producing zero savings on the rest. Dropping it costs nothing and removes a bottleneck from every cold start.&lt;/P&gt;
&lt;H2&gt;The Nydus prefetch finding&lt;/H2&gt;
&lt;P&gt;If decompression is the bottleneck, what about skipping the full pull entirely?&lt;/P&gt;
&lt;P&gt;Nydus lazy-pull takes a fundamentally different approach: it fetches only manifest metadata during “pull” (~0.7s), then streams model data on-demand via FUSE as the container reads it. Nydus TTFT isn’t directly comparable to the full-pull methods above because the download cost shifts from the pull column to the inference column.&lt;/P&gt;
&lt;P&gt;With prefetch enabled, Nydus achieved 77.8s TTFT for llama.cpp. The critical detail is the prefetch_all flag — the difference between prefetch ON and OFF is &lt;STRONG&gt;2.87x&lt;/STRONG&gt;:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Figure 4: Nydus prefetch ON vs OFF. One config flag, 2.87x difference. Overlayfs baselines shown for context.&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Configuration&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;1st Inference&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;TTFT&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Nydus, prefetch ON&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;72.4s ±0.6&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;77.8s ±0.5&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Nydus, prefetch OFF&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;218.6s ±2.9&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;223.4s ±2.9&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;overlayfs uncompressed (baseline)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;44.0s ±0.1&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;102.4s ±3.1&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;overlayfs gzip&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;44.0s ±0.4&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;139.1s ±1.9&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&lt;EM&gt;n=3 per config, 9/9 runs succeeded. Nydus and overlayfs gzip baselines are from a separate test run (&lt;/EM&gt;&lt;A href="https://github.com/robert-cronin/erofs-repro-repo/blob/main/results/03-prefetch-config-20260401-030725.csv" target="_blank"&gt;&lt;EM&gt;03-prefetch-config-20260401-030725.csv&lt;/EM&gt;&lt;/A&gt;&lt;EM&gt;); overlayfs uncompressed is from the main llama.cpp run. The overlayfs gzip baselines are within noise across runs (139.1s vs 138.8s).&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;One flag in nydusd-config.json, &lt;STRONG&gt;2.87x difference&lt;/STRONG&gt; (prefetch ON vs OFF). Without prefetch, every model weight page fault fires an individual HTTP range request to the registry. With prefetch_all=true, Nydus streams the full blob in the background while the container starts, so chunks arrive ahead of the GPU’s read pattern. Note that with prefetch enabled, Nydus is effectively performing a full pull overlapped with container startup rather than true on-demand fetching — the win comes from the overlap, not from fetching less data.&lt;/P&gt;
&lt;P&gt;Compared to overlayfs uncompressed (the post’s recommended baseline), Nydus prefetch is 1.32x faster (77.8s vs 102.4s). Compared to overlayfs gzip, 1.79x.&lt;/P&gt;
&lt;P&gt;Even with prefetch, Nydus first inference is ~28s slower than overlayfs (72s vs 44s) due to FUSE kernel-user roundtrips during model mmap. Nydus wins on total TTFT because it eliminates the blocking pull, but this overhead means its advantage shrinks on faster networks.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Bottom line:&lt;/STRONG&gt; Nydus lazy-pull can halve cold-start for pull-bound engines, but only if prefetch is on. Treat prefetch_all=true as a hard requirement, not a tuning knob.&lt;/P&gt;
&lt;H2&gt;How to apply these findings&lt;/H2&gt;
&lt;H3&gt;Pick your optimization by engine type&lt;/H3&gt;
&lt;P&gt;The right optimization depends on where your engine spends its cold-start time. This table summarizes the tradeoffs:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Engine type&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Dominant phase&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Speedup from dropping gzip&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Nydus viable?&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Best optimization&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;What NOT to optimize&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;vLLM / TensorRT-LLM&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Startup (56%)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;1.02x&lt;/STRONG&gt; — negligible&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;No — FUSE + Python/CUDA stack exceeded 900s in our tests&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Cache torch.compile artifacts and CUDA graphs&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Pull pipeline (it’s &amp;lt;44% of TTFT and already fast enough)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;llama.cpp / ONNX Runtime&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Pull (64%)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;1.35x&lt;/STRONG&gt; — 36s saved&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Yes, with prefetch_all=true (77.8s TTFT vs 102.4s uncompressed baseline)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Drop gzip on weight layers; consider lazy-pull on slow links&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Startup (already 2–5s; no room to improve)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Large dense models (70B+)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Pull (projected)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;&amp;gt;1.35x&lt;/STRONG&gt; — scales with image size&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Yes, strongest case for lazy-pull&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Uncompressed or zstd; Nydus prefetch on bandwidth-constrained links&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;—&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H3&gt;Recommendations&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Profile your engine’s startup before touching the pull pipeline.&lt;/STRONG&gt; If CUDA compilation dominates (vLLM, TensorRT-LLM), no amount of pull optimization will help. Cache torch.compile artifacts and CUDA graphs instead.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Drop gzip on model weight layers.&lt;/STRONG&gt; For pull-bound engines (llama.cpp, ONNX Runtime), this is the single highest-ROI change: build with --output=type=image,compression=uncompressed, or use &lt;A href="https://github.com/kaito-project/aikit" target="_blank"&gt;AIKit&lt;/A&gt;, which defaults to uncompressed weight layers. Quantized model weights (GGUF, safetensors) are already dense binary — gzip burns CPU for negligible size reduction.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;If using Nydus, set &lt;/STRONG&gt;&lt;STRONG&gt;prefetch_all=true&lt;/STRONG&gt;&lt;STRONG&gt;.&lt;/STRONG&gt; Without it, every weight page fault triggers an individual HTTP range request and cold-start is &lt;STRONG&gt;2.87x slower&lt;/STRONG&gt;. This is a single flag in nydusd-config.json.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Package models as signed OCI artifacts, not volume mounts.&lt;/STRONG&gt; Three CNCF projects implement this pipeline end-to-end: &lt;A href="https://github.com/modelpack/model-spec" target="_blank"&gt;ModelPack&lt;/A&gt; defines the OCI artifact spec (model metadata, architecture, quantization format). &lt;A href="https://github.com/kaito-project/aikit" target="_blank"&gt;AIKit&lt;/A&gt; builds ModelPack-compliant images with Cosign signatures, SBOMs, and provenance attestations — supply chain guarantees you lose when weights live on a shared drive. &lt;A href="https://github.com/kaito-project/kaito" target="_blank"&gt;KAITO&lt;/A&gt; handles the Kubernetes deployment: GPU node provisioning, inference engine setup, and API exposure. Together they cover packaging → build → deploy, and they produce the exact image layout these benchmarks measured.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;Why this matters: the cost of cold-start&lt;/H3&gt;
&lt;P&gt;On an A100 node (~$3–4/hr on major clouds), a 5-minute vLLM cold start burns ~$0.30 in idle GPU time per pod. That sounds small until you multiply it: a cluster that scales 50 pods to zero overnight and restarts them each morning wastes ~$15/day — over $5,000/year — on GPUs sitting idle during pull and CUDA compilation. More critically, cold-start latency determines whether scale-to-zero is feasible at all. If cold-start exceeds your SLO (say, 30s for an interactive app), you’re forced to keep warm replicas running 24/7, which can 2–3x your GPU spend.&lt;/P&gt;
&lt;H2&gt;What this doesn’t cover&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;zstd compression:&lt;/STRONG&gt; decompresses 5–10x faster than gzip; containerd supports it natively. The most obvious gap in this analysis.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Pre-pulling and caching:&lt;/STRONG&gt; production clusters pre-pull images and can cache CUDA compilation artifacts, substantially reducing restart times. We measure the cold case: scale-from-zero events and first-time deployments.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Volume-mounted weights:&lt;/STRONG&gt; skips the pull entirely, but loses supply chain properties (signing, scanning, provenance).&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Larger models (70B+):&lt;/STRONG&gt; pull would dominate more, increasing the gzip penalty.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Sample size:&lt;/STRONG&gt; n=3 per AIKit variant, n=2–3 per vLLM variant. The gzip finding for llama.cpp is statistically significant (Welch’s t-test, p=0.0014, Cohen’s d=16.3; &lt;A href="https://github.com/robert-cronin/erofs-repro-repo/blob/main/results/verify-significance.py" target="_blank"&gt;verification script&lt;/A&gt;). Other comparisons are directional.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Reproduce it&lt;/H2&gt;
&lt;P&gt;Scripts and raw data: &lt;A href="https://github.com/robert-cronin/erofs-repro-repo" target="_blank"&gt;erofs-repro-repo&lt;/A&gt;. Data for this post: &lt;A href="https://github.com/robert-cronin/erofs-repro-repo/blob/main/results/02-aikit-five-way-20260401-004716.csv" target="_blank"&gt;02-aikit-five-way-20260401-004716.csv&lt;/A&gt; and &lt;A href="https://github.com/robert-cronin/erofs-repro-repo/blob/main/results/01-vllm-four-way-20260331-113848.csv" target="_blank"&gt;01-vllm-four-way-20260331-113848.csv&lt;/A&gt;. Full analysis: &lt;A href="https://github.com/robert-cronin/erofs-benchmarks/blob/main/docs/report/README.md" target="_blank"&gt;technical report&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Mon, 27 Apr 2026 11:36:05 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/dissecting-llm-container-cold-start-where-the-time-actually-goes/ba-p/4508831</guid>
      <dc:creator>robcronin</dc:creator>
      <dc:date>2026-04-27T11:36:05Z</dc:date>
    </item>
    <item>
      <title>Agent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/agent-governance-toolkit-architecture-deep-dive-policy-engines/ba-p/4510105</link>
      <description>&lt;P&gt;Last week we announced the &lt;A class="lia-external-url" href="https://aka.ms/agt-opensource-blog" target="_blank"&gt;Agent Governance Toolkit&lt;/A&gt; on the Microsoft Open Source Blog, an open-source project that brings runtime security governance to autonomous AI agents. In that announcement, we covered the&amp;nbsp;&lt;STRONG&gt;why&lt;/STRONG&gt;: AI agents are making autonomous decisions in production, and the security patterns that kept systems safe for decades need to be applied to this new class of workload.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In this post, we'll go deeper into the&amp;nbsp;&lt;STRONG&gt;how&lt;/STRONG&gt;: the architecture, the implementation details, and what it takes to run governed agents in production.&lt;/P&gt;
&lt;H2&gt;The Problem: Production Infrastructure Meets Autonomous Agents&lt;/H2&gt;
&lt;P&gt;If you manage production infrastructure, you already know the playbook: least privilege, mandatory access controls, process isolation, audit logging, and circuit breakers for cascading failures. These patterns have kept production systems safe for decades.&lt;/P&gt;
&lt;P&gt;Now imagine a new class of workload arriving on your infrastructure, AI agents that autonomously execute code, call APIs, read databases, and spawn sub-processes. They reason about what to do, select tools, and act in loops. And in many current deployments, they do all of this without the security controls you'd demand of any other production workload.&lt;/P&gt;
&lt;P&gt;That gap is what led us to build the &lt;A class="lia-external-url" href="https://aka.ms/agent-governance-toolkit" target="_blank"&gt;Agent Governance Toolkit&lt;/A&gt;: an open-source project, that applies proven security concepts from operating systems, service meshes, and SRE to the emerging world of autonomous AI agents.&lt;/P&gt;
&lt;P&gt;To frame this in familiar terms: most AI agent frameworks today are like running every process as root, no access controls, no isolation, no audit trail. The Agent Governance Toolkit is the kernel, the service mesh, and the SRE platform for AI agents.&lt;/P&gt;
&lt;P&gt;When an agent calls a tool, say, `DELETE FROM users WHERE created_at &amp;lt; NOW()`, there is typically no policy layer checking whether that action is within scope. There is no identity verification when one agent communicates with another. There is no resource limit preventing an agent from making 10,000 API calls in a minute. And there is no circuit breaker to contain cascading failures when things go wrong.&lt;/P&gt;
&lt;H2&gt;OWASP Agentic Security Initiative&lt;/H2&gt;
&lt;P&gt;In December 2025, &lt;A class="lia-external-url" href="https://aka.ms/agt-owasp" target="_blank"&gt;OWASP published the Agentic AI Top 10:&lt;/A&gt;&amp;nbsp;the first formal taxonomy of risks specific to autonomous AI agents. The list reads like a security engineer's nightmare: goal hijacking, tool misuse, identity abuse, memory poisoning, cascading failures, rogue agents, and more.&lt;/P&gt;
&lt;P&gt;If you've ever hardened a production server, these risks will feel both familiar and urgent. The Agent Governance Toolkit is designed to help address all 10 of these risks through deterministic policy enforcement, cryptographic identity, execution isolation, and reliability engineering patterns.&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;Note&lt;/STRONG&gt;: The OWASP Agentic Security Initiative has since adopted the ASI 2026 taxonomy (ASI01–ASI10). The toolkit's copilot-governance package now uses these identifiers with backward compatibility for the original AT numbering.&lt;/EM&gt;&lt;/P&gt;
&lt;H2&gt;Architecture: Nine Packages, One Governance Stack&lt;/H2&gt;
&lt;P&gt;The toolkit is structured as a v3.0.0 Public Preview monorepo with nine independently &lt;A class="lia-external-url" href="https://aka.ms/agt-install" target="_blank"&gt;installable packages:&lt;/A&gt;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Package&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;What It Does&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Agent OS&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Stateless policy engine, intercepts agent actions before execution with configurable pattern matching and semantic intent classification&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Agent Mesh&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Cryptographic identity (DIDs with Ed25519), Inter-Agent Trust Protocol (IATP), and trust-gated communication between agents&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Agent Hypervisor&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Execution rings inspired by CPU privilege levels, saga orchestration for multi-step transactions, and shared session management&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Agent Runtime&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Runtime supervision with kill switches, dynamic resource allocation, and execution lifecycle management&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Agent SRE&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;SLOs, error budgets, circuit breakers, chaos engineering, and progressive delivery, production reliability practices adapted for AI agents&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Agent Compliance&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Automated governance verification with compliance grading and regulatory framework mapping (EU AI Act, NIST AI RMF, HIPAA, SOC 2)&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Agent Lightning&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Reinforcement learning training governance with policy-enforced runners and reward shaping&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Agent Marketplace&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Plugin lifecycle management with Ed25519 signing, trust-tiered capability gating, and SBOM generation&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Integrations&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;20+ framework adapters for LangChain, CrewAI, AutoGen, Semantic Kernel, Google ADK, Microsoft Agent Framework, OpenAI Agents SDK, and more&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2&gt;Agent OS: The Policy Engine&lt;/H2&gt;
&lt;P&gt;Agent OS intercepts agent tool calls before they execute:&lt;/P&gt;
&lt;P&gt;from agent_os import StatelessKernel, ExecutionContext, Policy&lt;BR /&gt;&lt;BR /&gt;kernel = StatelessKernel()&lt;BR /&gt;ctx = ExecutionContext(&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; agent_id="analyst-1",&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; policies=[&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Policy.read_only(),&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; # No write operations&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Policy.rate_limit(100, "1m"),&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; # Max 100 calls/minute&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Policy.require_approval(&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; actions=["delete_*", "write_production_*"],&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; min_approvals=2,&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; approval_timeout_minutes=30,&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ),&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ],&lt;BR /&gt;)&lt;BR /&gt;&lt;BR /&gt;result = await kernel.execute(&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; action="delete_user_record",&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; params={"user_id": 12345},&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; context=ctx,&lt;BR /&gt;)&lt;/P&gt;
&lt;P&gt;The policy engine works in two layers: configurable pattern matching (with sample rule sets for SQL injection, privilege escalation, and prompt injection that users customize for their environment) and a semantic intent classifier that helps detect dangerous goals regardless of phrasing. When an action is classified as `DESTRUCTIVE_DATA`, `DATA_EXFILTRATION`, or `PRIVILEGE_ESCALATION`, the engine blocks it, routes it for human approval, or downgrades the agent's trust level, depending on the configured policy.&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;Important&lt;/STRONG&gt;: All policy rules, detection patterns, and sensitivity thresholds are externalized to YAML configuration files. The toolkit ships with sample configurations in `examples/policies/` that must be reviewed and customized before production deployment. No built-in rule set should be considered exhaustive. Policy languages supported: YAML, OPA Rego, and Cedar.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;The kernel is stateless by design, each request carries its own context. This means you can deploy it behind a load balancer, as a sidecar container in Kubernetes, or in a serverless function, with no shared state to manage. On AKS or any Kubernetes cluster, it fits naturally into existing deployment patterns. Helm charts are available for agent-os, agent-mesh, and agent-sre.&lt;/P&gt;
&lt;H2&gt;Agent Mesh: Zero-Trust Identity for Agents&lt;/H2&gt;
&lt;P&gt;In service mesh architectures, services prove their identity via mTLS certificates before communicating. AgentMesh applies the same principle to AI agents using decentralized identifiers (DIDs) with Ed25519 cryptography and the Inter-Agent Trust Protocol (IATP):&lt;/P&gt;
&lt;P&gt;from agentmesh import AgentIdentity, TrustBridge&lt;BR /&gt;&lt;BR /&gt;identity = AgentIdentity.create(&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; name="data-analyst",&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; sponsor="alice@company.com",&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; # Human accountability&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; capabilities=["read:data", "write:reports"],&lt;BR /&gt;)&lt;BR /&gt;# identity.did -&amp;gt; "did:mesh:data-analyst:a7f3b2..."&lt;BR /&gt;&lt;BR /&gt;bridge = TrustBridge()&lt;BR /&gt;verification = await bridge.verify_peer(&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; peer_id="did:mesh:other-agent",&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; required_trust_score=700,&amp;nbsp; # Must score &amp;gt;= 700/1000&lt;BR /&gt;)&lt;/P&gt;
&lt;P&gt;A critical feature is&amp;nbsp;&lt;STRONG&gt;trust decay&lt;/STRONG&gt;: an agent's trust score decreases over time without positive signals. An agent trusted last week but silent since then gradually becomes untrusted, modeling the reality that trust requires ongoing demonstration, not a one-time grant.&lt;/P&gt;
&lt;P&gt;Delegation chains enforce &lt;STRONG&gt;scope narrowing&lt;/STRONG&gt;: a parent agent with read+write permissions can delegate only read access to a child agent, never escalate.&lt;/P&gt;
&lt;H2&gt;Agent Hypervisor: Execution Rings&lt;/H2&gt;
&lt;P&gt;CPU architectures use privilege rings (Ring 0 for kernel, Ring 3 for userspace) to isolate workloads. The Agent Hypervisor applies this model to AI agents:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Ring&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Trust Level&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;&lt;STRONG&gt;Capabilities&lt;/STRONG&gt;&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Ring 0 (Kernel)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Score ≥ 900&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Full system access, can modify policies&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Ring 1 (Supervisor)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Score ≥ 700&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Cross-agent coordination, elevated tool access&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Ring 2 (User)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Score ≥ 400&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Standard tool access within assigned scope&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;
&lt;P&gt;Ring 3 (Untrusted)&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Score &amp;lt; 400&lt;/P&gt;
&lt;/td&gt;&lt;td&gt;
&lt;P&gt;Read-only, sandboxed execution only&lt;/P&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;New and untrusted agents start in Ring 3 and earn their way up, exactly the principle of least privilege that production engineers apply to every other workload.&lt;/P&gt;
&lt;P&gt;Each ring enforces per-agent resource limits: maximum execution time, memory caps, CPU throttling, and request rate limits. If a Ring 2 agent attempts a Ring 1 operation, it gets blocked, just like a userspace process trying to access kernel memory.&lt;/P&gt;
&lt;P&gt;These ring definitions and their associated trust score thresholds are fully configurable via policy. Organizations can define custom ring structures, adjust the number of rings, set different trust score thresholds for transitions, and configure per-ring resource limits to match their security requirements.&lt;/P&gt;
&lt;P&gt;The hypervisor also provides&amp;nbsp;&lt;STRONG&gt;saga orchestration&lt;/STRONG&gt;&amp;nbsp;for multi-step operations. When an agent executes a sequence, draft email → send → update CRM, and the final step fails, compensating actions fire in reverse. Borrowed from distributed transaction patterns, this ensures multi-agent workflows maintain consistency even when individual steps fail.&lt;/P&gt;
&lt;H2&gt;Agent SRE: SLOs and Circuit Breakers for Agents&lt;/H2&gt;
&lt;P&gt;If you practice SRE, you measure services by SLOs and manage risk through error budgets. Agent SRE extends this to AI agents:&lt;/P&gt;
&lt;P&gt;When an agent's safety SLI drops below 99 percent, meaning more than 1 percent of its actions violate policy, the system automatically restricts the agent's capabilities until it recovers. This is the same error-budget model that SRE teams use for production services, applied to agent behavior.&lt;/P&gt;
&lt;P&gt;We also built nine chaos engineering fault injection templates: network delays, LLM provider failures, tool timeouts, trust score manipulation, memory corruption, and concurrent access races. Because the only way to know if your agent system is resilient is to break it intentionally.&lt;/P&gt;
&lt;P&gt;Agent SRE integrates with your existing observability stack through adapters for Datadog, PagerDuty, Prometheus, OpenTelemetry, Langfuse, LangSmith, Arize, MLflow, and more. Message broker adapters support Kafka, Redis, NATS, Azure Service Bus, AWS SQS, and RabbitMQ.&lt;/P&gt;
&lt;H2&gt;Compliance and Observability&lt;/H2&gt;
&lt;P&gt;If your organization already maps to CIS Benchmarks, NIST AI RMF, or other frameworks for infrastructure compliance, the OWASP Agentic Top 10 is the equivalent standard for AI agent workloads. The toolkit's agent-compliance package provides automated governance grading against these frameworks.&lt;/P&gt;
&lt;P&gt;The toolkit is framework-agnostic, with 20+ adapters that hook into each framework's native extension points, so adding governance to an existing agent is typically a few lines of configuration, not a rewrite.&lt;/P&gt;
&lt;P&gt;The toolkit exports metrics to any OpenTelemetry-compatible platform, Prometheus, Grafana, Datadog, Arize, or Langfuse. If you're already running an observability stack for your infrastructure, agent governance metrics flow through the same pipeline.&lt;/P&gt;
&lt;P&gt;Key metrics include: policy decisions per second, trust score distributions, ring transitions, SLO burn rates, circuit breaker state, and governance workflow latency.&lt;/P&gt;
&lt;H2&gt;Getting Started&lt;/H2&gt;
&lt;P&gt;# Install all packages&lt;BR /&gt;pip install agent-governance-toolkit[full]&lt;BR /&gt;&lt;BR /&gt;# Or individual packages&lt;BR /&gt;pip install agent-os-kernel agent-mesh agent-sre&lt;/P&gt;
&lt;P&gt;The toolkit is available across language ecosystems: Python, TypeScript (`@microsoft/agentmesh-sdk` on npm), Rust, Go, and .NET (`Microsoft.AgentGovernance` on NuGet).&lt;/P&gt;
&lt;H2&gt;Azure Integrations&lt;/H2&gt;
&lt;P&gt;While the toolkit is platform-agnostic, we've included integrations that help enable the fastest path to production, on Azure:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Azure Kubernetes Service (AKS):&lt;/STRONG&gt; Deploy the policy engine as a sidecar container alongside your agents. Helm charts provide production-ready manifests for agent-os, agent-mesh, and agent-sre.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Azure AI Foundry Agent Service:&lt;/STRONG&gt; Use the built-in middleware integration for agents deployed through Azure AI Foundry.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;OpenClaw Sidecar:&lt;/STRONG&gt; One compelling deployment scenario is running&amp;nbsp;&lt;A class="lia-external-url" href="https://github.com/openclaw" target="_blank"&gt;OpenClaw&lt;/A&gt;, the open-source autonomous agent, inside a container with the Agent Governance Toolkit deployed as a sidecar. This gives you policy enforcement, identity verification, and SLO monitoring over OpenClaw's autonomous operations. On Azure Kubernetes Service (AKS), the deployment is a standard pod with two containers: OpenClaw as the primary workload and the governance toolkit as the sidecar, communicating over localhost. We have a reference architecture and&amp;nbsp;&lt;A class="lia-external-url" href="https://aka.ms/agt-helm" target="_blank"&gt;Helm chart available in the repository&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;The same sidecar pattern works with any containerized agent, OpenClaw is a particularly compelling example because of the interest in autonomous agent safety.&lt;/P&gt;
&lt;H2&gt;Tutorials and Resources&lt;/H2&gt;
&lt;P&gt;&lt;A class="lia-external-url" href="https://aka.ms/agt-tutorials" target="_blank"&gt;34+ step-by-step tutorials&lt;/A&gt; covering policy engines, trust, compliance, MCP security, observability, and cross-platform SDK usage are available in the repository.&lt;/P&gt;
&lt;P&gt;git clone https://github.com/microsoft/agent-governance-toolkit&lt;BR /&gt;cd agent-governance-toolkit&lt;BR /&gt;pip install -e "packages/agent-os[dev]" -e "packages/agent-mesh[dev]" -e "packages/agent-sre[dev]"&lt;BR /&gt;&lt;BR /&gt;# Run the demo&lt;BR /&gt;python -m agent_os.demo&lt;/P&gt;
&lt;H2&gt;What's Next&lt;/H2&gt;
&lt;P&gt;AI agents are becoming autonomous decision-makers in production infrastructure, executing code, managing databases, and orchestrating services. The security patterns that kept production systems safe for decades, least privilege, mandatory access controls, process isolation, audit logging, are exactly what these new workloads need. We built them. They're open source.&lt;/P&gt;
&lt;P&gt;We're building this in the open because agent security is too important for any single organization to solve alone:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Security research&lt;/STRONG&gt;: Adversarial testing, red-team results, and vulnerability reports strengthen the toolkit for everyone.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Community contributions&lt;/STRONG&gt;: Framework adapters, detection rules, and compliance mappings from the community expand coverage across ecosystems.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;We are committed to open governance. We're releasing this project under Microsoft today, and we aspire to move it into a foundation home, such as the AI and Data Foundation (AAIF), where it can benefit from cross-industry stewardship. We're actively engaging with foundation partners on this path.&lt;/P&gt;
&lt;P&gt;The Agent Governance Toolkit is open source under the MIT license. Contributions welcome at&amp;nbsp;&lt;A class="lia-external-url" href="https://aka.ms/agent-governance-toolkit" target="_blank"&gt;github.com/microsoft/agent-governance-toolkit&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Fri, 10 Apr 2026 04:55:22 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/agent-governance-toolkit-architecture-deep-dive-policy-engines/ba-p/4510105</guid>
      <dc:creator>mosiddi</dc:creator>
      <dc:date>2026-04-10T04:55:22Z</dc:date>
    </item>
    <item>
      <title>DPDK 25.11 Performance on Azure for High-Speed Packet Workloads</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/dpdk-25-11-performance-on-azure-for-high-speed-packet-workloads/ba-p/4424905</link>
      <description>&lt;P&gt;At Microsoft Azure, performance is treated as an ongoing discipline grounded in careful engineering and real-world validation. As cloud workloads grow in scale and variety, customers depend on consistent, high-throughput networking. Technologies such as the Data Plane Development Kit (DPDK) play a key role in meeting these expectations&lt;/P&gt;
&lt;P&gt;To support customers running advanced network functions, we’ve released our latest performance report based on DPDK 25.11. It is now available in the DPDK performance catalog (&lt;A href="https://fast.dpdk.org/doc/perf/DPDK_25_11_Microsoft_NIC_performance_report.pdf" target="_blank" rel="noopener"&gt;Microsoft Azure DPDK Performance Report)&lt;/A&gt;. The report provides a clear view of how DPDK performs on Microsoft-developed Azure Boost within Azure infrastructure, with detailed insights into packet processing across a range of scenarios, from small packet sizes to multi-core scaling.&lt;/P&gt;
&lt;H4&gt;Why We Test DPDK on Azure&lt;/H4&gt;
&lt;P&gt;DPDK is widely used for high-performance packet processing in virtualized environments. It powers a range of workloads from customer-deployed virtual network functions to internal Azure network appliances.&lt;/P&gt;
&lt;P&gt;But simply enabling DPDK is not enough. To ensure optimal performance, we validate it under realistic conditions, including:&lt;/P&gt;
&lt;UL data-spread="false"&gt;
&lt;LI&gt;Azure VM configurations with Accelerated Networking&lt;/LI&gt;
&lt;LI&gt;NUMA-aware memory and CPU alignment&lt;/LI&gt;
&lt;LI&gt;Hugepage-backed memory allocation&lt;/LI&gt;
&lt;LI&gt;Multi-core PMD thread scaling&lt;/LI&gt;
&lt;LI&gt;Packet forwarding using real traffic generators&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This helps us understand how DPDK performs in actual cloud environments, not just idealized lab setups.&lt;/P&gt;
&lt;H4&gt;What the Report Covers&lt;/H4&gt;
&lt;P&gt;The DPDK 25.11 report includes performance benchmarks across different frame sizes, ranging from 64 bytes to 1518 bytes. It also evaluates CPU usage, queue configuration, and latency stability across various test conditions.&lt;/P&gt;
&lt;P&gt;Key Report Highlights:&lt;/P&gt;
&lt;UL data-spread="false"&gt;
&lt;LI&gt;Line-rate throughput is achievable at common frame sizes when vCPUs are pinned correctly and memory is properly configured&lt;/LI&gt;
&lt;LI&gt;Low jitter and consistent latency are observed across multi-queue and multi-core tests&lt;/LI&gt;
&lt;LI&gt;Performance scales nearly linearly with additional cores, especially for smaller packet sizes&lt;/LI&gt;
&lt;LI&gt;Queue and PMD thread alignment with the NUMA layout plays a critical role in maximizing efficiency&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;All tests were performed using Azure VM SKUs equipped with Microsoft NICs and configured for optimal isolation and performance.&lt;/P&gt;
&lt;H4&gt;Why We Shared This with the Community&lt;/H4&gt;
&lt;P&gt;Publishing this report reflects our commitment to open engineering and ecosystem collaboration. We believe performance transparency benefits everyone in the ecosystem, including developers, operators, and customers.&lt;/P&gt;
&lt;P&gt;Here are a few reasons why we share:&lt;/P&gt;
&lt;UL data-spread="false"&gt;
&lt;LI&gt;It helps customers plan and tune their workloads using validated performance envelopes&lt;/LI&gt;
&lt;LI&gt;It enables vendors and contributors to optimize drivers, firmware, and applications based on real-world data&lt;/LI&gt;
&lt;LI&gt;It encourages reproducibility and standardization in cloud DPDK benchmarking&lt;/LI&gt;
&lt;LI&gt;It creates a feedback loop between Azure, the DPDK community, and our partners&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Our goal is not just to test internally but to foster open dialogue and measurable improvement across platforms.&lt;/P&gt;
&lt;H4&gt;Recommendations for Running DPDK on Azure&lt;/H4&gt;
&lt;P&gt;Based on the test results, we offer the following best practices for customers deploying DPDK-based applications:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="width: 73.3333%; border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th&gt;Area&lt;/th&gt;&lt;th&gt;Recommendation&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;VM Selection&lt;/td&gt;&lt;td&gt;Choose Accelerated Networking-enabled SKUs like D, Fsv2, or Eav4&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;CPU Pinning&lt;/td&gt;&lt;td&gt;Use dedicated cores for PMD threads and align with NUMA topology&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Memory&lt;/td&gt;&lt;td&gt;Configure hugepages and allocate memory from the local NUMA node&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Queue Mapping&lt;/td&gt;&lt;td&gt;Match RX and TX queues to available vCPUs to avoid contention&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Packet Generator&lt;/td&gt;&lt;td&gt;Use pktgen-dpdk or testpmd with controlled traffic profiles&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 24.7602%" /&gt;&lt;col style="width: 75.2594%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;These settings can significantly improve consistency and peak throughput across many DPDK scenarios.&lt;/P&gt;
&lt;H4&gt;Get Involved and Reproduce the Results&lt;/H4&gt;
&lt;P&gt;We invite you to read the full report and try the configurations in your own environment. Whether you are running a firewall, a router, or a telemetry appliance, DPDK on Azure offers scalable performance with the right tuning.&lt;/P&gt;
&lt;P&gt;You can:&lt;/P&gt;
&lt;UL data-spread="false"&gt;
&lt;LI&gt;Download the report at &lt;A href="https://fast.dpdk.org/doc/perf/DPDK_25_11_Microsoft_NIC_performance_report.pdf" target="_blank" rel="noopener"&gt;Microsoft Azure DPDK Performance Report&amp;nbsp;&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Replicate the test setup using Azure VMs and your preferred packet generator &lt;A href="https://github.com/mcgov/dpdk-perf" target="_blank" rel="noopener" data-tabster="{&amp;quot;restorer&amp;quot;:{&amp;quot;type&amp;quot;:1}}"&gt;github.com/mcgov/dpdk-perf&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;Share your feedback with us through GitHub or community channels or send feedback&amp;nbsp;&lt;A href="mailto:dpdk@microsoft.com" target="_blank" rel="noopener" data-tabster="{&amp;quot;restorer&amp;quot;:{&amp;quot;type&amp;quot;:1}}"&gt;dpdk@microsoft.com&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Suggest improvements or contribute new scenarios to future performance reports&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Conclusion&lt;/H3&gt;
&lt;P&gt;DPDK is a powerful enabler of high-performance networking in the cloud. With this report, we aim to make Azure performance data open, useful, and actionable. It reflects our ongoing investment in validating and improving the underlying infrastructure that supports mission-critical workloads.&lt;/P&gt;
&lt;P&gt;We thank the DPDK community for ongoing collaboration. We look forward to continued engagement as we scale performance transparency in cloud-native environments.&lt;/P&gt;</description>
      <pubDate>Wed, 01 Apr 2026 18:37:04 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/dpdk-25-11-performance-on-azure-for-high-speed-packet-workloads/ba-p/4424905</guid>
      <dc:creator>KashanK</dc:creator>
      <dc:date>2026-04-01T18:37:04Z</dc:date>
    </item>
    <item>
      <title>Run OpenClaw Agents on Azure Linux VMs (with Secure Defaults)</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/run-openclaw-agents-on-azure-linux-vms-with-secure-defaults/ba-p/4502944</link>
      <description>&lt;P data-source-line="7"&gt;Many teams want an enterprise-ready personal AI assistant, but they need it on infrastructure they control, with security boundaries they can explain to IT. That is exactly where OpenClaw fits on Azure.&lt;/P&gt;
&lt;P data-source-line="9"&gt;&lt;STRONG&gt;OpenClaw is a self-hosted, always-on personal agent runtime you run in your enterprise environment and Azure infrastructure.&lt;/STRONG&gt; Instead of relying only on a hosted chat app from a third-party provider, you can deploy, operate, and experiment with &lt;STRONG&gt;an agent on an Azure Linux VM you control&lt;/STRONG&gt; — &lt;STRONG&gt;using your existing GitHub Copilot licenses, Azure OpenAI deployments, or API plans from OpenAI, Anthropic Claude, Google Gemini, and other model providers you already subscribe to&lt;/STRONG&gt;. Once deployed on Azure, you can interact with an OpenClaw agent through familiar channels like Microsoft Teams, Slack, Telegram, WhatsApp, and many more!&lt;/P&gt;
&lt;P data-source-line="11"&gt;For Azure users, this gives you a practical middle ground: modern personal-agent workflows on familiar Azure infrastructure.&lt;/P&gt;
&lt;H2 data-source-line="13"&gt;What is OpenClaw, and how is it different from ChatGPT/Claude/chat apps?&lt;/H2&gt;
&lt;P data-source-line="15"&gt;OpenClaw is a self-hosted personal agent runtime that can be hosted on Azure compute infrastructure.&lt;/P&gt;
&lt;P data-source-line="17"&gt;How it differs:&lt;/P&gt;
&lt;UL data-source-line="19"&gt;
&lt;LI data-source-line="19"&gt;&lt;STRONG&gt;ChatGPT/Claude apps&lt;/STRONG&gt;&amp;nbsp;are primarily hosted chat experiences tied to one provider's models&lt;/LI&gt;
&lt;LI data-source-line="20"&gt;&lt;STRONG&gt;OpenClaw&lt;/STRONG&gt;&amp;nbsp;is an always-on runtime you operate yourself, backed by&amp;nbsp;&lt;STRONG&gt;your choice of model provider&lt;/STRONG&gt;&amp;nbsp;— GitHub Copilot, Azure OpenAI, OpenAI, Anthropic Claude, Google Gemini, and others&lt;/LI&gt;
&lt;LI data-source-line="21"&gt;&lt;STRONG&gt;OpenClaw&lt;/STRONG&gt; lets you keep the runtime boundary in your own Azure VM environment within your Azure enterprise subscription&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-source-line="23"&gt;In practice, OpenClaw is useful when you want a persistent assistant for operational and workflow tasks, with your own infrastructure as the control point. You bring whatever model provider and API plan you already have — OpenClaw connects to it.&lt;/P&gt;
&lt;H2 data-source-line="25"&gt;Why Azure Linux VMs?&lt;/H2&gt;
&lt;P data-source-line="27"&gt;Azure Linux VMs are a strong fit because they provide:&lt;/P&gt;
&lt;UL data-source-line="29"&gt;
&lt;LI data-source-line="29"&gt;&lt;STRONG&gt;A suitable host machine for the OpenClaw agent to run on&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI data-source-line="29"&gt;Enterprise-friendly infrastructure and identity workflows&lt;/LI&gt;
&lt;LI data-source-line="30"&gt;Repeatable provisioning via the Azure CLI&lt;/LI&gt;
&lt;LI data-source-line="31"&gt;Network hardening with NSG rules&lt;/LI&gt;
&lt;LI data-source-line="32"&gt;Managed SSH access through Azure Bastion instead of public SSH exposure&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 data-source-line="34"&gt;How to Set Up OpenClaw on an Azure Linux VM&lt;/H2&gt;
&lt;P data-source-line="36"&gt;This guide sets up an Azure Linux VM, applies NSG (Network Security Group) hardening, configures Azure Bastion for managed SSH access, and &lt;STRONG&gt;installs an always-on OpenClaw agent within the VM&lt;/STRONG&gt; that you can interact with through various messaging channels.&lt;/P&gt;
&lt;H3 data-source-line="38"&gt;What you'll do&lt;/H3&gt;
&lt;UL data-source-line="40"&gt;
&lt;LI data-source-line="40"&gt;Create Azure networking (VNet, subnets, NSG) and compute resources with the Azure CLI&lt;/LI&gt;
&lt;LI data-source-line="41"&gt;Apply Network Security Group rules so VM SSH is allowed only from Azure Bastion&lt;/LI&gt;
&lt;LI data-source-line="42"&gt;Use Azure Bastion for SSH access (no public IP on the VM)&lt;/LI&gt;
&lt;LI data-source-line="43"&gt;Install OpenClaw on the Azure VM&lt;/LI&gt;
&lt;LI data-source-line="44"&gt;Verify OpenClaw installation and configuration on the VM&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 data-source-line="46"&gt;What you need&lt;/H3&gt;
&lt;UL data-source-line="48"&gt;
&lt;LI data-source-line="48"&gt;An Azure subscription with permission to create compute and network resources&lt;/LI&gt;
&lt;LI data-source-line="49"&gt;Azure CLI installed (&lt;A href="https://learn.microsoft.com/cli/azure/install-azure-cli" target="_blank"&gt;install steps&lt;/A&gt;)&lt;/LI&gt;
&lt;LI data-source-line="50"&gt;An SSH key pair (the guide covers generating one if needed)&lt;/LI&gt;
&lt;LI data-source-line="51"&gt;~20–30 minutes&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 data-source-line="53"&gt;Configure deployment&lt;/H2&gt;
&lt;H3 data-source-line="55"&gt;Step 1: Sign in to Azure CLI&lt;/H3&gt;
&lt;LI-CODE lang="bash"&gt;az login                     # Select a suitable Azure subscription during Azure login
az extension add -n ssh      # SSH extension is required for Azure Bastion SSH&lt;/LI-CODE&gt;
&lt;P data-source-line="62"&gt;The ssh extension is required for Azure Bastion native SSH tunneling.&lt;/P&gt;
&lt;H3 data-source-line="64"&gt;Step 2: Register required resource providers (one-time)&lt;/H3&gt;
&lt;P&gt;Register required Azure Resource Providers (one time registration):&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az provider register --namespace Microsoft.Compute
az provider register --namespace Microsoft.Network&lt;/LI-CODE&gt;
&lt;P data-source-line="71"&gt;Verify registration. Wait until both show &lt;EM&gt;Registered&lt;/EM&gt;.&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az provider show --namespace Microsoft.Compute --query registrationState -o tsv
az provider show --namespace Microsoft.Network --query registrationState -o tsv&lt;/LI-CODE&gt;
&lt;H3 data-source-line="78"&gt;Step 3: Set deployment variables&lt;/H3&gt;
&lt;P&gt;Set the deployment environment variables that will be needed throughout this guide.&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;RG="rg-openclaw"
LOCATION="westus2"
VNET_NAME="vnet-openclaw"
VNET_PREFIX="10.40.0.0/16"
VM_SUBNET_NAME="snet-openclaw-vm"
VM_SUBNET_PREFIX="10.40.2.0/24"
BASTION_SUBNET_PREFIX="10.40.1.0/26"
NSG_NAME="nsg-openclaw-vm"
VM_NAME="vm-openclaw"
ADMIN_USERNAME="openclaw"
BASTION_NAME="bas-openclaw"
BASTION_PIP_NAME="pip-openclaw-bastion"&lt;/LI-CODE&gt;
&lt;P data-source-line="95"&gt;Adjust names and CIDR ranges to fit your environment. The Bastion subnet must be at least &lt;EM&gt;/26&lt;/EM&gt;.&lt;/P&gt;
&lt;H3 data-source-line="97"&gt;Step 4: Select SSH key&lt;/H3&gt;
&lt;P data-source-line="99"&gt;Use your existing public key if you have one:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;SSH_PUB_KEY="$(cat ~/.ssh/id_ed25519.pub)"&lt;/LI-CODE&gt;
&lt;P data-source-line="105"&gt;If you don't have an SSH key yet, generate one:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;ssh-keygen -t ed25519 -a 100 -f ~/.ssh/id_ed25519 -C "you@example.com"
SSH_PUB_KEY="$(cat ~/.ssh/id_ed25519.pub)"&lt;/LI-CODE&gt;
&lt;H3 data-source-line="112"&gt;Step 5: Select VM size and OS disk size&lt;/H3&gt;
&lt;LI-CODE lang="bash"&gt;VM_SIZE="Standard_B2as_v2"
OS_DISK_SIZE_GB=64&lt;/LI-CODE&gt;
&lt;P data-source-line="119"&gt;Choose a VM size and OS disk size available in your subscription and region:&lt;/P&gt;
&lt;UL data-source-line="121"&gt;
&lt;LI data-source-line="121"&gt;Start smaller for light usage and scale up later&lt;/LI&gt;
&lt;LI data-source-line="122"&gt;Use more vCPU/RAM/disk for heavier automation, more channels, or larger model/tool workloads&lt;/LI&gt;
&lt;LI data-source-line="123"&gt;If a VM size is unavailable in your region or subscription quota, pick the closest available SKU&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-source-line="125"&gt;List VM sizes available in your target region:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az vm list-skus --location "${LOCATION}" --resource-type virtualMachines -o table&lt;/LI-CODE&gt;
&lt;P data-source-line="131"&gt;Check your current vCPU and disk usage/quota:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az vm list-usage --location "${LOCATION}" -o table&lt;/LI-CODE&gt;
&lt;H2 data-source-line="137"&gt;Deploy Azure resources&lt;/H2&gt;
&lt;H3 data-source-line="139"&gt;Step 1: Create the resource group&lt;/H3&gt;
&lt;P&gt;The Azure resource group will contain all of the Azure resources that the OpenClaw agent needs.&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az group create -n "${RG}" -l "${LOCATION}"&lt;/LI-CODE&gt;
&lt;H3 data-source-line="145"&gt;Step 2: Create the network security group&lt;/H3&gt;
&lt;P data-source-line="147"&gt;Create the NSG and add rules so only the Bastion subnet can SSH into the VM.&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az network nsg create \
  -g "${RG}" -n "${NSG_NAME}" -l "${LOCATION}"

# Allow SSH from the Bastion subnet only
az network nsg rule create \
  -g "${RG}" --nsg-name "${NSG_NAME}" \
  -n AllowSshFromBastionSubnet --priority 100 \
  --access Allow --direction Inbound --protocol Tcp \
  --source-address-prefixes "${BASTION_SUBNET_PREFIX}" \
  --destination-port-ranges 22

# Deny SSH from the public internet
az network nsg rule create \
  -g "${RG}" --nsg-name "${NSG_NAME}" \
  -n DenyInternetSsh --priority 110 \
  --access Deny --direction Inbound --protocol Tcp \
  --source-address-prefixes Internet \
  --destination-port-ranges 22

# Deny SSH from other VNet sources
az network nsg rule create \
  -g "${RG}" --nsg-name "${NSG_NAME}" \
  -n DenyVnetSsh --priority 120 \
  --access Deny --direction Inbound --protocol Tcp \
  --source-address-prefixes VirtualNetwork \
  --destination-port-ranges 22&lt;/LI-CODE&gt;
&lt;P data-source-line="178"&gt;The rules are evaluated by priority (lowest number first): Bastion traffic is allowed at 100, then all other SSH is blocked at 110 and 120.&lt;/P&gt;
&lt;H3 data-source-line="180"&gt;Step 3: Create the virtual network and subnets&lt;/H3&gt;
&lt;P data-source-line="182"&gt;Create the VNet with the VM subnet (NSG attached), then add the Bastion subnet.&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az network vnet create \
  -g "${RG}" -n "${VNET_NAME}" -l "${LOCATION}" \
  --address-prefixes "${VNET_PREFIX}" \
  --subnet-name "${VM_SUBNET_NAME}" \
  --subnet-prefixes "${VM_SUBNET_PREFIX}"

# Attach the NSG to the VM subnet
az network vnet subnet update \
  -g "${RG}" --vnet-name "${VNET_NAME}" \
  -n "${VM_SUBNET_NAME}" --nsg "${NSG_NAME}"

# AzureBastionSubnet — name is required by Azure
az network vnet subnet create \
  -g "${RG}" --vnet-name "${VNET_NAME}" \
  -n AzureBastionSubnet \
  --address-prefixes "${BASTION_SUBNET_PREFIX}"&lt;/LI-CODE&gt;
&lt;H3 data-source-line="203"&gt;Step 4: Create the Virtual Machine&lt;/H3&gt;
&lt;P data-source-line="205"&gt;Create the VM with no public IP. SSH access for OpenClaw configuration will be exclusively through Azure Bastion.&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az vm create \
  -g "${RG}" -n "${VM_NAME}" -l "${LOCATION}" \
  --image "Canonical:ubuntu-24_04-lts:server:latest" \
  --size "${VM_SIZE}" \
  --os-disk-size-gb "${OS_DISK_SIZE_GB}" \
  --storage-sku StandardSSD_LRS \
  --admin-username "${ADMIN_USERNAME}" \
  --ssh-key-values "${SSH_PUB_KEY}" \
  --vnet-name "${VNET_NAME}" \
  --subnet "${VM_SUBNET_NAME}" \
  --public-ip-address "" \
  --nsg ""&lt;/LI-CODE&gt;
&lt;P data-source-line="222"&gt;&lt;EM&gt;--public-ip-address "" &lt;/EM&gt;prevents a public IP from being assigned.&lt;/P&gt;
&lt;P data-source-line="222"&gt;&lt;EM&gt;--nsg "" &lt;/EM&gt;skips creating a per-NIC NSG (the subnet-level NSG created earlier handles security).&lt;/P&gt;
&lt;P data-source-line="248"&gt;&lt;STRONG&gt;Reproducibility:&lt;/STRONG&gt; The command above uses latest for the &lt;EM&gt;Ubuntu image&lt;/EM&gt;. To pin a specific version, list available versions and replace &lt;EM&gt;latest&lt;/EM&gt;:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az vm image list \
  --publisher Canonical --offer ubuntu-24_04-lts \
  --sku server --all -o table&lt;/LI-CODE&gt;
&lt;H3 data-source-line="232"&gt;Step 5: Create Azure Bastion&lt;/H3&gt;
&lt;P data-source-line="234"&gt;Azure Bastion provides secure-managed SSH access to the VM without exposing a public IP.&lt;/P&gt;
&lt;P data-source-line="234"&gt;Bastion Standard SKU with tunneling is required for CLI-based "az network bastion ssh" command.&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az network public-ip create \
  -g "${RG}" -n "${BASTION_PIP_NAME}" -l "${LOCATION}" \
  --sku Standard --allocation-method Static

az network bastion create \
  -g "${RG}" -n "${BASTION_NAME}" -l "${LOCATION}" \
  --vnet-name "${VNET_NAME}" \
  --public-ip-address "${BASTION_PIP_NAME}" \
  --sku Standard --enable-tunneling true&lt;/LI-CODE&gt;
&lt;P data-source-line="248"&gt;Bastion provisioning typically takes 5–10 minutes but can take up to 15–30 minutes in some regions.&lt;/P&gt;
&lt;H3 data-source-line="248"&gt;Step 6: Verify Deployments&lt;/H3&gt;
&lt;P data-source-line="248"&gt;After all resources are deployed, your resource group should look like the following:&lt;/P&gt;
&lt;img /&gt;
&lt;H2 data-source-line="250"&gt;Install OpenClaw&lt;/H2&gt;
&lt;H3 data-source-line="252"&gt;Step 1: SSH into the VM through Azure Bastion&lt;/H3&gt;
&lt;LI-CODE lang="bash"&gt;VM_ID="$(az vm show -g "${RG}" -n "${VM_NAME}" --query id -o tsv)"

az network bastion ssh \
  --name "${BASTION_NAME}" \
  --resource-group "${RG}" \
  --target-resource-id "${VM_ID}" \
  --auth-type ssh-key \
  --username "${ADMIN_USERNAME}" \
  --ssh-key ~/.ssh/id_ed25519&lt;/LI-CODE&gt;
&lt;H3 data-source-line="266"&gt;Step 2: Install OpenClaw (in the Bastion SSH shell)&lt;/H3&gt;
&lt;LI-CODE lang="bash"&gt;curl -fsSL https://openclaw.ai/install.sh | bash&lt;/LI-CODE&gt;
&lt;P data-source-line="274"&gt;The installer installs Node LTS and dependencies if not already present, installs OpenClaw, and launches the &lt;STRONG&gt;OpenClaw onboarding wizard&lt;/STRONG&gt;. For more information, see the &lt;A class="lia-external-url" href="https://docs.openclaw.ai/install" target="_blank"&gt;open source OpenClaw install docs&lt;/A&gt;.&lt;/P&gt;
&lt;H4 data-source-line="274"&gt;OpenClaw Onboarding: Choosing an AI Model Provider&lt;/H4&gt;
&lt;P data-source-line="274"&gt;During OpenClaw onboarding, you'll choose the &lt;STRONG&gt;AI &lt;/STRONG&gt;&lt;STRONG&gt;model provider for the OpenClaw agent&lt;/STRONG&gt;. This can be&amp;nbsp;GitHub Copilot, Azure OpenAI, OpenAI, Anthropic Claude, Google Gemini, or another supported provider. See the &lt;A class="lia-external-url" href="https://docs.openclaw.ai/install" target="_blank"&gt;open source OpenClaw install docs&lt;/A&gt; for details on choosing an AI model provider when going through the onboarding wizard.&lt;/P&gt;
&lt;P data-source-line="284"&gt;Most enterprise Azure teams already have GitHub Copilot licenses. If that is your case, we recommend choosing the GitHub Copilot provider in the OpenClaw onboarding wizard. See the &lt;A class="lia-external-url" href="https://docs.openclaw.ai/providers/github-copilot" target="_blank"&gt;open source OpenClaw docs on configuring &lt;STRONG&gt;GitHub Copilot as the AI model provider&lt;/STRONG&gt;&lt;/A&gt;.&lt;/P&gt;
&lt;H4 data-source-line="284"&gt;OpenClaw Onboarding: Setting up Messaging Channels&lt;/H4&gt;
&lt;P&gt;During OpenClaw onboarding, there will be an optional step where you can set up various messaging channels to interact with your OpenClaw agent.&lt;/P&gt;
&lt;P&gt;For first time users, we recommend setting up Telegram due to ease of setup. Other messaging channels such as Microsoft Teams, Slack, WhatsApp, and others can also be set up.&lt;/P&gt;
&lt;P&gt;To configure &lt;STRONG&gt;OpenClaw for messaging through chat channels&lt;/STRONG&gt;, see the &lt;A class="lia-external-url" href="https://docs.openclaw.ai/channels" target="_blank"&gt;open source OpenClaw chat channels docs&lt;/A&gt;.&lt;/P&gt;
&lt;H3 data-source-line="276"&gt;Step 3: Verify OpenClaw Configuration&lt;/H3&gt;
&lt;P data-source-line="278"&gt;To validate that everything was set up correctly, run the following commands within the same Bastion SSH session:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;openclaw status
openclaw gateway status&lt;/LI-CODE&gt;&lt;img /&gt;
&lt;P&gt;If there are any issues reported, you can run the onboarding wizard again with the steps above. Alternatively, you can run the following command:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;openclaw doctor&lt;/LI-CODE&gt;
&lt;H2&gt;Message OpenClaw&lt;/H2&gt;
&lt;P&gt;Once you have configured the OpenClaw agent to be reachable via various messaging channels, you can verify that it is responsive by messaging it.&lt;/P&gt;
&lt;img /&gt;
&lt;H2&gt;Enhancing OpenClaw for Use Cases&lt;/H2&gt;
&lt;P&gt;There you go! You now have a 24/7, always-on personal AI agent, living on its own Azure VM environment.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;For awesome OpenClaw use cases, check out the &lt;A class="lia-external-url" href="https://github.com/hesamsheikh/awesome-openclaw-usecases" target="_blank"&gt;awesome-openclaw-usecases repository&lt;/A&gt;.&lt;/LI&gt;
&lt;LI&gt;To enhance your OpenClaw agent with additional AI skills so that it can autonomously perform multi-step operations on any domain, check out the&amp;nbsp;&lt;A class="lia-external-url" href="https://github.com/VoltAgent/awesome-openclaw-skills" target="_blank"&gt;awesome-openclaw-skills repository&lt;/A&gt;.&lt;/LI&gt;
&lt;LI&gt;You can also check out &lt;A class="lia-external-url" href="https://clawhub.ai/" target="_blank"&gt;ClawHub&lt;/A&gt;&amp;nbsp;and &lt;A class="lia-external-url" href="https://clawskills.sh/" target="_blank"&gt;ClawSkills&lt;/A&gt;, two popular open source skills directories that can enhance your OpenClaw agent.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Cleanup&lt;/H2&gt;
&lt;P&gt;&lt;SPAN data-as="p"&gt;To delete all resources created by this guide:&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;az group delete -n "${RG}" --yes --no-wait&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN data-as="p"&gt;This removes the resource group and everything inside it (VM, VNet, NSG, Bastion, public IP). &lt;/SPAN&gt;&lt;SPAN data-as="p"&gt;This also deletes the OpenClaw agent running within the VM.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;If you'd like to dive deeper about deploying OpenClaw on Azure, please check out the &lt;A class="lia-external-url" href="https://docs.openclaw.ai/install/azure" target="_blank"&gt;open source OpenClaw on Azure docs&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Sun, 22 Mar 2026 16:34:46 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/run-openclaw-agents-on-azure-linux-vms-with-secure-defaults/ba-p/4502944</guid>
      <dc:creator>johnsonshi_msft</dc:creator>
      <dc:date>2026-03-22T16:34:46Z</dc:date>
    </item>
    <item>
      <title>How Netstar Streamlined Fleet Monitoring and Reduced Custom Integrations with Drasi</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/how-netstar-streamlined-fleet-monitoring-and-reduced-custom/ba-p/4499592</link>
      <description>&lt;P&gt;When a high-value container goes silent between waystations, logistics teams lose critical visibility, risking delays that can cascade into port congestion and missed connections. &lt;A href="https://www.netstar.co.za/" target="_blank"&gt;Netstar&lt;/A&gt;, a connected fleet solutions provider supporting customers like Maersk, faced this challenge as its operations scaled. Timely notifications of delays, arrivals, and status changes became critical to keeping cargo moving efficiently through port systems.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;BR /&gt;To address growing integration complexity and the need for real-time responsiveness, Netstar adopted &lt;A href="https://drasi.io/" target="_blank"&gt;Drasi&lt;/A&gt;. Drasi, built for change-driven solutions, provides continuously updated query results and automated reactions to data changes, enabling systems to detect and respond to critical changes as they happen. This shift to Drasi became foundational to how Netstar unified its fleet data, reduced engineering overhead, and improved monitoring workflows.&lt;/P&gt;
&lt;H5&gt;&lt;STRONG&gt;The Fragmentation Challenge&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;Growing operational complexity made an underlying challenge increasingly apparent. Tracking a container's journey from pickup to port terminal required reconciling data such as vehicle identifiers, waypoints, GPS location feeds, and IoT telemetry signals from siloed systems. With each new operational or business requirement, whether monitoring vehicle health or detecting route deviations, development teams found themselves repeatedly rebuilding similar patterns.&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;"We were essentially rebuilding the same integration architecture for every use case," &lt;/EM&gt;explains Daniel Joubert, General Manager and technical lead at Netstar&lt;EM&gt;. "One week we'd build a dashboard for location tracking. The next week, we'd build another one for breakdown detection. The engineering overhead was unsustainable."&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Batch-based processing compounded the issue. Critical signals such as missed health reports or route delays can surface long after they occur, potentially limiting Netstar’s ability to take timely action.&lt;/P&gt;
&lt;H4&gt;&lt;SPAN class="lia-text-color-15"&gt;&lt;STRONG&gt;Introducing Drasi for Change-driven Architecture&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;Rather than continue building point solutions, Netstar adopted Drasi as the backbone of its real-time data architecture. Drasi simplifies systems that must detect, evaluate, and react to data changes quickly and efficiently at scale, aligning directly with Netstar’s needs.&lt;/P&gt;
&lt;H5&gt;&lt;STRONG&gt;A Unified, Continuously Updated View &lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;Drasi connected directly to Netstar's existing data sources- Azure SQL databases for information such as vehicle identifiers and waypoints, and Azure EventHub for GPS location data and IoT telemetry. Drasi Continuous Queries joined this information into a single, always-current operational picture. Instead of multiple custom-built pipelines, Netstar gained a single source of truth for its fleet.&lt;/P&gt;
&lt;P&gt;Using Drasi Reactions, Netstar defined actions that trigger when specific events occur. When a truck fails to send a health signal within its expected window, or when a delay notification indicates potential supply chain disruption, the system responds immediately without human intervention, reducing the likelihood of missed events.&lt;/P&gt;
&lt;H5&gt;&lt;STRONG&gt;Improvements Enabled by Drasi&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;Using the Drasi plugin for Grafana, Netstar consolidated results from Continuous Queries into one monitoring interface. Operators no longer reconciled conflicting views across separate tools; they now track vehicle health, location, alerts, and route deviations in real time from a single dashboard.&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;"The transformation was remarkable," &lt;/EM&gt;says Dustyn Lightfoot, Solution Architect. &lt;EM&gt;"We were able to use a single Drasi instance to support multiple business use cases without building new infrastructure or writing additional code, for example, to stand up Blazor websites. More importantly, it eliminated the ongoing maintenance burden of managing dozens of custom pipelines."&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Drasi’s flexibility also extended beyond fleet tracking. By attaching an additional data source and defining new Continuous Queries, the same Drasi instance now surfaces changes in customer billing status and the legal contracts. This work required no new infrastructure, just connecting the source and writing queries (&lt;A href="https://drasi.io/reference/query-language/drasi-custom-functions/#drasi-delta-functions" target="_blank"&gt;leveraging Drasi’s custom Delta functions&lt;/A&gt;), providing business teams with up-to-date information without a separate integration effort.&lt;/P&gt;
&lt;H5&gt;&lt;STRONG&gt;Measurable Impact&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;Netstar reports tangible improvements across engineering operations and real-time responsiveness:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Faster incident response&lt;/STRONG&gt;: Missing health signals now trigger alerts immediately rather than being discovered later through manual checks, improving the speed and reliability of operational response.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Improved logistics coordination&lt;/STRONG&gt;: Real time visibility into container movement through waystations and toward port terminals has enabled Netstar and partners like Maersk to coordinate shipments more efficiently, with automated alerts keeping all stakeholders informed as conditions change.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Reduced development overhead&lt;/STRONG&gt;: Using Drasi has reduced the amount of custom development previously needed to support fleet monitoring capabilities. The same Drasi-driven architecture now supports multiple business cases, from tracking and health monitoring to route optimization.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Streamlined operator experience&lt;/STRONG&gt;: Teams moved from several monitoring tools to a single Drasi-powered Grafana interface, simplifying daily operations and eliminating time spent reconciling conflicting data from different systems.&lt;/P&gt;
&lt;H5&gt;&lt;STRONG&gt;Industry Context and What’s Next&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;Demand for real-time supply chain visibility has intensified as global logistics disruptions highlight the risks of delayed reporting.&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;"Our customers don't just want historical reports anymore. They need to know what's happening right now and be alerted the moment something changes,"&lt;/EM&gt; Daniel Joubert explains. &lt;EM&gt;"That shift from batch processing to continuous monitoring is becoming table stakes in fleet management."&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Building on this foundation, Netstar is now investigating how Drasi can support predictive maintenance- spotting patterns in vehicle health data early enough to prevent failures altogether. The same change-driven architecture could also streamline coordination across broader supply chain workflows.&lt;/P&gt;
&lt;H5&gt;&lt;STRONG&gt;The Broader Architectural Shift&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;Netstar’s implementation reflects a wider architectural move emerging across operational solutions: from systems that store and query data to platforms that detect and react to changes as they happen. In fleet logistics, financial systems, and industrial operations, the competitive advantage increasingly lies in eliminating the lag between event and response.&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;"Building custom integrations for every use case was slowing us down and limiting what we could deliver to customers," &lt;/EM&gt;Dustyn Lightfoot reflects&lt;EM&gt;. "Drasi gave us a reusable foundation that handles the hard parts, integrating disparate data sources and detecting meaningful changes, so we can focus on solving business problems rather than rebuilding infrastructure."&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;The collaboration between Drasi and Netstar demonstrates how open source change-driven platforms can simplify complex operational challenges whilst providing actionable insights across distributed systems. As logistics operations evolve, architectures like Drasi’s may define the next era of competitive advantage- one where actionable insight arrives the moment conditions change.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To learn more about Drasi visit &lt;A class="lia-external-url" href="https://drasi.io/" target="_blank"&gt;Drasi.io&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Thu, 05 Mar 2026 19:23:46 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/how-netstar-streamlined-fleet-monitoring-and-reduced-custom/ba-p/4499592</guid>
      <dc:creator>CollinBrian</dc:creator>
      <dc:date>2026-03-05T19:23:46Z</dc:date>
    </item>
    <item>
      <title>Retina 1.0 Is Now Available</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/retina-1-0-is-now-available/ba-p/4489003</link>
      <description>&lt;P&gt;We are excited to announce the first major release of &lt;A class="lia-external-url" href="https://retina.sh/" target="_blank" rel="noopener"&gt;Retina&lt;/A&gt; - a significant milestone for the project. This version brings along many new features, enhancements and bug fixes.&lt;/P&gt;
&lt;P&gt;The Retina maintainer team would like to thank all contributors, community members, and early adopters who helped make this 1.0 release possible.&lt;/P&gt;
&lt;H1&gt;What is Retina?&lt;/H1&gt;
&lt;P&gt;Retina is an open-source, Kubernetes network observability platform. It enables you to continuously observe and measure network health, and investigate network issues on-demand with integrated Kubernetes-native workflows.&lt;/P&gt;
&lt;H1&gt;Why Retina?&lt;/H1&gt;
&lt;P&gt;Kubernetes networking failures are rarely isolated or easy to reproduce. Pods are ephemeral, services span multiple nodes, and network traffic crosses multiple layers (CNI, kube-proxy, node networking, policies), making crucial evidence difficult to capture. Manually connecting to nodes and stitching together logs or packet captures simply does not scale as clusters grow in size and complexity.&lt;/P&gt;
&lt;P&gt;A modern approach to observability must automate and centralize data collection while exposing rich, actionable insights.&lt;/P&gt;
&lt;P&gt;Retina represents a major step forward in solving the complexities of Kubernetes observability by leveraging the power of&amp;nbsp;&lt;A class="lia-external-url" href="https://ebpf.io/" target="_blank" rel="noopener"&gt;eBPF&lt;/A&gt;. Its cloud-agnostic design, deep integration with &lt;A class="lia-external-url" href="https://github.com/cilium/hubble?tab=readme-ov-file#what-is-hubble" target="_blank" rel="noopener"&gt;Hubble&lt;/A&gt;, and support for both real-time metrics and on-demand packet captures make it an invaluable tool for DevOps, SecOps, and compliance teams across diverse environments.&lt;/P&gt;
&lt;H1&gt;What Does It Do?&lt;/H1&gt;
&lt;P&gt;Retina can collect two types of telemetry: metrics and packet captures.&lt;/P&gt;
&lt;P&gt;The Retina shell enables ad-hoc troubleshooting via pre-installed networking tools.&lt;/P&gt;
&lt;H2&gt;Metrics&lt;/H2&gt;
&lt;P&gt;&lt;A class="lia-external-url" href="https://retina.sh/docs/Metrics/metrics-intro" target="_blank" rel="noopener"&gt;Metrics&lt;/A&gt;&amp;nbsp;provide continuous observability. They can be exported to multiple storage options such as Prometheus or Azure Monitor, and visualized in a variety of ways, including Grafana or Azure Log Analytics.&lt;/P&gt;
&lt;P&gt;Retina supports two control planes: Hubble and Standard. Both are supported regardless of the underlying CNI. The choice of control plane affects the metrics which are collected.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://retina.sh/docs/Metrics/hubble_metrics" target="_blank" rel="noopener"&gt;Hubble metrics&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://retina.sh/docs/Metrics/modes/" target="_blank" rel="noopener"&gt;Standard metrics&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;You can customize which metrics are collected by enabling/disabling their corresponding &lt;A class="lia-external-url" href="https://retina.sh/docs/Metrics/plugins/" target="_blank" rel="noopener"&gt;plugins&lt;/A&gt;. Some examples of metrics may include:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Incoming/outcoming traffic&lt;/LI&gt;
&lt;LI&gt;Dropped packets&lt;/LI&gt;
&lt;LI&gt;TCP/UDP&lt;/LI&gt;
&lt;LI&gt;DNS&lt;/LI&gt;
&lt;LI&gt;API Server latency&lt;/LI&gt;
&lt;LI&gt;Node/interface statistics&lt;/LI&gt;
&lt;/UL&gt;
&lt;img&gt;Grafana dashboard visualizing metrics from Retina - showing packets being dropped on the cluster.&lt;/img&gt;
&lt;H2&gt;Packet Captures&lt;/H2&gt;
&lt;P&gt;&lt;A class="lia-external-url" href="https://retina.sh/docs/Captures/overview" target="_blank" rel="noopener"&gt;Captures&lt;/A&gt; provide on-demand observability. They allow users to perform distributed packet captures across the cluster, based on specified Nodes/Pods and other supported filters. They can be triggered&amp;nbsp;&lt;A class="lia-external-url" href="https://retina.sh/docs/Captures/cli" target="_blank" rel="noopener"&gt;via the CLI&lt;/A&gt; or &lt;A class="lia-external-url" href="https://retina.sh/docs/Captures/crd" target="_blank" rel="noopener"&gt;through the capture CRD&lt;/A&gt;, and may be output to persistent storage options such as the host filesystem, a PVC, or a storage blob.&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The &lt;A class="lia-external-url" href="https://retina.sh/docs/Captures/cli#file-and-directory-structure-inside-the-tarball" target="_blank" rel="noopener"&gt;result of the capture&lt;/A&gt;&amp;nbsp;&lt;/SPAN&gt;contains more than just a &lt;EM&gt;.pcap&lt;/EM&gt; file&lt;SPAN data-contrast="auto"&gt;. Retina also captures a number of networking metadata such as iptables rules, socket statistics, kernel network information from&amp;nbsp;&lt;EM&gt;/proc/net&lt;/EM&gt;, and more.&lt;/SPAN&gt;&lt;/P&gt;
&lt;img&gt;Retina packet capture performed through the CLI.&lt;/img&gt;
&lt;H2&gt;Shell&lt;/H2&gt;
&lt;P&gt;The &lt;A class="lia-external-url" href="https://retina.sh/docs/Troubleshooting/shell" target="_blank" rel="noopener"&gt;Retina shell&lt;/A&gt; enables deep ad-hoc troubleshooting by providing a suite of networking tools. The CLI command starts an interactive shell on a Kubernetes node that runs a container image which includes standard tools such as ping or curl, as well as specialized tools like&amp;nbsp;&lt;A class="lia-external-url" href="https://retina.sh/docs/Troubleshooting/shell#bpftool" target="_blank" rel="noopener"&gt;bpftool&lt;/A&gt;, &lt;A class="lia-external-url" href="https://retina.sh/docs/Troubleshooting/shell#pwru" target="_blank" rel="noopener"&gt;pwru&lt;/A&gt;, &lt;A class="lia-external-url" href="https://retina.sh/docs/Troubleshooting/shell#inspektor-gadget-ig" target="_blank" rel="noopener"&gt;Inspektor Gadget&lt;/A&gt; and more.&lt;/P&gt;
&lt;P&gt;The Retina shell is currently only available on Linux. Note that some tools require particular capabilities to execute. These can be passed as&amp;nbsp;&lt;A class="lia-external-url" href="https://retina.sh/docs/Troubleshooting/shell#getting-started" target="_blank" rel="noopener"&gt;parameters through the CLI&lt;/A&gt;.&lt;/P&gt;
&lt;img&gt;Retina shell CLI - showcasing some of the available tools, including ping, dig, bpftool and pwru.&lt;/img&gt;
&lt;H2&gt;Use Cases&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Debugging Pod Connectivity Issues&lt;/STRONG&gt;: When services can’t communicate, Retina enables rapid, automated distributed packet capture and drop metrics, drastically reducing troubleshooting time. The Retina shell also brings specialized tools for deep manual investigations.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Continuous Monitoring of Network Health&lt;/STRONG&gt;: Operators can set up alerts and dashboards for DNS failures, API server latency, or packet drops, gaining ongoing visibility into cluster networking.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Security Auditing and Compliance&lt;/STRONG&gt;: Flow logs (in Hubble mode) and metrics support security investigations and compliance reporting, enabling quick identification of unexpected connections or data transfers.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Multi-Cluster / Multi-Cloud Visibility&lt;/STRONG&gt;: Retina standardizes network observability across clouds, supporting unified dashboards and processes for SRE teams.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H1&gt;Where Does It Run?&lt;/H1&gt;
&lt;P&gt;Retina is designed for broad compatibility across Kubernetes distributions, cloud providers, and operating systems. There are no Azure-specific dependencies - Retina runs anywhere Kubernetes does.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Operating Systems&lt;/STRONG&gt;: Both Linux and Windows nodes are supported.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG style="color: rgb(30, 30, 30);"&gt;Kubernetes Distributions&lt;/STRONG&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;: Retina is distribution-agnostic, deployable on managed services (AKS, EKS, GKE) or self-managed clusters.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;CNI / Network Stack&lt;/STRONG&gt;: Retina works with any CNI, focusing on kernel-level events rather than CNI-specific logs.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Cloud Integration&lt;/STRONG&gt;: Retina exports metrics to Azure Monitor and Log Analytics, with pre-built Grafana dashboards for AKS. Integration with AWS CloudWatch or GCP Stackdriver is possible via Prometheus.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Observability Stacks&lt;/STRONG&gt;: Retina integrates with Prometheus &amp;amp; Grafana, Cilium Hubble (for flow logs and UI), and can be extended to other exporters.&lt;/LI&gt;
&lt;/UL&gt;
&lt;img&gt;High-level overview of where Retina runs.&lt;/img&gt;
&lt;H1 class="lia-clear-both"&gt;Design Overview&lt;/H1&gt;
&lt;P&gt;Retina’s architecture consists of two layers: a data collection layer in the kernel-space, and processing layer that converts low-level signals into Kubernetes-aware telemetry in the user-space.&lt;/P&gt;
&lt;P&gt;When Retina is installed, each node in the cluster runs a Retina agent which collects raw network telemetry from the host kernel - backed by eBPF on Linux, and HNS/VFP on Windows. The agent &lt;SPAN style="color: rgb(30, 30, 30);"&gt;processes the raw network data and enriches it with Kubernetes metadata, which is then exported&lt;/SPAN&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; for consumption by monitoring tools such as Prometheus, Grafana, or Hubble UI.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;Modularity and extensibility are central to the design philosophy. Retina's plugin model lets you enable only the telemetry you need, and add new sources by implementing a common plugin interface.&amp;nbsp;Built-in plugins include Drop Reason, DNS, Packet Forward, and more.&lt;/P&gt;
&lt;P&gt;Check out our &lt;A class="lia-external-url" href="https://retina.sh/docs/Introduction/architecture" target="_blank" rel="noopener"&gt;architecture docs&lt;/A&gt; for a deeper dive into Retina's design.&lt;/P&gt;
&lt;H1&gt;Get Started&lt;/H1&gt;
&lt;P&gt;Thanks to &lt;A class="lia-external-url" href="https://helm.sh/" target="_blank" rel="noopener"&gt;Helm charts&lt;/A&gt; deploying Retina is streamlined across all environments, and can be done with one configurable command. For complete documentation, visit our &lt;A href="https://retina.sh/docs/Installation/Setup" target="_blank" rel="noopener"&gt;installation docs&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To &lt;A class="lia-external-url" href="https://retina.sh/docs/Installation/Setup" target="_blank" rel="noopener"&gt;install Retina&lt;/A&gt; with the Standard control plane and Basic metrics mode:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;VERSION=$( curl -sL https://api.github.com/repos/microsoft/retina/releases/latest | jq -r .name)
helm upgrade --install retina oci://ghcr.io/microsoft/retina/charts/retina \
    --version $VERSION \
    --namespace kube-system \
    --set image.tag=$VERSION \
    --set operator.tag=$VERSION \
    --set logLevel=info \
    --set operator.enabled=true \
    --set enabledPlugin_linux="\[dropreason\,packetforward\,linuxutil\,dns\]"&lt;/LI-CODE&gt;
&lt;P&gt;Once Retina is running in your cluster, you can then configure &lt;A class="lia-external-url" href="https://retina.sh/docs/Installation/prometheus" target="_blank" rel="noopener"&gt;Prometheus&lt;/A&gt;&amp;nbsp;and &lt;A class="lia-external-url" href="https://retina.sh/docs/Installation/grafana" target="_blank" rel="noopener"&gt;Grafana&lt;/A&gt; to scrape and visualize your metrics.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A class="lia-external-url" href="https://retina.sh/docs/Installation/CLI" target="_blank" rel="noopener"&gt;Install the Retina CLI&lt;/A&gt; with &lt;A class="lia-external-url" href="https://krew.sigs.k8s.io/" target="_blank"&gt;Krew&lt;/A&gt;:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;kubectl krew install retina&lt;/LI-CODE&gt;
&lt;H1&gt;Get Involved&lt;/H1&gt;
&lt;P&gt;Retina is open-source under the &lt;A class="lia-external-url" href="https://retina.sh/docs/Contributing/overview#licensing" target="_blank" rel="noopener"&gt;MIT License&lt;/A&gt; and welcomes community contributions. Since its announcement in early 2024, the project has gained significant traction, with contributors from multiple organizations helping to expand its capabilities.&lt;/P&gt;
&lt;P&gt;The project is hosted on &lt;A class="lia-external-url" href="https://github.com/microsoft/retina" target="_blank" rel="noopener"&gt;GitHub · microsoft/retina&lt;/A&gt; and documentation is available at &lt;A href="https://retina.sh" target="_blank" rel="noopener"&gt;retina.sh&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;If you would like to contribute to Retina you can follow our &lt;A class="lia-external-url" href="https://retina.sh/docs/Contributing/overview" target="_blank" rel="noopener"&gt;contributor guide&lt;/A&gt;.&lt;/P&gt;
&lt;H1&gt;What's Next?&lt;/H1&gt;
&lt;P&gt;Retina 1.1 of course!&lt;/P&gt;
&lt;P&gt;We are also discussing the future roadmap, and exploring the possibility of moving the project to community ownership. Stay tuned!&lt;/P&gt;
&lt;P&gt;In the meantime, we welcome you to &lt;A class="lia-external-url" href="https://github.com/microsoft/retina/issues" target="_blank" rel="noopener"&gt;raise an issue&lt;/A&gt; if you find any bugs, or start a &lt;A class="lia-external-url" href="https://github.com/microsoft/retina/discussions" target="_blank" rel="noopener"&gt;discussion&lt;/A&gt; if you have any questions or suggestions.&lt;/P&gt;
&lt;P&gt;You can also &lt;A class="lia-external-url" href="mailto:retina@microsoft.com" target="_blank" rel="noopener"&gt;reach out to the Retina team via email&lt;/A&gt;, we would love to hear from you!&lt;/P&gt;
&lt;H1&gt;References&lt;/H1&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;EM&gt;&lt;A href="https://retina.sh/" target="_blank" rel="noopener"&gt;Retina&lt;/A&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;EM&gt;&lt;A href="https://www.srodi.com/posts/kubernetes-ebpf-observability-retina-deepdive/" target="_blank" rel="noopener"&gt;Deep Dive into Retina Open-Source Kubernetes Network Observability&lt;/A&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;EM&gt;&lt;A href="https://techcommunity.microsoft.com/blog/linuxandopensourceblog/troubleshooting-network-issues-with-retina/4446071" target="_blank" rel="noopener"&gt;Troubleshooting Network Issues with Retina&lt;/A&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-ccp-props="{}"&gt;&lt;EM&gt;&lt;A href="https://techcommunity.microsoft.com/blog/linuxandopensourceblog/ebpf-powered-observability-beyond-azure-a-multi-cloud-perspective-with-retina/4403361" target="_blank" rel="noopener"&gt;Retina: Bridging Kubernetes Observability and eBPF Across the Clouds&lt;/A&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Tue, 03 Feb 2026 14:36:06 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/retina-1-0-is-now-available/ba-p/4489003</guid>
      <dc:creator>kamilp</dc:creator>
      <dc:date>2026-02-03T14:36:06Z</dc:date>
    </item>
    <item>
      <title>Scaling DNS on AKS with Cilium: NodeLocal DNSCache, LRP, and FQDN Policies</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/scaling-dns-on-aks-with-cilium-nodelocal-dnscache-lrp-and-fqdn/ba-p/4486323</link>
      <description>&lt;H2&gt;Why Adopt NodeLocal DNSCache?&lt;/H2&gt;
&lt;P&gt;The primary drivers for adoption are usually:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Eliminating Conntrack Pressure:&lt;/STRONG&gt; In high-QPS UDP DNS scenarios, conntrack contention and UDP tracking can cause intermittent DNS response loss and retries; depending on resolver retry/timeouts, this can appear as multi-second lookup delays and sometimes much longer tails.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Reducing Latency&lt;/STRONG&gt;: By placing a cache on every node, you remove the network hop to the CoreDNS service. Responses are practically instantaneous for cached records.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Offloading CoreDNS&lt;/STRONG&gt;: A DaemonSet architecture effectively shards the DNS query load across the entire cluster, preventing the central CoreDNS deployment from becoming a single point of congestion during bursty scaling events.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H4&gt;Who needs this?&lt;/H4&gt;
&lt;P&gt;You should prioritize this architecture if you run:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Large-scale clusters&lt;/STRONG&gt; large clusters (hundreds of nodes or thousands of pods), where CoreDNS scaling becomes difficult to manage.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;High-churn endpoints&lt;/STRONG&gt;, such as spot instances or frequent auto-scaling jobs that trigger massive waves of DNS queries.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Real-time applications&lt;/STRONG&gt; where multi-second (and occasionally longer) DNS lookup delays are unacceptable.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;The Challenge with Cilium&lt;/H2&gt;
&lt;P&gt;Deploying NodeLocal DNSCache on a cluster managed by &lt;STRONG&gt;Cilium&lt;/STRONG&gt; (CNI) requires a specific approach. Standard NodeLocal DNSCache relies on node-level &lt;EM&gt;interface&lt;/EM&gt;/&lt;EM&gt;iptables &lt;/EM&gt;setup. In Cilium environments, you can instead implement the interception via &lt;STRONG&gt;Cilium Local Redirect Policy (LRP)&lt;/STRONG&gt;, which redirects traffic destined to the &lt;EM&gt;kube-dns&lt;/EM&gt; ClusterIP service to a node-local backend pod.&lt;/P&gt;
&lt;P&gt;This post details a production-ready deployment strategy aligned with Cilium’s Local Redirect Policy model. It covers necessary configuration tweaks to avoid conflicts and explains how to maintain security filtering.&lt;/P&gt;
&lt;H2&gt;Architecture Overview&lt;/H2&gt;
&lt;P&gt;In a standard Kubernetes deployment, NodeLocal DNSCache creates a dummy network interface and uses extensive iptables rules to hijack traffic destined for the Cluster DNS IP.&lt;/P&gt;
&lt;P&gt;When using Cilium, we can achieve this more elegantly and efficiently using &lt;STRONG&gt;Local Redirect Policies&lt;/STRONG&gt;.&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;DaemonSet&lt;/STRONG&gt;: Runs &lt;EM&gt;node-local-dns&lt;/EM&gt; on every node.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Configuration&lt;/STRONG&gt;: Configured to &lt;U&gt;skip&lt;/U&gt; interface creation and iptables manipulation.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Redirection&lt;/STRONG&gt;: Cilium LRP intercepts traffic to the &lt;EM&gt;kube-dns&lt;/EM&gt; Service IP and redirects it to the local pod on the same node.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;1. The NodeLocal DNSCache DaemonSet&lt;/H3&gt;
&lt;P&gt;The critical difference in this manifest is the arguments passed to the&amp;nbsp;&lt;EM&gt;node-local-dns&lt;/EM&gt; binary. We must explicitly disable its networking setup functions to let Cilium handle the traffic.&lt;/P&gt;
&lt;P&gt;The NodeLocal DNSCache deployment also requires the &lt;EM&gt;node-local-dns ConfigMap&lt;/EM&gt; and the &lt;EM&gt;kube-dns-upstream Service&lt;/EM&gt; (plus &lt;EM&gt;RBAC/ServiceAccount&lt;/EM&gt;). For brevity, the snippet below shows only the DaemonSet arguments that differ in the Cilium/LRP approach. The &lt;EM&gt;node-cache&lt;/EM&gt; reads the template &lt;EM&gt;Corefile &lt;/EM&gt;(&lt;EM&gt;/etc/coredns/Corefile.base&lt;/EM&gt;) and generates the active &lt;EM&gt;Corefile &lt;/EM&gt;(&lt;EM&gt;/etc/Corefile&lt;/EM&gt;). The &lt;EM&gt;-conf&lt;/EM&gt; flag points CoreDNS at the active &lt;EM&gt;Corefile &lt;/EM&gt;it should load.&lt;/P&gt;
&lt;P&gt;The node-cache binary accepts &lt;EM&gt;-localip&lt;/EM&gt; as an IP list; &lt;EM&gt;0.0.0.0&lt;/EM&gt; is a valid value and makes it listen on all interfaces, appropriate for the LRP-based redirection model.&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-local-dns
  namespace: kube-system
  labels:
    k8s-app: node-local-dns
spec:
  selector:
    matchLabels:
      k8s-app: node-local-dns
  template:
    metadata:
      labels:
        k8s-app: node-local-dns
      annotations:
        # Optional: policy.cilium.io/no-track-port can be used to bypass conntrack for DNS.
        # Validate the impact on your Cilium version and your observability/troubleshooting needs.
        policy.cilium.io/no-track-port: "53"
    spec:
      # IMPORTANT for the "LRP + listen broadly" approach:
      # keep hostNetwork off so you don't hijack node-wide :53
      hostNetwork: false
      # Don't use cluster DNS
      dnsPolicy: Default
      containers:
      - name: node-cache
        image: registry.k8s.io/dns/k8s-dns-node-cache:1.15.16
        args: 
        - "-localip"
        # Use a bind-all approach. Ensure server blocks bind broadly in your Corefile.
        - "0.0.0.0" 
        - "-conf"
        - "/etc/Corefile"
        - "-upstreamsvc"
        - "kube-dns-upstream"
        # CRITICAL: Disable internal setup
        - "-skipteardown=true"
        - "-setupinterface=false"
        - "-setupiptables=false"
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        # Ensure your Corefile includes health :8080 so the liveness probe works
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 60
          timeoutSeconds: 5
        volumeMounts:
        - name: config-volume
          mountPath: /etc/coredns
        - name: kube-dns-config
          mountPath: /etc/kube-dns
      volumes:
      - name: kube-dns-config
        configMap:
          name: kube-dns
          optional: true
      - name: config-volume
        configMap:
          name: node-local-dns
          items:
            - key: Corefile
              path: Corefile.base&lt;/LI-CODE&gt;
&lt;H3&gt;2. The Cilium Local Redirect Policy (LRP)&lt;/H3&gt;
&lt;P&gt;Instead of iptables, we define a CRD that tells Cilium: "When you see traffic for `kube-dns`, send it to the `node-local-dns` pod on this same node."&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;apiVersion: "cilium.io/v2"
kind: CiliumLocalRedirectPolicy
metadata:
  name: "nodelocaldns"
  namespace: kube-system
spec:
  redirectFrontend:
    # ServiceMatcher mode is for ClusterIP services
    serviceMatcher:
      serviceName: kube-dns
      namespace: kube-system
  redirectBackend:
    # The backend pods selected by localEndpointSelector must be in the same namespace as the LRP
    localEndpointSelector:
      matchLabels:
        k8s-app: node-local-dns
    toPorts:
      - port: "53"
        name: dns
        protocol: UDP
      - port: "53"
        name: dns-tcp
        protocol: TCP&lt;/LI-CODE&gt;
&lt;P&gt;This is an&amp;nbsp;&lt;STRONG&gt;LRP-based NodeLocal DNSCache deployment&lt;/STRONG&gt;: we disable node-cache’s &lt;EM&gt;iptables&lt;/EM&gt;/&lt;EM&gt;interface &lt;/EM&gt;setup and let &lt;STRONG&gt;Cilium LRP&lt;/STRONG&gt; handle local redirection. This differs from the upstream NodeLocal DNSCache manifest, which uses &lt;EM&gt;hostNetwork &lt;/EM&gt;+ &lt;EM&gt;dummy interface&lt;/EM&gt; + &lt;EM&gt;iptables&lt;/EM&gt;.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;LRP must be enabled in Cilium (e.g., &lt;EM&gt;localRedirectPolicies.enabled=true&lt;/EM&gt;) before applying the CRD. &lt;A class="lia-external-url" href="https://docs.cilium.io/en/stable/network/kubernetes/local-redirect-policy/#prerequisites" target="_blank" rel="noopener"&gt;Official Cilium LRP doc&lt;/A&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2 data-start="80" data-end="122"&gt;DNS-Based FQDN Policy Enforcement Flow&lt;/H2&gt;
&lt;P data-start="124" data-end="702"&gt;The diagram below illustrates how Cilium enforces FQDN-based egress policies using DNS observation and datapath programming. During the DNS resolution phase, queries are redirected to NodeLocal DNS (or CoreDNS), where responses are observed and used to populate Cilium’s FQDN-to-IP cache. Cilium then programs these mappings into eBPF maps in the datapath. In the connection phase, when the client initiates an HTTPS connection to the resolved IP, the datapath checks the IP against the learned FQDN map and applies the policy decision before allowing or denying the connection.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;img&gt;End-to-end flow of DNS resolution, FQDN learning, and eBPF-based policy enforcement in Cilium.&lt;/img&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;The Network Policy "Gotcha"&lt;/H2&gt;
&lt;P&gt;If you use &lt;STRONG&gt;CiliumNetworkPolicy &lt;/STRONG&gt;to restrict egress traffic, specifically for &lt;STRONG&gt;FQDN filtering, &lt;/STRONG&gt;you typically allow access to CoreDNS like this:&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;  - toEndpoints:
    - matchLabels:
        k8s:io.kubernetes.pod.namespace: kube-system
        k8s:k8s-app: kube-dns
    toPorts:
    - ports:
      - port: "53"
        protocol: ANY&lt;/LI-CODE&gt;
&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;This will break with local redirection.&lt;/STRONG&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;Why? Because LRP redirects the DNS request to the&amp;nbsp;&lt;STRONG&gt;node-local-dns backend endpoint&lt;/STRONG&gt;; strict egress policies must therefore allow both &lt;EM&gt;kube-dns&lt;/EM&gt; (upstream) &lt;STRONG&gt;and&lt;/STRONG&gt; &lt;EM&gt;node-local-dns&lt;/EM&gt; (the redirected destination).&lt;/P&gt;
&lt;H3&gt;The Repro Setup&lt;/H3&gt;
&lt;P&gt;To demonstrate this failure, the cluster is configured with:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;NodeLocal DNSCache&lt;/STRONG&gt;: Deployed as a DaemonSet (&lt;EM&gt;node-local-dns&lt;/EM&gt;) to cache DNS requests locally on every node.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Local Redirect Policy (LRP)&lt;/STRONG&gt;: An active LRP intercepts traffic destined for the &lt;EM&gt;kube-dns&lt;/EM&gt; Service IP and redirects it to the local &lt;EM&gt;node-local-dns&lt;/EM&gt; pod.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Incomplete Network Policy&lt;/STRONG&gt;: A strict &lt;EM&gt;CiliumNetworkPolicy &lt;/EM&gt;(CNP) is enforced on the client pod. While it explicitly allows egress to &lt;EM&gt;kube-dns&lt;/EM&gt;, it &lt;STRONG&gt;misses &lt;/STRONG&gt;the corresponding rule for &lt;EM&gt;node-local-dns&lt;/EM&gt;.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H4&gt;Reveal the issue using Hubble:&lt;/H4&gt;
&lt;P&gt;In this scenario, the client pod &lt;EM&gt;dns-client&lt;/EM&gt; is attempting to resolve the external domain &lt;EM&gt;github.com&lt;/EM&gt;.&lt;/P&gt;
&lt;P&gt;When inspecting the traffic flows, you will see &lt;EM&gt;EGRESS DENIED&lt;/EM&gt; verdicts. Crucially, notice the destination pod in the logs below:&lt;EM&gt; kube-system/node-local-dns&lt;/EM&gt;, not &lt;EM&gt;kube-dns&lt;/EM&gt;.&lt;/P&gt;
&lt;P&gt;Although the application originally sent the packet to the Cluster IP of CoreDNS, Cilium's Local Redirect Policy modified the destination to the local node cache. Since strictly defined Network Policies assume traffic is going to the &lt;EM&gt;kube-dns&lt;/EM&gt; identity, this redirected traffic falls outside the allowed rules and is dropped by the default deny stance.&lt;/P&gt;
&lt;img /&gt;
&lt;H3&gt;The Fix: You must allow egress to &lt;U&gt;both&lt;/U&gt; labels.&lt;/H3&gt;
&lt;LI-CODE lang="yaml"&gt;  - toEndpoints:
    - matchLabels:
        k8s:io.kubernetes.pod.namespace: kube-system
        k8s:k8s-app: kube-dns
    # Add this selector for the local cache
    - matchLabels:
        k8s:io.kubernetes.pod.namespace: kube-system
        k8s:k8s-app: node-local-dns 
    toPorts:
    - ports:
      - port: "53"
        protocol: ANY&lt;/LI-CODE&gt;
&lt;P&gt;Without this addition, pods protected by strict egress policies will timeout resolving DNS, even though the cache is running.&lt;/P&gt;
&lt;H4&gt;Use Hubble to observe the network flows:&lt;/H4&gt;
&lt;P&gt;After adding &lt;EM&gt;matchLabels: k8s:k8s-app: node-local-dns&lt;/EM&gt;, the traffic is now allowed. Hubble confirms a policy verdict of &lt;EM&gt;EGRESS ALLOWED&lt;/EM&gt; for UDP traffic on port 53. Because DNS resolution now succeeds, the response populates the Cilium FQDN cache, subsequently allowing the TCP traffic to &lt;EM&gt;github.com&lt;/EM&gt; on port 443 as intended.&lt;/P&gt;
&lt;img /&gt;
&lt;H3&gt;Real-World Example: Restricting Egress with FQDN Policies&lt;/H3&gt;
&lt;P&gt;Here is a complete &lt;EM&gt;CiliumNetworkPolicy&lt;/EM&gt; that locks down a workload to only &lt;EM&gt;access api.example.com.&lt;/EM&gt; Note how the DNS rule explicitly allows traffic to both &lt;EM&gt;kube-dns &lt;/EM&gt;(for upstream) and &lt;EM&gt;node-local-dns&lt;/EM&gt; (for the local cache).&lt;/P&gt;
&lt;LI-CODE lang="yaml"&gt;apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: secure-workload-policy
spec:
  endpointSelector:
    matchLabels:
      app: critical-workload
  egress:
  # 1. Allow DNS Resolution (REQUIRED for FQDN policies)
  - toEndpoints:
    - matchLabels:
        k8s:io.kubernetes.pod.namespace: kube-system
        k8s:k8s-app: kube-dns
    # Allow traffic to the local cache redirection target
    - matchLabels:
        k8s:io.kubernetes.pod.namespace: kube-system
        k8s:k8s-app: node-local-dns
    toPorts:
    - ports:
      - port: "53"
        protocol: ANY
      rules:
        dns:
        - matchPattern: "*"

  # 2. Allow specific FQDN traffic (populated via DNS lookups)
  - toFQDNs:
    - matchName: "api.example.com"
    toPorts:
    - ports:
      - port: "443"
        protocol: TCP&lt;/LI-CODE&gt;
&lt;H2&gt;Configuration &amp;amp; Upstream Loops&lt;/H2&gt;
&lt;P&gt;When configuring the &lt;EM&gt;ConfigMap &lt;/EM&gt;for &lt;EM&gt;node-local-dns&lt;/EM&gt;, use the standard placeholders provided by the image. The binary replaces them at runtime:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;__PILLAR__CLUSTER__DNS__: The Upstream Service IP (&lt;EM&gt;kube-dns-upstream&lt;/EM&gt;).&lt;/LI&gt;
&lt;LI&gt;__PILLAR__UPSTREAM__SERVERS__: The system resolvers (usually &lt;EM&gt;/etc/resolv.conf&lt;/EM&gt;).&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Ensure &lt;EM&gt;kube-dns-upstream&lt;/EM&gt; exists as a Service selecting the CoreDNS pods so cache misses are forwarded to the actual CoreDNS backends.&lt;/P&gt;
&lt;H2&gt;Alternative: AKS LocalDNS&lt;/H2&gt;
&lt;P&gt;&lt;STRONG&gt;LocalDNS &lt;/STRONG&gt;is an Azure Kubernetes Services (AKS)-managed node-local DNS proxy/cache.&lt;/P&gt;
&lt;H4&gt;Pros:&lt;/H4&gt;
&lt;UL&gt;
&lt;LI&gt;Managed lifecycle at the node pool level.&lt;/LI&gt;
&lt;LI&gt;Support for custom configuration via &lt;EM&gt;localdnsconfig.json&lt;/EM&gt; (e.g., custom server blocks, cache tuning).&lt;/LI&gt;
&lt;LI&gt;No manual DaemonSet management required.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4&gt;Cons &amp;amp; Limitations:&lt;/H4&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Incompatibility with FQDN Policies&lt;/STRONG&gt;: As noted in the &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/localdns-custom" target="_blank" rel="noopener"&gt;official documentation&lt;/A&gt;, LocalDNS isn’t compatible with applied FQDN filter policies in ACNS/Cilium; if you rely on FQDN enforcement, prefer a DNS path that preserves FQDN learning/enforcement.&lt;/LI&gt;
&lt;LI&gt;Updating configuration requires reimaging the node pool.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;For environments heavily relying on strict Cilium Network Policies and FQDN filtering, the manual deployment method described above (using LRP) can be more reliable and transparent.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;AKS recommends not enabling both upstream NodeLocal DNSCache and LocalDNS in the same node pool, as DNS traffic is routed through LocalDNS and results may be unexpected.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;References&lt;/H2&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/" target="_blank" rel="noopener"&gt;Kubernetes Documentation: NodeLocal DNSCache&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://docs.cilium.io/en/stable/network/kubernetes/local-redirect-policy/" target="_blank" rel="noopener"&gt;Cilium Documentation: Local Redirect Policy&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/aks/localdns-custom" target="_blank" rel="noopener"&gt;AKS Documentation: Configure LocalDNS&lt;/A&gt;&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Tue, 10 Mar 2026 09:23:51 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/scaling-dns-on-aks-with-cilium-nodelocal-dnscache-lrp-and-fqdn/ba-p/4486323</guid>
      <dc:creator>Simone_Rodigari</dc:creator>
      <dc:date>2026-03-10T09:23:51Z</dc:date>
    </item>
    <item>
      <title>Event-Driven to Change-Driven: Low-cost dependency inversion</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/event-driven-to-change-driven-low-cost-dependency-inversion/ba-p/4478948</link>
      <description>&lt;P&gt;Event-driven architectures tout scalability, loose coupling, and eventual consistency. The architectural patterns are sound, the theory is compelling, and the blog posts make it look straightforward.&lt;/P&gt;
&lt;P&gt;Then you implement it.&lt;/P&gt;
&lt;P&gt;Suddenly you're maintaining separate event stores, implementing transactional outboxes, debugging projection rebuilds, versioning events across a dozen micro-services, and writing mountains of boilerplate to handle what should be simple queries.&lt;/P&gt;
&lt;P&gt;Your domain events that were supposed to capture rich business meaning have devolved into glorified database change notifications. Downstream services diff field values to extract intent from "&lt;EM&gt;OrderUpdated&lt;/EM&gt;" events because developers just don't get what constitutes a proper domain event.&lt;/P&gt;
&lt;P&gt;The complexity tax is real, don't get me wrong, it's very elegant but for many systems it's unjustified.&lt;/P&gt;
&lt;P&gt;Drasi offers an alternative: &lt;EM&gt;change-driven architecture&lt;/EM&gt; that delivers reactive, real-time capabilities across multiple data sources without requiring you to rewrite your application or over complicate your architecture.&lt;/P&gt;
&lt;H3&gt;What do we mean by “Event-driven” architecture&lt;/H3&gt;
&lt;P&gt;As &lt;A href="https://www.youtube.com/watch?v=STKCRSUsyP0" data-test-app-aware-link="" target="_blank"&gt;Martin Fowler notes&lt;/A&gt;, event-driven architecture isn't a single pattern, it's at least four distinct patterns that are often confused, each with its own benefits and traps.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Event Notification&lt;/STRONG&gt; is the simplest form. Here, events act as signals that something has happened, but carry minimal data, often just an identifier. The recipient must query the source system for more details if needed. For example, a service emits an &lt;EM&gt;OrderPlaced &lt;/EM&gt;event with just the order ID. Downstream consumers must query the order service to retrieve full order details.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Event Carried State Transfer&lt;/STRONG&gt; broadcasts full state changes through events. When an order ships, you publish an &lt;EM&gt;OrderShipped&lt;/EM&gt; event containing all the order details. Downstream services maintain their own materialized views or projections by consuming these events.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Event Sourcing&lt;/STRONG&gt; goes further, events become your source of truth. Instead of storing current state, you store the sequence of events that led to that state. Your order isn't a row in a database; it's the sum of &lt;EM&gt;OrderPlaced&lt;/EM&gt;, &lt;EM&gt;ItemAdded&lt;/EM&gt;, &lt;EM&gt;PaymentProcessed&lt;/EM&gt;, and &lt;EM&gt;OrderShipped &lt;/EM&gt;events.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;CQRS (Command Query Responsibility Segregation)&lt;/STRONG&gt; separates write operations (commands) from read operations (queries). While not inherently event-driven, CQRS is often paired with event sourcing or event-carried state transfer to optimize for scalability and maintainability. Originally derived from Bertrand Meyer's Command-Query Separation principle and popularized by Greg Young, CQRS addresses a specific architectural challenge: the tension between optimizing for writes versus optimizing for reads.&lt;/P&gt;
&lt;P&gt;The pattern promises several benefits:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Optimized data models&lt;/STRONG&gt;: Your write model can focus on transactional consistency while read models optimize for query performance&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Scalability&lt;/STRONG&gt;: Read and write sides can scale independently&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Temporal queries&lt;/STRONG&gt;: With event sourcing, you get time travel for free—reconstruct state at any point in history&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Audit trail&lt;/STRONG&gt;: Every change is captured as an immutable event&lt;/P&gt;
&lt;P&gt;While CQRS isn't inherently tied to Domain-Driven Design (DDD), the pattern complements DDD well. In DDD contexts, CQRS enables different bounded contexts to maintain their own read models tailored to their specific ubiquitous language, while the write model protects domain invariants. This is why you'll often see them discussed together, though each can be applied independently.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;The core motivation for these patterns is often to invert the dependency between systems, so that your downstream services do not need to know about your upstream services.&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;H3&gt;The Developer's Struggle: When Domain Events Become Database Events&lt;/H3&gt;
&lt;P&gt;Chris Kiehl puts it bluntly in his article "&lt;A href="https://chriskiehl.com/article/event-sourcing-is-hard" data-test-app-aware-link="" target="_blank"&gt;Don't Let the Internet Dupe You, Event Sourcing is Hard&lt;/A&gt;": &lt;EM&gt;"The sheer volume of plumbing code involved is staggering—instead of a friendly N-tier setup, you now have classes for commands, command handlers, command validators, events, aggregates, and then projections, model classes, access classes, custom materialization code, and so on."&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;But the real tragedy isn't the boilerplate, it's what happens to those carefully crafted domain events. As developers are disconnected from the real-world business, they struggle to understand the nuances of domain events, a dangerous pattern emerges. Instead of modeling meaningful business processes, teams default to what they know: CRUD.&lt;/P&gt;
&lt;P&gt;Your event stream starts looking like this:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;OrderCreated&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;OrderUpdated&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;OrderUpdated&lt;/EM&gt;&lt;/STRONG&gt; (again)&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;OrderUpdated&lt;/EM&gt;&lt;/STRONG&gt; (wait, what changed?)&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;OrderDeleted&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;As one developer &lt;A href="https://www.linkedin.com/pulse/anti-patterns-event-driven-architecture-arpit-jain" data-test-app-aware-link="" target="_blank"&gt;noted on LinkedIn&lt;/A&gt;, these "CRUD events" are really just &lt;EM&gt;"leaky events that lack clarity and should not be used to replicate databases as this leaks implementation details and couples services to a shared data model."&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Dennis Doomen, reflecting on &lt;A href="https://www.dennisdoomen.com/2017/11/the-ugly-of-event-sourcingreal-world.html" data-test-app-aware-link="" target="_blank"&gt;real-world production issues&lt;/A&gt;, observes: &lt;EM&gt;"It's only once you have a living, breathing machine, users which depend on you, consumers which you can't break, and all the other real-world complexities that plague software projects that the hard problems in event sourcing will rear their heads."&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;The result? Your elegant event-driven architecture devolves into an expensive, brittle form of self-maintained Change Data Capture (CDC). You're not modeling business processes; you're just broadcasting database mutations with extra steps.&lt;/P&gt;
&lt;H3&gt;The Anti-Corruption Layer: Your Defense Against the Outside World&lt;/H3&gt;
&lt;P&gt;In DDD, an Anti-Corruption Layer (ACL) protects your bounded context from external models that would corrupt your domain. Think of it as a translator that speaks both languages, the messy external model and your clean internal model.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;The ACL ensures that changes to the external system don't ripple through your domain. If the legacy system changes its schema, you update the translator, not your entire domain model.&lt;/P&gt;
&lt;H3&gt;When Event Taxonomies Become Your ACL (And Why They Fail)&lt;/H3&gt;
&lt;P&gt;In most event-driven architectures, your event taxonomy is supposed to serve as the shared contract between services. Each service publishes events using its own ubiquitous language, and consumers translate these into their own models, this translation is the ACL.&lt;/P&gt;
&lt;P&gt;The theory looks beautiful:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But reality? Most teams end up with this:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;Instead of &lt;STRONG&gt;&lt;EM&gt;OrderPaid&lt;/EM&gt;&lt;/STRONG&gt; events that carry business meaning, we get &lt;STRONG&gt;&lt;EM&gt;OrderUpdated&lt;/EM&gt;&lt;/STRONG&gt; events that force every consumer to reconstruct intent by diffing fields. When you change your database schema, say splitting the orders table or switching from SQL to NoSQL, every downstream service breaks because they're all coupled to your internal data model.&lt;/P&gt;
&lt;P&gt;You haven't built an anti-corruption layer. You've built a corruption pipeline that efficiently distributes your internal implementation details across the entire system, forcing you to deploy all services in lock step and eroding the decoupling benefits you were supposed to get.&lt;/P&gt;
&lt;H2&gt;Enter Drasi: Continuous Queries&lt;/H2&gt;
&lt;P&gt;This is where Drasi changes the game. Instead of publishing events and hoping downstream services can make sense of them, Drasi tails the changelog of the data source itself and derives meaning through continuous queries.&lt;/P&gt;
&lt;P&gt;A continuous query in Drasi isn't just a query that runs repeatedly, it's a living, breathing projection that reacts to changes in real-time. Here's the key insight: instead of imperative code that processes events ("when this happens, do that"), you write declarative queries that describe the state you care about ("I want to know about orders that are ready and have drivers waiting").&lt;/P&gt;
&lt;P&gt;Let's break down what makes this powerful:&lt;/P&gt;
&lt;H3&gt;Declarative vs. Imperative&lt;/H3&gt;
&lt;P&gt;Traditional event processing:&lt;/P&gt;
&lt;img /&gt;
&lt;P class="lia-clear-both"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Drasi continuous query:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;Semantic Mapping from Low-Level Changes&lt;/H3&gt;
&lt;P&gt;Drasi excels at transforming database-level changes into business-meaningful events. You're not reacting to "row updated in orders table", you're reacting to "order ready for curbside pickup."&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;This enables the same core benefits of dependency inversion we get from event-driven architectures but at a fraction of the effort.&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;H3&gt;Advanced Temporal Features&lt;/H3&gt;
&lt;P&gt;Remember those developers struggling with "&lt;EM&gt;OrderUpdated&lt;/EM&gt;" events, trying to figure out if something just happened or has been true for a while? Drasi handles this elegantly:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This query only fires when a driver has been waiting for more than 10 minutes, no timestamp tracking, no state machines, no complex event correlation logic, imagine trying to manually implement this in a downstream event consumer. 😱&lt;/P&gt;
&lt;H3&gt;Cross-Source Aggregation Without Code&lt;/H3&gt;
&lt;P&gt;With Drasi, you can have live projections across PostgreSQL, MySQL, SQL Server, and Cosmos DB as if they were a single graph:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;No custom aggregation service. No event stitching logic. No custom downstream datastore to track the sum or keep a materialized projection. Just a query.&lt;/P&gt;
&lt;H3&gt;Continuous Queries as Your Shared Contract&lt;/H3&gt;
&lt;P&gt;Drasi's continuous queries, combined with pre-processing middleware, can form the shared contract that your anti-corruption layer can depend on.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The continuous query becomes your contract. Downstream systems don't know or care whether orders come from PostgreSQL, MongoDB, or a CSV file. They don't know if you normalized your database, denormalized it, or moved to event sourcing. They just consume the query results. Clean, semantic, and stable.&lt;/P&gt;
&lt;H3&gt;Reactions as your Declarative Consumers&lt;/H3&gt;
&lt;P&gt;Drasi does not simply output a stream of raw change diffs, instead it has a library of interchangeable &lt;A href="https://drasi.io/concepts/reactions/" data-test-app-aware-link="" target="_blank"&gt;Reactions&lt;/A&gt;, that can act on the output of continuous queries. These are declared using YAML and can do anything from host a web-socket endpoint that provides a live projection to your UI, to calling an Http endpoint or publishing a message on a queue.&lt;/P&gt;
&lt;H2&gt;Example: The Curbside Pickup System&lt;/H2&gt;
&lt;P&gt;Let's see how this works in Drasi's &lt;A href="https://drasi.io/tutorials/curbside-pickup/" data-test-app-aware-link="" target="_blank"&gt;curbside pickup tutorial&lt;/A&gt;. This example has two independent databases and serves as a great illustration of a real-time projection built from multiple upstream services.&lt;/P&gt;
&lt;H3&gt;The Business Problem&lt;/H3&gt;
&lt;P&gt;A retail system needs to:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Match ready orders with drivers who've arrived at pickup zones&lt;/LI&gt;
&lt;LI&gt;Alert staff when drivers wait more than 10 minutes without their order being ready&lt;/LI&gt;
&lt;LI&gt;Coordinate data from two different systems (retail ops in PostgreSQL, physical ops in MySQL)&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;Traditional Event-Driven Approach&lt;/H3&gt;
&lt;P&gt;In this architecture, you'd need something like:&lt;/P&gt;
&lt;img /&gt;&lt;img /&gt;&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;That's just the happy path. We haven't handled:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Event ordering issues&lt;/LI&gt;
&lt;LI&gt;Partial failures&lt;/LI&gt;
&lt;LI&gt;Cache invalidation&lt;/LI&gt;
&lt;LI&gt;Service restarts and replay&lt;/LI&gt;
&lt;LI&gt;Duplicate events&lt;/LI&gt;
&lt;LI&gt;Transactional outboxing&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;The Drasi Approach&lt;/H3&gt;
&lt;P&gt;With Drasi, the entire aggregation service above becomes two queries:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Delivery Dashboard Query:&lt;/STRONG&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Wait Detection Query:&lt;/STRONG&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;That's it. No event handlers. No caching. No timers. No state management. Drasi handles:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Change detection across both databases&lt;/LI&gt;
&lt;LI&gt;Correlation between orders and vehicles&lt;/LI&gt;
&lt;LI&gt;Temporal logic for wait detection&lt;/LI&gt;
&lt;LI&gt;Pushing updates to dashboards via SignalR&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The queries define your business logic declaratively. When data changes in either database, Drasi automatically re-evaluates the queries and triggers reactions for any changes in the result set.&lt;/P&gt;
&lt;H2&gt;Drasi: The Non-Invasive Alternative to Legacy System Rewrites&lt;/H2&gt;
&lt;P&gt;Here's perhaps the most compelling argument for Drasi: it doesn't require you to rewrite anything.&lt;/P&gt;
&lt;P&gt;Traditional event sourcing means:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Redesigning your application around events&lt;/LI&gt;
&lt;LI&gt;Rewriting your persistence layer&lt;/LI&gt;
&lt;LI&gt;Implementing transactional outboxes&lt;/LI&gt;
&lt;LI&gt;Managing snapshots and replays&lt;/LI&gt;
&lt;LI&gt;Training your team on new patterns, steep learning curve&lt;/LI&gt;
&lt;LI&gt;Migrating existing data to event streams&lt;/LI&gt;
&lt;LI&gt;Building projection infrastructure&lt;/LI&gt;
&lt;LI&gt;Updating all consumers to handle events&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;As one developer noted about their &lt;A href="https://www.infoq.com/news/2019/09/cqrs-event-sourcing-production/" data-test-app-aware-link="" target="_blank"&gt;event sourcing journey&lt;/A&gt;: &lt;EM&gt;"Event Sourcing is a beautiful solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about."&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Drasi's approach:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Keep your existing databases&lt;/LI&gt;
&lt;LI&gt;Keep your existing services&lt;/LI&gt;
&lt;LI&gt;Keep your existing deployment model&lt;/LI&gt;
&lt;LI&gt;Add continuous queries where you need reactive behavior&lt;/LI&gt;
&lt;LI&gt;Get the benefits of dependency inversion&lt;/LI&gt;
&lt;LI&gt;Gradually migrate complexity from code to queries&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;You can start with a single query on a single table and expand from there. No big bang. No feature freeze. No three-month architecture sprint or large multi-year investments, full of risk.&lt;/P&gt;
&lt;H3&gt;Migration Example: From Polling to Reactive&lt;/H3&gt;
&lt;P&gt;Let's say you have a legacy order system where a scheduled job polls for ready orders every 30 seconds:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;With Drasi, you:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Point Drasi at your existing database&lt;/LI&gt;
&lt;LI&gt;Write the continuous query&lt;/LI&gt;
&lt;LI&gt;Update your dashboard to receive pushes instead of polls&lt;/LI&gt;
&lt;LI&gt;Turn off the polling job&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Your database hasn't changed. Your order service hasn't changed. You've just added a reactive layer on top that eliminates polling overhead and reduces notification latency from 30 seconds to milliseconds.&lt;/P&gt;
&lt;P&gt;The intellectually satisfying complexity of event sourcing often obscures a simple truth: most systems don't need it. They need to know when interesting things change in their data and react accordingly. They need to combine data from multiple sources without writing bespoke aggregation services. They need to transform low-level changes into business-meaningful events.&lt;/P&gt;
&lt;P&gt;Drasi delivers these capabilities without the ceremony.&lt;/P&gt;
&lt;H2&gt;Where Do We Go from Here?&lt;/H2&gt;
&lt;P&gt;If you're building a new system and your team has deep event sourcing experience embrace the pattern. Event sourcing shines for certain domains.&lt;/P&gt;
&lt;P&gt;But if you're like many teams, trying to add reactive capabilities to existing systems, struggling with data synchronization across services, or finding that your "events" are just CRUD operations in disguise, consider the change-driven approach.&lt;/P&gt;
&lt;P&gt;Start small:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Identify one painful polling loop or batch job&lt;/LI&gt;
&lt;LI&gt;Set up Drasi to monitor those same data sources&lt;/LI&gt;
&lt;LI&gt;Write a continuous query that captures the business condition&lt;/LI&gt;
&lt;LI&gt;Replace the polling with push-based reactions&lt;/LI&gt;
&lt;LI&gt;Measure the reduction in latency, overhead, and code complexity&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;The best architecture isn't the most sophisticated one, it's the one your team can understand, maintain, and evolve. Sometimes that means acknowledging that we've been mid-curving it with overly complex event-driven architectures.&lt;/P&gt;
&lt;P&gt;Drasi and change-driven architecture offer the power of reactive systems without the complexity tax. Your data changes. Your queries notice. Your systems react.&lt;/P&gt;
&lt;P&gt;It makes it a non-event.&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Want to explore Drasi further? Check out the &lt;/EM&gt;&lt;A href="https://drasi.io/" data-test-app-aware-link="" target="_blank"&gt;&lt;EM&gt;official documentation&lt;/EM&gt;&lt;/A&gt;&lt;EM&gt; and try the &lt;/EM&gt;&lt;A href="https://drasi.io/tutorials/curbside-pickup/" data-test-app-aware-link="" target="_blank"&gt;&lt;EM&gt;curbside pickup tutorial&lt;/EM&gt;&lt;/A&gt;&lt;EM&gt; to see change-driven architecture in action.&amp;nbsp;&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Dec 2025 22:22:37 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/event-driven-to-change-driven-low-cost-dependency-inversion/ba-p/4478948</guid>
      <dc:creator>CollinBrian</dc:creator>
      <dc:date>2025-12-17T22:22:37Z</dc:date>
    </item>
    <item>
      <title>Building Bridges: Microsoft’s Participation in the Fedora Linux Community</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/building-bridges-microsoft-s-participation-in-the-fedora-linux/ba-p/4478461</link>
      <description>&lt;P&gt;At Microsoft, we believe that meaningful open source participation is driven by people, not corporations. But companies can - and should - create the conditions that empower individuals to contribute. Over the past year, our Community Linux Engineering team has been doing just that, focusing on Fedora Linux and working closely with the community to improve infrastructure, tooling, and collaboration. This post shares some of the highlights of that work and outlines where we’re headed next.&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Modernizing Fedora Cloud Image Delivery&amp;nbsp;&lt;/H2&gt;
&lt;P&gt;One of our most impactful contributions this year has been expanding the availability of Fedora Cloud images across major cloud platforms. We introduced support for publishing images to both the Azure Community Gallery and Google Cloud Platform—capabilities that didn’t exist before. At the same time, we modernized the existing AWS image publishing process by migrating it to a new, OpenShift-hosted automation framework. This new system, developed by our team and led by engineer Jeremy Cline, streamlines image delivery across all three platforms and positions the project to scale and adapt more easily in the future.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We partnered with Adam Williamson in Fedora QE to extend this tooling to support container image uploads, replacing fragile shell scripts with a robust, maintainable system. Nightly Fedora builds are now uploaded to Azure, with one periodically promoted to “latest” after manual validation and basic functionality testing. This ensures cloud users get up-to-date, ready-to-run images - critical for workloads that demand fast boot times and minimal setup. &amp;nbsp;As you’ll see , we have ideas for improving this testing.&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Enabling Secure Boot on ARM with Sigul&amp;nbsp;&lt;/H2&gt;
&lt;P&gt;Secure Boot is essential for trusted cloud workloads across architectures. Our current focus includes enabling it on ARM-based systems. Fedora currently signs most artifacts with Sigul, but UEFI applications are handled separately via a dedicated x86_64 builder with a smart card. We’re working to enable Sigul-based signing for UEFI applications across architectures, but Sigul is a complex project with unmaintained dependencies. We’ve stepped in to help modernize Sigul, starting with a Rust-based client and a roadmap to re-architect the code and structure for easier maintenance and improved performance. &amp;nbsp;&lt;/P&gt;
&lt;P&gt;This work is about more than just Microsoft’s needs - it’s about enabling Secure Boot support out of the box, like what users expect on x86_64 systems.&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Bringing Inspektor Gadget to Fedora&amp;nbsp;&lt;/H2&gt;
&lt;P&gt;Inspektor Gadget is an eBPF-based toolkit for kernel instrumentation, enabling powerful observability use cases like performance profiling and syscall tracing. &lt;SPAN data-olk-copy-source="MessageBody"&gt;The Community Linux Engineering team consulted with the Inspektor Gadget maintainers at Microsoft about putting the project in Fedora.&amp;nbsp; This led to the maintainers natively packaging it for Fedora and assuming ongoing maintenance of the package.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;We are encouraging teams to become active Fedora participants, to maintain their own packages, and to engage directly with the community. We believe in bi-directional feedback: upstream contributions should benefit both the project and the contributors.&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Azure VM Utils: Simplifying Cloud Enablement&amp;nbsp;&lt;/H2&gt;
&lt;P&gt;To streamline Fedora’s compatibility with Azure, we’ve introduced a package called azure-vm-utils. It consolidates Udev rules and low-level utilities that make Fedora work better on Azure infrastructure, particularly with NVMe devices. This package is a step toward greater transparency and maintainability and could serve as a model for other cloud providers.&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Fedora WSL: A Layer 9 Success&amp;nbsp;&lt;/H2&gt;
&lt;P&gt;Fedora is now officially available in the Windows Subsystem for Linux (WSL) catalog - a milestone that required both technical and organizational effort. While the engineering work was substantial, the real challenge was navigating the legal and governance landscape. This success reflects deep collaboration between Fedora leadership, Red Hat, and Microsoft.&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Looking Ahead: Strategic Participation and Testing&amp;nbsp;&lt;/H2&gt;
&lt;P&gt;We’re not stopping here. Our roadmap includes:&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Replacing Sigul&lt;/STRONG&gt; with a modern, maintainable signing infrastructure.&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Expanding participation&lt;/STRONG&gt; in Fedora SIGs (Cloud, Go, Rust) where Microsoft has relevant expertise.&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Improving automated testing&lt;/STRONG&gt; using Microsoft’s open source LISA framework to validate Fedora images at cloud scale.&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Enhancing the Fedora-on-Azure experience&lt;/STRONG&gt;, including exploring mirrors within Azure and expanding agent/extension support.&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;We’re also working closely with the Azure Linux team, which is aligning its development model with Fedora - much like RHEL does. while Azure Linux has used some Fedora sources in the past, their upcoming 4.0 release is intended to align much more closely with Fedora as an upstream&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;A Call for Collaboration&amp;nbsp;&lt;/H2&gt;
&lt;P&gt;While contributing patches is a good start, we intend to do much more. We aim to be a deeply involved member of the Fedora community - participating in SIGs, maintaining packages, and listening to feedback. If you have ideas for where Microsoft can make strategic investments that benefit Fedora, we want to hear them. &amp;nbsp;You’ll find us alongside you in Fedora meetings, forums, and at conferences like Flock.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Open source thrives when contributors bring their whole selves to the table. At Microsoft, we’re working to ensure our engineers can do just that - by aligning company goals with community value.&lt;/P&gt;
&lt;P&gt;(This post is based on a &lt;A class="lia-external-url" href="https://www.youtube.com/live/YhoFPG7Ack0?si=v5KH_0nRXl_bKtBD&amp;amp;t=4290" target="_blank" rel="noopener"&gt;talk delivered at Flock to Fedora 2025&lt;/A&gt;.)&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Dec 2025 13:47:31 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/building-bridges-microsoft-s-participation-in-the-fedora-linux/ba-p/4478461</guid>
      <dc:creator>bexelbie</dc:creator>
      <dc:date>2025-12-17T13:47:31Z</dc:date>
    </item>
    <item>
      <title>Beyond the Chat Window: How Change-Driven Architecture Enables Ambient AI Agents</title>
      <link>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/beyond-the-chat-window-how-change-driven-architecture-enables/ba-p/4475026</link>
      <description>&lt;P&gt;AI agents are everywhere now. Powering chat interfaces, answering questions, helping with code. We've gotten remarkably good at this conversational paradigm. But while the world has been focused on chat experiences, something new is quietly emerging: ambient agents. These aren't replacements for chat, they're an entirely new category of AI system that operates in the background, sensing, processing, and responding to the world in real time. And here's the thing, this is a new frontier. The infrastructure we need to build these systems barely exists yet.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Or at least, it didn't until now.&lt;/P&gt;
&lt;H4 aria-level="2"&gt;&lt;SPAN class="lia-text-color-15"&gt;Two Worlds: Conversational and Ambient&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Let me paint you a picture of the conversational AI paradigm we know well.&amp;nbsp;You open a chat window. You type a question. You wait. The AI responds. Rinse and repeat.&amp;nbsp;It's&amp;nbsp;the digital equivalent of having a brilliant assistant sitting at a desk, ready to help when you tap them on the shoulder.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Now imagine a completely different kind of assistant. One that watches for&amp;nbsp;important changes,&amp;nbsp;anticipates&amp;nbsp;needs, and springs into action without being asked.&amp;nbsp;That's&amp;nbsp;the promise of ambient agents.&amp;nbsp;AI systems that, as&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://blog.langchain.com/introducing-ambient-agents/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;LangChain puts it&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;:&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;"listen to an event stream and act on&amp;nbsp;it&amp;nbsp;accordingly, potentially acting on multiple events at a time."&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;This&amp;nbsp;isn't&amp;nbsp;an evolution of&amp;nbsp;chat;&amp;nbsp;it's&amp;nbsp;a fundamentally different interaction paradigm. Both have their place. Chat is great for collaboration and back-and-forth reasoning. Ambient agents excel at continuous monitoring and autonomous response. Instead of human-initiated conversations, ambient agents&amp;nbsp;operate&amp;nbsp;through detecting changes in upstream systems and&amp;nbsp;maintaining&amp;nbsp;context across time without constant prompting.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The use cases are compelling and distinct from chat. Imagine&amp;nbsp;a&amp;nbsp;project management&amp;nbsp;assistant that&amp;nbsp;operates&amp;nbsp;in two modes: you can&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;chat&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;with it to&amp;nbsp;ask,&amp;nbsp;"summarize&amp;nbsp;project status",&amp;nbsp;but it also runs in the background,&amp;nbsp;constantly&amp;nbsp;monitoring&amp;nbsp;new tickets that are created, or deployment pipelines that fail, automatically reassigning tasks. Or consider a DevOps agent that you can query conversationally ("what's our current CPU usage?") but also&amp;nbsp;monitors&amp;nbsp;your infrastructure continuously, detecting anomalies and starting remediation before you even know&amp;nbsp;there's&amp;nbsp;a problem.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4 aria-level="2"&gt;&lt;SPAN class="lia-text-color-15"&gt;The Challenge: Real-Time Change Detection&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Here's&amp;nbsp;where building ambient agents gets tricky. While chat-based agents work perfectly within the request-response paradigm, ambient agents need something entirely different: continuous monitoring and real-time change detection. How do you efficiently detect changes across multiple data sources? How do you avoid the performance nightmare of constant polling? How do you ensure your agent reacts instantly when something critical happens?&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Developers trying to build ambient agents hit the same wall: creating a reliable, scalable change detection system is&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;hard&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;. You either end up with:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Polling hell&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;:&lt;/STRONG&gt; Constantly querying databases, burning through resources, and still missing changes between polls&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Legacy system rewrites&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;:&lt;/STRONG&gt;&amp;nbsp;Massive expensive multi-year projects to re-write legacy systems so that they produce&amp;nbsp;domain events&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Webhook spaghetti&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;:&lt;/STRONG&gt; Managing dozens of event sources, each with different formats and reliability guarantees&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;This is where the story takes an interesting turn.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4 aria-level="2"&gt;&lt;SPAN class="lia-text-color-15"&gt;Enter&amp;nbsp;Drasi: The Change Detection Engine You Didn't Know You Needed&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134245418&amp;quot;:true,&amp;quot;134245529&amp;quot;:true,&amp;quot;335559738&amp;quot;:160,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;A href="https://drasi.io/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Drasi&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;is not another AI framework. Instead, it solves the problem that ambient agents&amp;nbsp;need&amp;nbsp;solved: intelligent change detection. Think of it as the sensory system for your AI agents,&amp;nbsp;the infrastructure that lets them perceive changes in the world.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Drasi&amp;nbsp;is built around three simple components:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Sources&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;:&lt;/STRONG&gt;&amp;nbsp;Connectivity to the systems that&amp;nbsp;Drasi&amp;nbsp;can&amp;nbsp;observe&amp;nbsp;as sources of change&amp;nbsp;(PostgreSQL, MySQL, Cosmos DB, Kubernetes, EventHub)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Continuous Queries&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;:&lt;/STRONG&gt; Graph-based queries (using Cypher/GQL) that&amp;nbsp;monitor&amp;nbsp;for specific change patterns&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Reactions&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;:&lt;/STRONG&gt;&amp;nbsp;What happens when a continuous query&amp;nbsp;detects&amp;nbsp;changes, or lack thereof&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;But&amp;nbsp;here's&amp;nbsp;the killer feature:&amp;nbsp;Drasi&amp;nbsp;doesn't&amp;nbsp;just detect that&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;something&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;changed. It understands&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;what&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;changed and&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;why it matters&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;, and even if something should have changed but did not. Using continuous queries, you can define complex conditions that your agents care about, and&amp;nbsp;Drasi&amp;nbsp;handles all the plumbing to deliver those insights in real time.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4 aria-level="2"&gt;&lt;SPAN class="lia-text-color-15"&gt;The Bridge:&amp;nbsp;langchain-drasi&amp;nbsp;Integration&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134245418&amp;quot;:true,&amp;quot;134245529&amp;quot;:true,&amp;quot;335559738&amp;quot;:160,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Now, detecting changes is only&amp;nbsp;part of the challenge. You need to connect those changes to your AI agents in a way that makes sense.&amp;nbsp;That's&amp;nbsp;where&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://github.com/drasi-project/langchain-drasi" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;langchain-drasi&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;comes in,&amp;nbsp;a purpose-built integration that bridges&amp;nbsp;Drasi's&amp;nbsp;change detection with&amp;nbsp;LangChain's&amp;nbsp;agent frameworks. It achieves this by&amp;nbsp;leveraging&amp;nbsp;the&amp;nbsp;Drasi&amp;nbsp;MCP Reaction, which exposes&amp;nbsp;Drasi&amp;nbsp;continuous queries as&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://modelcontextprotocol.io/specification/2025-06-18/server/resources" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;MCP resources&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The integration provides&amp;nbsp;a&amp;nbsp;simple&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Tool&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;&amp;nbsp;&lt;/STRONG&gt;that agents can use to:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Discover&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;&amp;nbsp;&lt;/STRONG&gt;available queries automatically&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Read&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;&amp;nbsp;&lt;/STRONG&gt;current query results on demand&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Subscribe&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;&amp;nbsp;&lt;/STRONG&gt;to real-time updates that flow directly into agent memory and workflow&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Here's what this looks like in practice:&lt;/SPAN&gt;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="width: 100%; border-width: 1px;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;
&lt;PRE&gt;&lt;SPAN class="lia-text-color-9"&gt;from &lt;/SPAN&gt;langchain_drasi &lt;SPAN class="lia-text-color-9"&gt;import&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN class="lia-text-color-9"&gt;&amp;nbsp;&lt;/SPAN&gt;create_drasi_tool,&amp;nbsp;MCPConnectionConfig&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:2,&amp;quot;335557856&amp;quot;:16777215,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:270,&amp;quot;335572071&amp;quot;:4,&amp;quot;335572072&amp;quot;:1,&amp;quot;335572073&amp;quot;:4278190080,&amp;quot;335572075&amp;quot;:4,&amp;quot;335572076&amp;quot;:4,&amp;quot;335572077&amp;quot;:4278190080,&amp;quot;335572079&amp;quot;:4,&amp;quot;335572080&amp;quot;:1,&amp;quot;335572081&amp;quot;:4278190080,&amp;quot;335572083&amp;quot;:4,&amp;quot;335572084&amp;quot;:4,&amp;quot;335572085&amp;quot;:4278190080,&amp;quot;469789798&amp;quot;:&amp;quot;single&amp;quot;,&amp;quot;469789802&amp;quot;:&amp;quot;single&amp;quot;,&amp;quot;469789806&amp;quot;:&amp;quot;single&amp;quot;,&amp;quot;469789810&amp;quot;:&amp;quot;single&amp;quot;}"&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="lia-text-color-11"&gt;# Configure connection to Drasi MCP server&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;mcp_config&amp;nbsp;=&amp;nbsp;MCPConnectionConfig(&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-20"&gt;server_url&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;=&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-12"&gt;"http://localhost:8083"&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="lia-text-color-11"&gt;# Create the tool with notification handlers&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;drasi_tool&amp;nbsp;=&amp;nbsp;create_drasi_tool(&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-20"&gt;mcp_config&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;=mcp_config,&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-20"&gt;notification_handlers&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;=[buffer_handler,&amp;nbsp;console_handler]&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:2,&amp;quot;335557856&amp;quot;:16777215,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:270,&amp;quot;335572071&amp;quot;:4,&amp;quot;335572072&amp;quot;:1,&amp;quot;335572073&amp;quot;:4278190080,&amp;quot;335572075&amp;quot;:4,&amp;quot;335572076&amp;quot;:4,&amp;quot;335572077&amp;quot;:4278190080,&amp;quot;335572079&amp;quot;:4,&amp;quot;335572080&amp;quot;:1,&amp;quot;335572081&amp;quot;:4278190080,&amp;quot;335572083&amp;quot;:4,&amp;quot;335572084&amp;quot;:4,&amp;quot;335572085&amp;quot;:4278190080,&amp;quot;469789798&amp;quot;:&amp;quot;single&amp;quot;,&amp;quot;469789802&amp;quot;:&amp;quot;single&amp;quot;,&amp;quot;469789806&amp;quot;:&amp;quot;single&amp;quot;,&amp;quot;469789810&amp;quot;:&amp;quot;single&amp;quot;}"&gt;&amp;nbsp;&lt;BR /&gt;)&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&lt;BR /&gt;&lt;SPAN class="lia-text-color-11"&gt;# Now your agent can discover and subscribe to data changes&lt;/SPAN&gt;&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN class="lia-text-color-11"&gt;&lt;SPAN class="lia-text-color-11"&gt;# No more polling, no more webhooks, just reactive intelligence&lt;/SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The beauty is in the notification handlers:&amp;nbsp;pre-built components that&amp;nbsp;determine&amp;nbsp;how changes flow into your agent's consciousness:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;BufferHandler&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;: Queues changes for sequential processing&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;LangGraphMemoryHandler&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;: Automatically&amp;nbsp;integrates&amp;nbsp;changes&amp;nbsp;into agent checkpoints&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;LoggingHandler&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;:&amp;nbsp;Integrates with standard logging infrastructure&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;This&amp;nbsp;isn't&amp;nbsp;just&amp;nbsp;plumbing;&amp;nbsp;it's&amp;nbsp;the foundation for what we might call "change-driven architecture" for AI systems.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4 aria-level="2"&gt;&lt;SPAN class="lia-text-color-15"&gt;Example: The&amp;nbsp;Seeker&amp;nbsp;Agent&amp;nbsp;Has Entered the Chat&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Let's&amp;nbsp;make this concrete with my favorite example from the&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;langchain-drasi&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;repository: a&amp;nbsp;hide and seek&amp;nbsp;inspired&amp;nbsp;non-player character (NPC)&amp;nbsp;AI&amp;nbsp;agent that&amp;nbsp;seeks human players in a multi-player game environment.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H5 aria-level="3"&gt;&lt;SPAN class="lia-text-color-15"&gt;The Scenario&amp;nbsp;&lt;/SPAN&gt;&lt;/H5&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Imagine a game where players move around a 2D map, updating their positions in a PostgreSQL database. But&amp;nbsp;here's&amp;nbsp;the twist: the NPC agent&amp;nbsp;doesn't&amp;nbsp;have omniscient vision. It can only detect players under specific conditions:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Stationary targets&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;: &lt;/STRONG&gt;When a player&amp;nbsp;doesn't&amp;nbsp;move for more than 3 seconds (they're exposed)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Frantic movement&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;: &lt;/STRONG&gt;When a player moves more than once in less than a second (panicking reveals your position)&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;This creates interesting strategic&amp;nbsp;gameplay,&amp;nbsp;players must balance staying still (safe from detection but vulnerable if found) with moving carefully (one move per second is the sweet spot). The NPC agent&amp;nbsp;seeks&amp;nbsp;based on these glimpses of player activity.&amp;nbsp;These detection rules are defined as&amp;nbsp;Drasi&amp;nbsp;continuous queries that&amp;nbsp;monitor&amp;nbsp;the player positions table.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;For reference, these are the two continuous queries we will use:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;When a player&amp;nbsp;doesn't&amp;nbsp;move for more than 3 seconds, this is&amp;nbsp;a great example&amp;nbsp;of detecting the absence of change use the&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://drasi.io/reference/query-language/drasi-custom-functions/#drasi-future-functions" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;trueLater function&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN class="lia-text-color-9"&gt;MATCH &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; (&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;p&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;:&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;player&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-9"&gt;{&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;type&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;:&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-12"&gt;'human'&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-9"&gt;}&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;)&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN class="lia-text-color-9"&gt;WHERE&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;drasi&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;.&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-10"&gt;true&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;Later&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;(&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;drasi&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;.&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;changeDateTime&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;(&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;p&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;) &amp;lt;= (&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;datetime&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;.&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;realtime&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;() -&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;duration&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;(&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-9"&gt;{&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;seconds&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;:&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;3&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-9"&gt;}&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;)),&amp;nbsp;&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;drasi&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;.&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;changeDateTime&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;(&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;p&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;) +&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;duration&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;(&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-9"&gt;{&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;seconds&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;:&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-11"&gt;3&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-9"&gt;}&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;)&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; )&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN class="lia-text-color-9"&gt;RETURN &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;p&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;.&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;id&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;,&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;p&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;.&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;x&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;,&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;p&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;.&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;y&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:2,&amp;quot;335557856&amp;quot;:16777215,&amp;quot;335559739&amp;quot;:0,&amp;quot;335559740&amp;quot;:270}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;When a player moves more than once in less than a second is an example of using the&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://drasi.io/reference/query-language/drasi-custom-functions/#drasi-delta-functions" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;previousValue function&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;to compare&amp;nbsp;that current state with a prior state:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN class="lia-text-color-9"&gt;MATCH&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; (&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;p&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;:&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;player&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-9"&gt;{&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;type&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;:&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-12"&gt;'human'&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-9"&gt;}&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;)&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN class="lia-text-color-9"&gt;WHERE&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;drasi&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;.&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;changeDateTime&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;(&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;p&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;).&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;epochMillis&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;-&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;drasi&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;.&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;previousValue&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;(&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;drasi&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;.&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;changeDateTime&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;(&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;p&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;).&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;epochMillis&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;) &amp;lt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-text-color-11"&gt;1000&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN class="lia-text-color-9"&gt;RETURN &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;p&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;.&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;id&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;,&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;p&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;.&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;x&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;,&lt;/SPAN&gt; &lt;BR /&gt;&lt;SPAN data-contrast="none"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;p&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;.&lt;/SPAN&gt;&lt;SPAN data-contrast="none"&gt;y&lt;/SPAN&gt; &lt;/PRE&gt;
&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Here's the neat part: you can dynamically adjust the game's difficulty by adding or removing queries with different conditions; no code changes required, just deploy new Drasi queries.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The traditional approach would have your agent constantly polling the data source checking these conditions: "Any player moves? How about now? Now?&amp;nbsp;Now?"&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4 aria-level="3"&gt;&lt;SPAN class="lia-text-color-15"&gt;The Workflow in Action&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The agent&amp;nbsp;operates&amp;nbsp;through a&amp;nbsp;LangGraph&amp;nbsp;based&amp;nbsp;state machine with two distinct phases:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;1. &lt;STRONG&gt;Setup Phase (First Run Only)&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Setup&amp;nbsp;queries&amp;nbsp;prompt&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt; &lt;/STRONG&gt;- Prompts the&amp;nbsp;AI model&amp;nbsp;to discover available&amp;nbsp;Drasi&amp;nbsp;queries&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559740&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Setup&amp;nbsp;queries&amp;nbsp;call&amp;nbsp;model&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt; -&amp;nbsp;AI model&amp;nbsp;calls the&amp;nbsp;Drasi&amp;nbsp;tool with discover operation&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559740&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Setup&amp;nbsp;queries&amp;nbsp;tools&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt; - Executes the&amp;nbsp;Drasi&amp;nbsp;tool calls to subscribe to relevant queries&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559740&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;This phase loops until the&amp;nbsp;AI model&amp;nbsp;has discovered and subscribed to all relevant queries&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;2. &lt;STRONG&gt;Main Seeking Loop (Continuous)&lt;/STRONG&gt;&lt;SPAN style="color: rgb(30, 30, 30);" data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Check&amp;nbsp;sensors&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt; -&amp;nbsp;Consumes any&amp;nbsp;new&amp;nbsp;Drasi&amp;nbsp;notifications&amp;nbsp;from the&amp;nbsp;buffer into&amp;nbsp;the workflow state&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559740&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Evaluate&amp;nbsp;targets&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt; - Uses&amp;nbsp;AI model&amp;nbsp;to parse sensor data and extract target positions&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559740&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Select&amp;nbsp;and&amp;nbsp;plan&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt; - Selects closest target and plans path&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559740&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Execute&amp;nbsp;move&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt; &lt;/STRONG&gt;- Executes the next move via game API&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559740&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;Loop continues indefinitely, reacting to new notifications&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;201341983&amp;quot;:0,&amp;quot;335559740&amp;quot;:240}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;No polling. No delays. No wasted resources checking positions that&amp;nbsp;don't&amp;nbsp;meet the detection criteria. Just pure, reactive intelligence flowing from meaningful data changes to agent actions. The continuous queries act as intelligent filters,&amp;nbsp;only alerting the agent when&amp;nbsp;relevant changes occur.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;A href="https://github.com/drasi-project/langchain-drasi/tree/main/examples/hide_and_seek" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Click here for the full implementation&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4 aria-level="2"&gt;&lt;SPAN class="lia-text-color-15"&gt;The Bigger Picture: Change-Driven Architecture&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134245418&amp;quot;:true,&amp;quot;134245529&amp;quot;:true,&amp;quot;335559738&amp;quot;:160,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;What&amp;nbsp;we're&amp;nbsp;seeing with&amp;nbsp;Drasi&amp;nbsp;and ambient agents&amp;nbsp;isn't&amp;nbsp;just a new tool,&amp;nbsp;it's&amp;nbsp;a new architectural pattern for AI systems. The core idea is profound:&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;AI agents can react to the world changing, not just wait to be asked about it.&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;This pattern enables entirely new categories of applications that complement traditional chat interfaces.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The example might seem playful, but it&amp;nbsp;demonstrates&amp;nbsp;that&amp;nbsp;AI agents&amp;nbsp;can&amp;nbsp;perceive and react to their environment in real time. Today&amp;nbsp;it's&amp;nbsp;seeking players in a game. Tomorrow it could be:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;Managing city traffic flows based on real-time sensor data&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;Coordinating disaster response as situations evolve&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;Optimizing&amp;nbsp;supply chains as demand patterns shift&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN data-contrast="auto"&gt;Protecting networks as threats&amp;nbsp;emerge&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;The change detection infrastructure is here. The patterns are&amp;nbsp;emerging. The only question is: what will you build?&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;H4 aria-level="2"&gt;&lt;SPAN class="lia-text-color-15"&gt;Where to Go&amp;nbsp;from&amp;nbsp;Here&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{&amp;quot;134245418&amp;quot;:true,&amp;quot;134245529&amp;quot;:true,&amp;quot;335559738&amp;quot;:160,&amp;quot;335559739&amp;quot;:80}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/H4&gt;
&lt;P&gt;&lt;SPAN data-contrast="auto"&gt;Ready to dive deeper? Here are your next steps:&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Explore&amp;nbsp;Drasi&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;: &lt;/STRONG&gt;Head to&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://drasi.io/" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;drasi.io&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt; and discover the power of the change detection platform&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Try&amp;nbsp;langchain-drasi&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;:&lt;/STRONG&gt; Clone the&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://github.com/drasi-project/langchain-drasi" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;GitHub repository&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;and run the&amp;nbsp;Hide-and-Seek&amp;nbsp;example yourself&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;SPAN data-contrast="auto"&gt;Join the conversation&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN data-contrast="auto"&gt;&lt;STRONG&gt;:&lt;/STRONG&gt; The space is new and needs diverse perspectives&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;. Join the&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;community&amp;nbsp;on&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://discord.gg/AX7FneckBq" target="_blank"&gt;&lt;SPAN data-contrast="none"&gt;&lt;SPAN data-ccp-charstyle="Hyperlink"&gt;Discord&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN data-contrast="auto"&gt;.&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;Let us know if&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;you&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;have&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;built ambient agents&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;and&amp;nbsp;w&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;hat&amp;nbsp;challenges&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;you fac&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;ed&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;&amp;nbsp;with real-time change detection&lt;/SPAN&gt;&lt;SPAN data-contrast="auto"&gt;.&lt;/SPAN&gt;&lt;SPAN data-ccp-props="{}"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 03 Dec 2025 23:09:40 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/linux-and-open-source-blog/beyond-the-chat-window-how-change-driven-architecture-enables/ba-p/4475026</guid>
      <dc:creator>CollinBrian</dc:creator>
      <dc:date>2025-12-03T23:09:40Z</dc:date>
    </item>
  </channel>
</rss>

