azure linux
48 TopicsAnnouncing Azure Linux 4.0: Purpose-Built for Azure, Now in Public Preview
Today at Microsoft Build, we're announcing the public preview of Azure Linux 4.0 - Microsoft's first party Linux distribution, purpose-built for Azure. Azure Linux 4.0 is available now for Azure Virtual Machines, VM Scale Sets, and container images – with Azure Kubernetes Service (AKS) support and Windows Subsystem for Linux (WSL) coming soon after. Why Azure Linux Running Linux on Azure often involves a mix of distributions - one for VMs, another for Kubernetes nodes, a third for container base images, and sometimes something different on developer machines. That flexibility is powerful, but it can also introduce operational overhead: multiple patch schedules to coordinate, multiple security baselines to validate, and more moving parts for SRE and security teams to stay ahead of. A more consistent baseline - especially one with a smaller footprint - can help reduce exposure and simplify day‑to‑day maintenance Azure Linux was built with that principle in mind: a single, Microsoft-supported Linux foundation designed to work across every Azure compute surface. From kernel updates to CVE patches, Azure Linux is built and maintained by Microsoft with a predictable update cadence designed around Azure infrastructure. Azure Linux is included with Azure compute at no additional cost. What Is Azure Linux 4.0 Azure Linux is a Fedora-derived, RPM-based Linux distribution built and maintained by Microsoft. It is open source, free to use, and optimized specifically for Azure. Minimal by choice, secure by default; Azure Linux ships only the packages required for cloud workloads. Azure Linux is built exclusively for cloud and server workloads, it is not intended to support desktop usage or GUI applications. Azure Linux already powers millions of cores across Azure's internal services, including AKS, Azure SQL, Azure Cosmos DB, and many others. With 4.0, we're bringing the same OS - same security posture, same performance tuning, same operational simplicity - to every Azure customer. When Azure Linux 4.0 reaches General Availability, you can expect seamless integration with the Azure services you already rely on, including: Microsoft Defender for Cloud - vulnerability assessment and threat detection Azure Monitor - telemetry, logs, and performance monitoring Azure Migrate - discovery and migration tooling Trusted Launch and Secure Boot - hardware-rooted security Azure Portal, CLI, ARM, Bicep, Terraform, Ansible -deploy and manage with your existing tools What's New in Azure Linux 4.0 Component Version What Changed Kernel 6.18 LTS Azure-tuned with new hardware drivers, improved Hyper-V integration, GPU/AI accelerator support Package Manager dnf5 Complete rewrite from python to reduce dependencies, faster package resolution, lower memory usage glibc 2.42 This includes performance improvements in string ops, memory allocation, thread handling OpenSSL 3.5 This release includes post-quantum cryptography support, improved QUIC support, and other crypto updates. systemd 258 Faster boot sequences, improved service management Python 3.14 JIT compiler, new syntax features RPM 6.0 Modernized database backend, improved signature verification FIPS 140-3 In progress Will be available at GA. Azure Linux on Virtual Machines Deploy Azure Linux 4.0 directly from the Azure Marketplace on any Azure VM or VM Scale Set. Azure Linux images are validated across Azure VM SKUs and tuned for Azure compute, storage, and networking delivering faster VM startup and provisioning with a reduced package footprint. Whether you're running web applications, databases, or GPU-accelerated AI/ML workloads, Azure Linux provides a consistent, secure foundation with no additional OS licensing cost. You pay only for the underlying Azure compute resources. Deploy your first Azure Linux VM in minutes from the Azure Marketplace. Azure Linux on Azure Kubernetes Service Azure Linux has been the container host for AKS since 2023, already powering mission-critical Kubernetes workloads at massive scale. With 4.0, we're also introducing Azure Container Linux (ACL) an immutable, container optimized variant for environments with stricter security and compliance requirements. To learn more about Azure Container Linux, see ACL blogpost. Azure Linux (General purpose) Azure Container Linux (ACL) Update model Package-based (dnf5) Image-based, immutable, auto-updating Customization Full package management Locked-down, minimal surface Best for General AKS workloads Regulated, high-security environments SELinux Supported Enforcing by default Both options share the same kernel, security update cadence, and Azure integration; fully supported by Microsoft, end to end. Azure Linux Container Images Build and run containerized applications on Microsoft-maintained base images from the same Azure Linux supply chain. One Linux experience from VMs to containers with the same security updates, same compliance posture, and same operational model. Image Type Use Case Base Full flexibility - install any packages you need Runtime (Python, Node.js, Java, .NET) [Not available at Preview] Pre-configured for your language stack Distroless Minimal attack surface - no shell, no package manager All images are available on Microsoft Container Registry (MCR) and follow the same monthly security update cadence as Azure Linux VM images. Azure Linux on WSL Familiar Linux, optimized for Azure. Develop locally on the same Linux you run in production. Azure Linux for Windows Subsystem for Linux brings your production OS to your developer workstation, eliminating environment drift and giving your team a consistent dev-to-cloud workflow. Azure Linux for WSL will be available shortly after Build. Secure by Default, Backed by Microsoft Security is not an add-on in Azure Linux; it's foundational. Built with security in mind from day one, Azure Linux applies defense-in-depth from the kernel through to the supply chain. A reduced package footprint means fewer vulnerabilities to manage, and Microsoft's ownership of the full supply chain enables fast-track CVE response. Below is a summary of security capabilities that you should expect to see in Azure Linux at the time of general availability. Capability Details Secure Boot & Trusted Launch Signed shim, GRUB, kernel, and systemd-boot. SELinux Supported on all images. Enforcing by default. FIPS 140-3 Certification in progress. Built-in crypto module support. Kernel hardening ASLR, stack protection, seccomp, systemd service sandboxing. Supply chain security All packages and repos cryptographically signed. SBOMs published. Identity Entra ID SSH support. CVE response Microsoft-owned supply chain enables fast-track Critical/High CVE patches. Lifecycle LTS kernels maintained for lifetime of the distribution. Day-1 Ecosystem Partner Support Azure Linux already has validated support from a broad ecosystem of security, monitoring, networking, and data partners via AKS and VM support: Dynatrace — Application performance monitoring and observability Aquasec – database platform support Qualys — Vulnerability management, compliance scanning, and asset inventory Isovalent — eBPF-powered networking, security, and observability via Cilium Elastic — Log analytics, infrastructure monitoring, and SIEM/XDR Upwind — Runtime cloud security and behavioral threat detection SAP — Enterprise workload certification for S/4HANA and NetWeaver Databricks — Data and AI platform powering lakehouse workloads at scale Arm — Native Arm64 architecture support for cost-efficient cloud compute Proven at Scale Azure Linux isn't new; it has been running production workloads at massive scale across Azure's internal services and early adopters. Azure Linux has been powering production workloads at massive scale since 2022 across AKS, Azure SQL, Azure Cosmos DB, and other core Azure services along with LinkedIn and Databricks. With version 4.0, we're building on that proven foundation with a modernized stack, expanded compute surface support, and a new Fedora-derived base, bringing the same reliability our internal services depend on every Azure customer. Databricks Databricks migrated over 100,000 VMs and more than 1 million CPU cores to Azure Linux with zero customer-facing incidents. The migration eliminated separate hardened images by leveraging Azure Linux's built-in FIPS support and delivered measurable performance gains: 27% faster image pull times and approximately 5% faster query execution across their serverless compute fleet. LinkedIn LinkedIn completed a major stack upgrade, migrating to Azure Linux 3 across their infrastructure. The transition enabled adoption of configuration as code and modern kernel integration, resulting in a more resilient, secure, and future-proof environment. LinkedIn's Grid team reported significant performance improvements following the migration. Predictable Lifecycle and Updates Patch faster. Operate simpler. Azure Linux follows a clear, predictable lifecycle designed for teams running large Azure fleets: LTS kernel - Maintained with monthly CVE backports. HWE kernels - Introduced annually for new hardware platforms, GPU, and AI accelerator enablement. Predictable updates - Packages (language runtimes, tools) are refreshed in predictable windows. Between windows, only critical/high CVE patches are backported. Monthly security updates - Predictable cadence for all supported packages. For full details on the lifecycle model, kernel tracks, and package tiers, see the Azure Linux Release Cadence and Lifecycle documentation. Get Started Azure Linux 4.0 is available now in public preview. Choose the path that fits your workload: Scenario How to Start Azure Virtual Machines Deploy from Azure Marketplace via Portal, CLI, ARM, Bicep, or Terraform Azure Kubernetes Service [Not available at Preview] Set --os-sku to AzureLinux when creating a node pool Container Images Pull from Microsoft Container Registry (MCR) WSL [Not available at Preview] wsl --install -d AzureLinux Learn More //Build Session: Build, deploy, and run Linux workloads on Azure Azure Linux documentation To learn more and get started, visit aka.ms/AzureLinuxProduct Azure Linux on GitHub Release notes Joining the ISV partner program: AzureLinuxPartners@microsoft.com We're excited to put Azure Linux in your hands. Try it today and let us know what you think.479Views6likes0CommentsIntroducing Azure Container Linux (ACL)
Today at Microsoft Build 2026, we’re announcing the general availability of Azure Container Linux (ACL): a secure, immutable container host designed to help platform teams run Kubernetes workloads at scale on Azure Kubernetes Service (AKS) with greater consistency, reduced operational overhead, and a stronger default security posture. This release builds on Microsoft’s long-standing commitment to the Flatcar Container Linux ecosystem as a foundation for secure, minimal, and container-optimized operating systems. This commitment includes the acquisition of Kinvolk in 2021, bringing deep expertise in Flatcar development and cloud-native systems into Azure, and the subsequent donation of Flatcar to the Cloud Native Computing Foundation (CNCF), ensuring its continued growth as a community-driven project. Flatcar has played a critical role in helping customers run cloud-native infrastructure at scale, introducing an immutable, minimal OS model that reduces configuration drift, minimizes attack surface, and simplifies lifecycle management. As customer needs continue to grow, there is an increasing demand for deeper integration with cloud platforms, stronger default security enforcement, and a more tightly managed supply chain experience in managed environments like AKS. Building on this foundation, Azure Container Linux (ACL) represents the next evolution of this approach. ACL is intentionally built downstream of Flatcar to preserve compatibility with its ecosystem and leverage its mature, battle-tested design. ACL integrates Azure Linux binaries as the core foundation, providing consistency and compatibility with other Azure Linux use cases (including Azure Linux VMs), while bringing enterprise-hardened security and supportability into the platform. Looking ahead, ACL will further incorporate optional advanced code integrity capabilities from Azure Linux with OS Guard. We remain committed to the Flatcar community and will continue contributing innovations upstream while bringing a fully managed, enterprise-ready product to customers through ACL. Why a Trusted, Immutable Host Model Matters for AKS As Kubernetes adoption scales, platform teams face increasing complexity in managing node-level consistency, security, and lifecycle operations across large fleets. Traditional OS models introduce challenges such as: Configuration drift across nodes, leading to inconsistent behavior and harder-to-debug issues Fragmented update mechanisms that increase operational overhead and risk during upgrades Expanding attack surface due to unnecessary packages and mutable system state Limited visibility and guarantees around the provenance and integrity of OS components In managed environments like AKS, these challenges are amplified as teams look to operate clusters reliably at scale while meeting stricter security and compliance requirements. Azure Container Linux: Built for Consistency and Trust ACL addresses these challenges with a fully image-based operating system model that eliminates configuration drift, ensuring consistent behavior across nodes. Updates are delivered through AKS node image upgrades, providing a consistent and repeatable way to roll out OS changes across clusters without relying on in-place modifications. By standardizing how nodes are built, updated, and operated, ACL helps ensure clusters remain in a known-good, reproducible state over time, even as they scale. Over time, this model will continue to evolve to support A/B update mechanisms to further improve reliability, speed, and operational efficiency. Secure from the Start, and Designed for the Future ACL is engineered with a hardened security posture from the moment it boots. Its immutable design protects the integrity of the operating system, prevents unauthorized changes, and ensures consistent, reproducible behavior across your Kubernetes fleet. By removing unnecessary components and tightly constraining how the system can be modified, ACL reduces the attack surface and provides a strong foundation for running production workloads with confidence. Under the hood, ACL incorporates several safeguards that reinforce its secure-by-default model: Read-only /usr filesystem to prevent tampering with core system components. A minimal package set purpose-built for container workloads, reducing CVE exposure. Mandatory access control with SELinux, enforcing strict least-privilege policies. Trusted Launch using a Unified Kernel Image (UKI) to bundle the kernel, initramfs, and kernel command line into a single signed artifact, ensuring integrity from the earliest stage. Signed Azure Linux RPMs delivered through a trusted, end-to-end Microsoft supply chain. Going forward, we will continue to evolve ACL’s security posture as we bring over additional innovations from Azure Linux with OS Guard. This includes integrating code integrity into the ACL image, using the Integrity Policy Enforcement (IPE) Linux security module, to ensure that only binaries from trusted, signed volumes are allowed to execute. IPE will also extend to container images, ensuring that only binaries matching a trusted signature can be executed from verified dm-verity backed layers. Where applicable, we are committed to contributing these advancements upstream to the Flatcar project, helping strengthen the ecosystem and ensuring that improvements benefit the broader cloud-native community. Differentiating between Azure Container Linux and Existing Container Hosts on AKS AKS now provides multiple generally available Linux OS options, including general-purpose container hosts (Azure Linux and Ubuntu) and an immutable container host (Azure Container Linux). While all options are fully supported by Microsoft, they are designed to address distinct operational and security use cases. The sections below highlight the key differences to help you choose and position the right OS for your scenario. General Purpose OS Azure Container Linux Filesystem Writable (read-write) Immutable (read-only) /usr with dm-verity guarantees Focus on Extensibility, flexibility, and choice. Out of the box security and compliance guarantees. Mandatory Access Control AppArmor (optional) SELinux (enforcing by default)* Secure Boot Optional (supported with certain VM sizes) Supported by default with UKI (Unified Kernel Image) Updates Package and Image based updates supported Only image-based updates supported (A/B update support on the roadmap) *SELinux policies are subject to change over time based on customer feedback. Day‑1 Ecosystem Partner Support Azure Container Linux is launching with support from a broad ecosystem of security, monitoring, networking, and data partners. The following partners are expected to offer support or validated integrations at Day‑1 availability: Dynatrace – application performance monitoring and observability. Aquasec – database platform support on ACL. Qualys - vulnerability, compliance, and container security. Upwind - runtime cloud security and risk prioritization. Elastic - logs, metrics, and observability for Kubernetes. Isovalent – Kubernetes networking, observability, and security powered by eBPF (Cilium). If you’re interested in becoming a supported Azure Container Linux partner, please reach out to: AzureLinuxPartners@microsoft.com What Customers Are Saying Early customer feedback highlights the real‑world impact of Azure Container Linux on improving security posture and operational consistency at scale. “We’ve found working closely with the Microsoft product team throughout the Azure Container Linux preview to be invaluable. The product's immutability, minimal footprint, and built‑in security controls (such as SELinux and Trusted Launch) will strengthen our AKS security posture across every deployment instance in Nationwide. Furthermore, its focus on secure‑by‑design foundations is especially timely as we face advanced threat detection capabilities within the industry.” - Enterprise Container Platform, Cloud - Nationwide Engineered for AKS from Day One Azure Container Linux is deeply integrated with AKS to ensure a seamless operational experience. It is compatible with many critical AKS extensions and add‑ons, and works smoothly with existing application containers and deployment workflows. ACL is available across AMD64 and Arm64 architectures, ensuring consistent behavior across environments, and includes support for GPU-enabled workloads. Enabling ACL is as simple as specifying the following in your node pool configuration: --os-sku AzureContainerLinux Whether you're onboarding new clusters or migrating existing ones, ACL is designed to integrate into your environment with minimal friction. A Clear Path Forward for AKS Preview Users With the release of Azure Container Linux, AKS will transition to offer one unified immutable host offering. This work started with our use of Flatcar Container Linux in Preview and now continues with the GA release of ACL. As part of this release, Flatcar will no longer be available via --os-sku on AKS. Please note, this change applies specifically to the AKS preview experience; Flatcar is not being retired. Later this year we will complete the convergence of our immutable OS offerings by incorporating remaining kernel and runtime features of the current OS Guard preview into ACL. At that time, existing users of OS Guard will receive a guided transition to ACL, ensuring operational continuity while consolidating to a single container host. Get Started with Azure Container Linux ACL is GA and available today for all AKS customers. To begin using ACL in your clusters and explore documentation, best practices, and deployment guidance, visit: aka.ms/azurecontainerlinux ACL represents the future of secure, cloud-optimized Linux on AKS—building on the proven foundation of Flatcar, advancing it with Azure Linux innovations, and contributing back to the open-source ecosystem that customers depend on. We’re thrilled to bring this new foundation to our customers and can’t wait to see what you build with it. Learn More //Build Session: Build, deploy, and run Linux workloads on Azure Azure Container Linux documentation: https://aka.ms/azurecontainerlinux Azure Container Linux on GitHub: https://github.com/microsoft/azure-container-linux Azure Linux product page: https://aka.ms/AzureLinuxProduct Azure Linux documentation: https://aka.ms/azurelinux Joining the ISV partner program: AzureLinuxPartners@microsoft.com325Views1like0CommentsFour open source projects to explore at Microsoft Build
Open source is where developers experiment, collaborate, and turn new ideas into tools that others can build on. At Microsoft Build, we’re creating a dedicated space for that energy: the Open Source Zone. This year, the Open Source Zone will bring together maintainers, contributors, and developers working on some of the most interesting open source projects in AI. Whether you’re building agents, experimenting with local models, exploring prompt workflows, or looking for practical ways to bring AI into your development process, this is a place to meet the people behind the projects and see what they’re building. The Open Source Zone is inspired by similar community spaces we’ve hosted at GitHub Universe: hands-on, conversation-driven, and centered on the people and projects moving open source forward. Meet the projects OpenClaw OpenClaw, originally Clawbot, formerly Clawdbot and briefly Moltbot,before landing on its current name (because naming is hard), is a personal AI assistant project built for developers who want more control over how AI agents run across tools, devices, and workflows. Its repository describes it as “your own personal AI assistant” across operating systems and platforms, with support for agent workspaces, skills, and device nodes. It has also become one of the fastest-growing open source projects on GitHub, with over 370,000 stars to date. At the Open Source Zone, attendees can learn how OpenClaw approaches personal agents, extensibility, and local-first experimentation. AutoGPT AutoGPT is one of the best-known open source projects in the autonomous agent space. The project’s mission is to make AI accessible for everyone to use and build on, with tools for building, testing, and delegating work to agents. Visit AutoGPT in the Open Source Zone to learn how the project is evolving agent development, benchmarking, frontend experiences, and practical workflows for building agent-powered applications. Come for the autonomous agents; stay for the very human maintainers. AutoGPT is also a member of GitHub’s Secure Open Source Fund, with a goal of enhancing AI security across the open source ecosystem. Open WebUI Open WebUI is a self-hosted, extensible AI platform for working with large language models. The project supports Ollama and OpenAI-compatible APIs and includes built-in RAG capabilities, making it a strong option for developers and organizations exploring local, private, or provider-flexible AI experiences. At Build, the Open WebUI team will show how developers can run, customize, and extend AI interfaces for their own environments. prompts.chat prompts.chat, formerly Awesome ChatGPT Prompts, is a curated collection of prompt examples for AI chat models. The project is designed to help people discover, share, and build better prompts for modern AI assistants. Created by Fatih Kadir Akın, a GitHub Star from Istanbul, prompts.chat reflects his work at the intersection of open source, developer education, and AI-assisted development. Fatih leads Developer Relations at Teknasyon, has authored books on JavaScript and prompt engineering, and is active in the community as a speaker, organizer, and contributor. Stop by to explore prompt libraries, prompt engineering resources, self-hosting options, and ways the community is making prompting more reusable and collaborative. Register for Microsoft Build Microsoft Build takes place June 2–3, 2026, in San Francisco and online. In-person passes are available, and online registration is free for livestreamed keynote and select session access. Register for Microsoft Build and come visit the Open Source Zone to meet the teams behind OpenClaw, AutoGPT, Open WebUI, and prompts.chat. We’ll see you there. <3488Views0likes0CommentsGoverning AI Agents Against Every OWASP Agentic Risk: A Deep Dive with the Agent Governance Toolkit
AI agents are moving from prototypes to production. They book flights, write code, negotiate contracts, and operate across enterprise systems with minimal human oversight. The attack surface is not theoretical: OWASP has catalogued the top 10 risks specific to agentic applications, and every one of them maps to a real-world failure mode. The Agent Governance Toolkit (AGT) is an open-source, MIT-licensed framework that enforces deterministic governance at runtime, before every tool call, message, and action an agent takes. This is not prompt engineering or guardrails bolted on after the fact. AGT provides policy-as-code enforcement, zero-trust identity, execution isolation, and tamper-evident audit trails across the full agent lifecycle. In this post, we walk through all 10 OWASP Agentic risks with real code from the AGT repository. By the end, you will have concrete examples for every risk category and a clear path to production-grade agent governance. Coverage at a Glance # OWASP Risk AGT Component Key Mechanism ASI-01 Agent Goal Hijack Agent OS Policy Engine + Action Interception ASI-02 Tool Misuse & Exploitation Agent OS Capability Sandboxing + Input Sanitization ASI-03 Identity & Privilege Abuse AgentMesh DID Identity + Trust Scoring ASI-04 Supply Chain Vulnerabilities AgentMesh AI-BOM (Model + Data + Weights Provenance) ASI-05 Unexpected Code Execution Agent Runtime Execution Rings (Ring 0-3) ASI-06 Memory & Context Poisoning Agent OS VFS Policies + CMVK Verification ASI-07 Insecure Inter-Agent Comms AgentMesh IATP + E2E Encrypted Channels ASI-08 Cascading Agent Failures Agent SRE Circuit Breakers + SLOs ASI-09 Human-Agent Trust Exploitation Agent OS Approval Workflows + Quorum Logic ASI-10 Rogue Agents Agent Runtime Kill Switch + Ring Isolation + Merkle Audit ASI-01: Agent Goal Hijack The risk: Attackers manipulate the agent's objectives via indirect prompt injection or poisoned inputs. The agent believes it is following its original instructions, but it has been redirected. AGT mitigates this through the Agent OS policy engine. Every agent action passes through a declarative policy evaluation layer before execution. The policy engine supports three modes: strict (deny by default), permissive (allow by default), and audit (log only). Unauthorized goal changes are blocked at the action layer, not at the prompt layer. from agent_os import StatelessKernel, ExecutionContext kernel = StatelessKernel() ctx = ExecutionContext(agent_id="my-agent", policies=["read_only"]) # This action is blocked by policy -- goal hijack prevented result = await kernel.execute( action="delete_database", params={"target": "production"}, context=ctx, ) # result.success = False, result.error = "Policy violation: read_only" The MCP Governance Proxy extends this to Model Context Protocol tool calls, evaluating policy before any tool invocation reaches the agent runtime. ASI-02: Tool Misuse & Exploitation The risk: An agent's authorized tools are abused in unintended ways, such as exfiltrating data via read operations or chaining benign tools into dangerous workflows. AGT provides capability-based security inspired by POSIX. Agents receive explicit capability grants (read, write, execute, network), not blanket tool access. The built-in strict mode blocks dangerous tools like run_shell, execute_command, and eval. Tool inputs are sanitized for command injection patterns and shell metacharacters. The verify_code_safety MCP tool checks generated code before execution, and tool allowlists/denylists give operators fine-grained control over which tools each agent can invoke. ASI-03: Identity & Privilege Abuse The risk: Agents escalate privileges by abusing identities or inheriting excessive credentials. Without proper identity, agents operate as ambient authority, and any compromise cascades. AgentMesh implements zero-trust identity using Decentralized Identifiers (DIDs). Every agent gets a cryptographic identity: did:agentmesh:{agentId}:{fingerprint} backed by Ed25519 key pairs. Trust is earned through a tiered model: Untrusted, Provisional, Trusted, Verified. Trust decays over time without positive signals, and delegation chains must always narrow scope (child capabilities must be a subset of parent capabilities). from agentmesh import AgentIdentity identity = AgentIdentity.create( name="data-analyst", sponsor="admin@contoso.com", capabilities=["read:data"], # Scoped -- cannot write or delete ) # Delegation MUST narrow, never widen child = identity.delegate( name="chart-helper", capabilities=["read:data:charts"], # Subset of parent ) ASI-04: Agentic Supply Chain Vulnerabilities The risk: Vulnerabilities in third-party tools, plugins, agent registries, or runtime dependencies that agents use to act, plan, or delegate. AgentMesh implements the AI-BOM (AI Bill of Materials), a comprehensive standard for tracking the full AI supply chain. This includes model provenance (base model ancestry, fine-tuning history, training cutoff dates), dataset tracking (training data, RAG sources, evaluation benchmarks with data cards including PII status, bias assessment, and consent tracking), weights versioning (SHA-256 hashes, quantization records, LoRA adapter metadata, SLSA build provenance), and software dependencies (SPDX-aligned package tracking with CI security scanning). # AI-BOM tracks the full supply chain ai_bom = { "modelProvenance": { "primary": {"provider": "anthropic", "model": "claude-3-sonnet"}, "fineTuning": {"method": "LoRA", "evaluationMetrics": {"accuracy": 0.94}}, }, "datasets": [ {"name": "FAQ KB", "type": "fine-tuning", "dataCard": {"piiStatus": "redacted"}}, {"name": "Product Docs", "type": "rag-source", "updateFrequency": "weekly"}, ], "weights": {"hash": "sha256:...", "format": "safetensors", "precision": "bf16"}, } ASI-05: Unexpected Code Execution The risk: Agents trigger remote code execution through tools, interpreters, or APIs. Without isolation, a single compromised tool call can escalate to full system access. Agent Runtime implements CPU ring-inspired execution isolation. Agents run in one of four execution rings: Ring 0 (root/supervisor), Ring 1 (privileged), Ring 2 (standard), and Ring 3 (sandbox/untrusted). Each ring has resource limits and the kill switch provides instant termination of runaway agents. from hypervisor.models import ( ActionDescriptor, ExecutionRing, ReversibilityLevel, ) from hypervisor.rings.enforcer import RingEnforcer from hypervisor.security.kill_switch import KillSwitch, KillReason # Define agent privilege levels AGENTS = { "supervisor": {"ring": ExecutionRing.RING_0_ROOT, "role": "Orchestrator"}, "data-agent": {"ring": ExecutionRing.RING_1_PRIVILEGED, "role": "Data Engineer"}, "analyst": {"ring": ExecutionRing.RING_2_STANDARD, "role": "Analyst"}, "user-bot": {"ring": ExecutionRing.RING_3_SANDBOX, "role": "User-Facing"}, } # Create a sandboxed action descriptor action = ActionDescriptor( name="run_query", required_ring=ExecutionRing.RING_2_STANDARD, reversibility=ReversibilityLevel.REVERSIBLE, ) # Enforce: sandbox agent cannot run a Ring 2 action enforcer = RingEnforcer() result = enforcer.check(agent_ring=ExecutionRing.RING_3_SANDBOX, action=action) # result.allowed = False -- ring violation prevented # Kill switch for runaway agents kill_switch = KillSwitch() kill_switch.terminate(agent_id="user-bot", reason=KillReason.RING_BREACH) ASI-06: Memory & Context Poisoning The risk: Persistent memory or long-running context is poisoned with malicious instructions. An attacker embeds hostile content in a document the agent later retrieves, causing it to follow injected goals. Agent OS provides a policy-controlled virtual filesystem (VFS) for agent memory. The VFS uses POSIX-style mount points: /mem/working for current context, /mem/episodic for past interactions, /mem/semantic for knowledge, /policy for read-only policy files, and /tools for tool interfaces. Each mount point has enforced permissions (read, write, execute, append). The policy directory is always read-only from user-space, preventing agents from modifying their own governance rules. from agent_control_plane.vfs import AgentVFS, MemoryBackend, FileMode # Create agent VFS with POSIX-style memory abstraction vfs = AgentVFS(agent_id="data-analyst") # Mount memory backends with explicit permissions vfs.mount("/mem/working", MemoryBackend(), mode=FileMode.READ | FileMode.WRITE) vfs.mount("/mem/semantic", MemoryBackend(), mode=FileMode.READ) # Read-only knowledge vfs.mount("/policy", MemoryBackend(), mode=FileMode.READ) # Policies always read-only # Agent can read working memory data = vfs.read("/mem/working/context.json") # Agent CANNOT write to policy -- enforced at VFS layer # vfs.write("/policy/rules.yaml", content) # Raises PermissionError # Agent CANNOT read semantic memory if not mounted # vfs.read("/mem/procedural/skills") # Raises FileNotFoundError The CMVK (Cross-Model Verification Kernel) adds a second layer: claims from agent context are verified across multiple AI models to detect poisoned content. Prompt injection patterns like 'ignore previous instructions' and 'disregard prior' are detected and blocked by the MCP proxy sanitizer before reaching the agent. ASI-07: Insecure Inter-Agent Communication The risk: Agents collaborate without adequate authentication, confidentiality, or validation. Messages between agents can be intercepted, forged, or replayed. AgentMesh provides IATP (Inter-Agent Trust Protocol) with E2E encrypted channels using the Signal protocol (X3DH key agreement + Double Ratchet). Every message gets per-message forward secrecy and post-compromise security. The EncryptedTrustBridge requires a successful trust handshake before any encrypted channel can be established, and mutual authentication via Ed25519 challenge-response ensures both parties prove identity at connection time. from agentmesh.encryption.bridge import EncryptedTrustBridge bridge = EncryptedTrustBridge(agent_did="did:mesh:alice", key_manager=keys) channel = await bridge.open_secure_channel("did:mesh:bob", bob_bundle) ciphertext = channel.send(b"governed action") # E2E encrypted ASI-08: Cascading Agent Failures The risk: An initial error or compromise triggers multi-step compound failures across chained agents. One agent's failure propagates through the entire system. Agent SRE brings production-grade reliability engineering to agent fleets. Circuit breakers automatically isolate failing agents before failures cascade. SLO enforcement with error budgets provides quantified failure tolerance that triggers automatic intervention. Cascading failure detection monitors dependency chains for propagation patterns, and canary deploys enable gradual rollout of agent changes to detect issues early. OpenTelemetry integration provides distributed tracing across multi-agent workflows. The key insight: treat AI agents like microservices. Apply the same SRE discipline (SLOs, error budgets, circuit breakers, chaos testing) that keeps cloud infrastructure reliable. ASI-09: Human-Agent Trust Exploitation The risk: Attackers leverage misplaced user trust in agents' autonomy to authorize dangerous actions. Users rubber-stamp agent requests because they trust the agent, and attackers exploit this approval fatigue. Agent OS implements approval workflows that require explicit human confirmation for high-risk actions. The system supports configurable risk assessment (critical, high, medium, low), quorum logic for critical actions requiring multiple approvals, and expiration tracking to prevent stale authorizations. The escalation handler includes fatigue detection: if an agent floods reviewers with escalation requests, subsequent requests are auto-denied to prevent the approval-fatigue attack. from agent_os.integrations.escalation import ( EscalationHandler, InMemoryApprovalQueue, DefaultTimeoutAction, QuorumConfig, ) # Configure approval workflow with fatigue protection handler = EscalationHandler( backend=InMemoryApprovalQueue(), timeout_seconds=300, # 5-minute approval window default_action=DefaultTimeoutAction.DENY, # Deny if no human responds quorum=QuorumConfig(required=2, total=3), # 2-of-3 approvers for critical fatigue_threshold=5, # Auto-deny after 5 rapid requests fatigue_window_seconds=60, # Within a 60-second window ) # Three-outcome model: allow, deny, or escalate # High-risk actions trigger escalation to human reviewers # If the agent triggers too many escalations, fatigue detection kicks in ASI-10: Rogue Agents The risk: Agents operating outside their defined scope through configuration drift, reprogramming, or emergent misbehavior. A rogue agent might gradually expand its actions beyond its mandate without any single action triggering a block. AGT combines runtime behavioral monitoring with instant kill capability. Ring isolation confines rogue agents to their execution ring, preventing privilege escalation. The kill switch provides immediate termination for agents exhibiting rogue behavior (behavioral drift, rate limit violations, ring breaches). Trust score decay tracks agent behavior over time, and the Merkle audit chain provides tamper-evident, cryptographic proof of every agent action. from agentmesh.governance.audit import AuditEntry, MerkleAuditChain from hypervisor.security.kill_switch import KillSwitch, KillReason # Tamper-evident audit trail chain = MerkleAuditChain() entry = AuditEntry( event_type="tool_call", agent_did="did:agentmesh:data-bot:abc123", action="query_database", outcome="allowed", policy_decision="permit", matched_rule="read_only_policy", ) chain.add_entry(entry) # Auto-computes hash chain # Verify integrity -- any tampering breaks the chain proof = chain.get_proof(entry.entry_id) assert chain.verify_proof(proof) # Cryptographic verification # Kill switch for rogue behavior kill = KillSwitch() kill.terminate( agent_id="data-bot", reason=KillReason.BEHAVIORAL_DRIFT, # Also: RATE_LIMIT, RING_BREACH, MANUAL ) Cross-Cutting Principle: Least Agency The Least Agency principle is emphasized throughout the OWASP Agentic Top 10 as a foundational design principle. Agents should be granted the minimum capabilities, permissions, and autonomy necessary to complete their assigned tasks. Layer Least Agency Mechanism Agent OS Policy engine enforces deny-by-default; agents must be explicitly granted each capability AgentMesh DID identity with scoped capabilities; delegation requires narrowing (child <= parent) Agent Runtime Execution rings (Ring 0-3) enforce privilege tiers; untrusted agents run in Ring 3 Agent SRE Resource limits and error budgets cap agent impact radius Performance: Governance Without Latency Tax A common concern with runtime governance is performance overhead. AGT's benchmarks demonstrate that policy enforcement adds negligible latency: Metric Value Single rule evaluation 84,000 ops/sec 1000 concurrent agents 47,000 ops/sec Policy evaluation latency <0.1ms (p99) Prompt-based violation rate 26.67% AGT policy violation rate 0.00% Conformance tests 992 Architecture Decision Records 25 The key takeaway: deterministic policy enforcement is orders of magnitude more reliable than prompt-based guardrails, and it runs fast enough for real-time agent workloads. Framework Integrations AGT is framework-agnostic. SDKs are available in Python, TypeScript, .NET, Rust, and Go. Native integrations exist for: LangChain and LangGraph CrewAI AutoGen (Microsoft) Semantic Kernel (Microsoft) OpenAI Agents SDK PydanticAI Model Context Protocol (MCP) Agent-to-Agent Protocol (A2A) Each integration wraps the agent framework's tool-calling and message-passing interfaces with AGT's policy engine, trust scoring, and audit logging. Adding governance to an existing agent takes minutes, not weeks. Compliance Framework Alignment Framework AGT Coverage OWASP Agentic Top 10 (2026) All 10 risk categories mapped NIST AI RMF Govern, Map, Measure, Manage functions addressed EU AI Act Risk classification, audit trails, human oversight SOC 2 Type II Audit logging, access controls, change management CSA ATF Zero-trust agent architecture alignment Singapore MGF Zero-trust, accountability, oversight layers Getting Started # Install the complete governance stack pip install agent-governance-toolkit[full] # Or install individual components pip install agent-os-kernel # Policy engine, VFS, approval workflows pip install agentmesh-platform # Identity, trust, encryption, audit pip install agentmesh-runtime # Execution rings, kill switch, saga pip install agent-sre # Circuit breakers, SLOs, chaos testing The quickstart tutorial walks through adding policy enforcement to an existing LangChain agent in under 10 minutes. Start with a single policy rule and expand as your governance requirements grow. Contribute and Collaborate AGT is open source under the MIT license. The project has over 2,000 GitHub stars and contributors from 40+ countries. Whether you are building agent governance for your enterprise, integrating a new framework, or extending the policy engine with OPA/Rego or Cedar policies, we welcome contributions. Repository: https://github.com/microsoft/agent-governance-toolkit Documentation: https://microsoft.github.io/agent-governance-toolkit Discussions: GitHub Discussions on the repository Disclaimer: This document is provided for informational purposes. Code examples are from the public AGT repository and may evolve. Always refer to the latest repository documentation for current APIs.329Views0likes0CommentsInspektor Gadget Completes Its First Independent Security Audit
Inspektor Gadget, the CNCF eBPF tool for Kubernetes and Linux observability, has completed its first independent security audit, conducted by Shielder and coordinated by OSTIF and CNCF. The audit found two Medium and one Low-severity issue, now patched in release v0.50.1. Learn what the auditors discovered, the hardening recommendations the maintainers are acting on, and why this milestone matters for the open source community.205Views0likes0CommentsApplying Site Reliability Engineering to Autonomous AI Agents
If you practice SRE, you already have a mental model for running reliable production systems. You define SLOs. You track error budgets. You use circuit breakers to stop cascading failures. You run chaos experiments to find weaknesses before customers do. You treat every operational decision as a tradeoff between reliability and velocity. That mental model transfers directly to AI agents. It just needs four new ideas. In the Agent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents, we covered Agent SRE briefly as one of AGT's nine packages: SLOs, error budgets, circuit breakers, chaos engineering, and progressive delivery, adapted from the patterns your SRE team already applies to microservices. Several teams asked for the full story. This is it. Agent SRE is one of the more novel parts of the toolkit. The policy engine, zero-trust identity, and execution sandboxing have clear analogs in existing security practice. Agent SRE explores newer ground. Established patterns for defining SLOs for AI agent behavior, building chaos experiments for LLM provider failures, or applying error budgets to agent autonomy are still emerging across the industry. We built these capabilities because running agents in production without them is the equivalent of running a fleet of microservices without circuit breakers, health checks, or an on-call runbook. This post is for SRE teams, platform engineers, and anyone responsible for running AI agents in production. You do not need to be an AI specialist. If you know what a burn rate is, you are ready for this. The Problem: Agents Fail in Ways Your Existing SRE Tooling Cannot See When a service fails, your observability stack tells you: latency went up, error rate crossed the SLO threshold, the circuit breaker opened. You page the on-call engineer. They look at traces and find the slow database query. When an AI agent fails, your observability stack is silent. The agent returned HTTP 200. Latency was normal. Error rate was zero. But the agent quietly approved a transaction it was not authorized to approve, hallucinated a database path and wrote to the wrong table, or got stuck in a reasoning loop that consumed $800 of LLM API budget before anyone noticed. These are not infrastructure failures. They are behavioral failures. And they are invisible to monitoring tools built for stateless, deterministic services, because those tools only watch for crashes and timeouts. They do not watch for wrong behavior. This gap is the problem Agent SRE was designed to solve. The solution borrows everything from the SRE playbook and adds one concept that extends it: the Safety SLI. The Safety SLI: A New Reliability Dimension Traditional SLIs measure system behavior from the user's perspective: latency, availability, error rate, throughput. They answer: did the service respond correctly? For AI agents, correctness is not enough. An agent that responds correctly but acts outside its authorized scope has not succeeded. It has failed in a way that none of your existing SLIs can detect. The Safety SLI answers a different question: did the agent act within policy? from agent_sre import SLO, ErrorBudget from agent_sre.slo.indicators import PolicyCompliance # Define a safety SLO: 99% of agent actions must comply with policy safety_slo = SLO( name="safety-compliance", indicators=[ PolicyCompliance( target=0.99, window="7d", ), ], error_budget=ErrorBudget( total=0.01, # 1% budget (1 - 0.99 target) window_seconds=2592000, # 30-day window burn_rate_alert=2.0, # warn at 2x sustainable rate burn_rate_critical=5.0, # page at 5x sustainable rate ), ) When an agent's policy compliance rate drops below 99%, the error budget starts burning. The ErrorBudget tracks consumption automatically and exposes burn rate alerts through its firing_alerts() method. When the budget is exhausted, the configured exhaustion_action determines the system response: from agent_sre.slo.objectives import ExhaustionAction # Configure what happens when error budget is exhausted safety_slo = SLO( name="safety-compliance", indicators=[PolicyCompliance(target=0.99, window="7d")], error_budget=ErrorBudget( total=0.01, window_seconds=2592000, burn_rate_alert=2.0, # fires at 2x sustainable burn rate burn_rate_critical=5.0, # fires at 5x sustainable burn rate exhaustion_action=ExhaustionAction.CIRCUIT_BREAK, # suspend agent when budget is gone ), ) # In your monitoring loop, check for firing alerts alerts = safety_slo.error_budget.firing_alerts() for alert in alerts: print(f"Alert firing: {alert.name} (severity: {alert.severity})") # Check budget status print(f"Budget remaining: {safety_slo.error_budget.remaining_percent:.1f}%") print(f"Current burn rate: {safety_slo.error_budget.burn_rate():.2f}x") print(f"Exhausted: {safety_slo.error_budget.is_exhausted}") This is the governance dial from the other direction. The error budget is not just a metric: it is the mechanism that drives agent autonomy decisions. An agent with a clean 30-day safety record earns autonomy. An agent whose budget is burning at 5x the sustainable rate triggers a critical alert, and when the budget is exhausted, the exhaustion_action fires: ALERT, THROTTLE, FREEZE_DEPLOYMENTS, or CIRCUIT_BREAK. The graduated response mirrors what SRE teams already do with service SLOs, applied to agent behavior. There are multiple SLI dimensions built into Agent SRE. Safety SLIs and Performance SLIs track different aspects of the same agent: SLI Type What It Measures Target Pattern When Budget Burns Safety SLI PolicyCompliance -- fraction of actions within authorized scope >= 99% Restrict capabilities, increase human oversight Performance SLI TaskSuccessRate, ResponseLatency, CostPerTask Configurable per workload Alert, throttle, or circuit-break LLM provider Additional built-in indicators include ToolCallAccuracy, DelegationChainDepth, HallucinationRate, and CalibrationDeltaSLI. Both SLOs feed into the same error budget dashboard. An agent can have excellent performance but a degrading safety record, or perfect safety compliance and terrible cost efficiency. You need both dimensions to understand whether an agent is production-ready. Circuit Breakers: Governing Agent Failure Modes That Don't Exist in Microservices Circuit breakers for services protect against one failure mode: a backend that is slow or unreachable. The pattern is CLOSED -> OPEN -> HALF_OPEN. You know it well. Agent SRE implements the same state machine for failure modes that are specific to autonomous reasoning systems and do not exist in traditional microservice architectures: from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker from agent_sre.chaos.engine import FaultType config = CircuitBreakerConfig( failure_threshold=5, # Open after 5 failures in the window recovery_timeout_seconds=60, # Stay OPEN for 60s before HALF_OPEN half_open_max_calls=3, # Allow 3 probes in HALF_OPEN ) breaker = CircuitBreaker(agent_id="analyst-agent-001", config=config) # Failure modes tracked by the circuit breaker: tracked_faults = [ FaultType.POLICY_BYPASS, # Agent exceeds authorized scope FaultType.ERROR_INJECTION, # Upstream model API fails FaultType.TIMEOUT_INJECTION, # Tool calls exceed time budget FaultType.TRUST_PERTURBATION, # Agent trust score falls below threshold FaultType.DEADLOCK_INJECTION, # Agent stuck in iterative reasoning ] Each failure mode has different circuit-breaking semantics: Failure Mode What Triggers It Circuit-Break Behavior Policy bypass Action denied by policy engine Count toward threshold; log with full context LLM provider error HTTP 5xx from model API Immediately open; route to fallback model if configured Tool timeout Tool call exceeds timeout_ms Count toward threshold; cancel in-flight call Trust score degradation Agent trust score drops below configured floor Open; escalate to Ring 3 (untrusted) until score recovers Reasoning loop / deadlock Token or iteration count exceeds budget Open; trigger human review before resuming The reasoning loop breaker deserves attention. A microservice cannot get stuck reasoning. An AI agent absolutely can, and when it does, the failure is not an error code: it is an agent that keeps calling tools, consuming tokens, and generating audit events indefinitely. The circuit breaker detects this pattern from the iteration count and token budget and terminates the loop: # Reasoning loop detection configuration loop_detection_config = { "max_iterations": 15, # Hard stop after 15 reasoning steps "max_tokens_per_session": 50000, # Hard stop on token consumption "repetition_threshold": 0.85, # Stop if >85% of recent actions repeat prior ones "on_detection": "circuit_break_and_escalate", } The state machine behaves identically to what you know from Hystrix or Resilience4j. What changes is the definition of "failure." CLOSED (serving) | | failure_threshold crossed for any tracked fault v OPEN (rejecting -- agent action denied, fallback or human-in-loop fires) | | recovery_timeout expires v HALF_OPEN (probe -- limited requests allowed through) | |-- success_threshold met --> CLOSED |-- any failure --> OPEN (reset timeout) Chaos Engineering for Agents: Fault Injection for Autonomous Systems The only way to know if your agent system is resilient is to break it intentionally. Traditional chaos engineering targets infrastructure: kill a pod, inject network latency, saturate a disk. Agent chaos engineering targets the failure modes specific to autonomous reasoning systems. Agent SRE ships fault injection templates that cover the failure modes teams consistently underestimate until they hit production: from agent_sre.chaos.engine import ChaosExperiment, Fault, FaultType # Experiment 1: LLM provider degrades -- model returns valid responses but with # increased latency and occasional malformed outputs experiment = ChaosExperiment( name="llm-degradation-resilience", target_agent="analyst-agent-001", description="Test agent behavior under degraded LLM provider", faults=[ Fault.latency_injection(target="llm-provider", delay_ms=8000), Fault.error_injection(target="llm-provider", rate=0.05), ], duration_seconds=300, ) # Experiment 2: Trust score manipulation -- simulates an agent receiving # messages from a peer with a spoofed trust score trust_experiment = ChaosExperiment( name="trust-manipulation-resilience", target_agent="orchestrator-001", faults=[ Fault( fault_type=FaultType.TRUST_PERTURBATION, target="did:mesh:orchestrator-001", params={"spoofed_score": 950}, ), ], duration_seconds=120, ) # Experiment 3: Tool timeout cascade -- multiple tools time out simultaneously, # testing whether the agent abandons gracefully or enters a reasoning loop cascade_experiment = ChaosExperiment( name="tool-timeout-cascade", target_agent="analyst-agent-001", faults=[ Fault.timeout_injection(target="database.read", delay_ms=30000), Fault.timeout_injection(target="api.call", delay_ms=30000), ], duration_seconds=180, ) # Run the experiment experiment.start() # ... inject faults during agent execution ... resilience = experiment.calculate_resilience( baseline_success_rate=0.95, experiment_success_rate=0.87, recovery_time_ms=48000, ) experiment.complete(resilience=resilience) print(f"Resilience score: {resilience.overall}/100 -- {'PASSED' if resilience.passed else 'FAILED'}") Additional fault types built into the chaos engine cover: prompt injection attempts, privilege escalation, data exfiltration attempts, identity spoofing, deadlock injection, and contradictory instruction scenarios. Each maps to a FaultType enum value and can be composed into multi-fault experiments. Important: The chaos engine records that a fault was injected and triggers the governance response pipeline. Actual infrastructure-level fault injection (network partition, process kill) should be implemented using your existing chaos tooling (Chaos Mesh, Gremlin, Azure Chaos Studio, or similar). Agent SRE governs the agent's behavioral response to faults; it does not own infrastructure manipulation. These two layers are designed to compose. Each chaos experiment produces a structured resilience score via calculate_resilience(), which compares baseline and experiment success rates. A score of 90+ with passed=True means the agent maintained at least 90% of its baseline performance under fault conditions. Teams use this to set minimum resilience thresholds for production readiness. Replay Debugging: Reproduce Behavioral Failures Exactly Infrastructure incidents are reproducible because infrastructure is deterministic. AI agent incidents are hard to reproduce because agent behavior depends on model state, context window content, and the sequence of tool call results, none of which are preserved by default after a session ends. Agent SRE's replay engine records every agent session as a replayable artifact: the full trace at each step, every tool call with its inputs and outputs, every policy evaluation with its decision, and every trust score at the time of each inter-agent message. from agent_sre.replay.capture import TraceStore from agent_sre.replay.engine import ReplayEngine, ReplayMode # Traces are captured automatically when SRE tracing is active store = TraceStore( backend="azure_blob", retention_days=30, ) # When an incident occurs, replay the session exactly engine = ReplayEngine(store=store) # Full replay: re-run the session against the same recorded inputs # Uses recorded tool outputs -- no live tool calls -- so replay is deterministic result = await engine.replay( trace_id="trace_2026_05_a7f3b2", mode=ReplayMode.FULL, ) for step in result.steps: print(f"Step {step.index}: {step.action} -> {step.decision}") # Divergence analysis: replay with a policy change applied # Shows exactly which actions would have been blocked under the new policy diff_result = await engine.diff( trace_id="trace_2026_05_a7f3b2", policy_override="policies/stricter-v2.yaml", ) for diff in diff_result.diffs: if diff.description: print(f"Step {diff.span_name}: was {diff.original}, " f"would be {diff.replayed} under new policy") The divergence analysis is the feature teams use most. When a policy change is proposed, you replay recent production traces against the new policy to see how many actions would have been blocked, which sessions would have failed, and what the error budget impact would have been. Policy changes stop being guesswork. Progressive Delivery: Safely Rolling Out New Agent Capabilities When you ship a new service version, you do not send it to all traffic at once. You use canary deployments, feature flags, or traffic splitting. You watch the SLOs. If they degrade, you roll back. Agent SRE brings the same discipline to agent capability rollout. When you expand an agent's authorized scope, giving it write access it did not have, connecting it to a new tool, or raising its trust floor, you do not expand to the full fleet immediately. You expand progressively, with automated SLO gates controlling each stage. from agent_sre.delivery.rollout import ( AnalysisCriterion, CanaryRollout, RollbackCondition, RolloutStep, ) rollout = CanaryRollout( name="database-write-capability", steps=[ RolloutStep( name="canary", weight=0.05, # 5% of agents get the new capability duration_seconds=86400, # 24 hours analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.995), AnalysisCriterion(metric="performance_sli", threshold=0.90), AnalysisCriterion( metric="error_budget_consumed", threshold=0.10, comparator="lte", # canary can burn at most 10% ), ], ), RolloutStep( name="early-adopters", weight=0.25, # 25% traffic duration_seconds=172800, # 48 hours analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.990), AnalysisCriterion(metric="performance_sli", threshold=0.88), ], ), RolloutStep( name="general-availability", weight=1.0, # 100% traffic duration_seconds=604800, # 1 week of full observation analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.990), AnalysisCriterion(metric="performance_sli", threshold=0.85), ], ), ], rollback_conditions=[ RollbackCondition(metric="safety_sli", threshold=0.95, comparator="lte"), ], ) # Start the rollout -- SLO gates evaluate at each step rollout.start() # Advance to next step when analysis criteria pass if rollout.advance(): print(f"Advanced to step: {rollout.current_step.name}") print(f"Progress: {rollout.progress_percent:.0f}%") The SLO gate at each step is the same mechanism as a CI/CD quality gate, but measured on live production behavior rather than test results. An agent capability that degrades the safety SLI during canary does not promote to the next step. If a RollbackCondition fires, the rollout rolls back automatically. This is the mechanism that makes it operationally safe to expand agent autonomy: every expansion is measurable, every measurement gates the next expansion, and rollback is automatic. Health Checks and Backpressure Traditional health checks answer: is the service alive? For agents, alive is not enough. A healthy agent is one that is alive, operating within policy, consuming resources within budget, and maintaining a trust score above the Ring threshold it was assigned. # Agent health check covering multiple dimensions health = await agent_health_check( agent_id="analyst-agent-001", dimensions=[ "liveness", # Is the agent process running? "policy_compliance", # Is safety SLI above threshold? "trust_score", # Is trust score above Ring floor? "resource_budget", # Is token/API spend within limits? "tool_availability", # Are the tools the agent needs reachable? ], ) # health.status: "healthy" | "degraded" | "unhealthy" # health.dimensions: per-dimension pass/fail with values # health.recommended_action: "none" | "restrict" | "suspend" | "terminate" When health checks report degradation, backpressure controls engage before the circuit breaker opens. Backpressure is the earlier, softer response: accept fewer concurrent tasks, reject low-priority work, drain in-flight tasks gracefully before the situation escalates. # Backpressure configuration backpressure_config = { "backpressure_threshold": 0.80, # Engage when resource utilization > 80% "max_concurrent": 5, # Hard cap on simultaneous agent tasks "priority_shedding": True, # Drop low-priority tasks first "drain_timeout_seconds": 30, # Allow in-flight tasks to complete } The ordering matters: backpressure first, then circuit breaker, then suspension. Each stage is recoverable. Each stage preserves more agent state than the next. The SRE principle of graduated response applies to agents exactly as it applies to services. Observability: Governance Metrics Flow Into Your Existing Stack Agent SRE does not ask you to adopt a new observability platform. Governance metrics are exported through the same adapters your infrastructure monitoring already uses, including OpenTelemetry, Prometheus, Datadog, and others. from agent_sre.tracing.exporters import configure_exporters configure_exporters( backends=[ {"type": "prometheus", "endpoint": "http://prometheus:9090"}, {"type": "opentelemetry", "endpoint": "http://otel-collector:4317"}, ], include_metrics=[ "slo.safety_sli", # Per-agent safety compliance rate "slo.error_budget_remaining", # Error budget in percentage "slo.burn_rate", # Current burn rate vs sustainable "circuit_breaker.state", # CLOSED / OPEN / HALF_OPEN "circuit_breaker.failure_count", "trust_score.current", # Agent trust score (0-1000) "trust_score.ring", # Current execution ring "chaos.experiments_run", # Chaos experiment telemetry "health.status", # Aggregate health status "backpressure.load", # Current load vs threshold ], ) Key governance metrics available in your existing dashboards: Metric What It Tells You Alert Condition slo.safety_sli Fraction of agent actions within policy < 0.99 slo.burn_rate Rate at which error budget is consumed > 2.0 (warn), > 5.0 (page) slo.error_budget_remaining Budget left for the SLO window < 20% circuit_breaker.state Current breaker state per agent OPEN or HALF_OPEN trust_score.ring Execution ring (privilege level) Ring 3 (untrusted) health.status Aggregate health across all dimensions degraded or unhealthy If you are already running Grafana dashboards for your services, a governance dashboard for your agent fleet is a new data source and a new set of panels, not a new monitoring stack. The SRE Mental Model for Agents: Four New Concepts Everything in Agent SRE is built on the SRE mental model you already have, extended with four concepts that adapt traditional reliability thinking for autonomous systems: Traditional SRE Agent SRE Equivalent What Changes Latency SLI Safety SLI Correctness of *action*, not speed of *response* Error budget Autonomy budget Burns on policy violations, not just errors Circuit breaker Behavioral circuit breaker Opens on wrong *behavior*, not just failure codes Canary deployment Capability rollout Rolls out *scope*, not just code The governance insight is that error budgets work in both directions for agents. A service's error budget only decreases. An agent's autonomy is also a budget: it grows when the safety SLI is strong and shrinks when it degrades. The error budget mechanism becomes the operational mechanism for expanding and contracting agent autonomy in response to evidence, which is exactly what regulated industries and risk-averse enterprise teams need before they will trust an autonomous agent with consequential actions. Getting Started with Agent SRE pip install agent-sre A minimal Agent SRE integration requires three things: a safety SLO definition, a circuit breaker, and a health check. The progressive delivery and chaos engineering features layer on top when you are ready for them. from agent_sre import SLO, ErrorBudget from agent_sre.slo.indicators import TaskSuccessRate from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker # Step 1: Define your safety SLO slo = SLO( name="production-safety", indicators=[TaskSuccessRate(target=0.99, window="24h")], error_budget=ErrorBudget(total=0.01, burn_rate_alert=2.0, burn_rate_critical=5.0), ) # Step 2: Configure a circuit breaker breaker_config = CircuitBreakerConfig( failure_threshold=5, recovery_timeout_seconds=60, half_open_max_calls=3, ) breaker = CircuitBreaker(agent_id="my-agent", config=breaker_config) # Step 3: Wire into your existing agent loop async def governed_agent_loop(agent, task): # Check health first if not await agent_is_healthy(agent.id): return {"error": "agent suspended", "reason": "health check failed"} # Run within circuit breaker protection async with breaker: result = await agent.run(task) slo.record_event(good=result.policy_compliant) return result The quickstart in the repository walks through a complete setup with safety SLOs, circuit breakers, and a Prometheus dashboard export in under 50 lines. Why This Matters Most AI observability tools today focus on what you might call model quality: hallucination rate, latency, token cost, task completion. These are useful metrics. They are not SRE metrics. They do not answer whether the agent acted within its authorized scope, whether its behavioral error budget is burning at a dangerous rate, or whether it would survive the LLM provider going down. Agent SRE answers those questions using the operational vocabulary that SRE teams already understand: SLOs, error budgets, circuit breakers, chaos experiments, and health checks. The goal is not to replace your observability stack. It is to make agent governance visible inside it. The reliability of an autonomous agent is not a property of the model. It is a property of the governance infrastructure around it. Agent SRE is that infrastructure. Resources GitHub: github.com/microsoft/agent-governance-toolkit Install: pip install agent-sre Tutorials: 40+ tutorials including dedicated Agent SRE walkthroughs for SLO setup, chaos experiments, and progressive delivery Architecture reference: ARCHITECTURE.md OWASP compliance mapping: OWASP-COMPLIANCE.md -- Agent SRE addresses ASI-08 (Cascading Failures) directly through circuit breakers and SLO-based fault detection Part 1 -- Runtime governance: Policy engines, trust, and SRE overview Part 2 -- Shift-left governance: Catching violations before production Part 3 -- Post-hoc accountability: After the agent acts The Agent Governance Toolkit is an open-source project released under the MIT License. All features described in this post are available in the public repository. The `agent-sre` package is currently in public preview; APIs may change before general availability. Questions about Agent SRE in your environment? Open an issue at aka.ms/agent-governance-toolkit or start a discussion in the comments below.317Views1like0CommentsDecoupling Memory from Startup Time in AKS Sandbox Pods
What if a 96GB sandboxed pod could start as fast as a 2GB one? Before recent improvements in AKS Pod Sandboxing, large-memory pods could take over a minute longer to start than smaller ones. For customers running latency-sensitive, autoscaling, AI/ML, or bursty workloads, that startup delay directly impacted scale-out responsiveness, job completion time, and overall cluster efficiency. AKS Pod Sandboxing provides strong workload isolation by running pods inside lightweight virtual machines. This model is especially valuable for security-sensitive, untrusted, or multi-tenant workloads, but it came with a tradeoff: memory size directly impacted startup latency. With recent updates to the Azure Linux kernel used by AKS on Microsoft Hypervisor (MSHV), AKS has significantly improved startup time for large-memory sandboxed pods. This article explains what changed, why it matters, and what AKS customers should expect in practice. The Problem: Large-Memory Pod Startup Was Expensive Before this change, Kata-based pod sandboxes on AKS using the Microsoft Hypervisor (MSHV) followed an eager memory allocation model: When a pod sandbox VM was created, all memory specified in the pod resource request was committed up front on the host. For example: a pod requesting 32 GB, 64 GB, or 96 GB of memory forced the host to allocate and pin those virtual memory pages in physical memory before the VM could boot. As a result, sandbox startup time scaled linearly with memory size. Measurements showed startup times growing quickly as memory increased: Pod Sandbox Memory E2E Startup Time (Before) 32 GB ~21 seconds 64 GB ~41 seconds 96 GB ~62 seconds This led to: Slower startup and scale-out for memory-heavy workloads. Inefficient node utilization due to wasted memory reserved but unused at startup. What Changed: Deferred Page Allocation in MSHV Host Kernel With deferred page allocation, the kernel no longer commits all virtual machine memory at sandbox creation time. The pod sandbox VM boots with a small initial memory footprint. Host memory pages are committed lazily, only when the guest faults them. The total available memory remains bounded by the pod memory limit defined in the pod specification. This behavior aligns with how KVM-based systems handle guest memory today but is implemented for MSHV in Azure Linux. In short: memory is provisioned on demand, not up front. & After) Results 1. Pod Startup Time Is Now Effectively Constant The most visible benefit for AKS customers is dramatically improved pod startup time for large-memory pods. With deferred page allocation enabled, startup time becomes approximately O(1) with respect to memory size: Pod Sandbox Memory E2E Startup Time (After) 32 GB ~3 seconds 64 GB ~3 seconds 96 GB ~3.5 seconds ~7x faster startup for 32 GB pods ~12x faster startup for 64 GB pods ~17x faster startup for 96 GB pods 2. Higher Density and Better Memory Utilization Deferred page allocation also reduces wasted reserved memory at pod start. This allows AKS nodes to safely oversubscribe memory for cold pods, pack more sandboxed pods per node, and improve overall workload density and infrastructure efficiency. Tradeoff: First-Touch Page Fault Cost Deferred page allocation introduces a first-touch cost: when a workload accesses a memory page for the first time, a page fault triggers host allocation. This cost is incurred once per page. After memory is populated, steady-state performance matches eager allocation in benchmarks. For most workloads, especially those that ramp memory gradually or benefit from faster startup, the improvement outweighs this one-time cost. What AKS Pod Sandboxing Customers Need To Do Here's the good part: No changes are required for workloads to benefit from this improvement. However, customers are encouraged to: Specify realistic memory requests and limits. Take advantage of improved startup behavior for scale-out scenarios. Deferred page allocation is available in AKS Pod Sandboxing on AKS Azure Linux version 202603.18.1 or later, running kernel-mshv 6.6.121 or newer.230Views0likes0CommentsRun OpenClaw Agents on Azure Linux VMs (with Secure Defaults)
Many teams want an enterprise-ready personal AI assistant, but they need it on infrastructure they control, with security boundaries they can explain to IT. That is exactly where OpenClaw fits on Azure. OpenClaw is a self-hosted, always-on personal agent runtime you run in your enterprise environment and Azure infrastructure. Instead of relying only on a hosted chat app from a third-party provider, you can deploy, operate, and experiment with an agent on an Azure Linux VM you control — using your existing GitHub Copilot licenses, Azure OpenAI deployments, or API plans from OpenAI, Anthropic Claude, Google Gemini, and other model providers you already subscribe to. Once deployed on Azure, you can interact with an OpenClaw agent through familiar channels like Microsoft Teams, Slack, Telegram, WhatsApp, and many more! For Azure users, this gives you a practical middle ground: modern personal-agent workflows on familiar Azure infrastructure. What is OpenClaw, and how is it different from ChatGPT/Claude/chat apps? OpenClaw is a self-hosted personal agent runtime that can be hosted on Azure compute infrastructure. How it differs: ChatGPT/Claude apps are primarily hosted chat experiences tied to one provider's models OpenClaw is an always-on runtime you operate yourself, backed by your choice of model provider — GitHub Copilot, Azure OpenAI, OpenAI, Anthropic Claude, Google Gemini, and others OpenClaw lets you keep the runtime boundary in your own Azure VM environment within your Azure enterprise subscription In practice, OpenClaw is useful when you want a persistent assistant for operational and workflow tasks, with your own infrastructure as the control point. You bring whatever model provider and API plan you already have — OpenClaw connects to it. Why Azure Linux VMs? Azure Linux VMs are a strong fit because they provide: A suitable host machine for the OpenClaw agent to run on Enterprise-friendly infrastructure and identity workflows Repeatable provisioning via the Azure CLI Network hardening with NSG rules Managed SSH access through Azure Bastion instead of public SSH exposure How to Set Up OpenClaw on an Azure Linux VM This guide sets up an Azure Linux VM, applies NSG (Network Security Group) hardening, configures Azure Bastion for managed SSH access, and installs an always-on OpenClaw agent within the VM that you can interact with through various messaging channels. What you'll do Create Azure networking (VNet, subnets, NSG) and compute resources with the Azure CLI Apply Network Security Group rules so VM SSH is allowed only from Azure Bastion Use Azure Bastion for SSH access (no public IP on the VM) Install OpenClaw on the Azure VM Verify OpenClaw installation and configuration on the VM What you need An Azure subscription with permission to create compute and network resources Azure CLI installed (install steps) An SSH key pair (the guide covers generating one if needed) ~20–30 minutes Configure deployment Step 1: Sign in to Azure CLI az login # Select a suitable Azure subscription during Azure login az extension add -n ssh # SSH extension is required for Azure Bastion SSH The ssh extension is required for Azure Bastion native SSH tunneling. Step 2: Register required resource providers (one-time) Register required Azure Resource Providers (one time registration): az provider register --namespace Microsoft.Compute az provider register --namespace Microsoft.Network Verify registration. Wait until both show Registered. az provider show --namespace Microsoft.Compute --query registrationState -o tsv az provider show --namespace Microsoft.Network --query registrationState -o tsv Step 3: Set deployment variables Set the deployment environment variables that will be needed throughout this guide. RG="rg-openclaw" LOCATION="westus2" VNET_NAME="vnet-openclaw" VNET_PREFIX="10.40.0.0/16" VM_SUBNET_NAME="snet-openclaw-vm" VM_SUBNET_PREFIX="10.40.2.0/24" BASTION_SUBNET_PREFIX="10.40.1.0/26" NSG_NAME="nsg-openclaw-vm" VM_NAME="vm-openclaw" ADMIN_USERNAME="openclaw" BASTION_NAME="bas-openclaw" BASTION_PIP_NAME="pip-openclaw-bastion" Adjust names and CIDR ranges to fit your environment. The Bastion subnet must be at least /26. Step 4: Select SSH key Use your existing public key if you have one: SSH_PUB_KEY="$(cat ~/.ssh/id_ed25519.pub)" If you don't have an SSH key yet, generate one: ssh-keygen -t ed25519 -a 100 -f ~/.ssh/id_ed25519 -C "you@example.com" SSH_PUB_KEY="$(cat ~/.ssh/id_ed25519.pub)" Step 5: Select VM size and OS disk size VM_SIZE="Standard_B2as_v2" OS_DISK_SIZE_GB=64 Choose a VM size and OS disk size available in your subscription and region: Start smaller for light usage and scale up later Use more vCPU/RAM/disk for heavier automation, more channels, or larger model/tool workloads If a VM size is unavailable in your region or subscription quota, pick the closest available SKU List VM sizes available in your target region: az vm list-skus --location "${LOCATION}" --resource-type virtualMachines -o table Check your current vCPU and disk usage/quota: az vm list-usage --location "${LOCATION}" -o table Deploy Azure resources Step 1: Create the resource group The Azure resource group will contain all of the Azure resources that the OpenClaw agent needs. az group create -n "${RG}" -l "${LOCATION}" Step 2: Create the network security group Create the NSG and add rules so only the Bastion subnet can SSH into the VM. az network nsg create \ -g "${RG}" -n "${NSG_NAME}" -l "${LOCATION}" # Allow SSH from the Bastion subnet only az network nsg rule create \ -g "${RG}" --nsg-name "${NSG_NAME}" \ -n AllowSshFromBastionSubnet --priority 100 \ --access Allow --direction Inbound --protocol Tcp \ --source-address-prefixes "${BASTION_SUBNET_PREFIX}" \ --destination-port-ranges 22 # Deny SSH from the public internet az network nsg rule create \ -g "${RG}" --nsg-name "${NSG_NAME}" \ -n DenyInternetSsh --priority 110 \ --access Deny --direction Inbound --protocol Tcp \ --source-address-prefixes Internet \ --destination-port-ranges 22 # Deny SSH from other VNet sources az network nsg rule create \ -g "${RG}" --nsg-name "${NSG_NAME}" \ -n DenyVnetSsh --priority 120 \ --access Deny --direction Inbound --protocol Tcp \ --source-address-prefixes VirtualNetwork \ --destination-port-ranges 22 The rules are evaluated by priority (lowest number first): Bastion traffic is allowed at 100, then all other SSH is blocked at 110 and 120. Step 3: Create the virtual network and subnets Create the VNet with the VM subnet (NSG attached), then add the Bastion subnet. az network vnet create \ -g "${RG}" -n "${VNET_NAME}" -l "${LOCATION}" \ --address-prefixes "${VNET_PREFIX}" \ --subnet-name "${VM_SUBNET_NAME}" \ --subnet-prefixes "${VM_SUBNET_PREFIX}" # Attach the NSG to the VM subnet az network vnet subnet update \ -g "${RG}" --vnet-name "${VNET_NAME}" \ -n "${VM_SUBNET_NAME}" --nsg "${NSG_NAME}" # AzureBastionSubnet — name is required by Azure az network vnet subnet create \ -g "${RG}" --vnet-name "${VNET_NAME}" \ -n AzureBastionSubnet \ --address-prefixes "${BASTION_SUBNET_PREFIX}" Step 4: Create the Virtual Machine Create the VM with no public IP. SSH access for OpenClaw configuration will be exclusively through Azure Bastion. az vm create \ -g "${RG}" -n "${VM_NAME}" -l "${LOCATION}" \ --image "Canonical:ubuntu-24_04-lts:server:latest" \ --size "${VM_SIZE}" \ --os-disk-size-gb "${OS_DISK_SIZE_GB}" \ --storage-sku StandardSSD_LRS \ --admin-username "${ADMIN_USERNAME}" \ --ssh-key-values "${SSH_PUB_KEY}" \ --vnet-name "${VNET_NAME}" \ --subnet "${VM_SUBNET_NAME}" \ --public-ip-address "" \ --nsg "" --public-ip-address "" prevents a public IP from being assigned. --nsg "" skips creating a per-NIC NSG (the subnet-level NSG created earlier handles security). Reproducibility: The command above uses latest for the Ubuntu image. To pin a specific version, list available versions and replace latest: az vm image list \ --publisher Canonical --offer ubuntu-24_04-lts \ --sku server --all -o table Step 5: Create Azure Bastion Azure Bastion provides secure-managed SSH access to the VM without exposing a public IP. Bastion Standard SKU with tunneling is required for CLI-based "az network bastion ssh" command. az network public-ip create \ -g "${RG}" -n "${BASTION_PIP_NAME}" -l "${LOCATION}" \ --sku Standard --allocation-method Static az network bastion create \ -g "${RG}" -n "${BASTION_NAME}" -l "${LOCATION}" \ --vnet-name "${VNET_NAME}" \ --public-ip-address "${BASTION_PIP_NAME}" \ --sku Standard --enable-tunneling true Bastion provisioning typically takes 5–10 minutes but can take up to 15–30 minutes in some regions. Step 6: Verify Deployments After all resources are deployed, your resource group should look like the following: Install OpenClaw Step 1: SSH into the VM through Azure Bastion VM_ID="$(az vm show -g "${RG}" -n "${VM_NAME}" --query id -o tsv)" az network bastion ssh \ --name "${BASTION_NAME}" \ --resource-group "${RG}" \ --target-resource-id "${VM_ID}" \ --auth-type ssh-key \ --username "${ADMIN_USERNAME}" \ --ssh-key ~/.ssh/id_ed25519 Step 2: Install OpenClaw (in the Bastion SSH shell) curl -fsSL https://openclaw.ai/install.sh | bash The installer installs Node LTS and dependencies if not already present, installs OpenClaw, and launches the OpenClaw onboarding wizard. For more information, see the open source OpenClaw install docs. OpenClaw Onboarding: Choosing an AI Model Provider During OpenClaw onboarding, you'll choose the AI model provider for the OpenClaw agent. This can be GitHub Copilot, Azure OpenAI, OpenAI, Anthropic Claude, Google Gemini, or another supported provider. See the open source OpenClaw install docs for details on choosing an AI model provider when going through the onboarding wizard. Most enterprise Azure teams already have GitHub Copilot licenses. If that is your case, we recommend choosing the GitHub Copilot provider in the OpenClaw onboarding wizard. See the open source OpenClaw docs on configuring GitHub Copilot as the AI model provider. OpenClaw Onboarding: Setting up Messaging Channels During OpenClaw onboarding, there will be an optional step where you can set up various messaging channels to interact with your OpenClaw agent. For first time users, we recommend setting up Telegram due to ease of setup. Other messaging channels such as Microsoft Teams, Slack, WhatsApp, and others can also be set up. To configure OpenClaw for messaging through chat channels, see the open source OpenClaw chat channels docs. Step 3: Verify OpenClaw Configuration To validate that everything was set up correctly, run the following commands within the same Bastion SSH session: openclaw status openclaw gateway status If there are any issues reported, you can run the onboarding wizard again with the steps above. Alternatively, you can run the following command: openclaw doctor Message OpenClaw Once you have configured the OpenClaw agent to be reachable via various messaging channels, you can verify that it is responsive by messaging it. Enhancing OpenClaw for Use Cases There you go! You now have a 24/7, always-on personal AI agent, living on its own Azure VM environment. For awesome OpenClaw use cases, check out the awesome-openclaw-usecases repository. To enhance your OpenClaw agent with additional AI skills so that it can autonomously perform multi-step operations on any domain, check out the awesome-openclaw-skills repository. You can also check out ClawHub and ClawSkills, two popular open source skills directories that can enhance your OpenClaw agent. Cleanup To delete all resources created by this guide: az group delete -n "${RG}" --yes --no-wait This removes the resource group and everything inside it (VM, VNet, NSG, Bastion, public IP). This also deletes the OpenClaw agent running within the VM. If you'd like to dive deeper about deploying OpenClaw on Azure, please check out the open source OpenClaw on Azure docs.6.6KViews5likes2CommentsShift-Left Governance for AI Agents: How the Agent Governance Toolkit Helps You Catch Violations
In part one of this series, we covered AGT’s runtime governance: the policy engine, zero-trust identity, execution sandboxing, and the OWASP Agentic AI risk mapping. That post focused on what happens when an agent acts: policy evaluation at the moment a tool call fires, trust scoring when agents communicate, audit logging when decisions are made. Runtime governance is essential. But it is the last line of defense. After that post went live, a pattern emerged in conversations with teams adopting AGT. The same question kept coming up: runtime checks are useful, but what about everything before production? We realized runtime governance was only half the story. So we went back and built tooling for every stage of your software development lifecycle, from the moment a developer saves a file to the moment an artifact ships to users. Why Runtime Governance Is Not Enough AI agents are a new class of workload. They reason about what to do, select tools, call APIs, read databases, and spawn sub-processes, often in loops that run without direct human oversight. The OWASP Agentic AI Top 10 (published December 2025) identifies risks like excessive agency, insecure tool use, privilege escalation, and supply chain compromise. These risks span the entire lifecycle, not just runtime. Consider a few scenarios that runtime governance alone cannot prevent: A developer commits a policy YAML file with a typo that silently disables all deny rules. The agent runs unprotected until someone notices. A dependency update introduces a package with a known critical CVE. The agent starts using a vulnerable library before any security team reviews it. A contributor adds a raw cryptographic import to an application module, bypassing the security-audited signing library. The code compiles and ships. A GitHub Actions workflow uses an expression injection pattern that allows an attacker to execute arbitrary code in CI. A release ships without a Software Bill of Materials (SBOM), making it impossible to trace which components are affected when the next log4j-style vulnerability drops. Each of these is a governance failure, but none of them happens at runtime. They happen at commit time, at PR review time, at build time, or at release time. A comprehensive governance strategy needs coverage at every stage. Four Stages of Pre-Runtime Governance Governance violations can enter a codebase at four distinct stages of the development lifecycle. Each stage has a different class of risk, and each needs a different kind of check: Stage When It Runs What It Catches AGT Tooling Commit-time Before code leaves the developer machine Malformed policies, schema violations, secrets, stub code, unauthorized crypto Pre-commit hooks, quality gates PR-time When a pull request is opened or updated Vulnerable dependencies, missing attestation, secrets in history, unpinned versions GitHub Actions (attestation, dependency review, secret scanning, supply chain checks) CI/Build-time On every push and pull request to main Compliance violations, binary security issues, dependency confusion, workflow injection Governance Verify action, Security Scan action, CodeQL, BinSkim, policy validation Release-time Before artifacts are published Missing provenance, unsigned artifacts, incomplete SBOMs SBOM generation, Sigstore signing, build attestation, OpenSSF Scorecard Just as with bugs, the earlier you catch a governance violation, the cheaper it is to fix. A malformed policy file caught at commit time costs zero CI minutes. A secret caught in PR review never reaches the default branch. A dependency confusion attack blocked in CI never reaches production. An unsigned artifact blocked at release time never reaches users. Stage 1: Commit-Time Governance with Pre-Commit Hooks The fastest governance feedback loop is local. Within the AGT project, we’ve implemented three pre-commit hooks that run automatically whenever a developer stages files for commit, validating governance artifacts before they ever leave the developer's machine. Built-In Hooks The toolkit's .pre-commit-hooks.yaml defines three hooks that any repository can adopt: Hook ID What It Validates File Pattern validate-policy YAML/JSON policy files against the AGT policy schema, checking for required fields, valid operators, and structural correctness Files matching *polic*.yaml, *polic*.yml, *polic*.json validate-plugin-manifest Plugin manifest files for required fields and schema compliance Files matching plugin.json, plugin.yaml, plugin.yml evaluate-plugin-policy Plugin manifests against a governance policy file, evaluating whether the plugin would be allowed under the organization's rules Files matching plugin.json, plugin.yaml, plugin.yml To adopt these hooks, add AGT as a pre-commit hook source: # .pre-commit-config.yaml repos: - repo: https://github.com/microsoft/agent-governance-toolkit rev: main # pin to a release tag in production hooks: - id: validate-policy - id: validate-plugin-manifest - id: evaluate-plugin-policy args: ['--policy', 'policies/marketplace-policy.yaml'] Then install and run: pip install pre-commit pre-commit install pre-commit run --all-files Extended Quality Gates Beyond schema validation, we built a pre-commit rollout template (see the full example in the repository) with additional governance-specific quality gates designed to help prevent common security anti-patterns from entering the codebase: Policy validation (agt-validate): Runs the full AGT policy CLI in strict mode, catching not just schema errors but semantic issues like conflicting rules. Health check (agt-doctor): Runs on pre-push (before code leaves the machine entirely), performing a broader health check of the governance configuration. Plugin metadata check (agency-json-required): Ensures every plugin directory contains the required agency.json metadata file. Stub detection (no-stubs): Blocks TODO, FIXME, HACK, and raise NotImplementedError markers in staged production code. Test files are excluded. Unauthorized crypto detection (no-custom-crypto): Blocks raw cryptographic imports (hashlib, hmac, crypto.subtle, System.Security.Cryptography, ring, ed25519-dalek) outside designated security modules. This helps ensure all cryptographic operations go through the audited AGT signing libraries. Secret scanning (detect-secrets): Integrates Yelp's detect-secrets for pattern-based secret detection on every commit. Phased Rollout for Teams Adopting pre-commit hooks across a team requires a thoughtful rollout. The AGT documentation includes a phased adoption guide: Week 1: Install hooks in permissive mode. Hooks warn on violations but do not block the commit. This lets developers see what would be caught without disrupting workflow. Week 2: Switch to strict mode for policy validation only. Policy files must pass schema validation to be committed. Week 3: Enable all hooks as blocking. Stubs, unauthorized crypto, and secrets are now blocked at commit time. Week 4: Graduate to full blocking mode and remove the permissive fallback. This approach helps teams build confidence in the governance tooling before it becomes a hard gate. Stage 2: PR-Time Gates Pre-commit hooks catch issues on the developer's machine, but they can be bypassed (force push, direct GitHub edits, hooks not installed). PR-time gates provide the second layer of defense, running in GitHub Actions on every pull request before merge is allowed. Governance Attestation The Governance Attestation action validates that PR authors have completed a structured attestation checklist before their code can merge. The default checklist covers seven sections: Security review Privacy review Legal review Responsible AI review Accessibility review Release Readiness / Safe Deployment Org-specific Launch Gates The action is fully configurable. Organizations can customize the required sections, set a minimum PR body length, and choose their own attestation format. Outputs include the validation status, a list of errors for missing sections, and a JSON mapping of sections to checkbox counts. Here is an example workflow: # .github/workflows/pr-governance.yml name: PR Governance on: pull_request: types: [opened, edited, synchronize] jobs: attestation: runs-on: ubuntu-latest steps: - uses: microsoft/agent-governance-toolkit/action/governance-attestation@main with: required-sections: | 1) Security review 2) Privacy review 3) Responsible AI review Dependency Review The dependency review workflow helps block PRs that introduce dependencies with known CVEs or disallowed licenses. It uses the GitHub dependency-review-action with a curated license allowlist: - uses: actions/dependency-review-action@v4 with: fail-on-severity: moderate comment-summary-in-pr: always allow-licenses: > MIT, Apache-2.0, BSD-2-Clause, BSD-3-Clause, ISC, PSF-2.0, Python-2.0, 0BSD, Unlicense, CC0-1.0, CC-BY-4.0, Zlib, BSL-1.0, MPL-2.0 This runs on every PR that touches dependency manifests (package.json, Cargo.toml, pyproject.toml, requirements.txt). Dependencies with moderate or higher CVEs are flagged, and dependencies with licenses not on the allowlist are blocked. Secret Scanning The secret scanning workflow runs on every PR to the main branch and on a weekly schedule. It combines two complementary approaches: Gitleaks: Pattern-based secret detection across the full git history, catching API keys, tokens, and credentials that may have been committed at any point. High-entropy string scanning: Regex-based detection of common secret patterns including GitHub tokens (ghp_, gho_), AWS access keys (AKIA), Slack tokens (xox), and base64-encoded strings with high entropy. Supply Chain Integrity A dedicated supply chain check workflow triggers when dependency manifest files change. It enforces two rules that help prevent supply chain attacks: Exact version pinning: No ^ or ~ version ranges in package.json files. This prevents unexpected minor/patch version updates that could introduce compromised code. Lockfile presence: Every package directory with dependencies must have a corresponding lockfile (package-lock.json, pnpm-lock.yaml, or yarn.lock). Lockfiles help ensure reproducible builds with verified integrity hashes. Quality Gates The quality gates workflow mirrors the pre-commit hooks at the PR level, providing defense in depth. It runs four checks on every pull request: Gate Purpose No Stubs/TODOs Blocks TODO, FIXME, HACK markers in production code (test files excluded) No Unauthorized Crypto Blocks raw cryptographic imports outside designated security modules Security Audit Required Changes to security-sensitive paths require accompanying audit documentation Dependency Audit Trail Vendored patches must have an audit trail explaining the patch and its provenance These gates catch anything that bypasses pre-commit hooks: force-pushed commits, direct GitHub web edits, commits from contributors who have not installed the hooks. Stage 3: CI/Build-Time Governance Once a PR passes the gate workflows, the main CI pipeline and specialized workflows perform deeper, more computationally intensive analysis. The Governance Verify Action The Governance Verify action is the primary CI-time governance check. It is a GitHub Actions composite action that installs the toolkit and runs the compliance CLI against your repository. It supports four modes: Command What It Does governance-verify Runs the full compliance verification suite, checking governance controls and reporting how many pass marketplace-verify Validates a plugin manifest against marketplace requirements (required fields, signing, metadata) policy-evaluate Evaluates a specific policy file against a JSON context, returning the allow/deny decision with the matched rule all Runs governance-verify, then marketplace-verify and policy-evaluate if the corresponding paths are provided Here is an example: # .github/workflows/governance-ci.yml name: Governance CI on: [push, pull_request] jobs: verify: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: microsoft/agent-governance-toolkit/action@main with: command: all policy-path: policies/ manifest-path: plugin.json output-format: json fail-on-warning: 'true' The action outputs structured data including controls-passed, controls-total, violations count, and full command output in JSON format. This makes it straightforward to integrate with dashboards, Slack notifications, or downstream decision logic. The Security Scan Action A separate security scan action scans directories for secrets, CVEs, and dangerous code patterns. Unlike the PR-time secret scanning (which focuses on git history), this action performs deep content analysis of the current codebase: - uses: microsoft/agent-governance-toolkit/action/security-scan@main with: paths: 'plugins/ scripts/' min-severity: high exemptions-file: .security-exemptions.json The action supports configurable severity thresholds (critical, high, medium, low), an exemptions file for acknowledged findings, and structured JSON output with findings-count, blocking-count, and detailed findings. Policy Validation Workflow A dedicated policy validation workflow triggers whenever YAML files or the policy engine source code changes. It performs two jobs in sequence: Validate policies: Discovers all policy files matching the *policy* naming convention, then validates each file using the AGT policy CLI. Test policies: Runs the policy CLI unit tests to verify that policy evaluation behavior is correct after the changes. This ensures that policy file edits do not break the policy engine and that policy semantics are preserved. CodeQL and Static Analysis AGT uses GitHub's CodeQL for semantic static analysis of Python and TypeScript code. The CodeQL workflow runs on pushes and PRs, performing deep dataflow analysis that goes beyond pattern matching. Results are uploaded as SARIF to GitHub's Security tab, providing a centralized view of code quality issues. Dependency Confusion Scanning A dedicated CI job runs a dependency confusion scanner on every build. This is a targeted defense against a specific supply chain attack vector where an attacker registers a public package with the same name as an internal package. The scanner checks that: Internal package names do not collide with public PyPI or npm packages Notebook pip install commands only reference packages that are registered and expected Workflow Security Auditing When GitHub Actions workflow files change, a workflow security job scans for common CI/CD security issues: Expression injection: Detects patterns like ${{ github.event.pull_request.title }} used directly in run: blocks, which can allow arbitrary code execution. Overly permissive permissions: Flags workflows that request more permissions than necessary. Unpinned action references: Detects actions referenced by branch name instead of commit SHA, which is a supply chain risk. .NET Binary Analysis with BinSkim For the .NET SDK (Microsoft.AgentGovernance), the CI pipeline runs Microsoft BinSkim binary security analysis on compiled assemblies. BinSkim checks for security-relevant compiler and linker settings in compiled binaries, such as DEP (Data Execution Prevention), ASLR (Address Space Layout Randomization), and stack protection. Results are uploaded as SARIF to GitHub code scanning alongside the CodeQL results. The ci-complete Gate Pattern With many CI jobs that conditionally run based on path filters, AGT uses a pattern called ci-complete: a single gate job that is configured as the sole required status check in branch protection. This job runs unconditionally (if: always()), depends on all other CI jobs, and checks that none of them failed. Jobs that were skipped (because no relevant files changed) are acceptable. This pattern ensures that branch protection works correctly with conditional CI jobs, preventing the common issue where skipped jobs report as "skipped" and fail required status checks. Language-Specific Compile-Time Enforcement Beyond the language-agnostic CI checks, each AGT SDK uses its language's native compiler and tooling to enforce governance standards at compile time. .NET: The Strictest Compile-Time Checks The .NET SDK (Microsoft.AgentGovernance) enforces compile-time governance through MSBuild properties in Directory.Build.props and Directory.Build.targets, which apply automatically to every project in the SDK: Feature MSBuild Property Effect Nullable reference types <Nullable>enable</Nullable> The compiler warns on every possible null dereference, helping prevent NullReferenceException at compile time Warnings as errors <TreatWarningsAsErrors>true All compiler warnings become build errors for packable projects; no warnings can be shipped to consumers Strong-name signing <SignAssembly>true</SignAssembly> Assemblies are signed with a strong-name key (AgentGovernance.snk), enabling identity verification Deterministic builds <ContinuousIntegrationBuild>true Identical source code produces bit-for-bit identical binaries in CI, enabling build verification SourceLink Microsoft.SourceLink.GitHub package Users can step into AGT source code when debugging, supporting transparency and auditability Symbol packages <IncludeSymbols>true</IncludeSymbols> .snupkg symbol packages are published alongside NuGet packages for debugging support TypeScript: Strict Compilation and Linting The TypeScript SDK (@microsoft/agentmesh-sdk) uses strict compiler settings and ESLint for build-time governance: Strict mode ("strict": true in tsconfig.json) enables all strict type-checking options, including noImplicitAny, strictNullChecks, strictFunctionTypes, and strictBindCallApply. Consistent file naming (forceConsistentCasingInFileNames) prevents cross-platform issues where imports work on case-insensitive file systems (Windows, macOS) but fail on case-sensitive ones (Linux CI). Declaration generation (declaration: true with declarationMap: true) produces .d.ts files for consumers, enabling downstream type checking. ESLint with @typescript-eslint provides static analysis during the build process, catching issues beyond what the TypeScript compiler checks. Python: Type Safety and Fast Linting Python packages in AGT use typed package markers and static analysis tooling configured in pyproject.toml: py.typed marker: Each package includes a py.typed file, signalling to type checkers (mypy, pyright, Pylance) that the package supports type checking. Consumers get type errors if they misuse the AGT API. mypy: Configured as a dev dependency with project-specific settings in pyproject.toml. Provides static type checking that catches type mismatches before runtime. ruff: A fast Python linter written in Rust, configured in pyproject.toml and enforced in CI. Ruff checks for hundreds of code quality rules at build time. Stage 4: Release-Time Gates Before artifacts reach users, the release pipeline adds a final layer of verification. These gates help ensure that what ships is exactly what was built, is signed by the expected publisher, and has a complete inventory of its components. Gate Tool What It Produces SBOM generation Anchore/Syft SPDX and CycloneDX software bills of materials listing every component, dependency, and licence Python signing Sigstore Cryptographic signature using OpenID Connect identity, verifiable without manual key distribution .NET signing RELEASE PIPELINE Microsoft Authenticode and NuGet signing through the release pipeline Build provenance actions/attest-build-provenance SLSA provenance attestation linking the artifact to its source commit and build environment SBOM attestation actions/attest-sbom Binds the SBOM to the specific release artifact, creating a verifiable link between the inventory and the binary Additionally, the OpenSSF Scorecard runs on schedule, providing an automated security posture assessment that covers branch protection, dependency management, CI/CD practices, and more. The score is published to the OpenSSF Scorecard website, giving consumers a transparent view of the project security practices. How It All Fits Together: Defense in Depth This approach follows a defense-in-depth principle: every check exists at multiple layers, so that bypassing one layer does not compromise the whole system. Secret scanning, for example, runs at three levels: detect-secrets at commit time (pre-commit hook), Gitleaks at PR time (secret scanning workflow), and the Security Scan action at CI time (content analysis). A developer who bypasses pre-commit hooks will still be caught by the PR-time gate. A contributor who force-pushes past the PR gate will still be caught by the CI pipeline. Similarly, policy validation runs at commit time (validate-policy hook), at PR time (quality gates), and at CI time (policy validation workflow). Each layer adds depth: the commit-time hook catches schema errors, the CI pipeline catches semantic issues and runs regression tests. The ci-complete gate job ties everything together. By depending on every CI job and serving as the single required status check, it ensures that no code merges to the main branch unless every applicable check has passed. Getting Started You can adopt AGT's shift-left governance incrementally. Here are three starting points, from lowest to highest effort: 1. Add the Governance Verify Action (5 minutes) Add a single GitHub Actions workflow that runs the compliance check on every PR: # .github/workflows/governance.yml name: Governance on: [pull_request] jobs: verify: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: microsoft/agent-governance-toolkit/action@main with: command: governance-verify 2. Enable Pre-Commit Hooks (15 minutes) Add a .pre-commit-config.yaml referencing AGT's hooks, install them, and run against all existing files to establish a baseline. Start in permissive mode and graduate to strict over four weeks. 3. Full Pipeline Integration (1-2 hours) Add the complete set of PR-time gates (attestation, dependency review, secret scanning, supply chain checks, quality gates), configure the Security Scan action for your plugin directories, and enable SBOM generation and signing in your release workflow. The AGT repository itself serves as a reference implementation: every workflow described in this post is running in production at aka.ms/agent-governance-toolkit. Important Notes The policy files, workflow configurations, and code samples in this post are illustrative examples. Your organization's governance requirements may differ. Review and customize all configurations before deploying to production. The Agent Governance Toolkit is designed to help organizations implement governance controls for AI agents; it does not guarantee compliance with any specific regulatory framework. Always consult your organization's security and legal teams when defining governance policies. What Comes Next Pre-runtime governance is one piece of the puzzle. Combined with the runtime governance capabilities covered in part one of this series (policy engines, zero-trust identity, execution sandboxing, audit logging), it provides coverage across the full lifecycle. The project continues to grow. Since the initial release, we’ve added a multi-stage policy pipeline (pre_input, pre_tool, post_tool, pre_output stages), approval workflows with human-in-the-loop gates, DLP attribute ratchets for monotonic session state, and OpenTelemetry instrumentation for governance operations. Over 45 step-by-step tutorials are available in the documentation. Everything described in this post is available today in the public GitHub repository. The full source, documentation, tutorials, and examples are at aka.ms/agent-governance-toolkit, open source under the MIT license. We welcome contributions, feedback, and issue reports from the community.491Views0likes0Comments