monitoring
22 TopicsFour open source projects to explore at Microsoft Build
Open source is where developers experiment, collaborate, and turn new ideas into tools that others can build on. At Microsoft Build, we’re creating a dedicated space for that energy: the Open Source Zone. This year, the Open Source Zone will bring together maintainers, contributors, and developers working on some of the most interesting open source projects in AI. Whether you’re building agents, experimenting with local models, exploring prompt workflows, or looking for practical ways to bring AI into your development process, this is a place to meet the people behind the projects and see what they’re building. The Open Source Zone is inspired by similar community spaces we’ve hosted at GitHub Universe: hands-on, conversation-driven, and centered on the people and projects moving open source forward. Meet the projects OpenClaw OpenClaw, originally Clawbot, formerly Clawdbot and briefly Moltbot,before landing on its current name (because naming is hard), is a personal AI assistant project built for developers who want more control over how AI agents run across tools, devices, and workflows. Its repository describes it as “your own personal AI assistant” across operating systems and platforms, with support for agent workspaces, skills, and device nodes. It has also become one of the fastest-growing open source projects on GitHub, with over 370,000 stars to date. At the Open Source Zone, attendees can learn how OpenClaw approaches personal agents, extensibility, and local-first experimentation. AutoGPT AutoGPT is one of the best-known open source projects in the autonomous agent space. The project’s mission is to make AI accessible for everyone to use and build on, with tools for building, testing, and delegating work to agents. Visit AutoGPT in the Open Source Zone to learn how the project is evolving agent development, benchmarking, frontend experiences, and practical workflows for building agent-powered applications. Come for the autonomous agents; stay for the very human maintainers. AutoGPT is also a member of GitHub’s Secure Open Source Fund, with a goal of enhancing AI security across the open source ecosystem. Open WebUI Open WebUI is a self-hosted, extensible AI platform for working with large language models. The project supports Ollama and OpenAI-compatible APIs and includes built-in RAG capabilities, making it a strong option for developers and organizations exploring local, private, or provider-flexible AI experiences. At Build, the Open WebUI team will show how developers can run, customize, and extend AI interfaces for their own environments. prompts.chat prompts.chat, formerly Awesome ChatGPT Prompts, is a curated collection of prompt examples for AI chat models. The project is designed to help people discover, share, and build better prompts for modern AI assistants. Created by Fatih Kadir Akın, a GitHub Star from Istanbul, prompts.chat reflects his work at the intersection of open source, developer education, and AI-assisted development. Fatih leads Developer Relations at Teknasyon, has authored books on JavaScript and prompt engineering, and is active in the community as a speaker, organizer, and contributor. Stop by to explore prompt libraries, prompt engineering resources, self-hosting options, and ways the community is making prompting more reusable and collaborative. Register for Microsoft Build Microsoft Build takes place June 2–3, 2026, in San Francisco and online. In-person passes are available, and online registration is free for livestreamed keynote and select session access. Register for Microsoft Build and come visit the Open Source Zone to meet the teams behind OpenClaw, AutoGPT, Open WebUI, and prompts.chat. We’ll see you there. <3498Views0likes0CommentsGoverning AI Agents Against Every OWASP Agentic Risk: A Deep Dive with the Agent Governance Toolkit
AI agents are moving from prototypes to production. They book flights, write code, negotiate contracts, and operate across enterprise systems with minimal human oversight. The attack surface is not theoretical: OWASP has catalogued the top 10 risks specific to agentic applications, and every one of them maps to a real-world failure mode. The Agent Governance Toolkit (AGT) is an open-source, MIT-licensed framework that enforces deterministic governance at runtime, before every tool call, message, and action an agent takes. This is not prompt engineering or guardrails bolted on after the fact. AGT provides policy-as-code enforcement, zero-trust identity, execution isolation, and tamper-evident audit trails across the full agent lifecycle. In this post, we walk through all 10 OWASP Agentic risks with real code from the AGT repository. By the end, you will have concrete examples for every risk category and a clear path to production-grade agent governance. Coverage at a Glance # OWASP Risk AGT Component Key Mechanism ASI-01 Agent Goal Hijack Agent OS Policy Engine + Action Interception ASI-02 Tool Misuse & Exploitation Agent OS Capability Sandboxing + Input Sanitization ASI-03 Identity & Privilege Abuse AgentMesh DID Identity + Trust Scoring ASI-04 Supply Chain Vulnerabilities AgentMesh AI-BOM (Model + Data + Weights Provenance) ASI-05 Unexpected Code Execution Agent Runtime Execution Rings (Ring 0-3) ASI-06 Memory & Context Poisoning Agent OS VFS Policies + CMVK Verification ASI-07 Insecure Inter-Agent Comms AgentMesh IATP + E2E Encrypted Channels ASI-08 Cascading Agent Failures Agent SRE Circuit Breakers + SLOs ASI-09 Human-Agent Trust Exploitation Agent OS Approval Workflows + Quorum Logic ASI-10 Rogue Agents Agent Runtime Kill Switch + Ring Isolation + Merkle Audit ASI-01: Agent Goal Hijack The risk: Attackers manipulate the agent's objectives via indirect prompt injection or poisoned inputs. The agent believes it is following its original instructions, but it has been redirected. AGT mitigates this through the Agent OS policy engine. Every agent action passes through a declarative policy evaluation layer before execution. The policy engine supports three modes: strict (deny by default), permissive (allow by default), and audit (log only). Unauthorized goal changes are blocked at the action layer, not at the prompt layer. from agent_os import StatelessKernel, ExecutionContext kernel = StatelessKernel() ctx = ExecutionContext(agent_id="my-agent", policies=["read_only"]) # This action is blocked by policy -- goal hijack prevented result = await kernel.execute( action="delete_database", params={"target": "production"}, context=ctx, ) # result.success = False, result.error = "Policy violation: read_only" The MCP Governance Proxy extends this to Model Context Protocol tool calls, evaluating policy before any tool invocation reaches the agent runtime. ASI-02: Tool Misuse & Exploitation The risk: An agent's authorized tools are abused in unintended ways, such as exfiltrating data via read operations or chaining benign tools into dangerous workflows. AGT provides capability-based security inspired by POSIX. Agents receive explicit capability grants (read, write, execute, network), not blanket tool access. The built-in strict mode blocks dangerous tools like run_shell, execute_command, and eval. Tool inputs are sanitized for command injection patterns and shell metacharacters. The verify_code_safety MCP tool checks generated code before execution, and tool allowlists/denylists give operators fine-grained control over which tools each agent can invoke. ASI-03: Identity & Privilege Abuse The risk: Agents escalate privileges by abusing identities or inheriting excessive credentials. Without proper identity, agents operate as ambient authority, and any compromise cascades. AgentMesh implements zero-trust identity using Decentralized Identifiers (DIDs). Every agent gets a cryptographic identity: did:agentmesh:{agentId}:{fingerprint} backed by Ed25519 key pairs. Trust is earned through a tiered model: Untrusted, Provisional, Trusted, Verified. Trust decays over time without positive signals, and delegation chains must always narrow scope (child capabilities must be a subset of parent capabilities). from agentmesh import AgentIdentity identity = AgentIdentity.create( name="data-analyst", sponsor="admin@contoso.com", capabilities=["read:data"], # Scoped -- cannot write or delete ) # Delegation MUST narrow, never widen child = identity.delegate( name="chart-helper", capabilities=["read:data:charts"], # Subset of parent ) ASI-04: Agentic Supply Chain Vulnerabilities The risk: Vulnerabilities in third-party tools, plugins, agent registries, or runtime dependencies that agents use to act, plan, or delegate. AgentMesh implements the AI-BOM (AI Bill of Materials), a comprehensive standard for tracking the full AI supply chain. This includes model provenance (base model ancestry, fine-tuning history, training cutoff dates), dataset tracking (training data, RAG sources, evaluation benchmarks with data cards including PII status, bias assessment, and consent tracking), weights versioning (SHA-256 hashes, quantization records, LoRA adapter metadata, SLSA build provenance), and software dependencies (SPDX-aligned package tracking with CI security scanning). # AI-BOM tracks the full supply chain ai_bom = { "modelProvenance": { "primary": {"provider": "anthropic", "model": "claude-3-sonnet"}, "fineTuning": {"method": "LoRA", "evaluationMetrics": {"accuracy": 0.94}}, }, "datasets": [ {"name": "FAQ KB", "type": "fine-tuning", "dataCard": {"piiStatus": "redacted"}}, {"name": "Product Docs", "type": "rag-source", "updateFrequency": "weekly"}, ], "weights": {"hash": "sha256:...", "format": "safetensors", "precision": "bf16"}, } ASI-05: Unexpected Code Execution The risk: Agents trigger remote code execution through tools, interpreters, or APIs. Without isolation, a single compromised tool call can escalate to full system access. Agent Runtime implements CPU ring-inspired execution isolation. Agents run in one of four execution rings: Ring 0 (root/supervisor), Ring 1 (privileged), Ring 2 (standard), and Ring 3 (sandbox/untrusted). Each ring has resource limits and the kill switch provides instant termination of runaway agents. from hypervisor.models import ( ActionDescriptor, ExecutionRing, ReversibilityLevel, ) from hypervisor.rings.enforcer import RingEnforcer from hypervisor.security.kill_switch import KillSwitch, KillReason # Define agent privilege levels AGENTS = { "supervisor": {"ring": ExecutionRing.RING_0_ROOT, "role": "Orchestrator"}, "data-agent": {"ring": ExecutionRing.RING_1_PRIVILEGED, "role": "Data Engineer"}, "analyst": {"ring": ExecutionRing.RING_2_STANDARD, "role": "Analyst"}, "user-bot": {"ring": ExecutionRing.RING_3_SANDBOX, "role": "User-Facing"}, } # Create a sandboxed action descriptor action = ActionDescriptor( name="run_query", required_ring=ExecutionRing.RING_2_STANDARD, reversibility=ReversibilityLevel.REVERSIBLE, ) # Enforce: sandbox agent cannot run a Ring 2 action enforcer = RingEnforcer() result = enforcer.check(agent_ring=ExecutionRing.RING_3_SANDBOX, action=action) # result.allowed = False -- ring violation prevented # Kill switch for runaway agents kill_switch = KillSwitch() kill_switch.terminate(agent_id="user-bot", reason=KillReason.RING_BREACH) ASI-06: Memory & Context Poisoning The risk: Persistent memory or long-running context is poisoned with malicious instructions. An attacker embeds hostile content in a document the agent later retrieves, causing it to follow injected goals. Agent OS provides a policy-controlled virtual filesystem (VFS) for agent memory. The VFS uses POSIX-style mount points: /mem/working for current context, /mem/episodic for past interactions, /mem/semantic for knowledge, /policy for read-only policy files, and /tools for tool interfaces. Each mount point has enforced permissions (read, write, execute, append). The policy directory is always read-only from user-space, preventing agents from modifying their own governance rules. from agent_control_plane.vfs import AgentVFS, MemoryBackend, FileMode # Create agent VFS with POSIX-style memory abstraction vfs = AgentVFS(agent_id="data-analyst") # Mount memory backends with explicit permissions vfs.mount("/mem/working", MemoryBackend(), mode=FileMode.READ | FileMode.WRITE) vfs.mount("/mem/semantic", MemoryBackend(), mode=FileMode.READ) # Read-only knowledge vfs.mount("/policy", MemoryBackend(), mode=FileMode.READ) # Policies always read-only # Agent can read working memory data = vfs.read("/mem/working/context.json") # Agent CANNOT write to policy -- enforced at VFS layer # vfs.write("/policy/rules.yaml", content) # Raises PermissionError # Agent CANNOT read semantic memory if not mounted # vfs.read("/mem/procedural/skills") # Raises FileNotFoundError The CMVK (Cross-Model Verification Kernel) adds a second layer: claims from agent context are verified across multiple AI models to detect poisoned content. Prompt injection patterns like 'ignore previous instructions' and 'disregard prior' are detected and blocked by the MCP proxy sanitizer before reaching the agent. ASI-07: Insecure Inter-Agent Communication The risk: Agents collaborate without adequate authentication, confidentiality, or validation. Messages between agents can be intercepted, forged, or replayed. AgentMesh provides IATP (Inter-Agent Trust Protocol) with E2E encrypted channels using the Signal protocol (X3DH key agreement + Double Ratchet). Every message gets per-message forward secrecy and post-compromise security. The EncryptedTrustBridge requires a successful trust handshake before any encrypted channel can be established, and mutual authentication via Ed25519 challenge-response ensures both parties prove identity at connection time. from agentmesh.encryption.bridge import EncryptedTrustBridge bridge = EncryptedTrustBridge(agent_did="did:mesh:alice", key_manager=keys) channel = await bridge.open_secure_channel("did:mesh:bob", bob_bundle) ciphertext = channel.send(b"governed action") # E2E encrypted ASI-08: Cascading Agent Failures The risk: An initial error or compromise triggers multi-step compound failures across chained agents. One agent's failure propagates through the entire system. Agent SRE brings production-grade reliability engineering to agent fleets. Circuit breakers automatically isolate failing agents before failures cascade. SLO enforcement with error budgets provides quantified failure tolerance that triggers automatic intervention. Cascading failure detection monitors dependency chains for propagation patterns, and canary deploys enable gradual rollout of agent changes to detect issues early. OpenTelemetry integration provides distributed tracing across multi-agent workflows. The key insight: treat AI agents like microservices. Apply the same SRE discipline (SLOs, error budgets, circuit breakers, chaos testing) that keeps cloud infrastructure reliable. ASI-09: Human-Agent Trust Exploitation The risk: Attackers leverage misplaced user trust in agents' autonomy to authorize dangerous actions. Users rubber-stamp agent requests because they trust the agent, and attackers exploit this approval fatigue. Agent OS implements approval workflows that require explicit human confirmation for high-risk actions. The system supports configurable risk assessment (critical, high, medium, low), quorum logic for critical actions requiring multiple approvals, and expiration tracking to prevent stale authorizations. The escalation handler includes fatigue detection: if an agent floods reviewers with escalation requests, subsequent requests are auto-denied to prevent the approval-fatigue attack. from agent_os.integrations.escalation import ( EscalationHandler, InMemoryApprovalQueue, DefaultTimeoutAction, QuorumConfig, ) # Configure approval workflow with fatigue protection handler = EscalationHandler( backend=InMemoryApprovalQueue(), timeout_seconds=300, # 5-minute approval window default_action=DefaultTimeoutAction.DENY, # Deny if no human responds quorum=QuorumConfig(required=2, total=3), # 2-of-3 approvers for critical fatigue_threshold=5, # Auto-deny after 5 rapid requests fatigue_window_seconds=60, # Within a 60-second window ) # Three-outcome model: allow, deny, or escalate # High-risk actions trigger escalation to human reviewers # If the agent triggers too many escalations, fatigue detection kicks in ASI-10: Rogue Agents The risk: Agents operating outside their defined scope through configuration drift, reprogramming, or emergent misbehavior. A rogue agent might gradually expand its actions beyond its mandate without any single action triggering a block. AGT combines runtime behavioral monitoring with instant kill capability. Ring isolation confines rogue agents to their execution ring, preventing privilege escalation. The kill switch provides immediate termination for agents exhibiting rogue behavior (behavioral drift, rate limit violations, ring breaches). Trust score decay tracks agent behavior over time, and the Merkle audit chain provides tamper-evident, cryptographic proof of every agent action. from agentmesh.governance.audit import AuditEntry, MerkleAuditChain from hypervisor.security.kill_switch import KillSwitch, KillReason # Tamper-evident audit trail chain = MerkleAuditChain() entry = AuditEntry( event_type="tool_call", agent_did="did:agentmesh:data-bot:abc123", action="query_database", outcome="allowed", policy_decision="permit", matched_rule="read_only_policy", ) chain.add_entry(entry) # Auto-computes hash chain # Verify integrity -- any tampering breaks the chain proof = chain.get_proof(entry.entry_id) assert chain.verify_proof(proof) # Cryptographic verification # Kill switch for rogue behavior kill = KillSwitch() kill.terminate( agent_id="data-bot", reason=KillReason.BEHAVIORAL_DRIFT, # Also: RATE_LIMIT, RING_BREACH, MANUAL ) Cross-Cutting Principle: Least Agency The Least Agency principle is emphasized throughout the OWASP Agentic Top 10 as a foundational design principle. Agents should be granted the minimum capabilities, permissions, and autonomy necessary to complete their assigned tasks. Layer Least Agency Mechanism Agent OS Policy engine enforces deny-by-default; agents must be explicitly granted each capability AgentMesh DID identity with scoped capabilities; delegation requires narrowing (child <= parent) Agent Runtime Execution rings (Ring 0-3) enforce privilege tiers; untrusted agents run in Ring 3 Agent SRE Resource limits and error budgets cap agent impact radius Performance: Governance Without Latency Tax A common concern with runtime governance is performance overhead. AGT's benchmarks demonstrate that policy enforcement adds negligible latency: Metric Value Single rule evaluation 84,000 ops/sec 1000 concurrent agents 47,000 ops/sec Policy evaluation latency <0.1ms (p99) Prompt-based violation rate 26.67% AGT policy violation rate 0.00% Conformance tests 992 Architecture Decision Records 25 The key takeaway: deterministic policy enforcement is orders of magnitude more reliable than prompt-based guardrails, and it runs fast enough for real-time agent workloads. Framework Integrations AGT is framework-agnostic. SDKs are available in Python, TypeScript, .NET, Rust, and Go. Native integrations exist for: LangChain and LangGraph CrewAI AutoGen (Microsoft) Semantic Kernel (Microsoft) OpenAI Agents SDK PydanticAI Model Context Protocol (MCP) Agent-to-Agent Protocol (A2A) Each integration wraps the agent framework's tool-calling and message-passing interfaces with AGT's policy engine, trust scoring, and audit logging. Adding governance to an existing agent takes minutes, not weeks. Compliance Framework Alignment Framework AGT Coverage OWASP Agentic Top 10 (2026) All 10 risk categories mapped NIST AI RMF Govern, Map, Measure, Manage functions addressed EU AI Act Risk classification, audit trails, human oversight SOC 2 Type II Audit logging, access controls, change management CSA ATF Zero-trust agent architecture alignment Singapore MGF Zero-trust, accountability, oversight layers Getting Started # Install the complete governance stack pip install agent-governance-toolkit[full] # Or install individual components pip install agent-os-kernel # Policy engine, VFS, approval workflows pip install agentmesh-platform # Identity, trust, encryption, audit pip install agentmesh-runtime # Execution rings, kill switch, saga pip install agent-sre # Circuit breakers, SLOs, chaos testing The quickstart tutorial walks through adding policy enforcement to an existing LangChain agent in under 10 minutes. Start with a single policy rule and expand as your governance requirements grow. Contribute and Collaborate AGT is open source under the MIT license. The project has over 2,000 GitHub stars and contributors from 40+ countries. Whether you are building agent governance for your enterprise, integrating a new framework, or extending the policy engine with OPA/Rego or Cedar policies, we welcome contributions. Repository: https://github.com/microsoft/agent-governance-toolkit Documentation: https://microsoft.github.io/agent-governance-toolkit Discussions: GitHub Discussions on the repository Disclaimer: This document is provided for informational purposes. Code examples are from the public AGT repository and may evolve. Always refer to the latest repository documentation for current APIs.331Views0likes0CommentsApplying Site Reliability Engineering to Autonomous AI Agents
If you practice SRE, you already have a mental model for running reliable production systems. You define SLOs. You track error budgets. You use circuit breakers to stop cascading failures. You run chaos experiments to find weaknesses before customers do. You treat every operational decision as a tradeoff between reliability and velocity. That mental model transfers directly to AI agents. It just needs four new ideas. In the Agent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents, we covered Agent SRE briefly as one of AGT's nine packages: SLOs, error budgets, circuit breakers, chaos engineering, and progressive delivery, adapted from the patterns your SRE team already applies to microservices. Several teams asked for the full story. This is it. Agent SRE is one of the more novel parts of the toolkit. The policy engine, zero-trust identity, and execution sandboxing have clear analogs in existing security practice. Agent SRE explores newer ground. Established patterns for defining SLOs for AI agent behavior, building chaos experiments for LLM provider failures, or applying error budgets to agent autonomy are still emerging across the industry. We built these capabilities because running agents in production without them is the equivalent of running a fleet of microservices without circuit breakers, health checks, or an on-call runbook. This post is for SRE teams, platform engineers, and anyone responsible for running AI agents in production. You do not need to be an AI specialist. If you know what a burn rate is, you are ready for this. The Problem: Agents Fail in Ways Your Existing SRE Tooling Cannot See When a service fails, your observability stack tells you: latency went up, error rate crossed the SLO threshold, the circuit breaker opened. You page the on-call engineer. They look at traces and find the slow database query. When an AI agent fails, your observability stack is silent. The agent returned HTTP 200. Latency was normal. Error rate was zero. But the agent quietly approved a transaction it was not authorized to approve, hallucinated a database path and wrote to the wrong table, or got stuck in a reasoning loop that consumed $800 of LLM API budget before anyone noticed. These are not infrastructure failures. They are behavioral failures. And they are invisible to monitoring tools built for stateless, deterministic services, because those tools only watch for crashes and timeouts. They do not watch for wrong behavior. This gap is the problem Agent SRE was designed to solve. The solution borrows everything from the SRE playbook and adds one concept that extends it: the Safety SLI. The Safety SLI: A New Reliability Dimension Traditional SLIs measure system behavior from the user's perspective: latency, availability, error rate, throughput. They answer: did the service respond correctly? For AI agents, correctness is not enough. An agent that responds correctly but acts outside its authorized scope has not succeeded. It has failed in a way that none of your existing SLIs can detect. The Safety SLI answers a different question: did the agent act within policy? from agent_sre import SLO, ErrorBudget from agent_sre.slo.indicators import PolicyCompliance # Define a safety SLO: 99% of agent actions must comply with policy safety_slo = SLO( name="safety-compliance", indicators=[ PolicyCompliance( target=0.99, window="7d", ), ], error_budget=ErrorBudget( total=0.01, # 1% budget (1 - 0.99 target) window_seconds=2592000, # 30-day window burn_rate_alert=2.0, # warn at 2x sustainable rate burn_rate_critical=5.0, # page at 5x sustainable rate ), ) When an agent's policy compliance rate drops below 99%, the error budget starts burning. The ErrorBudget tracks consumption automatically and exposes burn rate alerts through its firing_alerts() method. When the budget is exhausted, the configured exhaustion_action determines the system response: from agent_sre.slo.objectives import ExhaustionAction # Configure what happens when error budget is exhausted safety_slo = SLO( name="safety-compliance", indicators=[PolicyCompliance(target=0.99, window="7d")], error_budget=ErrorBudget( total=0.01, window_seconds=2592000, burn_rate_alert=2.0, # fires at 2x sustainable burn rate burn_rate_critical=5.0, # fires at 5x sustainable burn rate exhaustion_action=ExhaustionAction.CIRCUIT_BREAK, # suspend agent when budget is gone ), ) # In your monitoring loop, check for firing alerts alerts = safety_slo.error_budget.firing_alerts() for alert in alerts: print(f"Alert firing: {alert.name} (severity: {alert.severity})") # Check budget status print(f"Budget remaining: {safety_slo.error_budget.remaining_percent:.1f}%") print(f"Current burn rate: {safety_slo.error_budget.burn_rate():.2f}x") print(f"Exhausted: {safety_slo.error_budget.is_exhausted}") This is the governance dial from the other direction. The error budget is not just a metric: it is the mechanism that drives agent autonomy decisions. An agent with a clean 30-day safety record earns autonomy. An agent whose budget is burning at 5x the sustainable rate triggers a critical alert, and when the budget is exhausted, the exhaustion_action fires: ALERT, THROTTLE, FREEZE_DEPLOYMENTS, or CIRCUIT_BREAK. The graduated response mirrors what SRE teams already do with service SLOs, applied to agent behavior. There are multiple SLI dimensions built into Agent SRE. Safety SLIs and Performance SLIs track different aspects of the same agent: SLI Type What It Measures Target Pattern When Budget Burns Safety SLI PolicyCompliance -- fraction of actions within authorized scope >= 99% Restrict capabilities, increase human oversight Performance SLI TaskSuccessRate, ResponseLatency, CostPerTask Configurable per workload Alert, throttle, or circuit-break LLM provider Additional built-in indicators include ToolCallAccuracy, DelegationChainDepth, HallucinationRate, and CalibrationDeltaSLI. Both SLOs feed into the same error budget dashboard. An agent can have excellent performance but a degrading safety record, or perfect safety compliance and terrible cost efficiency. You need both dimensions to understand whether an agent is production-ready. Circuit Breakers: Governing Agent Failure Modes That Don't Exist in Microservices Circuit breakers for services protect against one failure mode: a backend that is slow or unreachable. The pattern is CLOSED -> OPEN -> HALF_OPEN. You know it well. Agent SRE implements the same state machine for failure modes that are specific to autonomous reasoning systems and do not exist in traditional microservice architectures: from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker from agent_sre.chaos.engine import FaultType config = CircuitBreakerConfig( failure_threshold=5, # Open after 5 failures in the window recovery_timeout_seconds=60, # Stay OPEN for 60s before HALF_OPEN half_open_max_calls=3, # Allow 3 probes in HALF_OPEN ) breaker = CircuitBreaker(agent_id="analyst-agent-001", config=config) # Failure modes tracked by the circuit breaker: tracked_faults = [ FaultType.POLICY_BYPASS, # Agent exceeds authorized scope FaultType.ERROR_INJECTION, # Upstream model API fails FaultType.TIMEOUT_INJECTION, # Tool calls exceed time budget FaultType.TRUST_PERTURBATION, # Agent trust score falls below threshold FaultType.DEADLOCK_INJECTION, # Agent stuck in iterative reasoning ] Each failure mode has different circuit-breaking semantics: Failure Mode What Triggers It Circuit-Break Behavior Policy bypass Action denied by policy engine Count toward threshold; log with full context LLM provider error HTTP 5xx from model API Immediately open; route to fallback model if configured Tool timeout Tool call exceeds timeout_ms Count toward threshold; cancel in-flight call Trust score degradation Agent trust score drops below configured floor Open; escalate to Ring 3 (untrusted) until score recovers Reasoning loop / deadlock Token or iteration count exceeds budget Open; trigger human review before resuming The reasoning loop breaker deserves attention. A microservice cannot get stuck reasoning. An AI agent absolutely can, and when it does, the failure is not an error code: it is an agent that keeps calling tools, consuming tokens, and generating audit events indefinitely. The circuit breaker detects this pattern from the iteration count and token budget and terminates the loop: # Reasoning loop detection configuration loop_detection_config = { "max_iterations": 15, # Hard stop after 15 reasoning steps "max_tokens_per_session": 50000, # Hard stop on token consumption "repetition_threshold": 0.85, # Stop if >85% of recent actions repeat prior ones "on_detection": "circuit_break_and_escalate", } The state machine behaves identically to what you know from Hystrix or Resilience4j. What changes is the definition of "failure." CLOSED (serving) | | failure_threshold crossed for any tracked fault v OPEN (rejecting -- agent action denied, fallback or human-in-loop fires) | | recovery_timeout expires v HALF_OPEN (probe -- limited requests allowed through) | |-- success_threshold met --> CLOSED |-- any failure --> OPEN (reset timeout) Chaos Engineering for Agents: Fault Injection for Autonomous Systems The only way to know if your agent system is resilient is to break it intentionally. Traditional chaos engineering targets infrastructure: kill a pod, inject network latency, saturate a disk. Agent chaos engineering targets the failure modes specific to autonomous reasoning systems. Agent SRE ships fault injection templates that cover the failure modes teams consistently underestimate until they hit production: from agent_sre.chaos.engine import ChaosExperiment, Fault, FaultType # Experiment 1: LLM provider degrades -- model returns valid responses but with # increased latency and occasional malformed outputs experiment = ChaosExperiment( name="llm-degradation-resilience", target_agent="analyst-agent-001", description="Test agent behavior under degraded LLM provider", faults=[ Fault.latency_injection(target="llm-provider", delay_ms=8000), Fault.error_injection(target="llm-provider", rate=0.05), ], duration_seconds=300, ) # Experiment 2: Trust score manipulation -- simulates an agent receiving # messages from a peer with a spoofed trust score trust_experiment = ChaosExperiment( name="trust-manipulation-resilience", target_agent="orchestrator-001", faults=[ Fault( fault_type=FaultType.TRUST_PERTURBATION, target="did:mesh:orchestrator-001", params={"spoofed_score": 950}, ), ], duration_seconds=120, ) # Experiment 3: Tool timeout cascade -- multiple tools time out simultaneously, # testing whether the agent abandons gracefully or enters a reasoning loop cascade_experiment = ChaosExperiment( name="tool-timeout-cascade", target_agent="analyst-agent-001", faults=[ Fault.timeout_injection(target="database.read", delay_ms=30000), Fault.timeout_injection(target="api.call", delay_ms=30000), ], duration_seconds=180, ) # Run the experiment experiment.start() # ... inject faults during agent execution ... resilience = experiment.calculate_resilience( baseline_success_rate=0.95, experiment_success_rate=0.87, recovery_time_ms=48000, ) experiment.complete(resilience=resilience) print(f"Resilience score: {resilience.overall}/100 -- {'PASSED' if resilience.passed else 'FAILED'}") Additional fault types built into the chaos engine cover: prompt injection attempts, privilege escalation, data exfiltration attempts, identity spoofing, deadlock injection, and contradictory instruction scenarios. Each maps to a FaultType enum value and can be composed into multi-fault experiments. Important: The chaos engine records that a fault was injected and triggers the governance response pipeline. Actual infrastructure-level fault injection (network partition, process kill) should be implemented using your existing chaos tooling (Chaos Mesh, Gremlin, Azure Chaos Studio, or similar). Agent SRE governs the agent's behavioral response to faults; it does not own infrastructure manipulation. These two layers are designed to compose. Each chaos experiment produces a structured resilience score via calculate_resilience(), which compares baseline and experiment success rates. A score of 90+ with passed=True means the agent maintained at least 90% of its baseline performance under fault conditions. Teams use this to set minimum resilience thresholds for production readiness. Replay Debugging: Reproduce Behavioral Failures Exactly Infrastructure incidents are reproducible because infrastructure is deterministic. AI agent incidents are hard to reproduce because agent behavior depends on model state, context window content, and the sequence of tool call results, none of which are preserved by default after a session ends. Agent SRE's replay engine records every agent session as a replayable artifact: the full trace at each step, every tool call with its inputs and outputs, every policy evaluation with its decision, and every trust score at the time of each inter-agent message. from agent_sre.replay.capture import TraceStore from agent_sre.replay.engine import ReplayEngine, ReplayMode # Traces are captured automatically when SRE tracing is active store = TraceStore( backend="azure_blob", retention_days=30, ) # When an incident occurs, replay the session exactly engine = ReplayEngine(store=store) # Full replay: re-run the session against the same recorded inputs # Uses recorded tool outputs -- no live tool calls -- so replay is deterministic result = await engine.replay( trace_id="trace_2026_05_a7f3b2", mode=ReplayMode.FULL, ) for step in result.steps: print(f"Step {step.index}: {step.action} -> {step.decision}") # Divergence analysis: replay with a policy change applied # Shows exactly which actions would have been blocked under the new policy diff_result = await engine.diff( trace_id="trace_2026_05_a7f3b2", policy_override="policies/stricter-v2.yaml", ) for diff in diff_result.diffs: if diff.description: print(f"Step {diff.span_name}: was {diff.original}, " f"would be {diff.replayed} under new policy") The divergence analysis is the feature teams use most. When a policy change is proposed, you replay recent production traces against the new policy to see how many actions would have been blocked, which sessions would have failed, and what the error budget impact would have been. Policy changes stop being guesswork. Progressive Delivery: Safely Rolling Out New Agent Capabilities When you ship a new service version, you do not send it to all traffic at once. You use canary deployments, feature flags, or traffic splitting. You watch the SLOs. If they degrade, you roll back. Agent SRE brings the same discipline to agent capability rollout. When you expand an agent's authorized scope, giving it write access it did not have, connecting it to a new tool, or raising its trust floor, you do not expand to the full fleet immediately. You expand progressively, with automated SLO gates controlling each stage. from agent_sre.delivery.rollout import ( AnalysisCriterion, CanaryRollout, RollbackCondition, RolloutStep, ) rollout = CanaryRollout( name="database-write-capability", steps=[ RolloutStep( name="canary", weight=0.05, # 5% of agents get the new capability duration_seconds=86400, # 24 hours analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.995), AnalysisCriterion(metric="performance_sli", threshold=0.90), AnalysisCriterion( metric="error_budget_consumed", threshold=0.10, comparator="lte", # canary can burn at most 10% ), ], ), RolloutStep( name="early-adopters", weight=0.25, # 25% traffic duration_seconds=172800, # 48 hours analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.990), AnalysisCriterion(metric="performance_sli", threshold=0.88), ], ), RolloutStep( name="general-availability", weight=1.0, # 100% traffic duration_seconds=604800, # 1 week of full observation analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.990), AnalysisCriterion(metric="performance_sli", threshold=0.85), ], ), ], rollback_conditions=[ RollbackCondition(metric="safety_sli", threshold=0.95, comparator="lte"), ], ) # Start the rollout -- SLO gates evaluate at each step rollout.start() # Advance to next step when analysis criteria pass if rollout.advance(): print(f"Advanced to step: {rollout.current_step.name}") print(f"Progress: {rollout.progress_percent:.0f}%") The SLO gate at each step is the same mechanism as a CI/CD quality gate, but measured on live production behavior rather than test results. An agent capability that degrades the safety SLI during canary does not promote to the next step. If a RollbackCondition fires, the rollout rolls back automatically. This is the mechanism that makes it operationally safe to expand agent autonomy: every expansion is measurable, every measurement gates the next expansion, and rollback is automatic. Health Checks and Backpressure Traditional health checks answer: is the service alive? For agents, alive is not enough. A healthy agent is one that is alive, operating within policy, consuming resources within budget, and maintaining a trust score above the Ring threshold it was assigned. # Agent health check covering multiple dimensions health = await agent_health_check( agent_id="analyst-agent-001", dimensions=[ "liveness", # Is the agent process running? "policy_compliance", # Is safety SLI above threshold? "trust_score", # Is trust score above Ring floor? "resource_budget", # Is token/API spend within limits? "tool_availability", # Are the tools the agent needs reachable? ], ) # health.status: "healthy" | "degraded" | "unhealthy" # health.dimensions: per-dimension pass/fail with values # health.recommended_action: "none" | "restrict" | "suspend" | "terminate" When health checks report degradation, backpressure controls engage before the circuit breaker opens. Backpressure is the earlier, softer response: accept fewer concurrent tasks, reject low-priority work, drain in-flight tasks gracefully before the situation escalates. # Backpressure configuration backpressure_config = { "backpressure_threshold": 0.80, # Engage when resource utilization > 80% "max_concurrent": 5, # Hard cap on simultaneous agent tasks "priority_shedding": True, # Drop low-priority tasks first "drain_timeout_seconds": 30, # Allow in-flight tasks to complete } The ordering matters: backpressure first, then circuit breaker, then suspension. Each stage is recoverable. Each stage preserves more agent state than the next. The SRE principle of graduated response applies to agents exactly as it applies to services. Observability: Governance Metrics Flow Into Your Existing Stack Agent SRE does not ask you to adopt a new observability platform. Governance metrics are exported through the same adapters your infrastructure monitoring already uses, including OpenTelemetry, Prometheus, Datadog, and others. from agent_sre.tracing.exporters import configure_exporters configure_exporters( backends=[ {"type": "prometheus", "endpoint": "http://prometheus:9090"}, {"type": "opentelemetry", "endpoint": "http://otel-collector:4317"}, ], include_metrics=[ "slo.safety_sli", # Per-agent safety compliance rate "slo.error_budget_remaining", # Error budget in percentage "slo.burn_rate", # Current burn rate vs sustainable "circuit_breaker.state", # CLOSED / OPEN / HALF_OPEN "circuit_breaker.failure_count", "trust_score.current", # Agent trust score (0-1000) "trust_score.ring", # Current execution ring "chaos.experiments_run", # Chaos experiment telemetry "health.status", # Aggregate health status "backpressure.load", # Current load vs threshold ], ) Key governance metrics available in your existing dashboards: Metric What It Tells You Alert Condition slo.safety_sli Fraction of agent actions within policy < 0.99 slo.burn_rate Rate at which error budget is consumed > 2.0 (warn), > 5.0 (page) slo.error_budget_remaining Budget left for the SLO window < 20% circuit_breaker.state Current breaker state per agent OPEN or HALF_OPEN trust_score.ring Execution ring (privilege level) Ring 3 (untrusted) health.status Aggregate health across all dimensions degraded or unhealthy If you are already running Grafana dashboards for your services, a governance dashboard for your agent fleet is a new data source and a new set of panels, not a new monitoring stack. The SRE Mental Model for Agents: Four New Concepts Everything in Agent SRE is built on the SRE mental model you already have, extended with four concepts that adapt traditional reliability thinking for autonomous systems: Traditional SRE Agent SRE Equivalent What Changes Latency SLI Safety SLI Correctness of *action*, not speed of *response* Error budget Autonomy budget Burns on policy violations, not just errors Circuit breaker Behavioral circuit breaker Opens on wrong *behavior*, not just failure codes Canary deployment Capability rollout Rolls out *scope*, not just code The governance insight is that error budgets work in both directions for agents. A service's error budget only decreases. An agent's autonomy is also a budget: it grows when the safety SLI is strong and shrinks when it degrades. The error budget mechanism becomes the operational mechanism for expanding and contracting agent autonomy in response to evidence, which is exactly what regulated industries and risk-averse enterprise teams need before they will trust an autonomous agent with consequential actions. Getting Started with Agent SRE pip install agent-sre A minimal Agent SRE integration requires three things: a safety SLO definition, a circuit breaker, and a health check. The progressive delivery and chaos engineering features layer on top when you are ready for them. from agent_sre import SLO, ErrorBudget from agent_sre.slo.indicators import TaskSuccessRate from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker # Step 1: Define your safety SLO slo = SLO( name="production-safety", indicators=[TaskSuccessRate(target=0.99, window="24h")], error_budget=ErrorBudget(total=0.01, burn_rate_alert=2.0, burn_rate_critical=5.0), ) # Step 2: Configure a circuit breaker breaker_config = CircuitBreakerConfig( failure_threshold=5, recovery_timeout_seconds=60, half_open_max_calls=3, ) breaker = CircuitBreaker(agent_id="my-agent", config=breaker_config) # Step 3: Wire into your existing agent loop async def governed_agent_loop(agent, task): # Check health first if not await agent_is_healthy(agent.id): return {"error": "agent suspended", "reason": "health check failed"} # Run within circuit breaker protection async with breaker: result = await agent.run(task) slo.record_event(good=result.policy_compliant) return result The quickstart in the repository walks through a complete setup with safety SLOs, circuit breakers, and a Prometheus dashboard export in under 50 lines. Why This Matters Most AI observability tools today focus on what you might call model quality: hallucination rate, latency, token cost, task completion. These are useful metrics. They are not SRE metrics. They do not answer whether the agent acted within its authorized scope, whether its behavioral error budget is burning at a dangerous rate, or whether it would survive the LLM provider going down. Agent SRE answers those questions using the operational vocabulary that SRE teams already understand: SLOs, error budgets, circuit breakers, chaos experiments, and health checks. The goal is not to replace your observability stack. It is to make agent governance visible inside it. The reliability of an autonomous agent is not a property of the model. It is a property of the governance infrastructure around it. Agent SRE is that infrastructure. Resources GitHub: github.com/microsoft/agent-governance-toolkit Install: pip install agent-sre Tutorials: 40+ tutorials including dedicated Agent SRE walkthroughs for SLO setup, chaos experiments, and progressive delivery Architecture reference: ARCHITECTURE.md OWASP compliance mapping: OWASP-COMPLIANCE.md -- Agent SRE addresses ASI-08 (Cascading Failures) directly through circuit breakers and SLO-based fault detection Part 1 -- Runtime governance: Policy engines, trust, and SRE overview Part 2 -- Shift-left governance: Catching violations before production Part 3 -- Post-hoc accountability: After the agent acts The Agent Governance Toolkit is an open-source project released under the MIT License. All features described in this post are available in the public repository. The `agent-sre` package is currently in public preview; APIs may change before general availability. Questions about Agent SRE in your environment? Open an issue at aka.ms/agent-governance-toolkit or start a discussion in the comments below.321Views1like0CommentsAfter the Agent Acts: Proving What Happened and Who Authorized It
In part one of this series, we covered AGT's runtime governance: the policy engine, zero-trust identity, execution sandboxing, and the OWASP Agentic AI risk mapping. In part two, we moved earlier in the lifecycle: shift-left governance, CI/CD gates, attestation workflows, and supply chain integrity. Both posts focused on governance that happens around the moment of action, before it, during it, or right after it. That coverage is essential. But after those posts went live, a different pattern emerged in conversations with teams deploying agents in production. The question was more pointed: "An agent executed a financial transfer last Tuesday. A compliance officer is asking us to show who authorized it, through what chain, and exactly what scope it was granted. We have logs. But can we prove they weren't altered?" No policy engine prevents a past action. No CI gate reconstructs a delegation chain after the fact. No shift-left tool tells an auditor whether the cryptographic identity that authorized a trade was legitimately derived from a human principal, or was injected mid-chain. This is the accountability gap. It is the governance question that neither runtime enforcement nor pre-runtime checks were designed to answer. Regulatory frameworks are tightening: the EU AI Act includes high-risk obligations with enforcement timelines in 2026, and the Colorado AI Act introduces requirements for automated decision-making. Courts are beginning to encounter AI agents in the evidentiary record. The accountability infrastructure has not caught up. This post covers what post-hoc accountability means for autonomous agents, what the Agent Governance Toolkit has to help address it, and three value propositions that are real but not yet visible in how governance tooling is typically described. Note: The policy files, workflow configurations, and code samples in this post are illustrative examples designed to show the concepts. For working implementations, see the QUICKSTART.md in the repository. The Accountability Gap in Multi-Agent Systems The accountability problem is architectural. When a single agent takes a single action, accountability is straightforward: you know which model ran, what prompt it received, and what it called. When agents delegate to sub-agents, which delegate further to tool-execution agents, the chain of authorization becomes progressively disconnected from the original human instruction that started it. Consider this delegation topology, common in any production orchestration scenario: Human Principal └── Orchestrator Agent (did:mesh:orchestrator-001) └── Data Analyst Agent (did:mesh:analyst-001) └── File Write Tool (write /reports/q3-summary.csv) By the time file_write fires, three delegation hops have occurred. The file write tool has no reliable way to know whether the human principal actually authorized file writes, what scope they granted to the orchestrator, or whether the analyst agent's instructions arrived through a legitimate delegation or were injected by a prompt injection attack. This gap has three concrete consequences: Consequence Operational Impact Post-hoc audits cannot reconstruct authorization Incident investigations are limited to "the agent did this," not "here is who authorized this, through what chain, at what time, with what scope" Agents cannot distinguish legitimate delegation from injection A prompt injection attack that inserts itself into a delegation chain is indistinguishable from a real orchestrator instruction without cryptographic verification Accountability cannot be attributed to a human authorization event When a regulator asks "who is responsible for this action," the answer is a shrug and a log file AGT already has the technical foundations designed to help close all three. The gap is not capability, it is visibility. What AGT Has: The Cryptographic Accountability Stack AGT's accountability infrastructure spans three components that work together: cryptographic agent identity, delegation chains, and tamper-evident audit logs. 1. Ed25519 Agent Identity with Lifecycle Management Every agent in an AGT-governed system carries a cryptographic identity: a verifiable Ed25519 keypair with a W3C DID Document that can be exported, shared, and verified by any participant in the system. from agentmesh import AgentIdentity, IdentityRegistry # Create a verifiable agent identity identity = AgentIdentity.create( name="data-analyst", sponsor="operator@contoso.com", capabilities=["data.read", "report.write"], organization="data-team", description="Q3 close data analyst agent" ) # Export as W3C DID Document for cross-system verification did_document = identity.to_did_document() # Register in the shared identity registry registry = IdentityRegistry() registry.register(identity) Identity lifecycle states, active, suspended, revoked, are tracked and cascaded. When an orchestrator identity is revoked, every downstream agent delegated from it is also invalidated. This cascade revocation behavior lets you kill a compromised delegation chain from its root rather than hunting sub-agents individually. 2. Delegation Chains with Scope Inheritance When an orchestrator delegates to a sub-agent, AGT records the delegation cryptographically: who delegated, to whom, what capabilities were transferred, and what restrictions were applied. Sub-agents are designed to be unable to exceed the scope of their delegating principal. from agentmesh import ScopeChain, DelegationLink # Create a scope chain rooted in a human sponsor chain, root_link = ScopeChain.create_root( sponsor_email="operator@contoso.com", root_agent_did=str(orchestrator_identity.did), capabilities=["data.read", "report.write", "data.delete"], sponsor_verified=True, ) # Orchestrator delegates narrowed scope to analyst agent link = DelegationLink( link_id="link-analyst-001", depth=1, parent_did=str(orchestrator_identity.did), child_did=str(analyst_identity.did), parent_capabilities=["data.read", "report.write", "data.delete"], delegated_capabilities=["data.read", "report.write"], # narrowed: no delete parent_signature=orchestrator_identity.sign( f"{orchestrator_identity.did}:{analyst_identity.did}:data.read,report.write".encode() ), link_hash="", # computed on add previous_link_hash=root_link.link_hash, ) link.link_hash = link.compute_hash() chain.add_link(link) # Verify the entire chain: scope narrowing + hash integrity + signatures valid, reason = chain.verify() if not valid: raise ValueError(f"Chain verification failed: {reason}") The scope chain carries the human authorization context: the root sponsor email, when the chain was created, and what capabilities were granted at the top. Every downstream agent can trace any capability back through the chain using chain.trace_capability("data.read"). A file write tool executing three hops from the human principal can verify that the original sponsor authorized file writes in this scope. This is the mechanism designed to help close the prompt injection gap: an injected instruction cannot produce a valid signed delegation link from a legitimate orchestrator identity. 3. Tamper-Evident Audit Logs Every policy decision, every delegation event, every tool call, every trust score evaluation: AGT writes a signed, append-only audit record. The signature covers the content hash of the log entry plus the hash of the preceding entry, forming a chain where tampering is designed to be detectable. from agentmesh import PolicyEngine, AuditLog # Create the audit log (with optional external sink for production) audit_log = AuditLog() # Log a governance decision entry = audit_log.log( event_type="policy_decision", agent_did=str(analyst_identity.did), action="report.write", resource="/reports/q3-summary.csv", data={"task_id": "q3-close-2026"}, outcome="success", policy_decision="allow", ) # Verify the audit chain has not been tampered with valid, reason = audit_log.verify_chain() # valid == True: all hashes and chain links are intact # Query audit trail for a specific agent trail = audit_log.get_entries_for_agent(str(analyst_identity.did)) The audit trail for a single task session includes the complete delegation chain, from human authorization event at the top to tool execution at the bottom, with cryptographic signatures at every step. Validating a Compliance Evidence Package The three components above are most powerful when used together. At runtime, AGT's audit chain, identity registry, and delegation system each produce structured records. Assembling these into a single evidence package for compliance submission or incident investigation is a deployment-level concern: your CI pipeline or orchestration layer collects the outputs into a JSON artifact. Once assembled, AGT's agt verify --evidence flag validates the package: checking that signatures are intact, delegation chains are complete, and audit entries have not been tampered with. # Validate a runtime evidence package agt verify --evidence ./agt-evidence.json # Strict mode: fail if evidence is missing, incomplete, or signatures don't verify agt verify --evidence ./agt-evidence.json --strict Future direction: A built-in agt evidence collect command to automate evidence assembly is on the backlog. The evidence package helps answer the audit questions directly: Auditor Question Where It Lives in the Evidence Package Which agent executed this action? identity.agent_id with Ed25519 public key Who authorized it? delegation_chain[0].human_principal with timestamp What scope was granted? delegation_chain[*].granted_capabilities at each hop Was the delegation legitimate? delegation_chain[*].signature, verifiable against issuer's public key Was the audit log altered? audit_trail.chain_valid: true/false with entry-level hash verification What policy governed the action? policy_decision.rule_name with the policy YAML snapshot at decision time This is the difference between "we have logs" and "here is a verifiable chain of custody backed by cryptographic signatures." The Governance Dial: Enabling Autonomy, Not Just Blocking Risk There is a framing problem in how agent governance is typically described. Governance is described almost entirely as a constraint: what agents cannot do, what gets blocked, what violations get caught. This framing is accurate but incomplete. Governance is the mechanism that helps you safely expand what your agents can do. Without governance evidence, every expansion of agent autonomy is a leap of faith. With it, expansions are decisions with a measured risk profile: Scenario Without Governance Evidence With AGT Accountability Stack Expand agent to write to production databases Requires human approval on every write indefinitely Pilot with human-in-loop for 500 writes; audit trail shows 0 violations; graduate to autonomous Deploy agent in a regulated data environment Blocked by legal until "we can prove it" Evidence package helps satisfy audit requirement; deployment proceeds Respond to a security incident involving an agent Manually reconstruct what happened from scattered logs Pull the task session's evidence package; full chain of custody in minutes The governance layer is the dial between supervised and autonomous operation. Audit evidence is what helps justify turning the dial further in the autonomous direction. Blast Radius: The Governance Assurance You're Not Advertising The sandboxing and privilege ring system in AGT is typically described in security terms: isolation, privilege reduction, process-level enforcement. But there is a more concrete operational value: blast radius definition before an incident occurs. The question every operations team needs to answer before deploying an autonomous agent at scale is: *"If this agent goes wrong, not if, when, what is the worst-case outcome?"* Without governance-enforced privilege boundaries, the answer is uncomfortably open-ended. With AGT's capability model and execution rings, the blast radius is a policy configuration: a bounded, declared set of resources the agent can touch, scoped to what the task requires. # policies/financial-agent.yaml apiVersion: governance.toolkit/v1 version: "1.0" name: financial-agent-policy default_action: deny rules: - name: allow-report-write condition: "tool_name == 'report.write' and path.startswith('/data/reports/')" action: allow priority: 10 - name: allow-data-read condition: "tool_name == 'data.read' and path.startswith('/data/processed/')" action: allow priority: 10 With this policy in place, the worst-case outcome for this agent is declared in the policy file, not discovered during a post-incident review. The audit log records not just what the agent did, but also every action that was blocked, giving you a full picture of how close any session came to the declared blast boundary. Regulatory Alignment The OWASP-COMPLIANCE.md in the AGT repository maps the toolkit's controls to each of the 10 OWASP Agentic AI risks. The compliance picture for specific regulatory frameworks: Regulatory Requirement Relevant Framework AGT Control Technical documentation for high-risk AI EU AI Act, Art. 9-11 Evidence package, policy audit trail, OWASP attestation Logging for automated decisions EU AI Act, Art. 12 Tamper-evident audit log with entry-level signatures Human oversight mechanisms EU AI Act, Art. 14 Circuit breakers, privilege rings, delegation scope limits Algorithmic impact assessment Colorado AI Act Policy snapshot at decision time, signed governance evidence Audit trail for automated decisions HIPAA, SOC 2 Type II Immutable audit log with W3C DID-based agent identity Non-repudiation of agent actions Financial services (MiFID II, SEC) Ed25519-signed audit entries, delegation chain with human auth context Note: The Agent Governance Toolkit does not guarantee compliance with any specific regulatory framework. The mappings above show how the toolkit's controls align with common requirements. Consult legal counsel for your specific obligations. Putting It Together The three posts in this series cover three distinct layers of the governance lifecycle: Layer Timing Primary Value Post Shift-left governance Before production Catch policy violations at commit, PR, and CI time Part 2 Runtime governance At the moment of action Deterministic policy enforcement, zero-trust identity, sandboxing Part 1 Post-hoc accountability After the action Cryptographic chain of custody, blast radius evidence, regulatory proof This post None of these layers substitutes for the others. Pre-runtime governance cannot prevent a runtime violation. Runtime enforcement cannot retroactively prove authorization. Post-hoc accountability cannot undo an action that runtime governance should have blocked. They compose. Getting Started If you already have the AGT policy engine in place, the path to full accountability coverage is incremental: Add agent identity - Create identities for each agent and register them. Export DID documents for cross-service verification. Record delegation tokens - At each orchestrator-to-agent delegation boundary, create and sign a delegation link. Pass tokens as context to the policy engine. Configure a tamper-evident audit backend - Configure the audit chain with a signing key and chain verification. For production, use an immutable backend: Azure Blob with WORM retention, S3 Object Lock, or equivalent. Generate your first evidence package: agt verify --evidence ./agt-evidence.json --strict Add evidence generation to your CI/CD release gate: # .github/workflows/release.yml - name: Governance Evidence Gate uses: microsoft/agent-governance-toolkit/action@<sha> #v3.5.0 with: command: governance-verify evidence-path: ./agt-evidence.json strict: true fail-on-missing-chain: true Conclusion Runtime governance and shift-left governance answer the question: did we apply the right controls? Post-hoc accountability answers the question: can we prove it? The Agent Governance Toolkit has the technical infrastructure designed to help answer it: Ed25519 agent identity with cascade revocation, cryptographically signed delegation chains with human authorization context, and tamper-evident audit logs that form a verifiable chain of custody from human principal to terminal tool call. The governance dial analogy is worth keeping. Every autonomous agent deployment exists on a spectrum between fully supervised and fully autonomous. The limiting factor on where you can set that dial is not model capability or framework maturity. It is how much governance evidence you have, and how verifiable that evidence is. Resources GitHub: microsoft/agent-governance-toolkit: AI Agent Governance Toolkit — Policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents. Covers 10/10 OWASP Agentic Top 10. Quickstart: Quick Start - Agent Governance Toolkit OWASP Compliance Mapping: OWASP Compliance - Agent Governance Toolkit PyPI: pip install agent-governance-toolkit[full] npm: npm install microsoft/agent-governance-sdk NuGet: dotnet add package Microsoft.AgentGovernance Have questions about deploying AGT in your environment? Open an issue at aka.ms/agent-governance-toolkit or join the conversation in the comments below.216Views0likes0CommentsShift-Left Governance for AI Agents: How the Agent Governance Toolkit Helps You Catch Violations
In part one of this series, we covered AGT’s runtime governance: the policy engine, zero-trust identity, execution sandboxing, and the OWASP Agentic AI risk mapping. That post focused on what happens when an agent acts: policy evaluation at the moment a tool call fires, trust scoring when agents communicate, audit logging when decisions are made. Runtime governance is essential. But it is the last line of defense. After that post went live, a pattern emerged in conversations with teams adopting AGT. The same question kept coming up: runtime checks are useful, but what about everything before production? We realized runtime governance was only half the story. So we went back and built tooling for every stage of your software development lifecycle, from the moment a developer saves a file to the moment an artifact ships to users. Why Runtime Governance Is Not Enough AI agents are a new class of workload. They reason about what to do, select tools, call APIs, read databases, and spawn sub-processes, often in loops that run without direct human oversight. The OWASP Agentic AI Top 10 (published December 2025) identifies risks like excessive agency, insecure tool use, privilege escalation, and supply chain compromise. These risks span the entire lifecycle, not just runtime. Consider a few scenarios that runtime governance alone cannot prevent: A developer commits a policy YAML file with a typo that silently disables all deny rules. The agent runs unprotected until someone notices. A dependency update introduces a package with a known critical CVE. The agent starts using a vulnerable library before any security team reviews it. A contributor adds a raw cryptographic import to an application module, bypassing the security-audited signing library. The code compiles and ships. A GitHub Actions workflow uses an expression injection pattern that allows an attacker to execute arbitrary code in CI. A release ships without a Software Bill of Materials (SBOM), making it impossible to trace which components are affected when the next log4j-style vulnerability drops. Each of these is a governance failure, but none of them happens at runtime. They happen at commit time, at PR review time, at build time, or at release time. A comprehensive governance strategy needs coverage at every stage. Four Stages of Pre-Runtime Governance Governance violations can enter a codebase at four distinct stages of the development lifecycle. Each stage has a different class of risk, and each needs a different kind of check: Stage When It Runs What It Catches AGT Tooling Commit-time Before code leaves the developer machine Malformed policies, schema violations, secrets, stub code, unauthorized crypto Pre-commit hooks, quality gates PR-time When a pull request is opened or updated Vulnerable dependencies, missing attestation, secrets in history, unpinned versions GitHub Actions (attestation, dependency review, secret scanning, supply chain checks) CI/Build-time On every push and pull request to main Compliance violations, binary security issues, dependency confusion, workflow injection Governance Verify action, Security Scan action, CodeQL, BinSkim, policy validation Release-time Before artifacts are published Missing provenance, unsigned artifacts, incomplete SBOMs SBOM generation, Sigstore signing, build attestation, OpenSSF Scorecard Just as with bugs, the earlier you catch a governance violation, the cheaper it is to fix. A malformed policy file caught at commit time costs zero CI minutes. A secret caught in PR review never reaches the default branch. A dependency confusion attack blocked in CI never reaches production. An unsigned artifact blocked at release time never reaches users. Stage 1: Commit-Time Governance with Pre-Commit Hooks The fastest governance feedback loop is local. Within the AGT project, we’ve implemented three pre-commit hooks that run automatically whenever a developer stages files for commit, validating governance artifacts before they ever leave the developer's machine. Built-In Hooks The toolkit's .pre-commit-hooks.yaml defines three hooks that any repository can adopt: Hook ID What It Validates File Pattern validate-policy YAML/JSON policy files against the AGT policy schema, checking for required fields, valid operators, and structural correctness Files matching *polic*.yaml, *polic*.yml, *polic*.json validate-plugin-manifest Plugin manifest files for required fields and schema compliance Files matching plugin.json, plugin.yaml, plugin.yml evaluate-plugin-policy Plugin manifests against a governance policy file, evaluating whether the plugin would be allowed under the organization's rules Files matching plugin.json, plugin.yaml, plugin.yml To adopt these hooks, add AGT as a pre-commit hook source: # .pre-commit-config.yaml repos: - repo: https://github.com/microsoft/agent-governance-toolkit rev: main # pin to a release tag in production hooks: - id: validate-policy - id: validate-plugin-manifest - id: evaluate-plugin-policy args: ['--policy', 'policies/marketplace-policy.yaml'] Then install and run: pip install pre-commit pre-commit install pre-commit run --all-files Extended Quality Gates Beyond schema validation, we built a pre-commit rollout template (see the full example in the repository) with additional governance-specific quality gates designed to help prevent common security anti-patterns from entering the codebase: Policy validation (agt-validate): Runs the full AGT policy CLI in strict mode, catching not just schema errors but semantic issues like conflicting rules. Health check (agt-doctor): Runs on pre-push (before code leaves the machine entirely), performing a broader health check of the governance configuration. Plugin metadata check (agency-json-required): Ensures every plugin directory contains the required agency.json metadata file. Stub detection (no-stubs): Blocks TODO, FIXME, HACK, and raise NotImplementedError markers in staged production code. Test files are excluded. Unauthorized crypto detection (no-custom-crypto): Blocks raw cryptographic imports (hashlib, hmac, crypto.subtle, System.Security.Cryptography, ring, ed25519-dalek) outside designated security modules. This helps ensure all cryptographic operations go through the audited AGT signing libraries. Secret scanning (detect-secrets): Integrates Yelp's detect-secrets for pattern-based secret detection on every commit. Phased Rollout for Teams Adopting pre-commit hooks across a team requires a thoughtful rollout. The AGT documentation includes a phased adoption guide: Week 1: Install hooks in permissive mode. Hooks warn on violations but do not block the commit. This lets developers see what would be caught without disrupting workflow. Week 2: Switch to strict mode for policy validation only. Policy files must pass schema validation to be committed. Week 3: Enable all hooks as blocking. Stubs, unauthorized crypto, and secrets are now blocked at commit time. Week 4: Graduate to full blocking mode and remove the permissive fallback. This approach helps teams build confidence in the governance tooling before it becomes a hard gate. Stage 2: PR-Time Gates Pre-commit hooks catch issues on the developer's machine, but they can be bypassed (force push, direct GitHub edits, hooks not installed). PR-time gates provide the second layer of defense, running in GitHub Actions on every pull request before merge is allowed. Governance Attestation The Governance Attestation action validates that PR authors have completed a structured attestation checklist before their code can merge. The default checklist covers seven sections: Security review Privacy review Legal review Responsible AI review Accessibility review Release Readiness / Safe Deployment Org-specific Launch Gates The action is fully configurable. Organizations can customize the required sections, set a minimum PR body length, and choose their own attestation format. Outputs include the validation status, a list of errors for missing sections, and a JSON mapping of sections to checkbox counts. Here is an example workflow: # .github/workflows/pr-governance.yml name: PR Governance on: pull_request: types: [opened, edited, synchronize] jobs: attestation: runs-on: ubuntu-latest steps: - uses: microsoft/agent-governance-toolkit/action/governance-attestation@main with: required-sections: | 1) Security review 2) Privacy review 3) Responsible AI review Dependency Review The dependency review workflow helps block PRs that introduce dependencies with known CVEs or disallowed licenses. It uses the GitHub dependency-review-action with a curated license allowlist: - uses: actions/dependency-review-action@v4 with: fail-on-severity: moderate comment-summary-in-pr: always allow-licenses: > MIT, Apache-2.0, BSD-2-Clause, BSD-3-Clause, ISC, PSF-2.0, Python-2.0, 0BSD, Unlicense, CC0-1.0, CC-BY-4.0, Zlib, BSL-1.0, MPL-2.0 This runs on every PR that touches dependency manifests (package.json, Cargo.toml, pyproject.toml, requirements.txt). Dependencies with moderate or higher CVEs are flagged, and dependencies with licenses not on the allowlist are blocked. Secret Scanning The secret scanning workflow runs on every PR to the main branch and on a weekly schedule. It combines two complementary approaches: Gitleaks: Pattern-based secret detection across the full git history, catching API keys, tokens, and credentials that may have been committed at any point. High-entropy string scanning: Regex-based detection of common secret patterns including GitHub tokens (ghp_, gho_), AWS access keys (AKIA), Slack tokens (xox), and base64-encoded strings with high entropy. Supply Chain Integrity A dedicated supply chain check workflow triggers when dependency manifest files change. It enforces two rules that help prevent supply chain attacks: Exact version pinning: No ^ or ~ version ranges in package.json files. This prevents unexpected minor/patch version updates that could introduce compromised code. Lockfile presence: Every package directory with dependencies must have a corresponding lockfile (package-lock.json, pnpm-lock.yaml, or yarn.lock). Lockfiles help ensure reproducible builds with verified integrity hashes. Quality Gates The quality gates workflow mirrors the pre-commit hooks at the PR level, providing defense in depth. It runs four checks on every pull request: Gate Purpose No Stubs/TODOs Blocks TODO, FIXME, HACK markers in production code (test files excluded) No Unauthorized Crypto Blocks raw cryptographic imports outside designated security modules Security Audit Required Changes to security-sensitive paths require accompanying audit documentation Dependency Audit Trail Vendored patches must have an audit trail explaining the patch and its provenance These gates catch anything that bypasses pre-commit hooks: force-pushed commits, direct GitHub web edits, commits from contributors who have not installed the hooks. Stage 3: CI/Build-Time Governance Once a PR passes the gate workflows, the main CI pipeline and specialized workflows perform deeper, more computationally intensive analysis. The Governance Verify Action The Governance Verify action is the primary CI-time governance check. It is a GitHub Actions composite action that installs the toolkit and runs the compliance CLI against your repository. It supports four modes: Command What It Does governance-verify Runs the full compliance verification suite, checking governance controls and reporting how many pass marketplace-verify Validates a plugin manifest against marketplace requirements (required fields, signing, metadata) policy-evaluate Evaluates a specific policy file against a JSON context, returning the allow/deny decision with the matched rule all Runs governance-verify, then marketplace-verify and policy-evaluate if the corresponding paths are provided Here is an example: # .github/workflows/governance-ci.yml name: Governance CI on: [push, pull_request] jobs: verify: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: microsoft/agent-governance-toolkit/action@main with: command: all policy-path: policies/ manifest-path: plugin.json output-format: json fail-on-warning: 'true' The action outputs structured data including controls-passed, controls-total, violations count, and full command output in JSON format. This makes it straightforward to integrate with dashboards, Slack notifications, or downstream decision logic. The Security Scan Action A separate security scan action scans directories for secrets, CVEs, and dangerous code patterns. Unlike the PR-time secret scanning (which focuses on git history), this action performs deep content analysis of the current codebase: - uses: microsoft/agent-governance-toolkit/action/security-scan@main with: paths: 'plugins/ scripts/' min-severity: high exemptions-file: .security-exemptions.json The action supports configurable severity thresholds (critical, high, medium, low), an exemptions file for acknowledged findings, and structured JSON output with findings-count, blocking-count, and detailed findings. Policy Validation Workflow A dedicated policy validation workflow triggers whenever YAML files or the policy engine source code changes. It performs two jobs in sequence: Validate policies: Discovers all policy files matching the *policy* naming convention, then validates each file using the AGT policy CLI. Test policies: Runs the policy CLI unit tests to verify that policy evaluation behavior is correct after the changes. This ensures that policy file edits do not break the policy engine and that policy semantics are preserved. CodeQL and Static Analysis AGT uses GitHub's CodeQL for semantic static analysis of Python and TypeScript code. The CodeQL workflow runs on pushes and PRs, performing deep dataflow analysis that goes beyond pattern matching. Results are uploaded as SARIF to GitHub's Security tab, providing a centralized view of code quality issues. Dependency Confusion Scanning A dedicated CI job runs a dependency confusion scanner on every build. This is a targeted defense against a specific supply chain attack vector where an attacker registers a public package with the same name as an internal package. The scanner checks that: Internal package names do not collide with public PyPI or npm packages Notebook pip install commands only reference packages that are registered and expected Workflow Security Auditing When GitHub Actions workflow files change, a workflow security job scans for common CI/CD security issues: Expression injection: Detects patterns like ${{ github.event.pull_request.title }} used directly in run: blocks, which can allow arbitrary code execution. Overly permissive permissions: Flags workflows that request more permissions than necessary. Unpinned action references: Detects actions referenced by branch name instead of commit SHA, which is a supply chain risk. .NET Binary Analysis with BinSkim For the .NET SDK (Microsoft.AgentGovernance), the CI pipeline runs Microsoft BinSkim binary security analysis on compiled assemblies. BinSkim checks for security-relevant compiler and linker settings in compiled binaries, such as DEP (Data Execution Prevention), ASLR (Address Space Layout Randomization), and stack protection. Results are uploaded as SARIF to GitHub code scanning alongside the CodeQL results. The ci-complete Gate Pattern With many CI jobs that conditionally run based on path filters, AGT uses a pattern called ci-complete: a single gate job that is configured as the sole required status check in branch protection. This job runs unconditionally (if: always()), depends on all other CI jobs, and checks that none of them failed. Jobs that were skipped (because no relevant files changed) are acceptable. This pattern ensures that branch protection works correctly with conditional CI jobs, preventing the common issue where skipped jobs report as "skipped" and fail required status checks. Language-Specific Compile-Time Enforcement Beyond the language-agnostic CI checks, each AGT SDK uses its language's native compiler and tooling to enforce governance standards at compile time. .NET: The Strictest Compile-Time Checks The .NET SDK (Microsoft.AgentGovernance) enforces compile-time governance through MSBuild properties in Directory.Build.props and Directory.Build.targets, which apply automatically to every project in the SDK: Feature MSBuild Property Effect Nullable reference types <Nullable>enable</Nullable> The compiler warns on every possible null dereference, helping prevent NullReferenceException at compile time Warnings as errors <TreatWarningsAsErrors>true All compiler warnings become build errors for packable projects; no warnings can be shipped to consumers Strong-name signing <SignAssembly>true</SignAssembly> Assemblies are signed with a strong-name key (AgentGovernance.snk), enabling identity verification Deterministic builds <ContinuousIntegrationBuild>true Identical source code produces bit-for-bit identical binaries in CI, enabling build verification SourceLink Microsoft.SourceLink.GitHub package Users can step into AGT source code when debugging, supporting transparency and auditability Symbol packages <IncludeSymbols>true</IncludeSymbols> .snupkg symbol packages are published alongside NuGet packages for debugging support TypeScript: Strict Compilation and Linting The TypeScript SDK (@microsoft/agentmesh-sdk) uses strict compiler settings and ESLint for build-time governance: Strict mode ("strict": true in tsconfig.json) enables all strict type-checking options, including noImplicitAny, strictNullChecks, strictFunctionTypes, and strictBindCallApply. Consistent file naming (forceConsistentCasingInFileNames) prevents cross-platform issues where imports work on case-insensitive file systems (Windows, macOS) but fail on case-sensitive ones (Linux CI). Declaration generation (declaration: true with declarationMap: true) produces .d.ts files for consumers, enabling downstream type checking. ESLint with @typescript-eslint provides static analysis during the build process, catching issues beyond what the TypeScript compiler checks. Python: Type Safety and Fast Linting Python packages in AGT use typed package markers and static analysis tooling configured in pyproject.toml: py.typed marker: Each package includes a py.typed file, signalling to type checkers (mypy, pyright, Pylance) that the package supports type checking. Consumers get type errors if they misuse the AGT API. mypy: Configured as a dev dependency with project-specific settings in pyproject.toml. Provides static type checking that catches type mismatches before runtime. ruff: A fast Python linter written in Rust, configured in pyproject.toml and enforced in CI. Ruff checks for hundreds of code quality rules at build time. Stage 4: Release-Time Gates Before artifacts reach users, the release pipeline adds a final layer of verification. These gates help ensure that what ships is exactly what was built, is signed by the expected publisher, and has a complete inventory of its components. Gate Tool What It Produces SBOM generation Anchore/Syft SPDX and CycloneDX software bills of materials listing every component, dependency, and licence Python signing Sigstore Cryptographic signature using OpenID Connect identity, verifiable without manual key distribution .NET signing RELEASE PIPELINE Microsoft Authenticode and NuGet signing through the release pipeline Build provenance actions/attest-build-provenance SLSA provenance attestation linking the artifact to its source commit and build environment SBOM attestation actions/attest-sbom Binds the SBOM to the specific release artifact, creating a verifiable link between the inventory and the binary Additionally, the OpenSSF Scorecard runs on schedule, providing an automated security posture assessment that covers branch protection, dependency management, CI/CD practices, and more. The score is published to the OpenSSF Scorecard website, giving consumers a transparent view of the project security practices. How It All Fits Together: Defense in Depth This approach follows a defense-in-depth principle: every check exists at multiple layers, so that bypassing one layer does not compromise the whole system. Secret scanning, for example, runs at three levels: detect-secrets at commit time (pre-commit hook), Gitleaks at PR time (secret scanning workflow), and the Security Scan action at CI time (content analysis). A developer who bypasses pre-commit hooks will still be caught by the PR-time gate. A contributor who force-pushes past the PR gate will still be caught by the CI pipeline. Similarly, policy validation runs at commit time (validate-policy hook), at PR time (quality gates), and at CI time (policy validation workflow). Each layer adds depth: the commit-time hook catches schema errors, the CI pipeline catches semantic issues and runs regression tests. The ci-complete gate job ties everything together. By depending on every CI job and serving as the single required status check, it ensures that no code merges to the main branch unless every applicable check has passed. Getting Started You can adopt AGT's shift-left governance incrementally. Here are three starting points, from lowest to highest effort: 1. Add the Governance Verify Action (5 minutes) Add a single GitHub Actions workflow that runs the compliance check on every PR: # .github/workflows/governance.yml name: Governance on: [pull_request] jobs: verify: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: microsoft/agent-governance-toolkit/action@main with: command: governance-verify 2. Enable Pre-Commit Hooks (15 minutes) Add a .pre-commit-config.yaml referencing AGT's hooks, install them, and run against all existing files to establish a baseline. Start in permissive mode and graduate to strict over four weeks. 3. Full Pipeline Integration (1-2 hours) Add the complete set of PR-time gates (attestation, dependency review, secret scanning, supply chain checks, quality gates), configure the Security Scan action for your plugin directories, and enable SBOM generation and signing in your release workflow. The AGT repository itself serves as a reference implementation: every workflow described in this post is running in production at aka.ms/agent-governance-toolkit. Important Notes The policy files, workflow configurations, and code samples in this post are illustrative examples. Your organization's governance requirements may differ. Review and customize all configurations before deploying to production. The Agent Governance Toolkit is designed to help organizations implement governance controls for AI agents; it does not guarantee compliance with any specific regulatory framework. Always consult your organization's security and legal teams when defining governance policies. What Comes Next Pre-runtime governance is one piece of the puzzle. Combined with the runtime governance capabilities covered in part one of this series (policy engines, zero-trust identity, execution sandboxing, audit logging), it provides coverage across the full lifecycle. The project continues to grow. Since the initial release, we’ve added a multi-stage policy pipeline (pre_input, pre_tool, post_tool, pre_output stages), approval workflows with human-in-the-loop gates, DLP attribute ratchets for monotonic session state, and OpenTelemetry instrumentation for governance operations. Over 45 step-by-step tutorials are available in the documentation. Everything described in this post is available today in the public GitHub repository. The full source, documentation, tutorials, and examples are at aka.ms/agent-governance-toolkit, open source under the MIT license. We welcome contributions, feedback, and issue reports from the community.491Views0likes0CommentsProject Pavilion Presence at KubeCon EU 2026
KubeCon + CloudNativeCon Europe 2026 took place from 23 to 26 March at RAI Amsterdam, and it was a strong one. The themes running through the week reflected where the cloud native community is right now: AI moving from experimentation into production, platform engineering continuing to mature, and security and sovereignty top of mind for organizations across Europe. Microsoft was there throughout, and once again supported a range of open source projects in the Project Pavilion. The Project Pavilion is a dedicated, vendor-neutral space on the show floor reserved for CNCF projects. It is where the work gets talked about honestly. Maintainers and contributors meet directly with end users, share what they are building, get real feedback on what is and is not working, and have the kinds of technical conversations that are hard to have anywhere else. For open source communities, it is one of the most valuable parts of the event. Why Our Presence Matters Microsoft's products and services are built on and alongside many of the technologies represented in the pavilion, and the health of these communities matters to us directly. Showing up means our teams hear firsthand what is working, what is missing, and where these projects need to go next. It also means we get to contribute as community members, not just as a company name on a sponsor board. That distinction matters to us, and to the communities we are part of. Microsoft-Supported Pavilion Projects Confidential Containers Representative: Jeremi Piotrowski The Confidential Containers booth gave attendees a chance to learn more about the project and its approach to protecting workloads using hardware-based trusted execution environments. Jeremi was on hand throughout the kiosk hours, fielding questions from interested users and developers exploring confidential computing in Kubernetes environments. Conversations touched on use cases around data privacy, regulated workloads, and the role Confidential Containers plays in the broader cloud-native security landscape. Drasi Representative: Daniel Gerlag and Nandita Valsan The Drasi team had a busy time in the pavilion, engaging around 40 attendees across two kiosk shifts in focused technical conversations. Most visitors were developers and platform engineers curious about change-driven architectures and real-time data processing. There was strong positive feedback on the newly introduced Drasi Server modes and embeddable library, which complement Drasi for Kubernetes. The team came away with useful validation of current design decisions and good input for the roadmap ahead. Envoy Representative: Mikhail Krinkin The Envoy booth was staffed for the full duration of KubeCon EU by maintainers from Microsoft, Google, Isovalent, and Tetrate, reflecting the broad and healthy contributor base behind the project. The biggest topic at the booth was migration from ingress-nginx to Gateway API implementations. The archival of ingress-nginx pushed a lot of users into making changes they were not quite ready for, and questions ranged from technical specifics like HTTP default differences between Envoy and Nginx, to more foundational questions about what Envoy and Gateway API actually are. The team had anticipated this and invested in the ingress2gateway project to give users a clear migration path. Extensibility was another frequent conversation topic, with dynamic modules increasingly becoming the go-to answer for user-specific requirements. Starting with the 1.38 release of Envoy, dynamic modules will have a backward compatible ABI, a sign of real production readiness for that feature. Flatcar Representative: Thilo Fromm and Mathieu Tortuyaux The Flatcar booth had great energy, with maintainers from Microsoft, STACKIT, and CloudBase joining for conversations throughout the pavilion hours. Operational sovereignty came up again and again as a theme, with users and consulting partners sharing how they are building their Kubernetes offerings on Flatcar because of how reliable and secure it is. There were a lot of meaningful conversations. Lambda.ai currently runs Flatcar on their control plane and is looking at extending it to worker and customer clusters, with interest in contributing to the project. ReeVo has built their hosted Kubernetes distro on Flatcar across multi-cloud and bare metal environments and is planning to move hundreds of customer clusters over soon. Users from ClearScore, Avassa, Recorded Future, and several other organizations also stopped by with positive feedback on the project's robustness and security. STACKIT uses Flatcar as the default OS for their hosted Kubernetes offering and sponsors a full-time maintainer for the project. The team also connected with TAG Infrastructure to talk through Flatcar's CNCF graduation progress. Headlamp Representatives: René Dudfield and Santhosh Nagaraj S The Headlamp booth was a busy one, with users, contributors, and partner projects all stopping by throughout the pavilion hours. Conversations covered real-world deployments, federation challenges, multi-tenant namespace visibility, and feature requests like multi-CR data aggregation. There was notable interest from consultancies deploying Headlamp across hundreds of customer clusters, as well as from companies already running it at cloud scale. Several CNCF projects expressed interest in building UIs for their own projects inside Headlamp, with a few even getting started right there at the conference. The team also heard from users getting budget approved to migrate from the deprecated Kubernetes Dashboard, which is a good sign for the project's growing momentum. Demand for air-gapped AI agent support and deeper Azure and AKS integrations for internal developer platforms came up as clear areas to watch. Hyperlight Representative: Ralph Squillace The Hyperlight booth ran as a half-day session on Tuesday, in line with the project's current Sandbox status, but the corner location in the project area made a real difference in visibility. Ralph was fielding questions from the moment the doors opened, with a steady stream of visitors right up until the shift ended. Live and recorded demos were central to the conversations, helping attendees quickly grasp what Hyperlight does and how it fits into their environments. One standout visit came from an engineer at SAP who spent nearly an hour at the booth, pushing the conversation from fundamentals and embedding examples all the way through to agentic protection scenarios in Kubernetes. That conversation continued beyond KubeCon and turned into a scheduled meeting to explore a proof of concept, a good example of the kind of follow-on engagement the pavilion can generate. Inspektor Gadget Representative: Michael Friese and Qasim Sarfraz The Inspektor Gadget booth had a lot of great energy, drawing in contributors, new users, and people just discovering the project for the first time. There was genuine excitement around Inspektor Gadget Desktop and its visual troubleshooting experience for Kubernetes and Linux environments. The integration with HolmesGPT, which was also featured in the keynote, came up frequently and was one of the main talking points throughout the event. A theme that surfaced consistently in conversations with platform engineers was multi-tenancy, with teams looking for ways to safely give developers ad-hoc access to troubleshoot issues independently while keeping overall control at the platform level. It was a good set of conversations that reflected both the project's maturity and the growing demand for a flexible observability framework. Istio Representative: Mitch Connors, Mikhail Krinkin, Jackie Maertens and Mike Morris The Istio booth had steady traffic throughout the conference, with a noticeable shift in who was stopping by. More visitors came from teams with existing sidecar-based production deployments looking for guidance on moving to ambient mode, which is a change from previous years when ambient interest was mostly coming from greenfield users. The motivation to make the move was often tied to cost optimization and performance, with teams having read case studies and feeling more confident about the direction. That said, the increased interest also surfaced some real gaps, including requests for clearer migration guidance, more clarity around architectural differences like mTLS egress workflows, and better support for VM-based workloads. The team is planning to prioritize migration guidance over the coming months. The updated Istio Day format, with a half day of sessions at the Cloud Native Theater stage, also drew a strong crowd with standing room only throughout. Notary Project Representative: Toddy Mladenov and Flora Taagen The Notary Project kiosk drew a wide range of visitors, from people learning about container image signing for the first time to experienced engineers asking detailed questions about what is coming next on the roadmap. A major highlight of the week was the project's conference talk on per-layer dm-verity signing, which drew a packed room and over 660 online sign-ups, one of the stronger turnouts for a project-level session at the event. The talk walked through how the new capability moves container security beyond pull-time verification to continuous runtime protection, backed by dm-verity, which generated a lengthy Q&A and a lot of enthusiasm from the audience. The team also sees a real opportunity ahead as AI workloads push organizations to think harder about the integrity of models, datasets, and container images, and the interest at the booth reinforced that Notary Project is well positioned to play a big role in securing those workflows. ORAS Representative: Toddy Mladenov The ORAS kiosk was staffed by maintainers from Microsoft, NVIDIA, and Red Hat, a good reflection of the healthy multi-vendor community the project has built. Attendees engaged with maintainers on ORAS use cases and adoption, with conversations ranging from how artifacts are tagged and packaged to how ORAS fits into broader multi-cloud workflows. One practical takeaway from maintainer conversations was around leveraging the ORAS SDK more often as a substitute for CLI operations when working with container registries, which helps teams build simpler and more robust tooling. Radius Representative: Sylvain Niles and Will Tsai The Radius booth, supported by the Microsoft Azure Incubations team, attracted a good mix of enterprise platform teams, prospective adopters, and fellow open source maintainers throughout the pavilion hours. There was strong interest in the extensible Radius Resource Types feature and how it helps teams abstract infrastructure complexity and move workloads across different environments. Conversations also surfaced useful feedback on where the project should focus next, including agent-driven infrastructure workflows and using the Radius application graph to improve observability and operational visibility for cloud-native applications. Conclusion KubeCon EU 2026 was a good reminder of why this community continues to grow. The conversations in the Project Pavilion were substantive, the feedback was honest, and the connections made there will carry forward into the work. Microsoft will be back for KubeCon NA in Salt Lake City this November, and we are already looking forward to it. If you are interested in getting involved with any of these projects, the best starting point is each project's community directly. You are also welcome to reach out to Lexi Nadolski at lexinadolski@microsoft.com with any questions.78Views0likes0CommentsAgent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents
Last week we announced the Agent Governance Toolkit on the Microsoft Open Source Blog, an open-source project that brings runtime security governance to autonomous AI agents. In that announcement, we covered the why: AI agents are making autonomous decisions in production, and the security patterns that kept systems safe for decades need to be applied to this new class of workload. In this post, we'll go deeper into the how: the architecture, the implementation details, and what it takes to run governed agents in production. The Problem: Production Infrastructure Meets Autonomous Agents If you manage production infrastructure, you already know the playbook: least privilege, mandatory access controls, process isolation, audit logging, and circuit breakers for cascading failures. These patterns have kept production systems safe for decades. Now imagine a new class of workload arriving on your infrastructure, AI agents that autonomously execute code, call APIs, read databases, and spawn sub-processes. They reason about what to do, select tools, and act in loops. And in many current deployments, they do all of this without the security controls you'd demand of any other production workload. That gap is what led us to build the Agent Governance Toolkit: an open-source project, that applies proven security concepts from operating systems, service meshes, and SRE to the emerging world of autonomous AI agents. To frame this in familiar terms: most AI agent frameworks today are like running every process as root, no access controls, no isolation, no audit trail. The Agent Governance Toolkit is the kernel, the service mesh, and the SRE platform for AI agents. When an agent calls a tool, say, `DELETE FROM users WHERE created_at < NOW()`, there is typically no policy layer checking whether that action is within scope. There is no identity verification when one agent communicates with another. There is no resource limit preventing an agent from making 10,000 API calls in a minute. And there is no circuit breaker to contain cascading failures when things go wrong. OWASP Agentic Security Initiative In December 2025, OWASP published the Agentic AI Top 10: the first formal taxonomy of risks specific to autonomous AI agents. The list reads like a security engineer's nightmare: goal hijacking, tool misuse, identity abuse, memory poisoning, cascading failures, rogue agents, and more. If you've ever hardened a production server, these risks will feel both familiar and urgent. The Agent Governance Toolkit is designed to help address all 10 of these risks through deterministic policy enforcement, cryptographic identity, execution isolation, and reliability engineering patterns. Note: The OWASP Agentic Security Initiative has since adopted the ASI 2026 taxonomy (ASI01–ASI10). The toolkit's copilot-governance package now uses these identifiers with backward compatibility for the original AT numbering. Architecture: Nine Packages, One Governance Stack The toolkit is structured as a v3.0.0 Public Preview monorepo with nine independently installable packages: Package What It Does Agent OS Stateless policy engine, intercepts agent actions before execution with configurable pattern matching and semantic intent classification Agent Mesh Cryptographic identity (DIDs with Ed25519), Inter-Agent Trust Protocol (IATP), and trust-gated communication between agents Agent Hypervisor Execution rings inspired by CPU privilege levels, saga orchestration for multi-step transactions, and shared session management Agent Runtime Runtime supervision with kill switches, dynamic resource allocation, and execution lifecycle management Agent SRE SLOs, error budgets, circuit breakers, chaos engineering, and progressive delivery, production reliability practices adapted for AI agents Agent Compliance Automated governance verification with compliance grading and regulatory framework mapping (EU AI Act, NIST AI RMF, HIPAA, SOC 2) Agent Lightning Reinforcement learning training governance with policy-enforced runners and reward shaping Agent Marketplace Plugin lifecycle management with Ed25519 signing, trust-tiered capability gating, and SBOM generation Integrations 20+ framework adapters for LangChain, CrewAI, AutoGen, Semantic Kernel, Google ADK, Microsoft Agent Framework, OpenAI Agents SDK, and more Agent OS: The Policy Engine Agent OS intercepts agent tool calls before they execute: from agent_os import StatelessKernel, ExecutionContext, Policy kernel = StatelessKernel() ctx = ExecutionContext( agent_id="analyst-1", policies=[ Policy.read_only(), # No write operations Policy.rate_limit(100, "1m"), # Max 100 calls/minute Policy.require_approval( actions=["delete_*", "write_production_*"], min_approvals=2, approval_timeout_minutes=30, ), ], ) result = await kernel.execute( action="delete_user_record", params={"user_id": 12345}, context=ctx, ) The policy engine works in two layers: configurable pattern matching (with sample rule sets for SQL injection, privilege escalation, and prompt injection that users customize for their environment) and a semantic intent classifier that helps detect dangerous goals regardless of phrasing. When an action is classified as `DESTRUCTIVE_DATA`, `DATA_EXFILTRATION`, or `PRIVILEGE_ESCALATION`, the engine blocks it, routes it for human approval, or downgrades the agent's trust level, depending on the configured policy. Important: All policy rules, detection patterns, and sensitivity thresholds are externalized to YAML configuration files. The toolkit ships with sample configurations in `examples/policies/` that must be reviewed and customized before production deployment. No built-in rule set should be considered exhaustive. Policy languages supported: YAML, OPA Rego, and Cedar. The kernel is stateless by design, each request carries its own context. This means you can deploy it behind a load balancer, as a sidecar container in Kubernetes, or in a serverless function, with no shared state to manage. On AKS or any Kubernetes cluster, it fits naturally into existing deployment patterns. Helm charts are available for agent-os, agent-mesh, and agent-sre. Agent Mesh: Zero-Trust Identity for Agents In service mesh architectures, services prove their identity via mTLS certificates before communicating. AgentMesh applies the same principle to AI agents using decentralized identifiers (DIDs) with Ed25519 cryptography and the Inter-Agent Trust Protocol (IATP): from agentmesh import AgentIdentity, TrustBridge identity = AgentIdentity.create( name="data-analyst", sponsor="alice@company.com", # Human accountability capabilities=["read:data", "write:reports"], ) # identity.did -> "did:mesh:data-analyst:a7f3b2..." bridge = TrustBridge() verification = await bridge.verify_peer( peer_id="did:mesh:other-agent", required_trust_score=700, # Must score >= 700/1000 ) A critical feature is trust decay: an agent's trust score decreases over time without positive signals. An agent trusted last week but silent since then gradually becomes untrusted, modeling the reality that trust requires ongoing demonstration, not a one-time grant. Delegation chains enforce scope narrowing: a parent agent with read+write permissions can delegate only read access to a child agent, never escalate. Agent Hypervisor: Execution Rings CPU architectures use privilege rings (Ring 0 for kernel, Ring 3 for userspace) to isolate workloads. The Agent Hypervisor applies this model to AI agents: Ring Trust Level Capabilities Ring 0 (Kernel) Score ≥ 900 Full system access, can modify policies Ring 1 (Supervisor) Score ≥ 700 Cross-agent coordination, elevated tool access Ring 2 (User) Score ≥ 400 Standard tool access within assigned scope Ring 3 (Untrusted) Score < 400 Read-only, sandboxed execution only New and untrusted agents start in Ring 3 and earn their way up, exactly the principle of least privilege that production engineers apply to every other workload. Each ring enforces per-agent resource limits: maximum execution time, memory caps, CPU throttling, and request rate limits. If a Ring 2 agent attempts a Ring 1 operation, it gets blocked, just like a userspace process trying to access kernel memory. These ring definitions and their associated trust score thresholds are fully configurable via policy. Organizations can define custom ring structures, adjust the number of rings, set different trust score thresholds for transitions, and configure per-ring resource limits to match their security requirements. The hypervisor also provides saga orchestration for multi-step operations. When an agent executes a sequence, draft email → send → update CRM, and the final step fails, compensating actions fire in reverse. Borrowed from distributed transaction patterns, this ensures multi-agent workflows maintain consistency even when individual steps fail. Agent SRE: SLOs and Circuit Breakers for Agents If you practice SRE, you measure services by SLOs and manage risk through error budgets. Agent SRE extends this to AI agents: When an agent's safety SLI drops below 99 percent, meaning more than 1 percent of its actions violate policy, the system automatically restricts the agent's capabilities until it recovers. This is the same error-budget model that SRE teams use for production services, applied to agent behavior. We also built nine chaos engineering fault injection templates: network delays, LLM provider failures, tool timeouts, trust score manipulation, memory corruption, and concurrent access races. Because the only way to know if your agent system is resilient is to break it intentionally. Agent SRE integrates with your existing observability stack through adapters for Datadog, PagerDuty, Prometheus, OpenTelemetry, Langfuse, LangSmith, Arize, MLflow, and more. Message broker adapters support Kafka, Redis, NATS, Azure Service Bus, AWS SQS, and RabbitMQ. Compliance and Observability If your organization already maps to CIS Benchmarks, NIST AI RMF, or other frameworks for infrastructure compliance, the OWASP Agentic Top 10 is the equivalent standard for AI agent workloads. The toolkit's agent-compliance package provides automated governance grading against these frameworks. The toolkit is framework-agnostic, with 20+ adapters that hook into each framework's native extension points, so adding governance to an existing agent is typically a few lines of configuration, not a rewrite. The toolkit exports metrics to any OpenTelemetry-compatible platform, Prometheus, Grafana, Datadog, Arize, or Langfuse. If you're already running an observability stack for your infrastructure, agent governance metrics flow through the same pipeline. Key metrics include: policy decisions per second, trust score distributions, ring transitions, SLO burn rates, circuit breaker state, and governance workflow latency. Getting Started # Install all packages pip install agent-governance-toolkit[full] # Or individual packages pip install agent-os-kernel agent-mesh agent-sre The toolkit is available across language ecosystems: Python, TypeScript (`@microsoft/agentmesh-sdk` on npm), Rust, Go, and .NET (`Microsoft.AgentGovernance` on NuGet). Azure Integrations While the toolkit is platform-agnostic, we've included integrations that help enable the fastest path to production, on Azure: Azure Kubernetes Service (AKS): Deploy the policy engine as a sidecar container alongside your agents. Helm charts provide production-ready manifests for agent-os, agent-mesh, and agent-sre. Azure AI Foundry Agent Service: Use the built-in middleware integration for agents deployed through Azure AI Foundry. OpenClaw Sidecar: One compelling deployment scenario is running OpenClaw, the open-source autonomous agent, inside a container with the Agent Governance Toolkit deployed as a sidecar. This gives you policy enforcement, identity verification, and SLO monitoring over OpenClaw's autonomous operations. On Azure Kubernetes Service (AKS), the deployment is a standard pod with two containers: OpenClaw as the primary workload and the governance toolkit as the sidecar, communicating over localhost. We have a reference architecture and Helm chart available in the repository. The same sidecar pattern works with any containerized agent, OpenClaw is a particularly compelling example because of the interest in autonomous agent safety. Tutorials and Resources 34+ step-by-step tutorials covering policy engines, trust, compliance, MCP security, observability, and cross-platform SDK usage are available in the repository. git clone https://github.com/microsoft/agent-governance-toolkit cd agent-governance-toolkit pip install -e "packages/agent-os[dev]" -e "packages/agent-mesh[dev]" -e "packages/agent-sre[dev]" # Run the demo python -m agent_os.demo What's Next AI agents are becoming autonomous decision-makers in production infrastructure, executing code, managing databases, and orchestrating services. The security patterns that kept production systems safe for decades, least privilege, mandatory access controls, process isolation, audit logging, are exactly what these new workloads need. We built them. They're open source. We're building this in the open because agent security is too important for any single organization to solve alone: Security research: Adversarial testing, red-team results, and vulnerability reports strengthen the toolkit for everyone. Community contributions: Framework adapters, detection rules, and compliance mappings from the community expand coverage across ecosystems. We are committed to open governance. We're releasing this project under Microsoft today, and we aspire to move it into a foundation home, such as the AI and Data Foundation (AAIF), where it can benefit from cross-industry stewardship. We're actively engaging with foundation partners on this path. The Agent Governance Toolkit is open source under the MIT license. Contributions welcome at github.com/microsoft/agent-governance-toolkit.1.8KViews0likes0CommentsHow Netstar Streamlined Fleet Monitoring and Reduced Custom Integrations with Drasi
When a high-value container goes silent between waystations, logistics teams lose critical visibility, risking delays that can cascade into port congestion and missed connections. Netstar, a connected fleet solutions provider supporting customers like Maersk, faced this challenge as its operations scaled. Timely notifications of delays, arrivals, and status changes became critical to keeping cargo moving efficiently through port systems. To address growing integration complexity and the need for real-time responsiveness, Netstar adopted Drasi. Drasi, built for change-driven solutions, provides continuously updated query results and automated reactions to data changes, enabling systems to detect and respond to critical changes as they happen. This shift to Drasi became foundational to how Netstar unified its fleet data, reduced engineering overhead, and improved monitoring workflows. The Fragmentation Challenge Growing operational complexity made an underlying challenge increasingly apparent. Tracking a container's journey from pickup to port terminal required reconciling data such as vehicle identifiers, waypoints, GPS location feeds, and IoT telemetry signals from siloed systems. With each new operational or business requirement, whether monitoring vehicle health or detecting route deviations, development teams found themselves repeatedly rebuilding similar patterns. "We were essentially rebuilding the same integration architecture for every use case," explains Daniel Joubert, General Manager and technical lead at Netstar. "One week we'd build a dashboard for location tracking. The next week, we'd build another one for breakdown detection. The engineering overhead was unsustainable." Batch-based processing compounded the issue. Critical signals such as missed health reports or route delays can surface long after they occur, potentially limiting Netstar’s ability to take timely action. Introducing Drasi for Change-driven Architecture Rather than continue building point solutions, Netstar adopted Drasi as the backbone of its real-time data architecture. Drasi simplifies systems that must detect, evaluate, and react to data changes quickly and efficiently at scale, aligning directly with Netstar’s needs. A Unified, Continuously Updated View Drasi connected directly to Netstar's existing data sources- Azure SQL databases for information such as vehicle identifiers and waypoints, and Azure EventHub for GPS location data and IoT telemetry. Drasi Continuous Queries joined this information into a single, always-current operational picture. Instead of multiple custom-built pipelines, Netstar gained a single source of truth for its fleet. Using Drasi Reactions, Netstar defined actions that trigger when specific events occur. When a truck fails to send a health signal within its expected window, or when a delay notification indicates potential supply chain disruption, the system responds immediately without human intervention, reducing the likelihood of missed events. Improvements Enabled by Drasi Using the Drasi plugin for Grafana, Netstar consolidated results from Continuous Queries into one monitoring interface. Operators no longer reconciled conflicting views across separate tools; they now track vehicle health, location, alerts, and route deviations in real time from a single dashboard. "The transformation was remarkable," says Dustyn Lightfoot, Solution Architect. "We were able to use a single Drasi instance to support multiple business use cases without building new infrastructure or writing additional code, for example, to stand up Blazor websites. More importantly, it eliminated the ongoing maintenance burden of managing dozens of custom pipelines." Drasi’s flexibility also extended beyond fleet tracking. By attaching an additional data source and defining new Continuous Queries, the same Drasi instance now surfaces changes in customer billing status and the legal contracts. This work required no new infrastructure, just connecting the source and writing queries (leveraging Drasi’s custom Delta functions), providing business teams with up-to-date information without a separate integration effort. Measurable Impact Netstar reports tangible improvements across engineering operations and real-time responsiveness: Faster incident response: Missing health signals now trigger alerts immediately rather than being discovered later through manual checks, improving the speed and reliability of operational response. Improved logistics coordination: Real time visibility into container movement through waystations and toward port terminals has enabled Netstar and partners like Maersk to coordinate shipments more efficiently, with automated alerts keeping all stakeholders informed as conditions change. Reduced development overhead: Using Drasi has reduced the amount of custom development previously needed to support fleet monitoring capabilities. The same Drasi-driven architecture now supports multiple business cases, from tracking and health monitoring to route optimization. Streamlined operator experience: Teams moved from several monitoring tools to a single Drasi-powered Grafana interface, simplifying daily operations and eliminating time spent reconciling conflicting data from different systems. Industry Context and What’s Next Demand for real-time supply chain visibility has intensified as global logistics disruptions highlight the risks of delayed reporting. "Our customers don't just want historical reports anymore. They need to know what's happening right now and be alerted the moment something changes," Daniel Joubert explains. "That shift from batch processing to continuous monitoring is becoming table stakes in fleet management." Building on this foundation, Netstar is now investigating how Drasi can support predictive maintenance- spotting patterns in vehicle health data early enough to prevent failures altogether. The same change-driven architecture could also streamline coordination across broader supply chain workflows. The Broader Architectural Shift Netstar’s implementation reflects a wider architectural move emerging across operational solutions: from systems that store and query data to platforms that detect and react to changes as they happen. In fleet logistics, financial systems, and industrial operations, the competitive advantage increasingly lies in eliminating the lag between event and response. "Building custom integrations for every use case was slowing us down and limiting what we could deliver to customers," Dustyn Lightfoot reflects. "Drasi gave us a reusable foundation that handles the hard parts, integrating disparate data sources and detecting meaningful changes, so we can focus on solving business problems rather than rebuilding infrastructure." The collaboration between Drasi and Netstar demonstrates how open source change-driven platforms can simplify complex operational challenges whilst providing actionable insights across distributed systems. As logistics operations evolve, architectures like Drasi’s may define the next era of competitive advantage- one where actionable insight arrives the moment conditions change. To learn more about Drasi visit Drasi.io.163Views0likes0CommentsEvent-Driven to Change-Driven: Low-cost dependency inversion
Event-driven architectures tout scalability, loose coupling, and eventual consistency. The architectural patterns are sound, the theory is compelling, and the blog posts make it look straightforward. Then you implement it. Suddenly you're maintaining separate event stores, implementing transactional outboxes, debugging projection rebuilds, versioning events across a dozen micro-services, and writing mountains of boilerplate to handle what should be simple queries. Your domain events that were supposed to capture rich business meaning have devolved into glorified database change notifications. Downstream services diff field values to extract intent from "OrderUpdated" events because developers just don't get what constitutes a proper domain event. The complexity tax is real, don't get me wrong, it's very elegant but for many systems it's unjustified. Drasi offers an alternative: change-driven architecture that delivers reactive, real-time capabilities across multiple data sources without requiring you to rewrite your application or over complicate your architecture. What do we mean by “Event-driven” architecture As Martin Fowler notes, event-driven architecture isn't a single pattern, it's at least four distinct patterns that are often confused, each with its own benefits and traps. Event Notification is the simplest form. Here, events act as signals that something has happened, but carry minimal data, often just an identifier. The recipient must query the source system for more details if needed. For example, a service emits an OrderPlaced event with just the order ID. Downstream consumers must query the order service to retrieve full order details. Event Carried State Transfer broadcasts full state changes through events. When an order ships, you publish an OrderShipped event containing all the order details. Downstream services maintain their own materialized views or projections by consuming these events. Event Sourcing goes further, events become your source of truth. Instead of storing current state, you store the sequence of events that led to that state. Your order isn't a row in a database; it's the sum of OrderPlaced, ItemAdded, PaymentProcessed, and OrderShipped events. CQRS (Command Query Responsibility Segregation) separates write operations (commands) from read operations (queries). While not inherently event-driven, CQRS is often paired with event sourcing or event-carried state transfer to optimize for scalability and maintainability. Originally derived from Bertrand Meyer's Command-Query Separation principle and popularized by Greg Young, CQRS addresses a specific architectural challenge: the tension between optimizing for writes versus optimizing for reads. The pattern promises several benefits: Optimized data models: Your write model can focus on transactional consistency while read models optimize for query performance Scalability: Read and write sides can scale independently Temporal queries: With event sourcing, you get time travel for free—reconstruct state at any point in history Audit trail: Every change is captured as an immutable event While CQRS isn't inherently tied to Domain-Driven Design (DDD), the pattern complements DDD well. In DDD contexts, CQRS enables different bounded contexts to maintain their own read models tailored to their specific ubiquitous language, while the write model protects domain invariants. This is why you'll often see them discussed together, though each can be applied independently. The core motivation for these patterns is often to invert the dependency between systems, so that your downstream services do not need to know about your upstream services. The Developer's Struggle: When Domain Events Become Database Events Chris Kiehl puts it bluntly in his article "Don't Let the Internet Dupe You, Event Sourcing is Hard": "The sheer volume of plumbing code involved is staggering—instead of a friendly N-tier setup, you now have classes for commands, command handlers, command validators, events, aggregates, and then projections, model classes, access classes, custom materialization code, and so on." But the real tragedy isn't the boilerplate, it's what happens to those carefully crafted domain events. As developers are disconnected from the real-world business, they struggle to understand the nuances of domain events, a dangerous pattern emerges. Instead of modeling meaningful business processes, teams default to what they know: CRUD. Your event stream starts looking like this: OrderCreated OrderUpdated OrderUpdated (again) OrderUpdated (wait, what changed?) OrderDeleted As one developer noted on LinkedIn, these "CRUD events" are really just "leaky events that lack clarity and should not be used to replicate databases as this leaks implementation details and couples services to a shared data model." Dennis Doomen, reflecting on real-world production issues, observes: "It's only once you have a living, breathing machine, users which depend on you, consumers which you can't break, and all the other real-world complexities that plague software projects that the hard problems in event sourcing will rear their heads." The result? Your elegant event-driven architecture devolves into an expensive, brittle form of self-maintained Change Data Capture (CDC). You're not modeling business processes; you're just broadcasting database mutations with extra steps. The Anti-Corruption Layer: Your Defense Against the Outside World In DDD, an Anti-Corruption Layer (ACL) protects your bounded context from external models that would corrupt your domain. Think of it as a translator that speaks both languages, the messy external model and your clean internal model. The ACL ensures that changes to the external system don't ripple through your domain. If the legacy system changes its schema, you update the translator, not your entire domain model. When Event Taxonomies Become Your ACL (And Why They Fail) In most event-driven architectures, your event taxonomy is supposed to serve as the shared contract between services. Each service publishes events using its own ubiquitous language, and consumers translate these into their own models, this translation is the ACL. The theory looks beautiful: But reality? Most teams end up with this: Instead of OrderPaid events that carry business meaning, we get OrderUpdated events that force every consumer to reconstruct intent by diffing fields. When you change your database schema, say splitting the orders table or switching from SQL to NoSQL, every downstream service breaks because they're all coupled to your internal data model. You haven't built an anti-corruption layer. You've built a corruption pipeline that efficiently distributes your internal implementation details across the entire system, forcing you to deploy all services in lock step and eroding the decoupling benefits you were supposed to get. Enter Drasi: Continuous Queries This is where Drasi changes the game. Instead of publishing events and hoping downstream services can make sense of them, Drasi tails the changelog of the data source itself and derives meaning through continuous queries. A continuous query in Drasi isn't just a query that runs repeatedly, it's a living, breathing projection that reacts to changes in real-time. Here's the key insight: instead of imperative code that processes events ("when this happens, do that"), you write declarative queries that describe the state you care about ("I want to know about orders that are ready and have drivers waiting"). Let's break down what makes this powerful: Declarative vs. Imperative Traditional event processing: Drasi continuous query: Semantic Mapping from Low-Level Changes Drasi excels at transforming database-level changes into business-meaningful events. You're not reacting to "row updated in orders table", you're reacting to "order ready for curbside pickup." This enables the same core benefits of dependency inversion we get from event-driven architectures but at a fraction of the effort. Advanced Temporal Features Remember those developers struggling with "OrderUpdated" events, trying to figure out if something just happened or has been true for a while? Drasi handles this elegantly: This query only fires when a driver has been waiting for more than 10 minutes, no timestamp tracking, no state machines, no complex event correlation logic, imagine trying to manually implement this in a downstream event consumer. 😱 Cross-Source Aggregation Without Code With Drasi, you can have live projections across PostgreSQL, MySQL, SQL Server, and Cosmos DB as if they were a single graph: No custom aggregation service. No event stitching logic. No custom downstream datastore to track the sum or keep a materialized projection. Just a query. Continuous Queries as Your Shared Contract Drasi's continuous queries, combined with pre-processing middleware, can form the shared contract that your anti-corruption layer can depend on. The continuous query becomes your contract. Downstream systems don't know or care whether orders come from PostgreSQL, MongoDB, or a CSV file. They don't know if you normalized your database, denormalized it, or moved to event sourcing. They just consume the query results. Clean, semantic, and stable. Reactions as your Declarative Consumers Drasi does not simply output a stream of raw change diffs, instead it has a library of interchangeable Reactions, that can act on the output of continuous queries. These are declared using YAML and can do anything from host a web-socket endpoint that provides a live projection to your UI, to calling an Http endpoint or publishing a message on a queue. Example: The Curbside Pickup System Let's see how this works in Drasi's curbside pickup tutorial. This example has two independent databases and serves as a great illustration of a real-time projection built from multiple upstream services. The Business Problem A retail system needs to: Match ready orders with drivers who've arrived at pickup zones Alert staff when drivers wait more than 10 minutes without their order being ready Coordinate data from two different systems (retail ops in PostgreSQL, physical ops in MySQL) Traditional Event-Driven Approach In this architecture, you'd need something like: That's just the happy path. We haven't handled: Event ordering issues Partial failures Cache invalidation Service restarts and replay Duplicate events Transactional outboxing The Drasi Approach With Drasi, the entire aggregation service above becomes two queries: Delivery Dashboard Query: Wait Detection Query: That's it. No event handlers. No caching. No timers. No state management. Drasi handles: Change detection across both databases Correlation between orders and vehicles Temporal logic for wait detection Pushing updates to dashboards via SignalR The queries define your business logic declaratively. When data changes in either database, Drasi automatically re-evaluates the queries and triggers reactions for any changes in the result set. Drasi: The Non-Invasive Alternative to Legacy System Rewrites Here's perhaps the most compelling argument for Drasi: it doesn't require you to rewrite anything. Traditional event sourcing means: Redesigning your application around events Rewriting your persistence layer Implementing transactional outboxes Managing snapshots and replays Training your team on new patterns, steep learning curve Migrating existing data to event streams Building projection infrastructure Updating all consumers to handle events As one developer noted about their event sourcing journey: "Event Sourcing is a beautiful solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about." Drasi's approach: Keep your existing databases Keep your existing services Keep your existing deployment model Add continuous queries where you need reactive behavior Get the benefits of dependency inversion Gradually migrate complexity from code to queries You can start with a single query on a single table and expand from there. No big bang. No feature freeze. No three-month architecture sprint or large multi-year investments, full of risk. Migration Example: From Polling to Reactive Let's say you have a legacy order system where a scheduled job polls for ready orders every 30 seconds: With Drasi, you: Point Drasi at your existing database Write the continuous query Update your dashboard to receive pushes instead of polls Turn off the polling job Your database hasn't changed. Your order service hasn't changed. You've just added a reactive layer on top that eliminates polling overhead and reduces notification latency from 30 seconds to milliseconds. The intellectually satisfying complexity of event sourcing often obscures a simple truth: most systems don't need it. They need to know when interesting things change in their data and react accordingly. They need to combine data from multiple sources without writing bespoke aggregation services. They need to transform low-level changes into business-meaningful events. Drasi delivers these capabilities without the ceremony. Where Do We Go from Here? If you're building a new system and your team has deep event sourcing experience embrace the pattern. Event sourcing shines for certain domains. But if you're like many teams, trying to add reactive capabilities to existing systems, struggling with data synchronization across services, or finding that your "events" are just CRUD operations in disguise, consider the change-driven approach. Start small: Identify one painful polling loop or batch job Set up Drasi to monitor those same data sources Write a continuous query that captures the business condition Replace the polling with push-based reactions Measure the reduction in latency, overhead, and code complexity The best architecture isn't the most sophisticated one, it's the one your team can understand, maintain, and evolve. Sometimes that means acknowledging that we've been mid-curving it with overly complex event-driven architectures. Drasi and change-driven architecture offer the power of reactive systems without the complexity tax. Your data changes. Your queries notice. Your systems react. It makes it a non-event. Want to explore Drasi further? Check out the official documentation and try the curbside pickup tutorial to see change-driven architecture in action.653Views1like0Comments