observability
43 TopicsShift-Left Governance for AI Agents: How the Agent Governance Toolkit Helps You Catch Violations
In part one of this series, we covered AGT’s runtime governance: the policy engine, zero-trust identity, execution sandboxing, and the OWASP Agentic AI risk mapping. That post focused on what happens when an agent acts: policy evaluation at the moment a tool call fires, trust scoring when agents communicate, audit logging when decisions are made. Runtime governance is essential. But it is the last line of defense. After that post went live, a pattern emerged in conversations with teams adopting AGT. The same question kept coming up: runtime checks are useful, but what about everything before production? We realized runtime governance was only half the story. So we went back and built tooling for every stage of your software development lifecycle, from the moment a developer saves a file to the moment an artifact ships to users. Why Runtime Governance Is Not Enough AI agents are a new class of workload. They reason about what to do, select tools, call APIs, read databases, and spawn sub-processes, often in loops that run without direct human oversight. The OWASP Agentic AI Top 10 (published December 2025) identifies risks like excessive agency, insecure tool use, privilege escalation, and supply chain compromise. These risks span the entire lifecycle, not just runtime. Consider a few scenarios that runtime governance alone cannot prevent: A developer commits a policy YAML file with a typo that silently disables all deny rules. The agent runs unprotected until someone notices. A dependency update introduces a package with a known critical CVE. The agent starts using a vulnerable library before any security team reviews it. A contributor adds a raw cryptographic import to an application module, bypassing the security-audited signing library. The code compiles and ships. A GitHub Actions workflow uses an expression injection pattern that allows an attacker to execute arbitrary code in CI. A release ships without a Software Bill of Materials (SBOM), making it impossible to trace which components are affected when the next log4j-style vulnerability drops. Each of these is a governance failure, but none of them happens at runtime. They happen at commit time, at PR review time, at build time, or at release time. A comprehensive governance strategy needs coverage at every stage. Four Stages of Pre-Runtime Governance Governance violations can enter a codebase at four distinct stages of the development lifecycle. Each stage has a different class of risk, and each needs a different kind of check: Stage When It Runs What It Catches AGT Tooling Commit-time Before code leaves the developer machine Malformed policies, schema violations, secrets, stub code, unauthorized crypto Pre-commit hooks, quality gates PR-time When a pull request is opened or updated Vulnerable dependencies, missing attestation, secrets in history, unpinned versions GitHub Actions (attestation, dependency review, secret scanning, supply chain checks) CI/Build-time On every push and pull request to main Compliance violations, binary security issues, dependency confusion, workflow injection Governance Verify action, Security Scan action, CodeQL, BinSkim, policy validation Release-time Before artifacts are published Missing provenance, unsigned artifacts, incomplete SBOMs SBOM generation, Sigstore signing, build attestation, OpenSSF Scorecard Just as with bugs, the earlier you catch a governance violation, the cheaper it is to fix. A malformed policy file caught at commit time costs zero CI minutes. A secret caught in PR review never reaches the default branch. A dependency confusion attack blocked in CI never reaches production. An unsigned artifact blocked at release time never reaches users. Stage 1: Commit-Time Governance with Pre-Commit Hooks The fastest governance feedback loop is local. Within the AGT project, we’ve implemented three pre-commit hooks that run automatically whenever a developer stages files for commit, validating governance artifacts before they ever leave the developer's machine. Built-In Hooks The toolkit's .pre-commit-hooks.yaml defines three hooks that any repository can adopt: Hook ID What It Validates File Pattern validate-policy YAML/JSON policy files against the AGT policy schema, checking for required fields, valid operators, and structural correctness Files matching *polic*.yaml, *polic*.yml, *polic*.json validate-plugin-manifest Plugin manifest files for required fields and schema compliance Files matching plugin.json, plugin.yaml, plugin.yml evaluate-plugin-policy Plugin manifests against a governance policy file, evaluating whether the plugin would be allowed under the organization's rules Files matching plugin.json, plugin.yaml, plugin.yml To adopt these hooks, add AGT as a pre-commit hook source: # .pre-commit-config.yaml repos: - repo: https://github.com/microsoft/agent-governance-toolkit rev: main # pin to a release tag in production hooks: - id: validate-policy - id: validate-plugin-manifest - id: evaluate-plugin-policy args: ['--policy', 'policies/marketplace-policy.yaml'] Then install and run: pip install pre-commit pre-commit install pre-commit run --all-files Extended Quality Gates Beyond schema validation, we built a pre-commit rollout template (see the full example in the repository) with additional governance-specific quality gates designed to help prevent common security anti-patterns from entering the codebase: Policy validation (agt-validate): Runs the full AGT policy CLI in strict mode, catching not just schema errors but semantic issues like conflicting rules. Health check (agt-doctor): Runs on pre-push (before code leaves the machine entirely), performing a broader health check of the governance configuration. Plugin metadata check (agency-json-required): Ensures every plugin directory contains the required agency.json metadata file. Stub detection (no-stubs): Blocks TODO, FIXME, HACK, and raise NotImplementedError markers in staged production code. Test files are excluded. Unauthorized crypto detection (no-custom-crypto): Blocks raw cryptographic imports (hashlib, hmac, crypto.subtle, System.Security.Cryptography, ring, ed25519-dalek) outside designated security modules. This helps ensure all cryptographic operations go through the audited AGT signing libraries. Secret scanning (detect-secrets): Integrates Yelp's detect-secrets for pattern-based secret detection on every commit. Phased Rollout for Teams Adopting pre-commit hooks across a team requires a thoughtful rollout. The AGT documentation includes a phased adoption guide: Week 1: Install hooks in permissive mode. Hooks warn on violations but do not block the commit. This lets developers see what would be caught without disrupting workflow. Week 2: Switch to strict mode for policy validation only. Policy files must pass schema validation to be committed. Week 3: Enable all hooks as blocking. Stubs, unauthorized crypto, and secrets are now blocked at commit time. Week 4: Graduate to full blocking mode and remove the permissive fallback. This approach helps teams build confidence in the governance tooling before it becomes a hard gate. Stage 2: PR-Time Gates Pre-commit hooks catch issues on the developer's machine, but they can be bypassed (force push, direct GitHub edits, hooks not installed). PR-time gates provide the second layer of defense, running in GitHub Actions on every pull request before merge is allowed. Governance Attestation The Governance Attestation action validates that PR authors have completed a structured attestation checklist before their code can merge. The default checklist covers seven sections: Security review Privacy review Legal review Responsible AI review Accessibility review Release Readiness / Safe Deployment Org-specific Launch Gates The action is fully configurable. Organizations can customize the required sections, set a minimum PR body length, and choose their own attestation format. Outputs include the validation status, a list of errors for missing sections, and a JSON mapping of sections to checkbox counts. Here is an example workflow: # .github/workflows/pr-governance.yml name: PR Governance on: pull_request: types: [opened, edited, synchronize] jobs: attestation: runs-on: ubuntu-latest steps: - uses: microsoft/agent-governance-toolkit/action/governance-attestation@main with: required-sections: | 1) Security review 2) Privacy review 3) Responsible AI review Dependency Review The dependency review workflow helps block PRs that introduce dependencies with known CVEs or disallowed licenses. It uses the GitHub dependency-review-action with a curated license allowlist: - uses: actions/dependency-review-action@v4 with: fail-on-severity: moderate comment-summary-in-pr: always allow-licenses: > MIT, Apache-2.0, BSD-2-Clause, BSD-3-Clause, ISC, PSF-2.0, Python-2.0, 0BSD, Unlicense, CC0-1.0, CC-BY-4.0, Zlib, BSL-1.0, MPL-2.0 This runs on every PR that touches dependency manifests (package.json, Cargo.toml, pyproject.toml, requirements.txt). Dependencies with moderate or higher CVEs are flagged, and dependencies with licenses not on the allowlist are blocked. Secret Scanning The secret scanning workflow runs on every PR to the main branch and on a weekly schedule. It combines two complementary approaches: Gitleaks: Pattern-based secret detection across the full git history, catching API keys, tokens, and credentials that may have been committed at any point. High-entropy string scanning: Regex-based detection of common secret patterns including GitHub tokens (ghp_, gho_), AWS access keys (AKIA), Slack tokens (xox), and base64-encoded strings with high entropy. Supply Chain Integrity A dedicated supply chain check workflow triggers when dependency manifest files change. It enforces two rules that help prevent supply chain attacks: Exact version pinning: No ^ or ~ version ranges in package.json files. This prevents unexpected minor/patch version updates that could introduce compromised code. Lockfile presence: Every package directory with dependencies must have a corresponding lockfile (package-lock.json, pnpm-lock.yaml, or yarn.lock). Lockfiles help ensure reproducible builds with verified integrity hashes. Quality Gates The quality gates workflow mirrors the pre-commit hooks at the PR level, providing defense in depth. It runs four checks on every pull request: Gate Purpose No Stubs/TODOs Blocks TODO, FIXME, HACK markers in production code (test files excluded) No Unauthorized Crypto Blocks raw cryptographic imports outside designated security modules Security Audit Required Changes to security-sensitive paths require accompanying audit documentation Dependency Audit Trail Vendored patches must have an audit trail explaining the patch and its provenance These gates catch anything that bypasses pre-commit hooks: force-pushed commits, direct GitHub web edits, commits from contributors who have not installed the hooks. Stage 3: CI/Build-Time Governance Once a PR passes the gate workflows, the main CI pipeline and specialized workflows perform deeper, more computationally intensive analysis. The Governance Verify Action The Governance Verify action is the primary CI-time governance check. It is a GitHub Actions composite action that installs the toolkit and runs the compliance CLI against your repository. It supports four modes: Command What It Does governance-verify Runs the full compliance verification suite, checking governance controls and reporting how many pass marketplace-verify Validates a plugin manifest against marketplace requirements (required fields, signing, metadata) policy-evaluate Evaluates a specific policy file against a JSON context, returning the allow/deny decision with the matched rule all Runs governance-verify, then marketplace-verify and policy-evaluate if the corresponding paths are provided Here is an example: # .github/workflows/governance-ci.yml name: Governance CI on: [push, pull_request] jobs: verify: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: microsoft/agent-governance-toolkit/action@main with: command: all policy-path: policies/ manifest-path: plugin.json output-format: json fail-on-warning: 'true' The action outputs structured data including controls-passed, controls-total, violations count, and full command output in JSON format. This makes it straightforward to integrate with dashboards, Slack notifications, or downstream decision logic. The Security Scan Action A separate security scan action scans directories for secrets, CVEs, and dangerous code patterns. Unlike the PR-time secret scanning (which focuses on git history), this action performs deep content analysis of the current codebase: - uses: microsoft/agent-governance-toolkit/action/security-scan@main with: paths: 'plugins/ scripts/' min-severity: high exemptions-file: .security-exemptions.json The action supports configurable severity thresholds (critical, high, medium, low), an exemptions file for acknowledged findings, and structured JSON output with findings-count, blocking-count, and detailed findings. Policy Validation Workflow A dedicated policy validation workflow triggers whenever YAML files or the policy engine source code changes. It performs two jobs in sequence: Validate policies: Discovers all policy files matching the *policy* naming convention, then validates each file using the AGT policy CLI. Test policies: Runs the policy CLI unit tests to verify that policy evaluation behavior is correct after the changes. This ensures that policy file edits do not break the policy engine and that policy semantics are preserved. CodeQL and Static Analysis AGT uses GitHub's CodeQL for semantic static analysis of Python and TypeScript code. The CodeQL workflow runs on pushes and PRs, performing deep dataflow analysis that goes beyond pattern matching. Results are uploaded as SARIF to GitHub's Security tab, providing a centralized view of code quality issues. Dependency Confusion Scanning A dedicated CI job runs a dependency confusion scanner on every build. This is a targeted defense against a specific supply chain attack vector where an attacker registers a public package with the same name as an internal package. The scanner checks that: Internal package names do not collide with public PyPI or npm packages Notebook pip install commands only reference packages that are registered and expected Workflow Security Auditing When GitHub Actions workflow files change, a workflow security job scans for common CI/CD security issues: Expression injection: Detects patterns like ${{ github.event.pull_request.title }} used directly in run: blocks, which can allow arbitrary code execution. Overly permissive permissions: Flags workflows that request more permissions than necessary. Unpinned action references: Detects actions referenced by branch name instead of commit SHA, which is a supply chain risk. .NET Binary Analysis with BinSkim For the .NET SDK (Microsoft.AgentGovernance), the CI pipeline runs Microsoft BinSkim binary security analysis on compiled assemblies. BinSkim checks for security-relevant compiler and linker settings in compiled binaries, such as DEP (Data Execution Prevention), ASLR (Address Space Layout Randomization), and stack protection. Results are uploaded as SARIF to GitHub code scanning alongside the CodeQL results. The ci-complete Gate Pattern With many CI jobs that conditionally run based on path filters, AGT uses a pattern called ci-complete: a single gate job that is configured as the sole required status check in branch protection. This job runs unconditionally (if: always()), depends on all other CI jobs, and checks that none of them failed. Jobs that were skipped (because no relevant files changed) are acceptable. This pattern ensures that branch protection works correctly with conditional CI jobs, preventing the common issue where skipped jobs report as "skipped" and fail required status checks. Language-Specific Compile-Time Enforcement Beyond the language-agnostic CI checks, each AGT SDK uses its language's native compiler and tooling to enforce governance standards at compile time. .NET: The Strictest Compile-Time Checks The .NET SDK (Microsoft.AgentGovernance) enforces compile-time governance through MSBuild properties in Directory.Build.props and Directory.Build.targets, which apply automatically to every project in the SDK: Feature MSBuild Property Effect Nullable reference types <Nullable>enable</Nullable> The compiler warns on every possible null dereference, helping prevent NullReferenceException at compile time Warnings as errors <TreatWarningsAsErrors>true All compiler warnings become build errors for packable projects; no warnings can be shipped to consumers Strong-name signing <SignAssembly>true</SignAssembly> Assemblies are signed with a strong-name key (AgentGovernance.snk), enabling identity verification Deterministic builds <ContinuousIntegrationBuild>true Identical source code produces bit-for-bit identical binaries in CI, enabling build verification SourceLink Microsoft.SourceLink.GitHub package Users can step into AGT source code when debugging, supporting transparency and auditability Symbol packages <IncludeSymbols>true</IncludeSymbols> .snupkg symbol packages are published alongside NuGet packages for debugging support TypeScript: Strict Compilation and Linting The TypeScript SDK (@microsoft/agentmesh-sdk) uses strict compiler settings and ESLint for build-time governance: Strict mode ("strict": true in tsconfig.json) enables all strict type-checking options, including noImplicitAny, strictNullChecks, strictFunctionTypes, and strictBindCallApply. Consistent file naming (forceConsistentCasingInFileNames) prevents cross-platform issues where imports work on case-insensitive file systems (Windows, macOS) but fail on case-sensitive ones (Linux CI). Declaration generation (declaration: true with declarationMap: true) produces .d.ts files for consumers, enabling downstream type checking. ESLint with @typescript-eslint provides static analysis during the build process, catching issues beyond what the TypeScript compiler checks. Python: Type Safety and Fast Linting Python packages in AGT use typed package markers and static analysis tooling configured in pyproject.toml: py.typed marker: Each package includes a py.typed file, signalling to type checkers (mypy, pyright, Pylance) that the package supports type checking. Consumers get type errors if they misuse the AGT API. mypy: Configured as a dev dependency with project-specific settings in pyproject.toml. Provides static type checking that catches type mismatches before runtime. ruff: A fast Python linter written in Rust, configured in pyproject.toml and enforced in CI. Ruff checks for hundreds of code quality rules at build time. Stage 4: Release-Time Gates Before artifacts reach users, the release pipeline adds a final layer of verification. These gates help ensure that what ships is exactly what was built, is signed by the expected publisher, and has a complete inventory of its components. Gate Tool What It Produces SBOM generation Anchore/Syft SPDX and CycloneDX software bills of materials listing every component, dependency, and licence Python signing Sigstore Cryptographic signature using OpenID Connect identity, verifiable without manual key distribution .NET signing RELEASE PIPELINE Microsoft Authenticode and NuGet signing through the release pipeline Build provenance actions/attest-build-provenance SLSA provenance attestation linking the artifact to its source commit and build environment SBOM attestation actions/attest-sbom Binds the SBOM to the specific release artifact, creating a verifiable link between the inventory and the binary Additionally, the OpenSSF Scorecard runs on schedule, providing an automated security posture assessment that covers branch protection, dependency management, CI/CD practices, and more. The score is published to the OpenSSF Scorecard website, giving consumers a transparent view of the project security practices. How It All Fits Together: Defense in Depth This approach follows a defense-in-depth principle: every check exists at multiple layers, so that bypassing one layer does not compromise the whole system. Secret scanning, for example, runs at three levels: detect-secrets at commit time (pre-commit hook), Gitleaks at PR time (secret scanning workflow), and the Security Scan action at CI time (content analysis). A developer who bypasses pre-commit hooks will still be caught by the PR-time gate. A contributor who force-pushes past the PR gate will still be caught by the CI pipeline. Similarly, policy validation runs at commit time (validate-policy hook), at PR time (quality gates), and at CI time (policy validation workflow). Each layer adds depth: the commit-time hook catches schema errors, the CI pipeline catches semantic issues and runs regression tests. The ci-complete gate job ties everything together. By depending on every CI job and serving as the single required status check, it ensures that no code merges to the main branch unless every applicable check has passed. Getting Started You can adopt AGT's shift-left governance incrementally. Here are three starting points, from lowest to highest effort: 1. Add the Governance Verify Action (5 minutes) Add a single GitHub Actions workflow that runs the compliance check on every PR: # .github/workflows/governance.yml name: Governance on: [pull_request] jobs: verify: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: microsoft/agent-governance-toolkit/action@main with: command: governance-verify 2. Enable Pre-Commit Hooks (15 minutes) Add a .pre-commit-config.yaml referencing AGT's hooks, install them, and run against all existing files to establish a baseline. Start in permissive mode and graduate to strict over four weeks. 3. Full Pipeline Integration (1-2 hours) Add the complete set of PR-time gates (attestation, dependency review, secret scanning, supply chain checks, quality gates), configure the Security Scan action for your plugin directories, and enable SBOM generation and signing in your release workflow. The AGT repository itself serves as a reference implementation: every workflow described in this post is running in production at aka.ms/agent-governance-toolkit. Important Notes The policy files, workflow configurations, and code samples in this post are illustrative examples. Your organization's governance requirements may differ. Review and customize all configurations before deploying to production. The Agent Governance Toolkit is designed to help organizations implement governance controls for AI agents; it does not guarantee compliance with any specific regulatory framework. Always consult your organization's security and legal teams when defining governance policies. What Comes Next Pre-runtime governance is one piece of the puzzle. Combined with the runtime governance capabilities covered in part one of this series (policy engines, zero-trust identity, execution sandboxing, audit logging), it provides coverage across the full lifecycle. The project continues to grow. Since the initial release, we’ve added a multi-stage policy pipeline (pre_input, pre_tool, post_tool, pre_output stages), approval workflows with human-in-the-loop gates, DLP attribute ratchets for monotonic session state, and OpenTelemetry instrumentation for governance operations. Over 45 step-by-step tutorials are available in the documentation. Everything described in this post is available today in the public GitHub repository. The full source, documentation, tutorials, and examples are at aka.ms/agent-governance-toolkit, open source under the MIT license. We welcome contributions, feedback, and issue reports from the community.319Views0likes0CommentsProject Pavilion Presence at KubeCon EU 2026
KubeCon + CloudNativeCon Europe 2026 took place from 23 to 26 March at RAI Amsterdam, and it was a strong one. The themes running through the week reflected where the cloud native community is right now: AI moving from experimentation into production, platform engineering continuing to mature, and security and sovereignty top of mind for organizations across Europe. Microsoft was there throughout, and once again supported a range of open source projects in the Project Pavilion. The Project Pavilion is a dedicated, vendor-neutral space on the show floor reserved for CNCF projects. It is where the work gets talked about honestly. Maintainers and contributors meet directly with end users, share what they are building, get real feedback on what is and is not working, and have the kinds of technical conversations that are hard to have anywhere else. For open source communities, it is one of the most valuable parts of the event. Why Our Presence Matters Microsoft's products and services are built on and alongside many of the technologies represented in the pavilion, and the health of these communities matters to us directly. Showing up means our teams hear firsthand what is working, what is missing, and where these projects need to go next. It also means we get to contribute as community members, not just as a company name on a sponsor board. That distinction matters to us, and to the communities we are part of. Microsoft-Supported Pavilion Projects Confidential Containers Representative: Jeremi Piotrowski The Confidential Containers booth gave attendees a chance to learn more about the project and its approach to protecting workloads using hardware-based trusted execution environments. Jeremi was on hand throughout the kiosk hours, fielding questions from interested users and developers exploring confidential computing in Kubernetes environments. Conversations touched on use cases around data privacy, regulated workloads, and the role Confidential Containers plays in the broader cloud-native security landscape. Drasi Representative: Daniel Gerlag and Nandita Valsan The Drasi team had a busy time in the pavilion, engaging around 40 attendees across two kiosk shifts in focused technical conversations. Most visitors were developers and platform engineers curious about change-driven architectures and real-time data processing. There was strong positive feedback on the newly introduced Drasi Server modes and embeddable library, which complement Drasi for Kubernetes. The team came away with useful validation of current design decisions and good input for the roadmap ahead. Envoy Representative: Mikhail Krinkin The Envoy booth was staffed for the full duration of KubeCon EU by maintainers from Microsoft, Google, Isovalent, and Tetrate, reflecting the broad and healthy contributor base behind the project. The biggest topic at the booth was migration from ingress-nginx to Gateway API implementations. The archival of ingress-nginx pushed a lot of users into making changes they were not quite ready for, and questions ranged from technical specifics like HTTP default differences between Envoy and Nginx, to more foundational questions about what Envoy and Gateway API actually are. The team had anticipated this and invested in the ingress2gateway project to give users a clear migration path. Extensibility was another frequent conversation topic, with dynamic modules increasingly becoming the go-to answer for user-specific requirements. Starting with the 1.38 release of Envoy, dynamic modules will have a backward compatible ABI, a sign of real production readiness for that feature. Flatcar Representative: Thilo Fromm and Mathieu Tortuyaux The Flatcar booth had great energy, with maintainers from Microsoft, STACKIT, and CloudBase joining for conversations throughout the pavilion hours. Operational sovereignty came up again and again as a theme, with users and consulting partners sharing how they are building their Kubernetes offerings on Flatcar because of how reliable and secure it is. There were a lot of meaningful conversations. Lambda.ai currently runs Flatcar on their control plane and is looking at extending it to worker and customer clusters, with interest in contributing to the project. ReeVo has built their hosted Kubernetes distro on Flatcar across multi-cloud and bare metal environments and is planning to move hundreds of customer clusters over soon. Users from ClearScore, Avassa, Recorded Future, and several other organizations also stopped by with positive feedback on the project's robustness and security. STACKIT uses Flatcar as the default OS for their hosted Kubernetes offering and sponsors a full-time maintainer for the project. The team also connected with TAG Infrastructure to talk through Flatcar's CNCF graduation progress. Headlamp Representatives: René Dudfield and Santhosh Nagaraj S The Headlamp booth was a busy one, with users, contributors, and partner projects all stopping by throughout the pavilion hours. Conversations covered real-world deployments, federation challenges, multi-tenant namespace visibility, and feature requests like multi-CR data aggregation. There was notable interest from consultancies deploying Headlamp across hundreds of customer clusters, as well as from companies already running it at cloud scale. Several CNCF projects expressed interest in building UIs for their own projects inside Headlamp, with a few even getting started right there at the conference. The team also heard from users getting budget approved to migrate from the deprecated Kubernetes Dashboard, which is a good sign for the project's growing momentum. Demand for air-gapped AI agent support and deeper Azure and AKS integrations for internal developer platforms came up as clear areas to watch. Hyperlight Representative: Ralph Squillace The Hyperlight booth ran as a half-day session on Tuesday, in line with the project's current Sandbox status, but the corner location in the project area made a real difference in visibility. Ralph was fielding questions from the moment the doors opened, with a steady stream of visitors right up until the shift ended. Live and recorded demos were central to the conversations, helping attendees quickly grasp what Hyperlight does and how it fits into their environments. One standout visit came from an engineer at SAP who spent nearly an hour at the booth, pushing the conversation from fundamentals and embedding examples all the way through to agentic protection scenarios in Kubernetes. That conversation continued beyond KubeCon and turned into a scheduled meeting to explore a proof of concept, a good example of the kind of follow-on engagement the pavilion can generate. Inspektor Gadget Representative: Michael Friese and Qasim Sarfraz The Inspektor Gadget booth had a lot of great energy, drawing in contributors, new users, and people just discovering the project for the first time. There was genuine excitement around Inspektor Gadget Desktop and its visual troubleshooting experience for Kubernetes and Linux environments. The integration with HolmesGPT, which was also featured in the keynote, came up frequently and was one of the main talking points throughout the event. A theme that surfaced consistently in conversations with platform engineers was multi-tenancy, with teams looking for ways to safely give developers ad-hoc access to troubleshoot issues independently while keeping overall control at the platform level. It was a good set of conversations that reflected both the project's maturity and the growing demand for a flexible observability framework. Istio Representative: Mitch Connors, Mikhail Krinkin, Jackie Maertens and Mike Morris The Istio booth had steady traffic throughout the conference, with a noticeable shift in who was stopping by. More visitors came from teams with existing sidecar-based production deployments looking for guidance on moving to ambient mode, which is a change from previous years when ambient interest was mostly coming from greenfield users. The motivation to make the move was often tied to cost optimization and performance, with teams having read case studies and feeling more confident about the direction. That said, the increased interest also surfaced some real gaps, including requests for clearer migration guidance, more clarity around architectural differences like mTLS egress workflows, and better support for VM-based workloads. The team is planning to prioritize migration guidance over the coming months. The updated Istio Day format, with a half day of sessions at the Cloud Native Theater stage, also drew a strong crowd with standing room only throughout. Notary Project Representative: Toddy Mladenov and Flora Taagen The Notary Project kiosk drew a wide range of visitors, from people learning about container image signing for the first time to experienced engineers asking detailed questions about what is coming next on the roadmap. A major highlight of the week was the project's conference talk on per-layer dm-verity signing, which drew a packed room and over 660 online sign-ups, one of the stronger turnouts for a project-level session at the event. The talk walked through how the new capability moves container security beyond pull-time verification to continuous runtime protection, backed by dm-verity, which generated a lengthy Q&A and a lot of enthusiasm from the audience. The team also sees a real opportunity ahead as AI workloads push organizations to think harder about the integrity of models, datasets, and container images, and the interest at the booth reinforced that Notary Project is well positioned to play a big role in securing those workflows. ORAS Representative: Toddy Mladenov The ORAS kiosk was staffed by maintainers from Microsoft, NVIDIA, and Red Hat, a good reflection of the healthy multi-vendor community the project has built. Attendees engaged with maintainers on ORAS use cases and adoption, with conversations ranging from how artifacts are tagged and packaged to how ORAS fits into broader multi-cloud workflows. One practical takeaway from maintainer conversations was around leveraging the ORAS SDK more often as a substitute for CLI operations when working with container registries, which helps teams build simpler and more robust tooling. Radius Representative: Sylvain Niles and Will Tsai The Radius booth, supported by the Microsoft Azure Incubations team, attracted a good mix of enterprise platform teams, prospective adopters, and fellow open source maintainers throughout the pavilion hours. There was strong interest in the extensible Radius Resource Types feature and how it helps teams abstract infrastructure complexity and move workloads across different environments. Conversations also surfaced useful feedback on where the project should focus next, including agent-driven infrastructure workflows and using the Radius application graph to improve observability and operational visibility for cloud-native applications. Conclusion KubeCon EU 2026 was a good reminder of why this community continues to grow. The conversations in the Project Pavilion were substantive, the feedback was honest, and the connections made there will carry forward into the work. Microsoft will be back for KubeCon NA in Salt Lake City this November, and we are already looking forward to it. If you are interested in getting involved with any of these projects, the best starting point is each project's community directly. You are also welcome to reach out to Lexi Nadolski at lexinadolski@microsoft.com with any questions.68Views0likes0CommentsJoin us at Microsoft Azure Infra Summit 2026 for deep technical Azure infrastructure content
Microsoft Azure Infra Summit 2026 is a free, engineering-led virtual event created for IT professionals, platform engineers, SREs, and infrastructure teams who want to go deeper on how Azure really works in production. It will take place May 19-21, 2026. This event is built for the people responsible for keeping systems running, making sound architecture decisions, and dealing with the operational realities that show up long after deployment day. Over the past year, one message has come through clearly from the community: infrastructure and operations audiences want more in-depth technical content. They want fewer surface-level overviews and more practical guidance from the engineers and experts who build, run, and support these systems every day. That is exactly what Azure Infra Summit aims to deliver. All content is created AND delivered by engineering, targeting folks working with Azure infrastructure and operating production environments. Who is this for: IT professionals, platform engineers, SREs, and infrastructure teams When: May 19-21, 2026 - 8:00 AM–1:00 PM Pacific Time, all 3 days Where: Online Virtual Cost: Free Level: Most sessions are advanced (L300-400). Register here: https://aka.ms/MAIS-Reg Built for the people who run workloads on Azure Azure Infra Summit is for the people who do more than deploy to Azure. It is for the people who run it. If your day involves uptime, patching, governance, monitoring, reliability, networking, identity, storage, or hybrid infrastructure, this event is for you. Whether you are an IT professional managing enterprise environments, a platform engineer designing landing zones, an Azure administrator, an architect, or an SRE responsible for resilience and operational excellence, you will find content built with your needs in mind. We are intentionally shaping this event around peer-to-peer technical learning. That means engineering-led sessions, practical examples, and candid discussion about architecture, failure modes, operational tradeoffs, and what breaks in production. The promise here is straightforward: less fluff, more infrastructure. What to expect Azure Infra Summit will feature deep technical content in the 300 to 400 level range, with sessions designed by engineering to help you build, operate, and optimize Azure infrastructure more effectively. The event will include a mix of live and pre-recorded sessions and live Q&A. Throughout the three days, we will dig into topics such as: Hybrid operations and management Networking at scale Storage, backup, and disaster recovery Observability, SLOs, and day-2 operations Confidential compute Architecture, automation, governance, and optimization in Azure Core environments And more… The goal is simple: to give you practical guidance you can take back to your environment and apply right away. We want attendees to leave with stronger mental models, a better understanding of how Azure behaves in the real world, and clearer patterns for designing and operating infrastructure with confidence. Why this event matters Infrastructure decisions have a long tail. The choices we make around architecture, operations, governance, and resilience show up later in the form of performance issues, outages, cost, complexity, and recovery challenges. That is why deep technical learning matters, and why events like this matter. Join us I hope you will join us for Microsoft Azure Infra Summit 2026, happening May 19-21, 2026. If you care about how Azure infrastructure behaves in the real world, and you want practical, engineering-led guidance on how to build, operate, and optimize it, this event was built for you. Register here: https://aka.ms/MAIS-Reg Cheers! Pierre Roman5.3KViews2likes2CommentsAgent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents
Last week we announced the Agent Governance Toolkit on the Microsoft Open Source Blog, an open-source project that brings runtime security governance to autonomous AI agents. In that announcement, we covered the why: AI agents are making autonomous decisions in production, and the security patterns that kept systems safe for decades need to be applied to this new class of workload. In this post, we'll go deeper into the how: the architecture, the implementation details, and what it takes to run governed agents in production. The Problem: Production Infrastructure Meets Autonomous Agents If you manage production infrastructure, you already know the playbook: least privilege, mandatory access controls, process isolation, audit logging, and circuit breakers for cascading failures. These patterns have kept production systems safe for decades. Now imagine a new class of workload arriving on your infrastructure, AI agents that autonomously execute code, call APIs, read databases, and spawn sub-processes. They reason about what to do, select tools, and act in loops. And in many current deployments, they do all of this without the security controls you'd demand of any other production workload. That gap is what led us to build the Agent Governance Toolkit: an open-source project, that applies proven security concepts from operating systems, service meshes, and SRE to the emerging world of autonomous AI agents. To frame this in familiar terms: most AI agent frameworks today are like running every process as root, no access controls, no isolation, no audit trail. The Agent Governance Toolkit is the kernel, the service mesh, and the SRE platform for AI agents. When an agent calls a tool, say, `DELETE FROM users WHERE created_at < NOW()`, there is typically no policy layer checking whether that action is within scope. There is no identity verification when one agent communicates with another. There is no resource limit preventing an agent from making 10,000 API calls in a minute. And there is no circuit breaker to contain cascading failures when things go wrong. OWASP Agentic Security Initiative In December 2025, OWASP published the Agentic AI Top 10: the first formal taxonomy of risks specific to autonomous AI agents. The list reads like a security engineer's nightmare: goal hijacking, tool misuse, identity abuse, memory poisoning, cascading failures, rogue agents, and more. If you've ever hardened a production server, these risks will feel both familiar and urgent. The Agent Governance Toolkit is designed to help address all 10 of these risks through deterministic policy enforcement, cryptographic identity, execution isolation, and reliability engineering patterns. Note: The OWASP Agentic Security Initiative has since adopted the ASI 2026 taxonomy (ASI01–ASI10). The toolkit's copilot-governance package now uses these identifiers with backward compatibility for the original AT numbering. Architecture: Nine Packages, One Governance Stack The toolkit is structured as a v3.0.0 Public Preview monorepo with nine independently installable packages: Package What It Does Agent OS Stateless policy engine, intercepts agent actions before execution with configurable pattern matching and semantic intent classification Agent Mesh Cryptographic identity (DIDs with Ed25519), Inter-Agent Trust Protocol (IATP), and trust-gated communication between agents Agent Hypervisor Execution rings inspired by CPU privilege levels, saga orchestration for multi-step transactions, and shared session management Agent Runtime Runtime supervision with kill switches, dynamic resource allocation, and execution lifecycle management Agent SRE SLOs, error budgets, circuit breakers, chaos engineering, and progressive delivery, production reliability practices adapted for AI agents Agent Compliance Automated governance verification with compliance grading and regulatory framework mapping (EU AI Act, NIST AI RMF, HIPAA, SOC 2) Agent Lightning Reinforcement learning training governance with policy-enforced runners and reward shaping Agent Marketplace Plugin lifecycle management with Ed25519 signing, trust-tiered capability gating, and SBOM generation Integrations 20+ framework adapters for LangChain, CrewAI, AutoGen, Semantic Kernel, Google ADK, Microsoft Agent Framework, OpenAI Agents SDK, and more Agent OS: The Policy Engine Agent OS intercepts agent tool calls before they execute: from agent_os import StatelessKernel, ExecutionContext, Policy kernel = StatelessKernel() ctx = ExecutionContext( agent_id="analyst-1", policies=[ Policy.read_only(), # No write operations Policy.rate_limit(100, "1m"), # Max 100 calls/minute Policy.require_approval( actions=["delete_*", "write_production_*"], min_approvals=2, approval_timeout_minutes=30, ), ], ) result = await kernel.execute( action="delete_user_record", params={"user_id": 12345}, context=ctx, ) The policy engine works in two layers: configurable pattern matching (with sample rule sets for SQL injection, privilege escalation, and prompt injection that users customize for their environment) and a semantic intent classifier that helps detect dangerous goals regardless of phrasing. When an action is classified as `DESTRUCTIVE_DATA`, `DATA_EXFILTRATION`, or `PRIVILEGE_ESCALATION`, the engine blocks it, routes it for human approval, or downgrades the agent's trust level, depending on the configured policy. Important: All policy rules, detection patterns, and sensitivity thresholds are externalized to YAML configuration files. The toolkit ships with sample configurations in `examples/policies/` that must be reviewed and customized before production deployment. No built-in rule set should be considered exhaustive. Policy languages supported: YAML, OPA Rego, and Cedar. The kernel is stateless by design, each request carries its own context. This means you can deploy it behind a load balancer, as a sidecar container in Kubernetes, or in a serverless function, with no shared state to manage. On AKS or any Kubernetes cluster, it fits naturally into existing deployment patterns. Helm charts are available for agent-os, agent-mesh, and agent-sre. Agent Mesh: Zero-Trust Identity for Agents In service mesh architectures, services prove their identity via mTLS certificates before communicating. AgentMesh applies the same principle to AI agents using decentralized identifiers (DIDs) with Ed25519 cryptography and the Inter-Agent Trust Protocol (IATP): from agentmesh import AgentIdentity, TrustBridge identity = AgentIdentity.create( name="data-analyst", sponsor="alice@company.com", # Human accountability capabilities=["read:data", "write:reports"], ) # identity.did -> "did:mesh:data-analyst:a7f3b2..." bridge = TrustBridge() verification = await bridge.verify_peer( peer_id="did:mesh:other-agent", required_trust_score=700, # Must score >= 700/1000 ) A critical feature is trust decay: an agent's trust score decreases over time without positive signals. An agent trusted last week but silent since then gradually becomes untrusted, modeling the reality that trust requires ongoing demonstration, not a one-time grant. Delegation chains enforce scope narrowing: a parent agent with read+write permissions can delegate only read access to a child agent, never escalate. Agent Hypervisor: Execution Rings CPU architectures use privilege rings (Ring 0 for kernel, Ring 3 for userspace) to isolate workloads. The Agent Hypervisor applies this model to AI agents: Ring Trust Level Capabilities Ring 0 (Kernel) Score ≥ 900 Full system access, can modify policies Ring 1 (Supervisor) Score ≥ 700 Cross-agent coordination, elevated tool access Ring 2 (User) Score ≥ 400 Standard tool access within assigned scope Ring 3 (Untrusted) Score < 400 Read-only, sandboxed execution only New and untrusted agents start in Ring 3 and earn their way up, exactly the principle of least privilege that production engineers apply to every other workload. Each ring enforces per-agent resource limits: maximum execution time, memory caps, CPU throttling, and request rate limits. If a Ring 2 agent attempts a Ring 1 operation, it gets blocked, just like a userspace process trying to access kernel memory. These ring definitions and their associated trust score thresholds are fully configurable via policy. Organizations can define custom ring structures, adjust the number of rings, set different trust score thresholds for transitions, and configure per-ring resource limits to match their security requirements. The hypervisor also provides saga orchestration for multi-step operations. When an agent executes a sequence, draft email → send → update CRM, and the final step fails, compensating actions fire in reverse. Borrowed from distributed transaction patterns, this ensures multi-agent workflows maintain consistency even when individual steps fail. Agent SRE: SLOs and Circuit Breakers for Agents If you practice SRE, you measure services by SLOs and manage risk through error budgets. Agent SRE extends this to AI agents: When an agent's safety SLI drops below 99 percent, meaning more than 1 percent of its actions violate policy, the system automatically restricts the agent's capabilities until it recovers. This is the same error-budget model that SRE teams use for production services, applied to agent behavior. We also built nine chaos engineering fault injection templates: network delays, LLM provider failures, tool timeouts, trust score manipulation, memory corruption, and concurrent access races. Because the only way to know if your agent system is resilient is to break it intentionally. Agent SRE integrates with your existing observability stack through adapters for Datadog, PagerDuty, Prometheus, OpenTelemetry, Langfuse, LangSmith, Arize, MLflow, and more. Message broker adapters support Kafka, Redis, NATS, Azure Service Bus, AWS SQS, and RabbitMQ. Compliance and Observability If your organization already maps to CIS Benchmarks, NIST AI RMF, or other frameworks for infrastructure compliance, the OWASP Agentic Top 10 is the equivalent standard for AI agent workloads. The toolkit's agent-compliance package provides automated governance grading against these frameworks. The toolkit is framework-agnostic, with 20+ adapters that hook into each framework's native extension points, so adding governance to an existing agent is typically a few lines of configuration, not a rewrite. The toolkit exports metrics to any OpenTelemetry-compatible platform, Prometheus, Grafana, Datadog, Arize, or Langfuse. If you're already running an observability stack for your infrastructure, agent governance metrics flow through the same pipeline. Key metrics include: policy decisions per second, trust score distributions, ring transitions, SLO burn rates, circuit breaker state, and governance workflow latency. Getting Started # Install all packages pip install agent-governance-toolkit[full] # Or individual packages pip install agent-os-kernel agent-mesh agent-sre The toolkit is available across language ecosystems: Python, TypeScript (`@microsoft/agentmesh-sdk` on npm), Rust, Go, and .NET (`Microsoft.AgentGovernance` on NuGet). Azure Integrations While the toolkit is platform-agnostic, we've included integrations that help enable the fastest path to production, on Azure: Azure Kubernetes Service (AKS): Deploy the policy engine as a sidecar container alongside your agents. Helm charts provide production-ready manifests for agent-os, agent-mesh, and agent-sre. Azure AI Foundry Agent Service: Use the built-in middleware integration for agents deployed through Azure AI Foundry. OpenClaw Sidecar: One compelling deployment scenario is running OpenClaw, the open-source autonomous agent, inside a container with the Agent Governance Toolkit deployed as a sidecar. This gives you policy enforcement, identity verification, and SLO monitoring over OpenClaw's autonomous operations. On Azure Kubernetes Service (AKS), the deployment is a standard pod with two containers: OpenClaw as the primary workload and the governance toolkit as the sidecar, communicating over localhost. We have a reference architecture and Helm chart available in the repository. The same sidecar pattern works with any containerized agent, OpenClaw is a particularly compelling example because of the interest in autonomous agent safety. Tutorials and Resources 34+ step-by-step tutorials covering policy engines, trust, compliance, MCP security, observability, and cross-platform SDK usage are available in the repository. git clone https://github.com/microsoft/agent-governance-toolkit cd agent-governance-toolkit pip install -e "packages/agent-os[dev]" -e "packages/agent-mesh[dev]" -e "packages/agent-sre[dev]" # Run the demo python -m agent_os.demo What's Next AI agents are becoming autonomous decision-makers in production infrastructure, executing code, managing databases, and orchestrating services. The security patterns that kept production systems safe for decades, least privilege, mandatory access controls, process isolation, audit logging, are exactly what these new workloads need. We built them. They're open source. We're building this in the open because agent security is too important for any single organization to solve alone: Security research: Adversarial testing, red-team results, and vulnerability reports strengthen the toolkit for everyone. Community contributions: Framework adapters, detection rules, and compliance mappings from the community expand coverage across ecosystems. We are committed to open governance. We're releasing this project under Microsoft today, and we aspire to move it into a foundation home, such as the AI and Data Foundation (AAIF), where it can benefit from cross-industry stewardship. We're actively engaging with foundation partners on this path. The Agent Governance Toolkit is open source under the MIT license. Contributions welcome at github.com/microsoft/agent-governance-toolkit.1.4KViews0likes0CommentsHow Do We Know AI Isn’t Lying? The Art of Evaluating LLMs in RAG Systems
🔍 1. Why Evaluating LLM Responses is Hard In classical programming, correctness is binary. Input Expected Result 2 + 2 4 ✔ Correct 2 + 2 5 ✘ Wrong Software is deterministic — same input → same output. LLMs are probabilistic. They generate one of many valid word combinations, like forming sentences from multiple possible synonyms and sentence structures. Example: Prompt: "Explain gravity like I'm 10" Possible responses: Response A Response B Gravity is a force that pulls everything to Earth. Gravity bends space-time causing objects to attract. Both are correct. Which is better? Depends on audience. So evaluation needs to look beyond text similarity. We must check: ✔ Is the answer meaningful? ✔ Is it correct? ✔ Is it easy to understand? ✔ Does it follow prompt intent? Testing LLMs is like grading essays — not checking numeric outputs. 🧠 2. Why RAG Evaluation is Even Harder RAG introduces an additional layer — retrieval. The model no longer answers from memory; it must first read context, then summarise it. Evaluation now has multi-dimensions: Evaluation Layer What we must verify Retrieval Did we fetch the right documents? Understanding Did the model interpret context correctly? Grounding Is the answer based on retrieved data? Generation Quality Is final response complete & clear? A simple story makes this intuitive: Teacher asks student to explain Photosynthesis. Student goes to library → selects a book → reads → writes explanation. We must evaluate: Did they pick the right book? → Retrieval Did they understand the topic? → Reasoning Did they copy facts correctly without inventing? → Faithfulness Is written explanation clear enough for another child to learn from? → Answer Quality One failure → total failure. 🧩 3. Two Types of Evaluation 🔹 Intrinsic Evaluation — Quality of the Response Itself Here we judge the answer, ignoring real-world impact. We check: ✔ Grammar & coherence ✔ Completeness of explanation ✔ No hallucination ✔ Logic flow & clarity ✔ Semantic correctness This is similar to checking how well the essay is written. Even if the result did not solve the real problem, the answer could still look good — that’s why intrinsic alone is not enough. 🔹 Extrinsic Evaluation — Did It Achieve the Goal? This measures task success. If a customer support bot writes a beautifully worded paragraph, but the user still doesn’t get their refund — it failed extrinsically. Examples: System Type Extrinsic Goal Banking RAG Bot Did user get correct KYC procedure? Medical RAG Was advice safe & factual? Legal search assistant Did it return the right section of the law? Technical summariser Did summary capture key meaning? Intrinsic = writing quality. Extrinsic = impact quality. A production-grade RAG system must satisfy both. 📏 4. Core RAG Evaluation Metrics (Explained with Very Simple Analogies) Metric Meaning Analogy Relevance Does answer match question? Ask who invented C++? → model talks about Java ❌ Faithfulness No invented facts Book says started 2004, response says 1990 ❌ Groundedness Answer traceable to sources Claims facts that don’t exist in context ❌ Completeness Covers all parts of question User asks Windows vs Linux → only explains Windows Context Recall / Precision Correct docs retrieved & used Student opens wrong chapter Hallucination Rate Degree of made-up info “Taj Mahal is in London” 😱 Semantic Similarity Meaning-level match “Engine died” = “Car stopped running” 💡 Good evaluation doesn’t check exact wording. It checks meaning + truth + usefulness. 🛠 5. Tools for RAG Evaluation 🔹 1. RAGAS — Foundation for RAG Scoring RAGAS evaluates responses based on: ✔ Faithfulness ✔ Relevance ✔ Context recall ✔ Answer similarity Think of RAGAS as a teacher grading with a rubric. It reads both answer + source documents, then scores based on truthfulness & alignment. 🔹 2. LangChain Evaluators LangChain offers multiple evaluation types: Type What it checks String or regex Basic keyword presence Embedding based Meaning similarity, not text match LLM-as-a-Judge AI evaluates AI (deep reasoning) LangChain = testing toolbox RAGAS = grading framework Together they form a complete QA ecosystem. 🔹 3. PyTest + CI for Automated LLM Testing Instead of manually validating outputs, we automate: Feed preset questions to RAG Capture answers Run RAGAS/LangChain scoring Fail test if hallucination > threshold This brings AI closer to software-engineering discipline. RAG systems stop being experiments — they become testable, trackable, production-grade products. 🚀 6. The Future: LLM-as-a-Judge The future of evaluation is simple: LLMs will evaluate other LLMs. One model writes an answer. Another model checks: ✔ Was it truthful? ✔ Was it relevant? ✔ Did it follow context? This enables: Benefit Why it matters Scalable evaluation No humans needed for every query Continuous improvement Model learns from mistakes Real-time scoring Detect errors before user sees them This is like autopilot for AI systems — not only navigating, but self-correcting mid-flight. And that is where enterprise AI is headed. 🎯 Final Summary Evaluating LLM responses is not checking if strings match. It is checking if the machine: ✔ Understood the question ✔ Retrieved relevant knowledge ✔ Avoided hallucination ✔ Provided complete, meaningful reasoning ✔ Grounded answer in real source text RAG evaluation demands multi-layer validation — retrieval, reasoning, grounding, semantics, safety. Frameworks like RAGAS + LangChain evaluators + PyTest pipelines are shaping the discipline of measurable, reliable AI — pushing LLM-powered RAG from cool demo → trustworthy enterprise intelligence. Useful Resources What is Retrieval-Augmented Generation (RAG) : https://azure.microsoft.com/en-in/resources/cloud-computing-dictionary/what-is-retrieval-augmented-generation-rag/ Retrieval-Augmented Generation concepts (Azure AI) : https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/retrieval-augmented-generation RAG with Azure AI Search – Overview : https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview Evaluate Generative AI Applications (Microsoft Learn – Learning Path) : https://learn.microsoft.com/en-us/training/paths/evaluate-generative-ai-apps/ Evaluate Generative AI Models in Microsoft Foundry Portal : https://learn.microsoft.com/en-us/training/modules/evaluate-models-azure-ai-studio/ RAG Evaluation Metrics (Relevance, Groundedness, Faithfulness) : https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-evaluators/rag-evaluators RAGAS – Evaluation Framework for RAG Systems : https://docs.ragas.io/515Views0likes0CommentsBuilding Production-Ready, Secure, Observable, AI Agents with Real-Time Voice with Microsoft Foundry
We're excited to announce the general availability of Foundry Agent Service, Observability in Foundry Control Plane, and the Microsoft Foundry portal — plus Voice Live integration with Agent Service in public preview — giving teams a production-ready platform to build, deploy, and operate intelligent AI agents with enterprise-grade security and observability.8.8KViews2likes0CommentsGenerally Available: Evaluations, Monitoring, and Tracing in Microsoft Foundry
If you've shipped an AI agent to production, you've likely run into the same uncomfortable realization: the hard part isn't getting the agent to work - it's keeping it working. Models get updated, prompts get tweaked, retrieval pipelines drift, and user traffic surfaces edge cases that never appeared in your eval suite. Quality isn't something you establish once. It's something you have to continuously measure. Today, we're making that continuous measurement a first-class operational capability. Evaluations, Monitoring, and Tracing in Microsoft Foundry are now generally available through Foundry Control Plane. These aren't standalone tools bolted onto the side of the platform - they're deeply integrated with Azure Monitor, which means AI agent observability now lives in the same operational plane as the rest of your infrastructure. The Problem With Point-in-Time Evaluation Most evaluation workflows are designed around a pre-deployment gate. You build a test dataset, run your evals, review the scores, and ship. That approach has real value - but it has a hard ceiling. In production, agent behavior is a function of many things that change independently of your code: Foundation model updates ship continuously and can shift output style, reasoning patterns, and edge case handling in ways that don't always surface on your benchmark set. Prompt changes can have nonlinear effects downstream, especially in multi-step agentic flows. Retrieval pipeline drift changes what context your agent actually sees at inference time. A document index fresh last month may have stale or subtly different content today. Real-world traffic distribution is never exactly what you sampled for your test set. Production surfaces long-tail inputs that feel obvious in hindsight but were invisible during development. The implication is straightforward: evaluation has to be continuous, not episodic. You need quality signals at development time, at every CI/CD commit, and continuously against live production traffic - all using the same evaluator definitions so results are comparable across environments. That's the core design principle behind Foundry Observability. Continuous Evaluation Across the Full AI Lifecycle Built-In Evaluators Foundry's built-in evaluators cover the most critical quality and safety dimensions for production agent systems: Coherence and Relevance measure whether responses are internally consistent and on-topic relative to the input. These are table-stakes signals for any conversational or task-completion agent. Groundedness is particularly important for RAG-based architectures. It measures whether the model's output is actually supported by the retrieved context - as opposed to plausible-sounding content the model generated from its parametric memory. Groundedness failures are a leading indicator of hallucination risk in production, and they're often invisible to human reviewers at scale. Retrieval Quality evaluates the retrieval step independently from generation. Groundedness failures can originate in two places: the model may be ignoring good context, or the retrieval pipeline may not be surfacing relevant context in the first place. Splitting these signals makes it much easier to pinpoint root cause. Safety and Policy Alignment evaluates whether outputs meet your deployment's policy requirements - content safety, topic restrictions, response format compliance, and similar constraints. These evaluators are designed to run at every stage of the AI lifecycle: Local development - run evals inline as you iterate on prompts, retrieval config, or orchestration logic CI/CD pipelines - gate every commit against your quality baselines; catch regressions before they reach production Production traffic monitoring - continuously evaluate sampled live traffic and surface trends over time Because the evaluators are identical across all three contexts, a score in CI means the same thing as a score in production monitoring. See the Practical Guide to Evaluations and the Built-in Evaluators Reference for a deeper walkthrough. Custom Evaluators - Encoding Your Own Definition of Quality Built-in evaluators cover common signals well, but production agents often need to satisfy criteria specific to a domain, regulatory environment, or internal standard. Foundry supports two types of custom evaluators (currently in public preview): LLM-as-a-Judge evaluators let you configure a prompt and grading rubric, then use a language model to apply that rubric to your agent's outputs. This is the right approach for quality dimensions that require reasoning or contextual judgment - whether a response appropriately acknowledges uncertainty, whether a customer-facing message matches your brand tone, or whether a clinical summary meets documentation standards. You write a judge prompt with a scoring scale (e.g., 1–5 with criteria for each level) that evaluates a given {input} / {response} pair. Foundry runs this at scale and aggregates scores into your dashboards alongside built-in results. Code-based evaluators are Python functions that implement any evaluation logic you can express programmatically - regex matching, schema validation, business rule checks, compliance assertions, or calls to external systems. If your organization has documented policies about what a valid agent response looks like, you can encode those policies directly into your evaluation pipeline. Custom and built-in evaluators compose naturally - running against the same traffic, producing results in the same schema, feeding into the same dashboards and alert rules. Monitoring and Alerting - AI Quality as an Operational Signal All observability data produced by Foundry - evaluation results, traces, latency, token usage, and quality metrics - is published directly to Azure Monitor. This is where the integration pays off for teams already on Azure. What this enables that siloed AI monitoring tools can't: Cross-stack correlation. When your groundedness score drops, is it a model update, a retrieval pipeline issue, or an infrastructure problem affecting latency? With AI quality signals and infrastructure telemetry in the same Azure Monitor Application Insights workspace, you can answer that in minutes rather than hours of manual correlation across disconnected systems. Unified alerting. Configure Azure Monitor alert rules on any evaluation metric - trigger a PagerDuty incident when groundedness drops below threshold, send a Teams notification when safety violations spike, or create automated runbook responses when retrieval quality degrades. These are the same alert mechanisms your SRE team already uses. Enterprise governance by default. Azure Monitor's RBAC, retention policies, diagnostic settings, and audit logging apply automatically to all AI observability data. You inherit the governance framework your organization has already built and approved. Grafana and existing dashboards. If your team uses Azure Managed Grafana, evaluation metrics can flow into existing dashboards alongside your other operational metrics - a single pane of glass for application health, infrastructure performance, and AI agent quality. The Agent Monitoring Dashboard in the Foundry portal provides an AI-native view out of the box - evaluation metric trends, safety threshold status, quality score distributions, and latency breakdowns. Everything in that dashboard is backed by Azure Monitor data, so SRE teams can always drill deeper. End-to-End Tracing: From Quality Signal to Root Cause A groundedness score tells you something is wrong. A trace tells you exactly where the failure occurred and what the agent actually did. Foundry provides OpenTelemetry-based distributed tracing that follows each request through your entire agent system: model calls, tool invocations, retrieval steps, orchestration logic, and cross-agent handoffs. Traces capture the full execution path - inputs, outputs, latency at each step, tool call parameters and responses, and token usage. The key design decision: evaluation results are linked directly to traces. When you see a low groundedness score in your monitoring dashboard, you navigate directly to the specific trace that produced it - no manual timestamp correlation, no separate trace ID lookup. The connection is made automatically. Foundry auto-collects traces across the frameworks your agents are likely already built on: Microsoft Agent Framework Semantic Kernel LangChain and LangGraph OpenAI Agents SDK For custom or less common orchestration frameworks, the Azure Monitor OpenTelemetry Distro provides an instrumentation path. Microsoft is also contributing upstream to the OpenTelemetry project - working with Cisco Outshift, we've contributed semantic conventions for multi-agent trace correlation, standardizing how agent identity, task context, and cross-agent handoffs are represented in OTel spans. Note: Tracing is currently in public preview, with GA shipping by end of March. Prompt Optimizer (Public Preview) One persistent friction point in agent development is the iteration loop between writing prompts and measuring their effect. You make a change, run your evals, look at the delta, try to infer what about the change mattered, and repeat. Prompt Optimizer tightens this loop. It analyzes your existing prompt and applies structured prompt engineering techniques - clarifying ambiguous instructions, improving formatting for model comprehension, restructuring few-shot examples, making implicit constraints explicit - with paragraph-level explanations for every change it makes. The transparency is deliberate. Rather than producing a black-box "optimized" prompt, it shows you exactly what it changed and why. You can add constraints, trigger another optimization pass, and iterate until satisfied. When you're done, apply it with one click. The value compounds alongside continuous evaluation: run your eval suite against the current prompt, optimize, run evals again, see the measured improvement. That feedback loop - optimize, measure, optimize - is the closest thing to a systematic approach to prompt engineering that currently exists. What Makes our Approach to Observability Different There are other evaluation and observability tools in the AI ecosystem. The differentiation in Foundry's approach comes down to specific architectural choices: Unified lifecycle coverage, not just pre-deployment testing. Most existing evaluation tools are designed for offline, pre-deployment use. Foundry's evaluators run in the same form at development time, in CI/CD, and against live production traffic. Your quality metrics are actually comparable across the lifecycle - you can tell whether production quality matches what you saw in testing, rather than operating two separate measurement systems that can't be compared. No separate observability silo. Publishing all observability data to Azure Monitor means you don't operate a separate system for AI quality alongside your existing infrastructure monitoring. AI incidents route through your existing on-call rotations. AI quality data is subject to the same retention and compliance controls as the rest of your telemetry. Framework-agnostic tracing. Auto-instrumentation across Semantic Kernel, LangChain, LangGraph, and the OpenAI Agents SDK means you're not locked into a specific orchestration framework. The OpenTelemetry foundation means trace data is portable to any compatible backend, protecting your investment as the tooling landscape evolves. Composable evaluators. Built-in and custom evaluators run in the same pipeline, against the same traffic, producing results in the same schema, feeding into the same dashboards and alert rules. You don't choose between generic coverage and domain-specific precision - you get both. Evaluation linked to traces. Most systems treat evaluation and tracing as separate concerns. Foundry treats them as two views of the same event - closing the loop between detecting a quality problem and diagnosing it. Getting Started If you're building agents on Microsoft Foundry, or using Semantic Kernel, LangChain, LangGraph, or the OpenAI Agents SDK and want to add production observability, the entry point is Foundry Control Plane. Try it You'll need a Foundry project with an agent and an Azure OpenAI deployment. Enable observability by navigating to Foundry Control Plane and connecting your Azure Monitor workspace. Then walk through the Practical Guide to Evaluations, explore the Built-in Evaluators Reference, and set up end-to-end tracing for your agents.4.1KViews1like0Comments