observability
41 TopicsAgent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents
Last week we announced the Agent Governance Toolkit on the Microsoft Open Source Blog, an open-source project that brings runtime security governance to autonomous AI agents. In that announcement, we covered the why: AI agents are making autonomous decisions in production, and the security patterns that kept systems safe for decades need to be applied to this new class of workload. In this post, we'll go deeper into the how: the architecture, the implementation details, and what it takes to run governed agents in production. The Problem: Production Infrastructure Meets Autonomous Agents If you manage production infrastructure, you already know the playbook: least privilege, mandatory access controls, process isolation, audit logging, and circuit breakers for cascading failures. These patterns have kept production systems safe for decades. Now imagine a new class of workload arriving on your infrastructure, AI agents that autonomously execute code, call APIs, read databases, and spawn sub-processes. They reason about what to do, select tools, and act in loops. And in many current deployments, they do all of this without the security controls you'd demand of any other production workload. That gap is what led us to build the Agent Governance Toolkit: an open-source project, that applies proven security concepts from operating systems, service meshes, and SRE to the emerging world of autonomous AI agents. To frame this in familiar terms: most AI agent frameworks today are like running every process as root, no access controls, no isolation, no audit trail. The Agent Governance Toolkit is the kernel, the service mesh, and the SRE platform for AI agents. When an agent calls a tool, say, `DELETE FROM users WHERE created_at < NOW()`, there is typically no policy layer checking whether that action is within scope. There is no identity verification when one agent communicates with another. There is no resource limit preventing an agent from making 10,000 API calls in a minute. And there is no circuit breaker to contain cascading failures when things go wrong. OWASP Agentic Security Initiative In December 2025, OWASP published the Agentic AI Top 10: the first formal taxonomy of risks specific to autonomous AI agents. The list reads like a security engineer's nightmare: goal hijacking, tool misuse, identity abuse, memory poisoning, cascading failures, rogue agents, and more. If you've ever hardened a production server, these risks will feel both familiar and urgent. The Agent Governance Toolkit is designed to help address all 10 of these risks through deterministic policy enforcement, cryptographic identity, execution isolation, and reliability engineering patterns. Note: The OWASP Agentic Security Initiative has since adopted the ASI 2026 taxonomy (ASI01–ASI10). The toolkit's copilot-governance package now uses these identifiers with backward compatibility for the original AT numbering. Architecture: Nine Packages, One Governance Stack The toolkit is structured as a v3.0.0 Public Preview monorepo with nine independently installable packages: Package What It Does Agent OS Stateless policy engine, intercepts agent actions before execution with configurable pattern matching and semantic intent classification Agent Mesh Cryptographic identity (DIDs with Ed25519), Inter-Agent Trust Protocol (IATP), and trust-gated communication between agents Agent Hypervisor Execution rings inspired by CPU privilege levels, saga orchestration for multi-step transactions, and shared session management Agent Runtime Runtime supervision with kill switches, dynamic resource allocation, and execution lifecycle management Agent SRE SLOs, error budgets, circuit breakers, chaos engineering, and progressive delivery, production reliability practices adapted for AI agents Agent Compliance Automated governance verification with compliance grading and regulatory framework mapping (EU AI Act, NIST AI RMF, HIPAA, SOC 2) Agent Lightning Reinforcement learning training governance with policy-enforced runners and reward shaping Agent Marketplace Plugin lifecycle management with Ed25519 signing, trust-tiered capability gating, and SBOM generation Integrations 20+ framework adapters for LangChain, CrewAI, AutoGen, Semantic Kernel, Google ADK, Microsoft Agent Framework, OpenAI Agents SDK, and more Agent OS: The Policy Engine Agent OS intercepts agent tool calls before they execute: from agent_os import StatelessKernel, ExecutionContext, Policy kernel = StatelessKernel() ctx = ExecutionContext( agent_id="analyst-1", policies=[ Policy.read_only(), # No write operations Policy.rate_limit(100, "1m"), # Max 100 calls/minute Policy.require_approval( actions=["delete_*", "write_production_*"], min_approvals=2, approval_timeout_minutes=30, ), ], ) result = await kernel.execute( action="delete_user_record", params={"user_id": 12345}, context=ctx, ) The policy engine works in two layers: configurable pattern matching (with sample rule sets for SQL injection, privilege escalation, and prompt injection that users customize for their environment) and a semantic intent classifier that helps detect dangerous goals regardless of phrasing. When an action is classified as `DESTRUCTIVE_DATA`, `DATA_EXFILTRATION`, or `PRIVILEGE_ESCALATION`, the engine blocks it, routes it for human approval, or downgrades the agent's trust level, depending on the configured policy. Important: All policy rules, detection patterns, and sensitivity thresholds are externalized to YAML configuration files. The toolkit ships with sample configurations in `examples/policies/` that must be reviewed and customized before production deployment. No built-in rule set should be considered exhaustive. Policy languages supported: YAML, OPA Rego, and Cedar. The kernel is stateless by design, each request carries its own context. This means you can deploy it behind a load balancer, as a sidecar container in Kubernetes, or in a serverless function, with no shared state to manage. On AKS or any Kubernetes cluster, it fits naturally into existing deployment patterns. Helm charts are available for agent-os, agent-mesh, and agent-sre. Agent Mesh: Zero-Trust Identity for Agents In service mesh architectures, services prove their identity via mTLS certificates before communicating. AgentMesh applies the same principle to AI agents using decentralized identifiers (DIDs) with Ed25519 cryptography and the Inter-Agent Trust Protocol (IATP): from agentmesh import AgentIdentity, TrustBridge identity = AgentIdentity.create( name="data-analyst", sponsor="alice@company.com", # Human accountability capabilities=["read:data", "write:reports"], ) # identity.did -> "did:mesh:data-analyst:a7f3b2..." bridge = TrustBridge() verification = await bridge.verify_peer( peer_id="did:mesh:other-agent", required_trust_score=700, # Must score >= 700/1000 ) A critical feature is trust decay: an agent's trust score decreases over time without positive signals. An agent trusted last week but silent since then gradually becomes untrusted, modeling the reality that trust requires ongoing demonstration, not a one-time grant. Delegation chains enforce scope narrowing: a parent agent with read+write permissions can delegate only read access to a child agent, never escalate. Agent Hypervisor: Execution Rings CPU architectures use privilege rings (Ring 0 for kernel, Ring 3 for userspace) to isolate workloads. The Agent Hypervisor applies this model to AI agents: Ring Trust Level Capabilities Ring 0 (Kernel) Score ≥ 900 Full system access, can modify policies Ring 1 (Supervisor) Score ≥ 700 Cross-agent coordination, elevated tool access Ring 2 (User) Score ≥ 400 Standard tool access within assigned scope Ring 3 (Untrusted) Score < 400 Read-only, sandboxed execution only New and untrusted agents start in Ring 3 and earn their way up, exactly the principle of least privilege that production engineers apply to every other workload. Each ring enforces per-agent resource limits: maximum execution time, memory caps, CPU throttling, and request rate limits. If a Ring 2 agent attempts a Ring 1 operation, it gets blocked, just like a userspace process trying to access kernel memory. These ring definitions and their associated trust score thresholds are fully configurable via policy. Organizations can define custom ring structures, adjust the number of rings, set different trust score thresholds for transitions, and configure per-ring resource limits to match their security requirements. The hypervisor also provides saga orchestration for multi-step operations. When an agent executes a sequence, draft email → send → update CRM, and the final step fails, compensating actions fire in reverse. Borrowed from distributed transaction patterns, this ensures multi-agent workflows maintain consistency even when individual steps fail. Agent SRE: SLOs and Circuit Breakers for Agents If you practice SRE, you measure services by SLOs and manage risk through error budgets. Agent SRE extends this to AI agents: When an agent's safety SLI drops below 99 percent, meaning more than 1 percent of its actions violate policy, the system automatically restricts the agent's capabilities until it recovers. This is the same error-budget model that SRE teams use for production services, applied to agent behavior. We also built nine chaos engineering fault injection templates: network delays, LLM provider failures, tool timeouts, trust score manipulation, memory corruption, and concurrent access races. Because the only way to know if your agent system is resilient is to break it intentionally. Agent SRE integrates with your existing observability stack through adapters for Datadog, PagerDuty, Prometheus, OpenTelemetry, Langfuse, LangSmith, Arize, MLflow, and more. Message broker adapters support Kafka, Redis, NATS, Azure Service Bus, AWS SQS, and RabbitMQ. Compliance and Observability If your organization already maps to CIS Benchmarks, NIST AI RMF, or other frameworks for infrastructure compliance, the OWASP Agentic Top 10 is the equivalent standard for AI agent workloads. The toolkit's agent-compliance package provides automated governance grading against these frameworks. The toolkit is framework-agnostic, with 20+ adapters that hook into each framework's native extension points, so adding governance to an existing agent is typically a few lines of configuration, not a rewrite. The toolkit exports metrics to any OpenTelemetry-compatible platform, Prometheus, Grafana, Datadog, Arize, or Langfuse. If you're already running an observability stack for your infrastructure, agent governance metrics flow through the same pipeline. Key metrics include: policy decisions per second, trust score distributions, ring transitions, SLO burn rates, circuit breaker state, and governance workflow latency. Getting Started # Install all packages pip install agent-governance-toolkit[full] # Or individual packages pip install agent-os-kernel agent-mesh agent-sre The toolkit is available across language ecosystems: Python, TypeScript (`@microsoft/agentmesh-sdk` on npm), Rust, Go, and .NET (`Microsoft.AgentGovernance` on NuGet). Azure Integrations While the toolkit is platform-agnostic, we've included integrations that help enable the fastest path to production, on Azure: Azure Kubernetes Service (AKS): Deploy the policy engine as a sidecar container alongside your agents. Helm charts provide production-ready manifests for agent-os, agent-mesh, and agent-sre. Azure AI Foundry Agent Service: Use the built-in middleware integration for agents deployed through Azure AI Foundry. OpenClaw Sidecar: One compelling deployment scenario is running OpenClaw, the open-source autonomous agent, inside a container with the Agent Governance Toolkit deployed as a sidecar. This gives you policy enforcement, identity verification, and SLO monitoring over OpenClaw's autonomous operations. On Azure Kubernetes Service (AKS), the deployment is a standard pod with two containers: OpenClaw as the primary workload and the governance toolkit as the sidecar, communicating over localhost. We have a reference architecture and Helm chart available in the repository. The same sidecar pattern works with any containerized agent, OpenClaw is a particularly compelling example because of the interest in autonomous agent safety. Tutorials and Resources 34+ step-by-step tutorials covering policy engines, trust, compliance, MCP security, observability, and cross-platform SDK usage are available in the repository. git clone https://github.com/microsoft/agent-governance-toolkit cd agent-governance-toolkit pip install -e "packages/agent-os[dev]" -e "packages/agent-mesh[dev]" -e "packages/agent-sre[dev]" # Run the demo python -m agent_os.demo What's Next AI agents are becoming autonomous decision-makers in production infrastructure, executing code, managing databases, and orchestrating services. The security patterns that kept production systems safe for decades, least privilege, mandatory access controls, process isolation, audit logging, are exactly what these new workloads need. We built them. They're open source. We're building this in the open because agent security is too important for any single organization to solve alone: Security research: Adversarial testing, red-team results, and vulnerability reports strengthen the toolkit for everyone. Community contributions: Framework adapters, detection rules, and compliance mappings from the community expand coverage across ecosystems. We are committed to open governance. We're releasing this project under Microsoft today, and we aspire to move it into a foundation home, such as the AI and Data Foundation (AAIF), where it can benefit from cross-industry stewardship. We're actively engaging with foundation partners on this path. The Agent Governance Toolkit is open source under the MIT license. Contributions welcome at github.com/microsoft/agent-governance-toolkit.659Views0likes0CommentsJoin us at Microsoft Azure Infra Summit 2026 for deep technical Azure infrastructure content
Microsoft Azure Infra Summit 2026 is a free, engineering-led virtual event created for IT professionals, platform engineers, SREs, and infrastructure teams who want to go deeper on how Azure really works in production. It will take place May 19-21, 2026. This event is built for the people responsible for keeping systems running, making sound architecture decisions, and dealing with the operational realities that show up long after deployment day. Over the past year, one message has come through clearly from the community: infrastructure and operations audiences want more in-depth technical content. They want fewer surface-level overviews and more practical guidance from the engineers and experts who build, run, and support these systems every day. That is exactly what Azure Infra Summit aims to deliver. All content is created AND delivered by engineering, targeting folks working with Azure infrastructure and operating production environments. Who is this for: IT professionals, platform engineers, SREs, and infrastructure teams When: May 19-21, 2026 - 8:00 AM–1:00 PM Pacific Time, all 3 days Where: Online Virtual Cost: Free Level: Most sessions are advanced (L300-400). Register here: https://aka.ms/MAIS-Reg Built for the people who run workloads on Azure Azure Infra Summit is for the people who do more than deploy to Azure. It is for the people who run it. If your day involves uptime, patching, governance, monitoring, reliability, networking, identity, storage, or hybrid infrastructure, this event is for you. Whether you are an IT professional managing enterprise environments, a platform engineer designing landing zones, an Azure administrator, an architect, or an SRE responsible for resilience and operational excellence, you will find content built with your needs in mind. We are intentionally shaping this event around peer-to-peer technical learning. That means engineering-led sessions, practical examples, and candid discussion about architecture, failure modes, operational tradeoffs, and what breaks in production. The promise here is straightforward: less fluff, more infrastructure. What to expect Azure Infra Summit will feature deep technical content in the 300 to 400 level range, with sessions designed by engineering to help you build, operate, and optimize Azure infrastructure more effectively. The event will include a mix of live and pre-recorded sessions and live Q&A. Throughout the three days, we will dig into topics such as: Hybrid operations and management Networking at scale Storage, backup, and disaster recovery Observability, SLOs, and day-2 operations Confidential compute Architecture, automation, governance, and optimization in Azure Core environments And more… The goal is simple: to give you practical guidance you can take back to your environment and apply right away. We want attendees to leave with stronger mental models, a better understanding of how Azure behaves in the real world, and clearer patterns for designing and operating infrastructure with confidence. Why this event matters Infrastructure decisions have a long tail. The choices we make around architecture, operations, governance, and resilience show up later in the form of performance issues, outages, cost, complexity, and recovery challenges. That is why deep technical learning matters, and why events like this matter. Join us I hope you will join us for Microsoft Azure Infra Summit 2026, happening May 19-21, 2026. If you care about how Azure infrastructure behaves in the real world, and you want practical, engineering-led guidance on how to build, operate, and optimize it, this event was built for you. Register here: https://aka.ms/MAIS-Reg Cheers! Pierre Roman3.7KViews2likes1CommentHow Do We Know AI Isn’t Lying? The Art of Evaluating LLMs in RAG Systems
🔍 1. Why Evaluating LLM Responses is Hard In classical programming, correctness is binary. Input Expected Result 2 + 2 4 ✔ Correct 2 + 2 5 ✘ Wrong Software is deterministic — same input → same output. LLMs are probabilistic. They generate one of many valid word combinations, like forming sentences from multiple possible synonyms and sentence structures. Example: Prompt: "Explain gravity like I'm 10" Possible responses: Response A Response B Gravity is a force that pulls everything to Earth. Gravity bends space-time causing objects to attract. Both are correct. Which is better? Depends on audience. So evaluation needs to look beyond text similarity. We must check: ✔ Is the answer meaningful? ✔ Is it correct? ✔ Is it easy to understand? ✔ Does it follow prompt intent? Testing LLMs is like grading essays — not checking numeric outputs. 🧠 2. Why RAG Evaluation is Even Harder RAG introduces an additional layer — retrieval. The model no longer answers from memory; it must first read context, then summarise it. Evaluation now has multi-dimensions: Evaluation Layer What we must verify Retrieval Did we fetch the right documents? Understanding Did the model interpret context correctly? Grounding Is the answer based on retrieved data? Generation Quality Is final response complete & clear? A simple story makes this intuitive: Teacher asks student to explain Photosynthesis. Student goes to library → selects a book → reads → writes explanation. We must evaluate: Did they pick the right book? → Retrieval Did they understand the topic? → Reasoning Did they copy facts correctly without inventing? → Faithfulness Is written explanation clear enough for another child to learn from? → Answer Quality One failure → total failure. 🧩 3. Two Types of Evaluation 🔹 Intrinsic Evaluation — Quality of the Response Itself Here we judge the answer, ignoring real-world impact. We check: ✔ Grammar & coherence ✔ Completeness of explanation ✔ No hallucination ✔ Logic flow & clarity ✔ Semantic correctness This is similar to checking how well the essay is written. Even if the result did not solve the real problem, the answer could still look good — that’s why intrinsic alone is not enough. 🔹 Extrinsic Evaluation — Did It Achieve the Goal? This measures task success. If a customer support bot writes a beautifully worded paragraph, but the user still doesn’t get their refund — it failed extrinsically. Examples: System Type Extrinsic Goal Banking RAG Bot Did user get correct KYC procedure? Medical RAG Was advice safe & factual? Legal search assistant Did it return the right section of the law? Technical summariser Did summary capture key meaning? Intrinsic = writing quality. Extrinsic = impact quality. A production-grade RAG system must satisfy both. 📏 4. Core RAG Evaluation Metrics (Explained with Very Simple Analogies) Metric Meaning Analogy Relevance Does answer match question? Ask who invented C++? → model talks about Java ❌ Faithfulness No invented facts Book says started 2004, response says 1990 ❌ Groundedness Answer traceable to sources Claims facts that don’t exist in context ❌ Completeness Covers all parts of question User asks Windows vs Linux → only explains Windows Context Recall / Precision Correct docs retrieved & used Student opens wrong chapter Hallucination Rate Degree of made-up info “Taj Mahal is in London” 😱 Semantic Similarity Meaning-level match “Engine died” = “Car stopped running” 💡 Good evaluation doesn’t check exact wording. It checks meaning + truth + usefulness. 🛠 5. Tools for RAG Evaluation 🔹 1. RAGAS — Foundation for RAG Scoring RAGAS evaluates responses based on: ✔ Faithfulness ✔ Relevance ✔ Context recall ✔ Answer similarity Think of RAGAS as a teacher grading with a rubric. It reads both answer + source documents, then scores based on truthfulness & alignment. 🔹 2. LangChain Evaluators LangChain offers multiple evaluation types: Type What it checks String or regex Basic keyword presence Embedding based Meaning similarity, not text match LLM-as-a-Judge AI evaluates AI (deep reasoning) LangChain = testing toolbox RAGAS = grading framework Together they form a complete QA ecosystem. 🔹 3. PyTest + CI for Automated LLM Testing Instead of manually validating outputs, we automate: Feed preset questions to RAG Capture answers Run RAGAS/LangChain scoring Fail test if hallucination > threshold This brings AI closer to software-engineering discipline. RAG systems stop being experiments — they become testable, trackable, production-grade products. 🚀 6. The Future: LLM-as-a-Judge The future of evaluation is simple: LLMs will evaluate other LLMs. One model writes an answer. Another model checks: ✔ Was it truthful? ✔ Was it relevant? ✔ Did it follow context? This enables: Benefit Why it matters Scalable evaluation No humans needed for every query Continuous improvement Model learns from mistakes Real-time scoring Detect errors before user sees them This is like autopilot for AI systems — not only navigating, but self-correcting mid-flight. And that is where enterprise AI is headed. 🎯 Final Summary Evaluating LLM responses is not checking if strings match. It is checking if the machine: ✔ Understood the question ✔ Retrieved relevant knowledge ✔ Avoided hallucination ✔ Provided complete, meaningful reasoning ✔ Grounded answer in real source text RAG evaluation demands multi-layer validation — retrieval, reasoning, grounding, semantics, safety. Frameworks like RAGAS + LangChain evaluators + PyTest pipelines are shaping the discipline of measurable, reliable AI — pushing LLM-powered RAG from cool demo → trustworthy enterprise intelligence. Useful Resources What is Retrieval-Augmented Generation (RAG) : https://azure.microsoft.com/en-in/resources/cloud-computing-dictionary/what-is-retrieval-augmented-generation-rag/ Retrieval-Augmented Generation concepts (Azure AI) : https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/retrieval-augmented-generation RAG with Azure AI Search – Overview : https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview Evaluate Generative AI Applications (Microsoft Learn – Learning Path) : https://learn.microsoft.com/en-us/training/paths/evaluate-generative-ai-apps/ Evaluate Generative AI Models in Microsoft Foundry Portal : https://learn.microsoft.com/en-us/training/modules/evaluate-models-azure-ai-studio/ RAG Evaluation Metrics (Relevance, Groundedness, Faithfulness) : https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-evaluators/rag-evaluators RAGAS – Evaluation Framework for RAG Systems : https://docs.ragas.io/378Views0likes0CommentsBuilding Production-Ready, Secure, Observable, AI Agents with Real-Time Voice with Microsoft Foundry
We're excited to announce the general availability of Foundry Agent Service, Observability in Foundry Control Plane, and the Microsoft Foundry portal — plus Voice Live integration with Agent Service in public preview — giving teams a production-ready platform to build, deploy, and operate intelligent AI agents with enterprise-grade security and observability.7.9KViews2likes0CommentsGenerally Available: Evaluations, Monitoring, and Tracing in Microsoft Foundry
If you've shipped an AI agent to production, you've likely run into the same uncomfortable realization: the hard part isn't getting the agent to work - it's keeping it working. Models get updated, prompts get tweaked, retrieval pipelines drift, and user traffic surfaces edge cases that never appeared in your eval suite. Quality isn't something you establish once. It's something you have to continuously measure. Today, we're making that continuous measurement a first-class operational capability. Evaluations, Monitoring, and Tracing in Microsoft Foundry are now generally available through Foundry Control Plane. These aren't standalone tools bolted onto the side of the platform - they're deeply integrated with Azure Monitor, which means AI agent observability now lives in the same operational plane as the rest of your infrastructure. The Problem With Point-in-Time Evaluation Most evaluation workflows are designed around a pre-deployment gate. You build a test dataset, run your evals, review the scores, and ship. That approach has real value - but it has a hard ceiling. In production, agent behavior is a function of many things that change independently of your code: Foundation model updates ship continuously and can shift output style, reasoning patterns, and edge case handling in ways that don't always surface on your benchmark set. Prompt changes can have nonlinear effects downstream, especially in multi-step agentic flows. Retrieval pipeline drift changes what context your agent actually sees at inference time. A document index fresh last month may have stale or subtly different content today. Real-world traffic distribution is never exactly what you sampled for your test set. Production surfaces long-tail inputs that feel obvious in hindsight but were invisible during development. The implication is straightforward: evaluation has to be continuous, not episodic. You need quality signals at development time, at every CI/CD commit, and continuously against live production traffic - all using the same evaluator definitions so results are comparable across environments. That's the core design principle behind Foundry Observability. Continuous Evaluation Across the Full AI Lifecycle Built-In Evaluators Foundry's built-in evaluators cover the most critical quality and safety dimensions for production agent systems: Coherence and Relevance measure whether responses are internally consistent and on-topic relative to the input. These are table-stakes signals for any conversational or task-completion agent. Groundedness is particularly important for RAG-based architectures. It measures whether the model's output is actually supported by the retrieved context - as opposed to plausible-sounding content the model generated from its parametric memory. Groundedness failures are a leading indicator of hallucination risk in production, and they're often invisible to human reviewers at scale. Retrieval Quality evaluates the retrieval step independently from generation. Groundedness failures can originate in two places: the model may be ignoring good context, or the retrieval pipeline may not be surfacing relevant context in the first place. Splitting these signals makes it much easier to pinpoint root cause. Safety and Policy Alignment evaluates whether outputs meet your deployment's policy requirements - content safety, topic restrictions, response format compliance, and similar constraints. These evaluators are designed to run at every stage of the AI lifecycle: Local development - run evals inline as you iterate on prompts, retrieval config, or orchestration logic CI/CD pipelines - gate every commit against your quality baselines; catch regressions before they reach production Production traffic monitoring - continuously evaluate sampled live traffic and surface trends over time Because the evaluators are identical across all three contexts, a score in CI means the same thing as a score in production monitoring. See the Practical Guide to Evaluations and the Built-in Evaluators Reference for a deeper walkthrough. Custom Evaluators - Encoding Your Own Definition of Quality Built-in evaluators cover common signals well, but production agents often need to satisfy criteria specific to a domain, regulatory environment, or internal standard. Foundry supports two types of custom evaluators (currently in public preview): LLM-as-a-Judge evaluators let you configure a prompt and grading rubric, then use a language model to apply that rubric to your agent's outputs. This is the right approach for quality dimensions that require reasoning or contextual judgment - whether a response appropriately acknowledges uncertainty, whether a customer-facing message matches your brand tone, or whether a clinical summary meets documentation standards. You write a judge prompt with a scoring scale (e.g., 1–5 with criteria for each level) that evaluates a given {input} / {response} pair. Foundry runs this at scale and aggregates scores into your dashboards alongside built-in results. Code-based evaluators are Python functions that implement any evaluation logic you can express programmatically - regex matching, schema validation, business rule checks, compliance assertions, or calls to external systems. If your organization has documented policies about what a valid agent response looks like, you can encode those policies directly into your evaluation pipeline. Custom and built-in evaluators compose naturally - running against the same traffic, producing results in the same schema, feeding into the same dashboards and alert rules. Monitoring and Alerting - AI Quality as an Operational Signal All observability data produced by Foundry - evaluation results, traces, latency, token usage, and quality metrics - is published directly to Azure Monitor. This is where the integration pays off for teams already on Azure. What this enables that siloed AI monitoring tools can't: Cross-stack correlation. When your groundedness score drops, is it a model update, a retrieval pipeline issue, or an infrastructure problem affecting latency? With AI quality signals and infrastructure telemetry in the same Azure Monitor Application Insights workspace, you can answer that in minutes rather than hours of manual correlation across disconnected systems. Unified alerting. Configure Azure Monitor alert rules on any evaluation metric - trigger a PagerDuty incident when groundedness drops below threshold, send a Teams notification when safety violations spike, or create automated runbook responses when retrieval quality degrades. These are the same alert mechanisms your SRE team already uses. Enterprise governance by default. Azure Monitor's RBAC, retention policies, diagnostic settings, and audit logging apply automatically to all AI observability data. You inherit the governance framework your organization has already built and approved. Grafana and existing dashboards. If your team uses Azure Managed Grafana, evaluation metrics can flow into existing dashboards alongside your other operational metrics - a single pane of glass for application health, infrastructure performance, and AI agent quality. The Agent Monitoring Dashboard in the Foundry portal provides an AI-native view out of the box - evaluation metric trends, safety threshold status, quality score distributions, and latency breakdowns. Everything in that dashboard is backed by Azure Monitor data, so SRE teams can always drill deeper. End-to-End Tracing: From Quality Signal to Root Cause A groundedness score tells you something is wrong. A trace tells you exactly where the failure occurred and what the agent actually did. Foundry provides OpenTelemetry-based distributed tracing that follows each request through your entire agent system: model calls, tool invocations, retrieval steps, orchestration logic, and cross-agent handoffs. Traces capture the full execution path - inputs, outputs, latency at each step, tool call parameters and responses, and token usage. The key design decision: evaluation results are linked directly to traces. When you see a low groundedness score in your monitoring dashboard, you navigate directly to the specific trace that produced it - no manual timestamp correlation, no separate trace ID lookup. The connection is made automatically. Foundry auto-collects traces across the frameworks your agents are likely already built on: Microsoft Agent Framework Semantic Kernel LangChain and LangGraph OpenAI Agents SDK For custom or less common orchestration frameworks, the Azure Monitor OpenTelemetry Distro provides an instrumentation path. Microsoft is also contributing upstream to the OpenTelemetry project - working with Cisco Outshift, we've contributed semantic conventions for multi-agent trace correlation, standardizing how agent identity, task context, and cross-agent handoffs are represented in OTel spans. Note: Tracing is currently in public preview, with GA shipping by end of March. Prompt Optimizer (Public Preview) One persistent friction point in agent development is the iteration loop between writing prompts and measuring their effect. You make a change, run your evals, look at the delta, try to infer what about the change mattered, and repeat. Prompt Optimizer tightens this loop. It analyzes your existing prompt and applies structured prompt engineering techniques - clarifying ambiguous instructions, improving formatting for model comprehension, restructuring few-shot examples, making implicit constraints explicit - with paragraph-level explanations for every change it makes. The transparency is deliberate. Rather than producing a black-box "optimized" prompt, it shows you exactly what it changed and why. You can add constraints, trigger another optimization pass, and iterate until satisfied. When you're done, apply it with one click. The value compounds alongside continuous evaluation: run your eval suite against the current prompt, optimize, run evals again, see the measured improvement. That feedback loop - optimize, measure, optimize - is the closest thing to a systematic approach to prompt engineering that currently exists. What Makes our Approach to Observability Different There are other evaluation and observability tools in the AI ecosystem. The differentiation in Foundry's approach comes down to specific architectural choices: Unified lifecycle coverage, not just pre-deployment testing. Most existing evaluation tools are designed for offline, pre-deployment use. Foundry's evaluators run in the same form at development time, in CI/CD, and against live production traffic. Your quality metrics are actually comparable across the lifecycle - you can tell whether production quality matches what you saw in testing, rather than operating two separate measurement systems that can't be compared. No separate observability silo. Publishing all observability data to Azure Monitor means you don't operate a separate system for AI quality alongside your existing infrastructure monitoring. AI incidents route through your existing on-call rotations. AI quality data is subject to the same retention and compliance controls as the rest of your telemetry. Framework-agnostic tracing. Auto-instrumentation across Semantic Kernel, LangChain, LangGraph, and the OpenAI Agents SDK means you're not locked into a specific orchestration framework. The OpenTelemetry foundation means trace data is portable to any compatible backend, protecting your investment as the tooling landscape evolves. Composable evaluators. Built-in and custom evaluators run in the same pipeline, against the same traffic, producing results in the same schema, feeding into the same dashboards and alert rules. You don't choose between generic coverage and domain-specific precision - you get both. Evaluation linked to traces. Most systems treat evaluation and tracing as separate concerns. Foundry treats them as two views of the same event - closing the loop between detecting a quality problem and diagnosing it. Getting Started If you're building agents on Microsoft Foundry, or using Semantic Kernel, LangChain, LangGraph, or the OpenAI Agents SDK and want to add production observability, the entry point is Foundry Control Plane. Try it You'll need a Foundry project with an agent and an Azure OpenAI deployment. Enable observability by navigating to Foundry Control Plane and connecting your Azure Monitor workspace. Then walk through the Practical Guide to Evaluations, explore the Built-in Evaluators Reference, and set up end-to-end tracing for your agents.3.7KViews1like0CommentsRetina 1.0 Is Now Available
We are excited to announce the first major release of Retina - a significant milestone for the project. This version brings along many new features, enhancements and bug fixes. The Retina maintainer team would like to thank all contributors, community members, and early adopters who helped make this 1.0 release possible. What is Retina? Retina is an open-source, Kubernetes network observability platform. It enables you to continuously observe and measure network health, and investigate network issues on-demand with integrated Kubernetes-native workflows. Why Retina? Kubernetes networking failures are rarely isolated or easy to reproduce. Pods are ephemeral, services span multiple nodes, and network traffic crosses multiple layers (CNI, kube-proxy, node networking, policies), making crucial evidence difficult to capture. Manually connecting to nodes and stitching together logs or packet captures simply does not scale as clusters grow in size and complexity. A modern approach to observability must automate and centralize data collection while exposing rich, actionable insights. Retina represents a major step forward in solving the complexities of Kubernetes observability by leveraging the power of eBPF. Its cloud-agnostic design, deep integration with Hubble, and support for both real-time metrics and on-demand packet captures make it an invaluable tool for DevOps, SecOps, and compliance teams across diverse environments. What Does It Do? Retina can collect two types of telemetry: metrics and packet captures. The Retina shell enables ad-hoc troubleshooting via pre-installed networking tools. Metrics Metrics provide continuous observability. They can be exported to multiple storage options such as Prometheus or Azure Monitor, and visualized in a variety of ways, including Grafana or Azure Log Analytics. Retina supports two control planes: Hubble and Standard. Both are supported regardless of the underlying CNI. The choice of control plane affects the metrics which are collected. Hubble metrics Standard metrics You can customize which metrics are collected by enabling/disabling their corresponding plugins. Some examples of metrics may include: Incoming/outcoming traffic Dropped packets TCP/UDP DNS API Server latency Node/interface statistics Packet Captures Captures provide on-demand observability. They allow users to perform distributed packet captures across the cluster, based on specified Nodes/Pods and other supported filters. They can be triggered via the CLI or through the capture CRD, and may be output to persistent storage options such as the host filesystem, a PVC, or a storage blob. The result of the capture contains more than just a .pcap file. Retina also captures a number of networking metadata such as iptables rules, socket statistics, kernel network information from /proc/net, and more. Shell The Retina shell enables deep ad-hoc troubleshooting by providing a suite of networking tools. The CLI command starts an interactive shell on a Kubernetes node that runs a container image which includes standard tools such as ping or curl, as well as specialized tools like bpftool, pwru, Inspektor Gadget and more. The Retina shell is currently only available on Linux. Note that some tools require particular capabilities to execute. These can be passed as parameters through the CLI. Use Cases Debugging Pod Connectivity Issues: When services can’t communicate, Retina enables rapid, automated distributed packet capture and drop metrics, drastically reducing troubleshooting time. The Retina shell also brings specialized tools for deep manual investigations. Continuous Monitoring of Network Health: Operators can set up alerts and dashboards for DNS failures, API server latency, or packet drops, gaining ongoing visibility into cluster networking. Security Auditing and Compliance: Flow logs (in Hubble mode) and metrics support security investigations and compliance reporting, enabling quick identification of unexpected connections or data transfers. Multi-Cluster / Multi-Cloud Visibility: Retina standardizes network observability across clouds, supporting unified dashboards and processes for SRE teams. Where Does It Run? Retina is designed for broad compatibility across Kubernetes distributions, cloud providers, and operating systems. There are no Azure-specific dependencies - Retina runs anywhere Kubernetes does. Operating Systems: Both Linux and Windows nodes are supported. Kubernetes Distributions: Retina is distribution-agnostic, deployable on managed services (AKS, EKS, GKE) or self-managed clusters. CNI / Network Stack: Retina works with any CNI, focusing on kernel-level events rather than CNI-specific logs. Cloud Integration: Retina exports metrics to Azure Monitor and Log Analytics, with pre-built Grafana dashboards for AKS. Integration with AWS CloudWatch or GCP Stackdriver is possible via Prometheus. Observability Stacks: Retina integrates with Prometheus & Grafana, Cilium Hubble (for flow logs and UI), and can be extended to other exporters. Design Overview Retina’s architecture consists of two layers: a data collection layer in the kernel-space, and processing layer that converts low-level signals into Kubernetes-aware telemetry in the user-space. When Retina is installed, each node in the cluster runs a Retina agent which collects raw network telemetry from the host kernel - backed by eBPF on Linux, and HNS/VFP on Windows. The agent processes the raw network data and enriches it with Kubernetes metadata, which is then exported for consumption by monitoring tools such as Prometheus, Grafana, or Hubble UI. Modularity and extensibility are central to the design philosophy. Retina's plugin model lets you enable only the telemetry you need, and add new sources by implementing a common plugin interface. Built-in plugins include Drop Reason, DNS, Packet Forward, and more. Check out our architecture docs for a deeper dive into Retina's design. Get Started Thanks to Helm charts deploying Retina is streamlined across all environments, and can be done with one configurable command. For complete documentation, visit our installation docs. To install Retina with the Standard control plane and Basic metrics mode: VERSION=$( curl -sL https://api.github.com/repos/microsoft/retina/releases/latest | jq -r .name) helm upgrade --install retina oci://ghcr.io/microsoft/retina/charts/retina \ --version $VERSION \ --namespace kube-system \ --set image.tag=$VERSION \ --set operator.tag=$VERSION \ --set logLevel=info \ --set operator.enabled=true \ --set enabledPlugin_linux="\[dropreason\,packetforward\,linuxutil\,dns\]" Once Retina is running in your cluster, you can then configure Prometheus and Grafana to scrape and visualize your metrics. Install the Retina CLI with Krew: kubectl krew install retina Get Involved Retina is open-source under the MIT License and welcomes community contributions. Since its announcement in early 2024, the project has gained significant traction, with contributors from multiple organizations helping to expand its capabilities. The project is hosted on GitHub · microsoft/retina and documentation is available at retina.sh. If you would like to contribute to Retina you can follow our contributor guide. What's Next? Retina 1.1 of course! We are also discussing the future roadmap, and exploring the possibility of moving the project to community ownership. Stay tuned! In the meantime, we welcome you to raise an issue if you find any bugs, or start a discussion if you have any questions or suggestions. You can also reach out to the Retina team via email, we would love to hear from you! References Retina Deep Dive into Retina Open-Source Kubernetes Network Observability Troubleshooting Network Issues with Retina Retina: Bridging Kubernetes Observability and eBPF Across the Clouds740Views0likes0Comments