monitoring
151 TopicsShift-Left Governance for AI Agents: How the Agent Governance Toolkit Helps You Catch Violations
In part one of this series, we covered AGT’s runtime governance: the policy engine, zero-trust identity, execution sandboxing, and the OWASP Agentic AI risk mapping. That post focused on what happens when an agent acts: policy evaluation at the moment a tool call fires, trust scoring when agents communicate, audit logging when decisions are made. Runtime governance is essential. But it is the last line of defense. After that post went live, a pattern emerged in conversations with teams adopting AGT. The same question kept coming up: runtime checks are useful, but what about everything before production? We realized runtime governance was only half the story. So we went back and built tooling for every stage of your software development lifecycle, from the moment a developer saves a file to the moment an artifact ships to users. Why Runtime Governance Is Not Enough AI agents are a new class of workload. They reason about what to do, select tools, call APIs, read databases, and spawn sub-processes, often in loops that run without direct human oversight. The OWASP Agentic AI Top 10 (published December 2025) identifies risks like excessive agency, insecure tool use, privilege escalation, and supply chain compromise. These risks span the entire lifecycle, not just runtime. Consider a few scenarios that runtime governance alone cannot prevent: A developer commits a policy YAML file with a typo that silently disables all deny rules. The agent runs unprotected until someone notices. A dependency update introduces a package with a known critical CVE. The agent starts using a vulnerable library before any security team reviews it. A contributor adds a raw cryptographic import to an application module, bypassing the security-audited signing library. The code compiles and ships. A GitHub Actions workflow uses an expression injection pattern that allows an attacker to execute arbitrary code in CI. A release ships without a Software Bill of Materials (SBOM), making it impossible to trace which components are affected when the next log4j-style vulnerability drops. Each of these is a governance failure, but none of them happens at runtime. They happen at commit time, at PR review time, at build time, or at release time. A comprehensive governance strategy needs coverage at every stage. Four Stages of Pre-Runtime Governance Governance violations can enter a codebase at four distinct stages of the development lifecycle. Each stage has a different class of risk, and each needs a different kind of check: Stage When It Runs What It Catches AGT Tooling Commit-time Before code leaves the developer machine Malformed policies, schema violations, secrets, stub code, unauthorized crypto Pre-commit hooks, quality gates PR-time When a pull request is opened or updated Vulnerable dependencies, missing attestation, secrets in history, unpinned versions GitHub Actions (attestation, dependency review, secret scanning, supply chain checks) CI/Build-time On every push and pull request to main Compliance violations, binary security issues, dependency confusion, workflow injection Governance Verify action, Security Scan action, CodeQL, BinSkim, policy validation Release-time Before artifacts are published Missing provenance, unsigned artifacts, incomplete SBOMs SBOM generation, Sigstore signing, build attestation, OpenSSF Scorecard Just as with bugs, the earlier you catch a governance violation, the cheaper it is to fix. A malformed policy file caught at commit time costs zero CI minutes. A secret caught in PR review never reaches the default branch. A dependency confusion attack blocked in CI never reaches production. An unsigned artifact blocked at release time never reaches users. Stage 1: Commit-Time Governance with Pre-Commit Hooks The fastest governance feedback loop is local. Within the AGT project, we’ve implemented three pre-commit hooks that run automatically whenever a developer stages files for commit, validating governance artifacts before they ever leave the developer's machine. Built-In Hooks The toolkit's .pre-commit-hooks.yaml defines three hooks that any repository can adopt: Hook ID What It Validates File Pattern validate-policy YAML/JSON policy files against the AGT policy schema, checking for required fields, valid operators, and structural correctness Files matching *polic*.yaml, *polic*.yml, *polic*.json validate-plugin-manifest Plugin manifest files for required fields and schema compliance Files matching plugin.json, plugin.yaml, plugin.yml evaluate-plugin-policy Plugin manifests against a governance policy file, evaluating whether the plugin would be allowed under the organization's rules Files matching plugin.json, plugin.yaml, plugin.yml To adopt these hooks, add AGT as a pre-commit hook source: # .pre-commit-config.yaml repos: - repo: https://github.com/microsoft/agent-governance-toolkit rev: main # pin to a release tag in production hooks: - id: validate-policy - id: validate-plugin-manifest - id: evaluate-plugin-policy args: ['--policy', 'policies/marketplace-policy.yaml'] Then install and run: pip install pre-commit pre-commit install pre-commit run --all-files Extended Quality Gates Beyond schema validation, we built a pre-commit rollout template (see the full example in the repository) with additional governance-specific quality gates designed to help prevent common security anti-patterns from entering the codebase: Policy validation (agt-validate): Runs the full AGT policy CLI in strict mode, catching not just schema errors but semantic issues like conflicting rules. Health check (agt-doctor): Runs on pre-push (before code leaves the machine entirely), performing a broader health check of the governance configuration. Plugin metadata check (agency-json-required): Ensures every plugin directory contains the required agency.json metadata file. Stub detection (no-stubs): Blocks TODO, FIXME, HACK, and raise NotImplementedError markers in staged production code. Test files are excluded. Unauthorized crypto detection (no-custom-crypto): Blocks raw cryptographic imports (hashlib, hmac, crypto.subtle, System.Security.Cryptography, ring, ed25519-dalek) outside designated security modules. This helps ensure all cryptographic operations go through the audited AGT signing libraries. Secret scanning (detect-secrets): Integrates Yelp's detect-secrets for pattern-based secret detection on every commit. Phased Rollout for Teams Adopting pre-commit hooks across a team requires a thoughtful rollout. The AGT documentation includes a phased adoption guide: Week 1: Install hooks in permissive mode. Hooks warn on violations but do not block the commit. This lets developers see what would be caught without disrupting workflow. Week 2: Switch to strict mode for policy validation only. Policy files must pass schema validation to be committed. Week 3: Enable all hooks as blocking. Stubs, unauthorized crypto, and secrets are now blocked at commit time. Week 4: Graduate to full blocking mode and remove the permissive fallback. This approach helps teams build confidence in the governance tooling before it becomes a hard gate. Stage 2: PR-Time Gates Pre-commit hooks catch issues on the developer's machine, but they can be bypassed (force push, direct GitHub edits, hooks not installed). PR-time gates provide the second layer of defense, running in GitHub Actions on every pull request before merge is allowed. Governance Attestation The Governance Attestation action validates that PR authors have completed a structured attestation checklist before their code can merge. The default checklist covers seven sections: Security review Privacy review Legal review Responsible AI review Accessibility review Release Readiness / Safe Deployment Org-specific Launch Gates The action is fully configurable. Organizations can customize the required sections, set a minimum PR body length, and choose their own attestation format. Outputs include the validation status, a list of errors for missing sections, and a JSON mapping of sections to checkbox counts. Here is an example workflow: # .github/workflows/pr-governance.yml name: PR Governance on: pull_request: types: [opened, edited, synchronize] jobs: attestation: runs-on: ubuntu-latest steps: - uses: microsoft/agent-governance-toolkit/action/governance-attestation@main with: required-sections: | 1) Security review 2) Privacy review 3) Responsible AI review Dependency Review The dependency review workflow helps block PRs that introduce dependencies with known CVEs or disallowed licenses. It uses the GitHub dependency-review-action with a curated license allowlist: - uses: actions/dependency-review-action@v4 with: fail-on-severity: moderate comment-summary-in-pr: always allow-licenses: > MIT, Apache-2.0, BSD-2-Clause, BSD-3-Clause, ISC, PSF-2.0, Python-2.0, 0BSD, Unlicense, CC0-1.0, CC-BY-4.0, Zlib, BSL-1.0, MPL-2.0 This runs on every PR that touches dependency manifests (package.json, Cargo.toml, pyproject.toml, requirements.txt). Dependencies with moderate or higher CVEs are flagged, and dependencies with licenses not on the allowlist are blocked. Secret Scanning The secret scanning workflow runs on every PR to the main branch and on a weekly schedule. It combines two complementary approaches: Gitleaks: Pattern-based secret detection across the full git history, catching API keys, tokens, and credentials that may have been committed at any point. High-entropy string scanning: Regex-based detection of common secret patterns including GitHub tokens (ghp_, gho_), AWS access keys (AKIA), Slack tokens (xox), and base64-encoded strings with high entropy. Supply Chain Integrity A dedicated supply chain check workflow triggers when dependency manifest files change. It enforces two rules that help prevent supply chain attacks: Exact version pinning: No ^ or ~ version ranges in package.json files. This prevents unexpected minor/patch version updates that could introduce compromised code. Lockfile presence: Every package directory with dependencies must have a corresponding lockfile (package-lock.json, pnpm-lock.yaml, or yarn.lock). Lockfiles help ensure reproducible builds with verified integrity hashes. Quality Gates The quality gates workflow mirrors the pre-commit hooks at the PR level, providing defense in depth. It runs four checks on every pull request: Gate Purpose No Stubs/TODOs Blocks TODO, FIXME, HACK markers in production code (test files excluded) No Unauthorized Crypto Blocks raw cryptographic imports outside designated security modules Security Audit Required Changes to security-sensitive paths require accompanying audit documentation Dependency Audit Trail Vendored patches must have an audit trail explaining the patch and its provenance These gates catch anything that bypasses pre-commit hooks: force-pushed commits, direct GitHub web edits, commits from contributors who have not installed the hooks. Stage 3: CI/Build-Time Governance Once a PR passes the gate workflows, the main CI pipeline and specialized workflows perform deeper, more computationally intensive analysis. The Governance Verify Action The Governance Verify action is the primary CI-time governance check. It is a GitHub Actions composite action that installs the toolkit and runs the compliance CLI against your repository. It supports four modes: Command What It Does governance-verify Runs the full compliance verification suite, checking governance controls and reporting how many pass marketplace-verify Validates a plugin manifest against marketplace requirements (required fields, signing, metadata) policy-evaluate Evaluates a specific policy file against a JSON context, returning the allow/deny decision with the matched rule all Runs governance-verify, then marketplace-verify and policy-evaluate if the corresponding paths are provided Here is an example: # .github/workflows/governance-ci.yml name: Governance CI on: [push, pull_request] jobs: verify: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: microsoft/agent-governance-toolkit/action@main with: command: all policy-path: policies/ manifest-path: plugin.json output-format: json fail-on-warning: 'true' The action outputs structured data including controls-passed, controls-total, violations count, and full command output in JSON format. This makes it straightforward to integrate with dashboards, Slack notifications, or downstream decision logic. The Security Scan Action A separate security scan action scans directories for secrets, CVEs, and dangerous code patterns. Unlike the PR-time secret scanning (which focuses on git history), this action performs deep content analysis of the current codebase: - uses: microsoft/agent-governance-toolkit/action/security-scan@main with: paths: 'plugins/ scripts/' min-severity: high exemptions-file: .security-exemptions.json The action supports configurable severity thresholds (critical, high, medium, low), an exemptions file for acknowledged findings, and structured JSON output with findings-count, blocking-count, and detailed findings. Policy Validation Workflow A dedicated policy validation workflow triggers whenever YAML files or the policy engine source code changes. It performs two jobs in sequence: Validate policies: Discovers all policy files matching the *policy* naming convention, then validates each file using the AGT policy CLI. Test policies: Runs the policy CLI unit tests to verify that policy evaluation behavior is correct after the changes. This ensures that policy file edits do not break the policy engine and that policy semantics are preserved. CodeQL and Static Analysis AGT uses GitHub's CodeQL for semantic static analysis of Python and TypeScript code. The CodeQL workflow runs on pushes and PRs, performing deep dataflow analysis that goes beyond pattern matching. Results are uploaded as SARIF to GitHub's Security tab, providing a centralized view of code quality issues. Dependency Confusion Scanning A dedicated CI job runs a dependency confusion scanner on every build. This is a targeted defense against a specific supply chain attack vector where an attacker registers a public package with the same name as an internal package. The scanner checks that: Internal package names do not collide with public PyPI or npm packages Notebook pip install commands only reference packages that are registered and expected Workflow Security Auditing When GitHub Actions workflow files change, a workflow security job scans for common CI/CD security issues: Expression injection: Detects patterns like ${{ github.event.pull_request.title }} used directly in run: blocks, which can allow arbitrary code execution. Overly permissive permissions: Flags workflows that request more permissions than necessary. Unpinned action references: Detects actions referenced by branch name instead of commit SHA, which is a supply chain risk. .NET Binary Analysis with BinSkim For the .NET SDK (Microsoft.AgentGovernance), the CI pipeline runs Microsoft BinSkim binary security analysis on compiled assemblies. BinSkim checks for security-relevant compiler and linker settings in compiled binaries, such as DEP (Data Execution Prevention), ASLR (Address Space Layout Randomization), and stack protection. Results are uploaded as SARIF to GitHub code scanning alongside the CodeQL results. The ci-complete Gate Pattern With many CI jobs that conditionally run based on path filters, AGT uses a pattern called ci-complete: a single gate job that is configured as the sole required status check in branch protection. This job runs unconditionally (if: always()), depends on all other CI jobs, and checks that none of them failed. Jobs that were skipped (because no relevant files changed) are acceptable. This pattern ensures that branch protection works correctly with conditional CI jobs, preventing the common issue where skipped jobs report as "skipped" and fail required status checks. Language-Specific Compile-Time Enforcement Beyond the language-agnostic CI checks, each AGT SDK uses its language's native compiler and tooling to enforce governance standards at compile time. .NET: The Strictest Compile-Time Checks The .NET SDK (Microsoft.AgentGovernance) enforces compile-time governance through MSBuild properties in Directory.Build.props and Directory.Build.targets, which apply automatically to every project in the SDK: Feature MSBuild Property Effect Nullable reference types <Nullable>enable</Nullable> The compiler warns on every possible null dereference, helping prevent NullReferenceException at compile time Warnings as errors <TreatWarningsAsErrors>true All compiler warnings become build errors for packable projects; no warnings can be shipped to consumers Strong-name signing <SignAssembly>true</SignAssembly> Assemblies are signed with a strong-name key (AgentGovernance.snk), enabling identity verification Deterministic builds <ContinuousIntegrationBuild>true Identical source code produces bit-for-bit identical binaries in CI, enabling build verification SourceLink Microsoft.SourceLink.GitHub package Users can step into AGT source code when debugging, supporting transparency and auditability Symbol packages <IncludeSymbols>true</IncludeSymbols> .snupkg symbol packages are published alongside NuGet packages for debugging support TypeScript: Strict Compilation and Linting The TypeScript SDK (@microsoft/agentmesh-sdk) uses strict compiler settings and ESLint for build-time governance: Strict mode ("strict": true in tsconfig.json) enables all strict type-checking options, including noImplicitAny, strictNullChecks, strictFunctionTypes, and strictBindCallApply. Consistent file naming (forceConsistentCasingInFileNames) prevents cross-platform issues where imports work on case-insensitive file systems (Windows, macOS) but fail on case-sensitive ones (Linux CI). Declaration generation (declaration: true with declarationMap: true) produces .d.ts files for consumers, enabling downstream type checking. ESLint with @typescript-eslint provides static analysis during the build process, catching issues beyond what the TypeScript compiler checks. Python: Type Safety and Fast Linting Python packages in AGT use typed package markers and static analysis tooling configured in pyproject.toml: py.typed marker: Each package includes a py.typed file, signalling to type checkers (mypy, pyright, Pylance) that the package supports type checking. Consumers get type errors if they misuse the AGT API. mypy: Configured as a dev dependency with project-specific settings in pyproject.toml. Provides static type checking that catches type mismatches before runtime. ruff: A fast Python linter written in Rust, configured in pyproject.toml and enforced in CI. Ruff checks for hundreds of code quality rules at build time. Stage 4: Release-Time Gates Before artifacts reach users, the release pipeline adds a final layer of verification. These gates help ensure that what ships is exactly what was built, is signed by the expected publisher, and has a complete inventory of its components. Gate Tool What It Produces SBOM generation Anchore/Syft SPDX and CycloneDX software bills of materials listing every component, dependency, and licence Python signing Sigstore Cryptographic signature using OpenID Connect identity, verifiable without manual key distribution .NET signing RELEASE PIPELINE Microsoft Authenticode and NuGet signing through the release pipeline Build provenance actions/attest-build-provenance SLSA provenance attestation linking the artifact to its source commit and build environment SBOM attestation actions/attest-sbom Binds the SBOM to the specific release artifact, creating a verifiable link between the inventory and the binary Additionally, the OpenSSF Scorecard runs on schedule, providing an automated security posture assessment that covers branch protection, dependency management, CI/CD practices, and more. The score is published to the OpenSSF Scorecard website, giving consumers a transparent view of the project security practices. How It All Fits Together: Defense in Depth This approach follows a defense-in-depth principle: every check exists at multiple layers, so that bypassing one layer does not compromise the whole system. Secret scanning, for example, runs at three levels: detect-secrets at commit time (pre-commit hook), Gitleaks at PR time (secret scanning workflow), and the Security Scan action at CI time (content analysis). A developer who bypasses pre-commit hooks will still be caught by the PR-time gate. A contributor who force-pushes past the PR gate will still be caught by the CI pipeline. Similarly, policy validation runs at commit time (validate-policy hook), at PR time (quality gates), and at CI time (policy validation workflow). Each layer adds depth: the commit-time hook catches schema errors, the CI pipeline catches semantic issues and runs regression tests. The ci-complete gate job ties everything together. By depending on every CI job and serving as the single required status check, it ensures that no code merges to the main branch unless every applicable check has passed. Getting Started You can adopt AGT's shift-left governance incrementally. Here are three starting points, from lowest to highest effort: 1. Add the Governance Verify Action (5 minutes) Add a single GitHub Actions workflow that runs the compliance check on every PR: # .github/workflows/governance.yml name: Governance on: [pull_request] jobs: verify: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: microsoft/agent-governance-toolkit/action@main with: command: governance-verify 2. Enable Pre-Commit Hooks (15 minutes) Add a .pre-commit-config.yaml referencing AGT's hooks, install them, and run against all existing files to establish a baseline. Start in permissive mode and graduate to strict over four weeks. 3. Full Pipeline Integration (1-2 hours) Add the complete set of PR-time gates (attestation, dependency review, secret scanning, supply chain checks, quality gates), configure the Security Scan action for your plugin directories, and enable SBOM generation and signing in your release workflow. The AGT repository itself serves as a reference implementation: every workflow described in this post is running in production at aka.ms/agent-governance-toolkit. Important Notes The policy files, workflow configurations, and code samples in this post are illustrative examples. Your organization's governance requirements may differ. Review and customize all configurations before deploying to production. The Agent Governance Toolkit is designed to help organizations implement governance controls for AI agents; it does not guarantee compliance with any specific regulatory framework. Always consult your organization's security and legal teams when defining governance policies. What Comes Next Pre-runtime governance is one piece of the puzzle. Combined with the runtime governance capabilities covered in part one of this series (policy engines, zero-trust identity, execution sandboxing, audit logging), it provides coverage across the full lifecycle. The project continues to grow. Since the initial release, we’ve added a multi-stage policy pipeline (pre_input, pre_tool, post_tool, pre_output stages), approval workflows with human-in-the-loop gates, DLP attribute ratchets for monotonic session state, and OpenTelemetry instrumentation for governance operations. Over 45 step-by-step tutorials are available in the documentation. Everything described in this post is available today in the public GitHub repository. The full source, documentation, tutorials, and examples are at aka.ms/agent-governance-toolkit, open source under the MIT license. We welcome contributions, feedback, and issue reports from the community.170Views0likes0CommentsProject Pavilion Presence at KubeCon EU 2026
KubeCon + CloudNativeCon Europe 2026 took place from 23 to 26 March at RAI Amsterdam, and it was a strong one. The themes running through the week reflected where the cloud native community is right now: AI moving from experimentation into production, platform engineering continuing to mature, and security and sovereignty top of mind for organizations across Europe. Microsoft was there throughout, and once again supported a range of open source projects in the Project Pavilion. The Project Pavilion is a dedicated, vendor-neutral space on the show floor reserved for CNCF projects. It is where the work gets talked about honestly. Maintainers and contributors meet directly with end users, share what they are building, get real feedback on what is and is not working, and have the kinds of technical conversations that are hard to have anywhere else. For open source communities, it is one of the most valuable parts of the event. Why Our Presence Matters Microsoft's products and services are built on and alongside many of the technologies represented in the pavilion, and the health of these communities matters to us directly. Showing up means our teams hear firsthand what is working, what is missing, and where these projects need to go next. It also means we get to contribute as community members, not just as a company name on a sponsor board. That distinction matters to us, and to the communities we are part of. Microsoft-Supported Pavilion Projects Confidential Containers Representative: Jeremi Piotrowski The Confidential Containers booth gave attendees a chance to learn more about the project and its approach to protecting workloads using hardware-based trusted execution environments. Jeremi was on hand throughout the kiosk hours, fielding questions from interested users and developers exploring confidential computing in Kubernetes environments. Conversations touched on use cases around data privacy, regulated workloads, and the role Confidential Containers plays in the broader cloud-native security landscape. Drasi Representative: Daniel Gerlag and Nandita Valsan The Drasi team had a busy time in the pavilion, engaging around 40 attendees across two kiosk shifts in focused technical conversations. Most visitors were developers and platform engineers curious about change-driven architectures and real-time data processing. There was strong positive feedback on the newly introduced Drasi Server modes and embeddable library, which complement Drasi for Kubernetes. The team came away with useful validation of current design decisions and good input for the roadmap ahead. Envoy Representative: Mikhail Krinkin The Envoy booth was staffed for the full duration of KubeCon EU by maintainers from Microsoft, Google, Isovalent, and Tetrate, reflecting the broad and healthy contributor base behind the project. The biggest topic at the booth was migration from ingress-nginx to Gateway API implementations. The archival of ingress-nginx pushed a lot of users into making changes they were not quite ready for, and questions ranged from technical specifics like HTTP default differences between Envoy and Nginx, to more foundational questions about what Envoy and Gateway API actually are. The team had anticipated this and invested in the ingress2gateway project to give users a clear migration path. Extensibility was another frequent conversation topic, with dynamic modules increasingly becoming the go-to answer for user-specific requirements. Starting with the 1.38 release of Envoy, dynamic modules will have a backward compatible ABI, a sign of real production readiness for that feature. Flatcar Representative: Thilo Fromm and Mathieu Tortuyaux The Flatcar booth had great energy, with maintainers from Microsoft, STACKIT, and CloudBase joining for conversations throughout the pavilion hours. Operational sovereignty came up again and again as a theme, with users and consulting partners sharing how they are building their Kubernetes offerings on Flatcar because of how reliable and secure it is. There were a lot of meaningful conversations. Lambda.ai currently runs Flatcar on their control plane and is looking at extending it to worker and customer clusters, with interest in contributing to the project. ReeVo has built their hosted Kubernetes distro on Flatcar across multi-cloud and bare metal environments and is planning to move hundreds of customer clusters over soon. Users from ClearScore, Avassa, Recorded Future, and several other organizations also stopped by with positive feedback on the project's robustness and security. STACKIT uses Flatcar as the default OS for their hosted Kubernetes offering and sponsors a full-time maintainer for the project. The team also connected with TAG Infrastructure to talk through Flatcar's CNCF graduation progress. Headlamp Representatives: René Dudfield and Santhosh Nagaraj S The Headlamp booth was a busy one, with users, contributors, and partner projects all stopping by throughout the pavilion hours. Conversations covered real-world deployments, federation challenges, multi-tenant namespace visibility, and feature requests like multi-CR data aggregation. There was notable interest from consultancies deploying Headlamp across hundreds of customer clusters, as well as from companies already running it at cloud scale. Several CNCF projects expressed interest in building UIs for their own projects inside Headlamp, with a few even getting started right there at the conference. The team also heard from users getting budget approved to migrate from the deprecated Kubernetes Dashboard, which is a good sign for the project's growing momentum. Demand for air-gapped AI agent support and deeper Azure and AKS integrations for internal developer platforms came up as clear areas to watch. Hyperlight Representative: Ralph Squillace The Hyperlight booth ran as a half-day session on Tuesday, in line with the project's current Sandbox status, but the corner location in the project area made a real difference in visibility. Ralph was fielding questions from the moment the doors opened, with a steady stream of visitors right up until the shift ended. Live and recorded demos were central to the conversations, helping attendees quickly grasp what Hyperlight does and how it fits into their environments. One standout visit came from an engineer at SAP who spent nearly an hour at the booth, pushing the conversation from fundamentals and embedding examples all the way through to agentic protection scenarios in Kubernetes. That conversation continued beyond KubeCon and turned into a scheduled meeting to explore a proof of concept, a good example of the kind of follow-on engagement the pavilion can generate. Inspektor Gadget Representative: Michael Friese and Qasim Sarfraz The Inspektor Gadget booth had a lot of great energy, drawing in contributors, new users, and people just discovering the project for the first time. There was genuine excitement around Inspektor Gadget Desktop and its visual troubleshooting experience for Kubernetes and Linux environments. The integration with HolmesGPT, which was also featured in the keynote, came up frequently and was one of the main talking points throughout the event. A theme that surfaced consistently in conversations with platform engineers was multi-tenancy, with teams looking for ways to safely give developers ad-hoc access to troubleshoot issues independently while keeping overall control at the platform level. It was a good set of conversations that reflected both the project's maturity and the growing demand for a flexible observability framework. Istio Representative: Mitch Connors, Mikhail Krinkin, Jackie Maertens and Mike Morris The Istio booth had steady traffic throughout the conference, with a noticeable shift in who was stopping by. More visitors came from teams with existing sidecar-based production deployments looking for guidance on moving to ambient mode, which is a change from previous years when ambient interest was mostly coming from greenfield users. The motivation to make the move was often tied to cost optimization and performance, with teams having read case studies and feeling more confident about the direction. That said, the increased interest also surfaced some real gaps, including requests for clearer migration guidance, more clarity around architectural differences like mTLS egress workflows, and better support for VM-based workloads. The team is planning to prioritize migration guidance over the coming months. The updated Istio Day format, with a half day of sessions at the Cloud Native Theater stage, also drew a strong crowd with standing room only throughout. Notary Project Representative: Toddy Mladenov and Flora Taagen The Notary Project kiosk drew a wide range of visitors, from people learning about container image signing for the first time to experienced engineers asking detailed questions about what is coming next on the roadmap. A major highlight of the week was the project's conference talk on per-layer dm-verity signing, which drew a packed room and over 660 online sign-ups, one of the stronger turnouts for a project-level session at the event. The talk walked through how the new capability moves container security beyond pull-time verification to continuous runtime protection, backed by dm-verity, which generated a lengthy Q&A and a lot of enthusiasm from the audience. The team also sees a real opportunity ahead as AI workloads push organizations to think harder about the integrity of models, datasets, and container images, and the interest at the booth reinforced that Notary Project is well positioned to play a big role in securing those workflows. ORAS Representative: Toddy Mladenov The ORAS kiosk was staffed by maintainers from Microsoft, NVIDIA, and Red Hat, a good reflection of the healthy multi-vendor community the project has built. Attendees engaged with maintainers on ORAS use cases and adoption, with conversations ranging from how artifacts are tagged and packaged to how ORAS fits into broader multi-cloud workflows. One practical takeaway from maintainer conversations was around leveraging the ORAS SDK more often as a substitute for CLI operations when working with container registries, which helps teams build simpler and more robust tooling. Radius Representative: Sylvain Niles and Will Tsai The Radius booth, supported by the Microsoft Azure Incubations team, attracted a good mix of enterprise platform teams, prospective adopters, and fellow open source maintainers throughout the pavilion hours. There was strong interest in the extensible Radius Resource Types feature and how it helps teams abstract infrastructure complexity and move workloads across different environments. Conversations also surfaced useful feedback on where the project should focus next, including agent-driven infrastructure workflows and using the Radius application graph to improve observability and operational visibility for cloud-native applications. Conclusion KubeCon EU 2026 was a good reminder of why this community continues to grow. The conversations in the Project Pavilion were substantive, the feedback was honest, and the connections made there will carry forward into the work. Microsoft will be back for KubeCon NA in Salt Lake City this November, and we are already looking forward to it. If you are interested in getting involved with any of these projects, the best starting point is each project's community directly. You are also welcome to reach out to Lexi Nadolski at lexinadolski@microsoft.com with any questions.58Views0likes0CommentsAgent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents
Last week we announced the Agent Governance Toolkit on the Microsoft Open Source Blog, an open-source project that brings runtime security governance to autonomous AI agents. In that announcement, we covered the why: AI agents are making autonomous decisions in production, and the security patterns that kept systems safe for decades need to be applied to this new class of workload. In this post, we'll go deeper into the how: the architecture, the implementation details, and what it takes to run governed agents in production. The Problem: Production Infrastructure Meets Autonomous Agents If you manage production infrastructure, you already know the playbook: least privilege, mandatory access controls, process isolation, audit logging, and circuit breakers for cascading failures. These patterns have kept production systems safe for decades. Now imagine a new class of workload arriving on your infrastructure, AI agents that autonomously execute code, call APIs, read databases, and spawn sub-processes. They reason about what to do, select tools, and act in loops. And in many current deployments, they do all of this without the security controls you'd demand of any other production workload. That gap is what led us to build the Agent Governance Toolkit: an open-source project, that applies proven security concepts from operating systems, service meshes, and SRE to the emerging world of autonomous AI agents. To frame this in familiar terms: most AI agent frameworks today are like running every process as root, no access controls, no isolation, no audit trail. The Agent Governance Toolkit is the kernel, the service mesh, and the SRE platform for AI agents. When an agent calls a tool, say, `DELETE FROM users WHERE created_at < NOW()`, there is typically no policy layer checking whether that action is within scope. There is no identity verification when one agent communicates with another. There is no resource limit preventing an agent from making 10,000 API calls in a minute. And there is no circuit breaker to contain cascading failures when things go wrong. OWASP Agentic Security Initiative In December 2025, OWASP published the Agentic AI Top 10: the first formal taxonomy of risks specific to autonomous AI agents. The list reads like a security engineer's nightmare: goal hijacking, tool misuse, identity abuse, memory poisoning, cascading failures, rogue agents, and more. If you've ever hardened a production server, these risks will feel both familiar and urgent. The Agent Governance Toolkit is designed to help address all 10 of these risks through deterministic policy enforcement, cryptographic identity, execution isolation, and reliability engineering patterns. Note: The OWASP Agentic Security Initiative has since adopted the ASI 2026 taxonomy (ASI01–ASI10). The toolkit's copilot-governance package now uses these identifiers with backward compatibility for the original AT numbering. Architecture: Nine Packages, One Governance Stack The toolkit is structured as a v3.0.0 Public Preview monorepo with nine independently installable packages: Package What It Does Agent OS Stateless policy engine, intercepts agent actions before execution with configurable pattern matching and semantic intent classification Agent Mesh Cryptographic identity (DIDs with Ed25519), Inter-Agent Trust Protocol (IATP), and trust-gated communication between agents Agent Hypervisor Execution rings inspired by CPU privilege levels, saga orchestration for multi-step transactions, and shared session management Agent Runtime Runtime supervision with kill switches, dynamic resource allocation, and execution lifecycle management Agent SRE SLOs, error budgets, circuit breakers, chaos engineering, and progressive delivery, production reliability practices adapted for AI agents Agent Compliance Automated governance verification with compliance grading and regulatory framework mapping (EU AI Act, NIST AI RMF, HIPAA, SOC 2) Agent Lightning Reinforcement learning training governance with policy-enforced runners and reward shaping Agent Marketplace Plugin lifecycle management with Ed25519 signing, trust-tiered capability gating, and SBOM generation Integrations 20+ framework adapters for LangChain, CrewAI, AutoGen, Semantic Kernel, Google ADK, Microsoft Agent Framework, OpenAI Agents SDK, and more Agent OS: The Policy Engine Agent OS intercepts agent tool calls before they execute: from agent_os import StatelessKernel, ExecutionContext, Policy kernel = StatelessKernel() ctx = ExecutionContext( agent_id="analyst-1", policies=[ Policy.read_only(), # No write operations Policy.rate_limit(100, "1m"), # Max 100 calls/minute Policy.require_approval( actions=["delete_*", "write_production_*"], min_approvals=2, approval_timeout_minutes=30, ), ], ) result = await kernel.execute( action="delete_user_record", params={"user_id": 12345}, context=ctx, ) The policy engine works in two layers: configurable pattern matching (with sample rule sets for SQL injection, privilege escalation, and prompt injection that users customize for their environment) and a semantic intent classifier that helps detect dangerous goals regardless of phrasing. When an action is classified as `DESTRUCTIVE_DATA`, `DATA_EXFILTRATION`, or `PRIVILEGE_ESCALATION`, the engine blocks it, routes it for human approval, or downgrades the agent's trust level, depending on the configured policy. Important: All policy rules, detection patterns, and sensitivity thresholds are externalized to YAML configuration files. The toolkit ships with sample configurations in `examples/policies/` that must be reviewed and customized before production deployment. No built-in rule set should be considered exhaustive. Policy languages supported: YAML, OPA Rego, and Cedar. The kernel is stateless by design, each request carries its own context. This means you can deploy it behind a load balancer, as a sidecar container in Kubernetes, or in a serverless function, with no shared state to manage. On AKS or any Kubernetes cluster, it fits naturally into existing deployment patterns. Helm charts are available for agent-os, agent-mesh, and agent-sre. Agent Mesh: Zero-Trust Identity for Agents In service mesh architectures, services prove their identity via mTLS certificates before communicating. AgentMesh applies the same principle to AI agents using decentralized identifiers (DIDs) with Ed25519 cryptography and the Inter-Agent Trust Protocol (IATP): from agentmesh import AgentIdentity, TrustBridge identity = AgentIdentity.create( name="data-analyst", sponsor="alice@company.com", # Human accountability capabilities=["read:data", "write:reports"], ) # identity.did -> "did:mesh:data-analyst:a7f3b2..." bridge = TrustBridge() verification = await bridge.verify_peer( peer_id="did:mesh:other-agent", required_trust_score=700, # Must score >= 700/1000 ) A critical feature is trust decay: an agent's trust score decreases over time without positive signals. An agent trusted last week but silent since then gradually becomes untrusted, modeling the reality that trust requires ongoing demonstration, not a one-time grant. Delegation chains enforce scope narrowing: a parent agent with read+write permissions can delegate only read access to a child agent, never escalate. Agent Hypervisor: Execution Rings CPU architectures use privilege rings (Ring 0 for kernel, Ring 3 for userspace) to isolate workloads. The Agent Hypervisor applies this model to AI agents: Ring Trust Level Capabilities Ring 0 (Kernel) Score ≥ 900 Full system access, can modify policies Ring 1 (Supervisor) Score ≥ 700 Cross-agent coordination, elevated tool access Ring 2 (User) Score ≥ 400 Standard tool access within assigned scope Ring 3 (Untrusted) Score < 400 Read-only, sandboxed execution only New and untrusted agents start in Ring 3 and earn their way up, exactly the principle of least privilege that production engineers apply to every other workload. Each ring enforces per-agent resource limits: maximum execution time, memory caps, CPU throttling, and request rate limits. If a Ring 2 agent attempts a Ring 1 operation, it gets blocked, just like a userspace process trying to access kernel memory. These ring definitions and their associated trust score thresholds are fully configurable via policy. Organizations can define custom ring structures, adjust the number of rings, set different trust score thresholds for transitions, and configure per-ring resource limits to match their security requirements. The hypervisor also provides saga orchestration for multi-step operations. When an agent executes a sequence, draft email → send → update CRM, and the final step fails, compensating actions fire in reverse. Borrowed from distributed transaction patterns, this ensures multi-agent workflows maintain consistency even when individual steps fail. Agent SRE: SLOs and Circuit Breakers for Agents If you practice SRE, you measure services by SLOs and manage risk through error budgets. Agent SRE extends this to AI agents: When an agent's safety SLI drops below 99 percent, meaning more than 1 percent of its actions violate policy, the system automatically restricts the agent's capabilities until it recovers. This is the same error-budget model that SRE teams use for production services, applied to agent behavior. We also built nine chaos engineering fault injection templates: network delays, LLM provider failures, tool timeouts, trust score manipulation, memory corruption, and concurrent access races. Because the only way to know if your agent system is resilient is to break it intentionally. Agent SRE integrates with your existing observability stack through adapters for Datadog, PagerDuty, Prometheus, OpenTelemetry, Langfuse, LangSmith, Arize, MLflow, and more. Message broker adapters support Kafka, Redis, NATS, Azure Service Bus, AWS SQS, and RabbitMQ. Compliance and Observability If your organization already maps to CIS Benchmarks, NIST AI RMF, or other frameworks for infrastructure compliance, the OWASP Agentic Top 10 is the equivalent standard for AI agent workloads. The toolkit's agent-compliance package provides automated governance grading against these frameworks. The toolkit is framework-agnostic, with 20+ adapters that hook into each framework's native extension points, so adding governance to an existing agent is typically a few lines of configuration, not a rewrite. The toolkit exports metrics to any OpenTelemetry-compatible platform, Prometheus, Grafana, Datadog, Arize, or Langfuse. If you're already running an observability stack for your infrastructure, agent governance metrics flow through the same pipeline. Key metrics include: policy decisions per second, trust score distributions, ring transitions, SLO burn rates, circuit breaker state, and governance workflow latency. Getting Started # Install all packages pip install agent-governance-toolkit[full] # Or individual packages pip install agent-os-kernel agent-mesh agent-sre The toolkit is available across language ecosystems: Python, TypeScript (`@microsoft/agentmesh-sdk` on npm), Rust, Go, and .NET (`Microsoft.AgentGovernance` on NuGet). Azure Integrations While the toolkit is platform-agnostic, we've included integrations that help enable the fastest path to production, on Azure: Azure Kubernetes Service (AKS): Deploy the policy engine as a sidecar container alongside your agents. Helm charts provide production-ready manifests for agent-os, agent-mesh, and agent-sre. Azure AI Foundry Agent Service: Use the built-in middleware integration for agents deployed through Azure AI Foundry. OpenClaw Sidecar: One compelling deployment scenario is running OpenClaw, the open-source autonomous agent, inside a container with the Agent Governance Toolkit deployed as a sidecar. This gives you policy enforcement, identity verification, and SLO monitoring over OpenClaw's autonomous operations. On Azure Kubernetes Service (AKS), the deployment is a standard pod with two containers: OpenClaw as the primary workload and the governance toolkit as the sidecar, communicating over localhost. We have a reference architecture and Helm chart available in the repository. The same sidecar pattern works with any containerized agent, OpenClaw is a particularly compelling example because of the interest in autonomous agent safety. Tutorials and Resources 34+ step-by-step tutorials covering policy engines, trust, compliance, MCP security, observability, and cross-platform SDK usage are available in the repository. git clone https://github.com/microsoft/agent-governance-toolkit cd agent-governance-toolkit pip install -e "packages/agent-os[dev]" -e "packages/agent-mesh[dev]" -e "packages/agent-sre[dev]" # Run the demo python -m agent_os.demo What's Next AI agents are becoming autonomous decision-makers in production infrastructure, executing code, managing databases, and orchestrating services. The security patterns that kept production systems safe for decades, least privilege, mandatory access controls, process isolation, audit logging, are exactly what these new workloads need. We built them. They're open source. We're building this in the open because agent security is too important for any single organization to solve alone: Security research: Adversarial testing, red-team results, and vulnerability reports strengthen the toolkit for everyone. Community contributions: Framework adapters, detection rules, and compliance mappings from the community expand coverage across ecosystems. We are committed to open governance. We're releasing this project under Microsoft today, and we aspire to move it into a foundation home, such as the AI and Data Foundation (AAIF), where it can benefit from cross-industry stewardship. We're actively engaging with foundation partners on this path. The Agent Governance Toolkit is open source under the MIT license. Contributions welcome at github.com/microsoft/agent-governance-toolkit.1.2KViews0likes0CommentsBuilding Multi-Agent Orchestration Using Microsoft Semantic Kernel: A Complete Step-by-Step Guide
What You Will Build By the end of this guide, you will have a working multi-agent system where 4 specialist AI agents collaborate to diagnose production issues: ClientAnalyst — Analyzes browser, JavaScript, CORS, uploads, and UI symptoms NetworkAnalyst — Analyzes DNS, TCP/IP, TLS, load balancers, and firewalls ServerAnalyst — Analyzes backend logs, database, deployments, and resource limits Coordinator — Synthesizes all findings into a root cause report with a prioritized action plan These agents don't just run in sequence — they debate, cross-examine, and challenge each other's findings through a shared conversation, producing a diagnosis that's better than any single agent could achieve alone. Table of Contents Why Multi-Agent? The Problem with Single Agents Architecture Overview Understanding the Key SK Components The Actor Model — How InProcessRuntime Works Setting Up Your Development Environment Step-by-Step: Building the Multi-Agent Analyzer The Agent Interaction Flow — Round by Round Bugs I Found & Fixed — Lessons Learned Running with Different AI Providers What to Build Next 1. Why Multi-Agent? The Problem with Single Agents A single AI agent analyzing a production issue is like having one doctor diagnose everything — they'll catch issues in their specialty but miss cross-domain connections. Consider this problem: "Users report 504 Gateway Timeout errors when uploading files larger than 10MB. Started after Friday's deployment. Worse during peak hours." A single agent might say "it's a server timeout" and stop. But the real root cause often spans multiple layers: The client is sending chunked uploads with an incorrect Content-Length header (client-side bug) The load balancer has a 30-second timeout that's too short for large uploads (network config) The server recently deployed a new request body parser that's 3x slower (server-side regression) The combination only fails during peak hours because connection pool saturation amplifies the latency No single perspective catches this. You need specialists who analyze independently, then debate to find the cross-layer causal chain. That's what multi-agent orchestration gives you. The 5 Orchestration Patterns in SK Semantic Kernel provides 5 built-in patterns for agent collaboration: SEQUENTIAL: A → B → C → Done (pipeline — each builds on previous) CONCURRENT: ↗ A ↘ Task → B → Aggregate ↘ C ↗ (parallel — results merged) GROUP CHAT: A ↔ B ↔ C ↔ D ← We use this one (rounds, shared history, debate) HANDOFF: A → (stuck?) → B → (complex?) → Human (escalation with human-in-the-loop) MAGENTIC: LLM picks who speaks next dynamically (AI-driven speaker selection) We use GroupChatOrchestration with RoundRobinGroupChatManager because our problem requires agents to see each other's work, challenge assumptions, and build on each other's analysis across two rounds. 2. Architecture Overview Here's the complete architecture of what we're building: 3. Understanding the Key SK Components Before we write code, let's understand the 5 components we'll use and the design pattern each implements: ChatCompletionAgent — Strategy Pattern The agent definition. Each agent is a combination of: name — unique identifier (used in round-robin ordering) instructions — the persona and rules (this is the prompt engineering) service — which AI provider to call (Strategy Pattern — swap providers without changing agent logic) description — what other agents/tools understand about this agent agent = ChatCompletionAgent( name="ClientAnalyst", instructions="You are ONLY ClientAnalyst...", service=gemini_service, # ← Strategy: swap to OpenAI with zero changes description="Analyzes client-side issues", ) GroupChatOrchestration — Mediator Pattern The orchestration defines HOW agents interact. It's the Mediator — agents don't talk to each other directly. Instead, the orchestration manages a shared ChatHistory and routes messages through the Manager. RoundRobinGroupChatManager — Strategy Pattern The Manager decides WHO speaks next. RoundRobinGroupChatManager cycles through agents in a fixed order. SK also provides AutomaticGroupChatManager where the LLM decides who speaks next. max_rounds is the total number of messages per agent or cycle. With 4 agents and max_rounds=8, each agent speaks exactly twice. InProcessRuntime — Actor Model Abstraction The execution engine. Every agent becomes an "actor" with its own kind of mailbox (message queue). The runtime delivers messages between actors. Key properties: No shared state — agents communicate only through messages Sequential processing — each agent processes one message at a time Location transparency — same code works in-process today, distributed tomorrow agent_response_callback — Observer Pattern A function that fires after EVERY agent response. We use it to display each agent's output in real-time with emoji labels and round numbers. 4. The Actor Model — How InProcessRuntime Works The Actor Model is a concurrency pattern where each entity is an isolated "actor" with a private mailbox. Here's what happens inside InProcessRuntime when we run our demo: runtime.start() │ ├── Creates internal message loop (asyncio event loop) │ orchestration.invoke(task="504 timeout...", runtime=runtime) │ ├── Creates Actor[Orchestrator] → manages overall flow ├── Creates Actor[Manager] → RoundRobinGroupChatManager ├── Creates Actor[ClientAnalyst] → mailbox created, waiting ├── Creates Actor[NetworkAnalyst] → mailbox created, waiting ├── Creates Actor[ServerAnalyst] → mailbox created, waiting └── Creates Actor[Coordinator] → mailbox created, waiting Manager receives "start" message │ ├── Checks turn order: [Client, Network, Server, Coordinator] ├── Sends task to ClientAnalyst mailbox │ → ClientAnalyst processes: calls LLM → response │ → Response added to shared ChatHistory │ → callback fires (displayed in Notebook UI) │ → Sends "done" back to Manager │ ├── Manager updates: turn_index=1 ├── Sends to NetworkAnalyst mailbox │ → Same flow... │ ├── ... (ServerAnalyst, Coordinator for Round 1) │ ├── Manager checks: messages=4, max_rounds=8 → continue │ ├── Round 2: same cycle with cross-examination │ └── After message 8: Manager sends "complete" → OrchestrationResult resolves → result.get() returns final answer runtime.stop_when_idle() → All mailboxes empty → clean shutdown The Actor Model guarantees: No race conditions (each actor processes one message at a time) No deadlocks (no shared locks to contend for) No shared mutable state (agents communicate only via messages) 5. Setting Up Your Development Environment Prerequisites Python 3.11 or 3.12 (3.13+ may have compatibility issues with some SK connectors) Visual Studio Code with the Python and Jupyter extensions An API key from one of: Google AI Studio (free), OpenAI Step 1: Install Python Download from python.org. During installation, check "Add Python to PATH". Verify: python --version # Python 3.12.x Step 2: Install VS Code Extensions Open VS Code, go to Extensions (Ctrl+Shift+X), and install: Python (by Microsoft) — Python language support Jupyter (by Microsoft) — Notebook support Pylance (by Microsoft) — IntelliSense and type checking Step 3: Create Project Folder mkdir sk-multiagent-demo cd sk-multiagent-demo Open in VS Code: code . Step 4: Create Virtual Environment Open the VS Code terminal (Ctrl+`) and run: # Create virtual environment python -m venv sk-env # Activate it # Windows: sk-env\Scripts\activate # macOS/Linux: source sk-env/bin/activate You should see (sk-env) in your terminal prompt. Step 5: Install Semantic Kernel For Google Gemini (free tier — recommended for getting started): pip install semantic-kernel[google] python-dotenv ipykernel For OpenAI (paid API key): pip install semantic-kernel openai python-dotenv ipykernel For Azure AI Foundry (enterprise, Entra ID auth): pip install semantic-kernel azure-identity python-dotenv ipykernel Step 6: Register the Jupyter Kernel python -m ipykernel install --user --name=sk-env --display-name="Semantic Kernel (Python 3.12)" You can also select if this is already available from your environment from VSCode as below: Step 7: Get Your API Key Option A — Google Gemini (FREE, recommended for demo): Go to https://aistudio.google.com/apikey Click "Create API Key" Copy the key Free tier limits: 15 requests/minute, 1 million tokens/minute — more than enough for this demo. Option B — OpenAI: Go to https://platform.openai.com/api-keys Create a new key Copy the key Option C — Azure AI Foundry: Deploy a model in Azure AI Foundry portal Note the endpoint URL and deployment name If key-based auth is disabled, you'll need Entra ID with permissions Step 8: Create the .env File In your project root, create a file named .env: For Gemini: GOOGLE_AI_API_KEY=AIzaSy...your-key-here GOOGLE_AI_GEMINI_MODEL_ID=gemini-2.5-flash For OpenAI: OPENAI_API_KEY=sk-...your-key-here OPENAI_CHAT_MODEL_ID=gpt-4o For Azure AI Foundry: AZURE_OPENAI_ENDPOINT=https://your-resource.cognitiveservices.azure.com AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=gpt-4o AZURE_OPENAI_API_KEY=your-key Step 9: Create the Notebook In VS Code: Click File > New File Save as multi_agent_analyzer.ipynb In the top-right of the notebook, click Select Kernel Choose Semantic Kernel (Python 3.12) (or your sk-env) Your environment is ready. Let's build. 6. Step-by-Step: Building the Multi-Agent Analyzer Cell 1: Verify Setup import semantic_kernel print(f"Semantic Kernel version: {semantic_kernel.__version__}") from semantic_kernel.agents import ( ChatCompletionAgent, GroupChatOrchestration, RoundRobinGroupChatManager, ) from semantic_kernel.agents.runtime import InProcessRuntime from semantic_kernel.contents import ChatMessageContent print("All imports successful") Cell 2: Load API Key and Create Service For Gemini: import os from dotenv import load_dotenv load_dotenv() from semantic_kernel.connectors.ai.google.google_ai import ( GoogleAIChatCompletion, GoogleAIChatPromptExecutionSettings, ) from semantic_kernel.contents import ChatHistory GEMINI_API_KEY = os.getenv("GOOGLE_AI_API_KEY") GEMINI_MODEL = os.getenv("GOOGLE_AI_GEMINI_MODEL_ID", "gemini-2.5-flash") service = GoogleAIChatCompletion( gemini_model_id=GEMINI_MODEL, api_key=GEMINI_API_KEY, ) print(f"Service created: Gemini {GEMINI_MODEL}") # Smoke test settings = GoogleAIChatPromptExecutionSettings() test_history = ChatHistory(system_message="You are a helpful assistant.") test_history.add_user_message("Say 'Connected!' and nothing else.") response = await service.get_chat_message_content( chat_history=test_history, settings=settings ) print(f"Model says: {response.content}") For OpenAI: import os from dotenv import load_dotenv load_dotenv() from semantic_kernel.connectors.ai.open_ai import ( OpenAIChatCompletion, OpenAIChatPromptExecutionSettings, ) from semantic_kernel.contents import ChatHistory service = OpenAIChatCompletion( ai_model_id=os.getenv("OPENAI_CHAT_MODEL_ID", "gpt-4o"), ) print(f"Service created: OpenAI {os.getenv('OPENAI_CHAT_MODEL_ID', 'gpt-4o')}") # Smoke test settings = OpenAIChatPromptExecutionSettings() test_history = ChatHistory(system_message="You are a helpful assistant.") test_history.add_user_message("Say 'Connected!' and nothing else.") response = await service.get_chat_message_content( chat_history=test_history, settings=settings ) print(f"Model says: {response.content}") Cell 3: Define All 4 Agents This is the most important cell — the prompt engineering that makes the demo work: from semantic_kernel.agents import ChatCompletionAgent # ═══════════════════════════════════════════════════ # AGENT 1: Client-Side Analyst # ═══════════════════════════════════════════════════ client_agent = ChatCompletionAgent( name="ClientAnalyst", description="Analyzes problems from the client-side: browser, JS, CORS, caching, UI symptoms", instructions="""You are ONLY **ClientAnalyst**. You must NEVER speak as NetworkAnalyst, ServerAnalyst, or Coordinator. Every word you write is from ClientAnalyst's perspective only. You are a senior front-end and client-side diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the client side: 1. **Browser & Rendering**: DOM issues, JavaScript errors, CSS rendering, browser compatibility, memory leaks, console errors. 2. **Client-Side Caching**: Stale cache, service worker issues, local storage corruption. 3. **Network from Client View**: CORS errors, preflight failures, request timeouts, client-side retry storms, fetch/XHR configuration. 4. **Upload Handling**: File API usage, chunk upload implementation, progress tracking, FormData construction, content-type headers. 5. **UI/UX Symptoms**: What the user sees, error messages displayed, loading states. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference NetworkAnalyst and ServerAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the client perspective - Do NOT just say 'I agree' — provide substantive technical reasoning Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # ═══════════════════════════════════════════════════ # AGENT 2: Network Analyst # ═══════════════════════════════════════════════════ network_agent = ChatCompletionAgent( name="NetworkAnalyst", description="Analyzes problems from the network side: DNS, TCP, TLS, firewalls, load balancers, latency", instructions="""You are ONLY **NetworkAnalyst**. You must NEVER speak as ClientAnalyst, ServerAnalyst, or Coordinator. Every word you write is from NetworkAnalyst's perspective only. You are a senior network infrastructure diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the network layer: 1. **DNS & Resolution**: DNS TTL, propagation delays, record misconfigurations. 2. **TCP/IP & Connections**: Connection pooling, keep-alive, TCP window scaling, connection resets, SYN floods. 3. **TLS/SSL**: Certificate issues, handshake failures, protocol version mismatches. 4. **Load Balancers & Proxies**: Sticky sessions, health checks, timeout configs, request body size limits, proxy buffering. 5. **Firewall & WAF**: Rule blocks, rate limiting, request inspection delays, geo-blocking, DDoS protection interference. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference ClientAnalyst and ServerAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the network perspective - Do NOT just say 'I am ready to proceed' — provide substantive technical analysis Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # ═══════════════════════════════════════════════════ # AGENT 3: Server-Side Analyst # ═══════════════════════════════════════════════════ server_agent = ChatCompletionAgent( name="ServerAnalyst", description="Analyzes problems from the server side: backend app, database, logs, resources, deployments", instructions="""You are ONLY **ServerAnalyst**. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or Coordinator. Every word you write is from ServerAnalyst's perspective only. You are a senior backend and infrastructure diagnostics expert. When given a problem statement, analyze it EXCLUSIVELY from the server side: 1. **Application Server**: Error logs, exception traces, thread pool exhaustion, memory leaks, CPU spikes, garbage collection pauses. 2. **Database**: Slow queries, connection pool saturation, lock contention, deadlocks, replication lag, query plan changes. 3. **Deployment & Config**: Recent deployments, configuration changes, feature flags, environment variable mismatches, rollback candidates. 4. **Resource Limits**: File upload size limits, request body limits, disk space, temporary file cleanup, storage quotas. 5. **External Dependencies**: Upstream API timeouts, third-party service degradation, queue backlogs, cache (Redis/Memcached) issues. ROUND 1: Provide your independent analysis. Do NOT reference other agents. List your top 3 most likely causes with evidence. Every response MUST be at least 200 words. ROUND 2: You MUST: - Reference ClientAnalyst and NetworkAnalyst BY NAME - State specifically where you AGREE or DISAGREE with their findings - Answer the Coordinator's questions from your perspective - Add NEW cross-layer insights you see from the server perspective - Do NOT just say 'I agree' — provide substantive technical reasoning Be specific, evidence-based, and prioritize findings by likelihood.""", service=service, ) # ═══════════════════════════════════════════════════ # AGENT 4: Coordinator # ═══════════════════════════════════════════════════ coordinator_agent = ChatCompletionAgent( name="Coordinator", description="Synthesizes all specialist analyses into a final root cause report with prioritized action plan", instructions="""You are ONLY **Coordinator**. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or ServerAnalyst. You synthesize — you do NOT do domain-specific analysis. You are the lead engineer who synthesizes the team's findings. ═══ ROUND 1 BEHAVIOR (your first turn, message 4) ═══ Keep this SHORT — maximum 300 words. - Note 2-3 KEY PATTERNS across the three analyses - Identify where specialists AGREE (high-confidence) - Identify where they CONTRADICT (needs resolution) - Ask 2-3 SPECIFIC QUESTIONS for Round 2 Round 1 MUST NOT: assign tasks, create action plans, write reports, or tell agents what to take lead on. Observation + questions ONLY. ═══ ROUND 2 BEHAVIOR (your final turn, message 8) ═══ Keep this FOCUSED — maximum 800 words. Produce a structured report: 1. **Root Cause** (1 paragraph): The #1 most likely cause with causal chain across layers. Reference specific findings from each specialist. 2. **Confidence** (short list): - HIGH: Areas where all 3 agreed - MEDIUM: Areas where 2 of 3 agreed - LOW: Disagreements needing investigation 3. **Action Plan** (numbered, max 6 items): For each: - What to do (specific) - Owner (Client/Network/Server team) - Time estimate 4. **Quick Wins vs Long-term** (2 short lists) Do NOT repeat what specialists already said verbatim. Synthesize, don't echo.""", service=service, ) # ═══════════════════════════════════════════════════ # All 4 agents — order = RoundRobin order # ═══════════════════════════════════════════════════ agents = [client_agent, network_agent, server_agent, coordinator_agent] print(f"{len(agents)} agents created:") for i, a in enumerate(agents, 1): print(f" {i}. {a.name}: {a.description[:60]}...") print(f"\nRoundRobin order: {' → '.join(a.name for a in agents)}") Cell 4: Run the Analysis from semantic_kernel.agents import GroupChatOrchestration, RoundRobinGroupChatManager from semantic_kernel.agents.runtime import InProcessRuntime from semantic_kernel.contents import ChatMessageContent from IPython.display import display, Markdown # ╔══════════════════════════════════════════════════════════╗ # ║ EDIT YOUR PROBLEM STATEMENT HERE ║ # ╚══════════════════════════════════════════════════════════╝ PROBLEM = """ Users are reporting intermittent 504 Gateway Timeout errors when trying to upload files larger than 10MB through our web application. The issue started after last Friday's deployment and seems worse during peak hours (2-5 PM EST). Some users also report that smaller file uploads work fine but the progress bar freezes at 85% for large files before timing out. """ # ════════════════════════════════════════════════════════════ agent_responses = [] def agent_response_callback(message: ChatMessageContent) -> None: name = message.name or "Unknown" content = message.content or "" agent_responses.append({"agent": name, "content": content}) emoji = { "ClientAnalyst": "🖥️", "NetworkAnalyst": "🌐", "ServerAnalyst": "⚙️", "Coordinator": "🎯" }.get(name, "🔹") round_num = (len(agent_responses) - 1) // len(agents) + 1 display(Markdown( f"---\n### {emoji} {name} (Message {len(agent_responses)}, Round {round_num})\n\n{content}" )) MAX_ROUNDS = 8 # 4 agents × 2 rounds = 8 messages exactly task = f"""## Problem Statement {PROBLEM.strip()} ## Discussion Rules You are in a GROUP DISCUSSION with 4 members. You can see ALL previous messages. There are exactly 2 rounds. ### ROUND 1 (Messages 1-4): Independent Analysis - ClientAnalyst, NetworkAnalyst, ServerAnalyst: Analyze from YOUR domain only. Give your top 3 most likely causes with evidence and reasoning. - Coordinator: Note patterns across the 3 analyses. Ask 2-3 specific questions. Do NOT assign tasks yet. ### ROUND 2 (Messages 5-8): Cross-Examination & Final Report - ClientAnalyst, NetworkAnalyst, ServerAnalyst: You MUST reference the OTHER specialists BY NAME. State where you agree, disagree, or have new insights. Answer the Coordinator's questions. Provide SUBSTANTIVE analysis. - Coordinator: Produce the FINAL structured report: root cause, confidence levels, prioritized action plan with owners and time estimates. IMPORTANT: Each agent speaks as THEMSELVES only. Never impersonate another agent.""" display(Markdown(f"## Problem Statement\n\n{PROBLEM.strip()}")) display(Markdown(f"---\n## Discussion Starting — {len(agents)} agents, {MAX_ROUNDS} rounds\n")) # Build and run orchestration = GroupChatOrchestration( members=agents, manager=RoundRobinGroupChatManager(max_rounds=MAX_ROUNDS), agent_response_callback=agent_response_callback, ) runtime = InProcessRuntime() runtime.start() result = await orchestration.invoke(task=task, runtime=runtime) final_result = await result.get(timeout=300) await runtime.stop_when_idle() display(Markdown(f"---\n## FINAL CONCLUSION\n\n{final_result}")) Cell 5: Statistics and Validation print("═" * 55) print(" ANALYSIS STATISTICS") print("═" * 55) emojis = {"ClientAnalyst": "🖥️", "NetworkAnalyst": "🌐", "ServerAnalyst": "⚙️", "Coordinator": "🎯"} agent_counts = {} agent_chars = {} for r in agent_responses: agent_counts[r["agent"]] = agent_counts.get(r["agent"], 0) + 1 agent_chars[r["agent"]] = agent_chars.get(r["agent"], 0) + len(r["content"]) for agent, count in agent_counts.items(): em = emojis.get(agent, "🔹") chars = agent_chars.get(agent, 0) avg = chars // count if count else 0 print(f" {em} {agent}: {count} msg(s), ~{chars:,} chars (avg {avg:,}/msg)") print(f"\n Total messages: {len(agent_responses)}") total_chars = sum(len(r['content']) for r in agent_responses) print(f" Total analysis: ~{total_chars:,} characters") # Validation print(f"\n Validation:") import re identity_issues = [] for r in agent_responses: other_agents = [a.name for a in agents if a.name != r["agent"]] for other in other_agents: pattern = rf'(?i)as {re.escape(other)}[,:]?\s+I\b' if re.search(pattern, r["content"][:300]): identity_issues.append(f"{r['agent']} impersonated {other}") if identity_issues: print(f" Identity confusion: {identity_issues}") else: print(f" No identity confusion detected") thin = [r for r in agent_responses if len(r["content"].strip()) < 100] if thin: for t in thin: print(f" Thin response from {t['agent']}") else: print(f" All responses are substantive") Cell 6: Save Report from datetime import datetime timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") filename = f"analysis_report_{timestamp}.md" with open(filename, "w", encoding="utf-8") as f: f.write(f"# Problem Analysis Report\n\n") f.write(f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n") f.write(f"**Agents:** {', '.join(a.name for a in agents)}\n") f.write(f"**Rounds:** {MAX_ROUNDS}\n\n---\n\n") f.write(f"## Problem Statement\n\n{PROBLEM.strip()}\n\n---\n\n") for i, r in enumerate(agent_responses, 1): em = emojis.get(r['agent'], '🔹') round_num = (i - 1) // len(agents) + 1 f.write(f"### {em} {r['agent']} (Message {i}, Round {round_num})\n\n") f.write(f"{r['content']}\n\n---\n\n") f.write(f"## Final Conclusion\n\n{final_result}\n") print(f"Report saved to: {filename}") 7. The Agent Interaction Flow — Round by Round Here's what actually happens during the 8-message orchestration: Round 1: Independent Analysis (Messages 1-4) Msg Agent What They See What They Do 1 ClientAnalyst Problem statement only Analyzes from client perspective: upload chunking, progress bar freezing at 85%, CORS, content-type headers 2 NetworkAnalyst Problem + ClientAnalyst's analysis Gives INDEPENDENT analysis despite seeing msg 1: load balancer timeouts, proxy body size limits, TCP window scaling 3 ServerAnalyst Problem + msgs 1-2 Gives INDEPENDENT analysis: recent deployment regression, request body parser, thread pool exhaustion, disk space 4 Coordinator Problem + msgs 1-3 Observes patterns: "All three mention timeout configuration. ClientAnalyst and NetworkAnalyst both point to body size. Question: Was the deployment a backend-only change or did it include infra?" Round 2: Cross-Examination (Messages 5-8) Msg Agent What They Do 5 ClientAnalyst "I agree with NetworkAnalyst that the load balancer timeout is likely a factor — the 85% freeze point matches the 30-second LB timeout for a 10MB upload on our average upload speed. However, I disagree with ServerAnalyst about thread pool exhaustion because the UI shows a clean 504, not a connection reset." 6 NetworkAnalyst "ServerAnalyst's point about the recent deployment is critical — if the new request parser is 3x slower, that would push uploads past the LB timeout. I can confirm the LB has a 30s idle timeout. The fix is both: increase LB timeout AND optimize the parser." 7 ServerAnalyst "Responding to Coordinator's question: The deployment was backend-only — a new multipart parser using streaming instead of buffered reads. ClientAnalyst is correct that the 504 is from the LB, not the app. The app itself returns 200 after 45 seconds, but the LB kills the connection at 30." 8 Coordinator Produces final structured report with root cause: "The backend deployment introduced a slower multipart parser (45s vs 15s for 10MB). The load balancer's 30s timeout kills the connection at ~85% progress. Fix: immediate — increase LB timeout to 120s. Short-term — optimize parser. Long-term — implement chunked uploads with progress resumption." Notice: The Round 2 analysis is dramatically better than Round 1. Agents reference each other by name, build on each other's findings, and the Coordinator can synthesize a cross-layer causal chain that no single agent could have produced. I made a small adjustment to the issue with Azure Web Apps. Please find the details below from testing carried out using Google Gemini: 8. Bugs I Found & Fixed — Lessons Learned Building this demo taught me several important lessons about multi-agent systems: Bug 1: Agents Speaking Only Once Symptom: Only 4 messages instead of 8. Root cause: The agents list was missing the Coordinator. It was defined in a separate cell and wasn't included in the members list. Fix: All 4 agents must be in the same list passed to GroupChatOrchestration. Bug 2: NetworkAnalyst Says "I'm Ready to Proceed" Symptom: NetworkAnalyst's Round 2 response was just "I'm ready to proceed with the analysis" — no actual content. Root cause: The Coordinator's Round 1 message was assigning tasks ("NetworkAnalyst, please check the load balancer config"), and the agent was acknowledging the assignment instead of analyzing. Fix: Added explicit constraint to Coordinator: "Round 1 MUST NOT assign tasks — observation + questions ONLY." Bug 3: ServerAnalyst Says "As NetworkAnalyst, I..." Symptom: ServerAnalyst's response started with "As NetworkAnalyst, I believe..." Root cause: LLM identity bleeding. When agents share ChatHistory, the LLM sometimes loses track of which agent it's currently playing. This is especially common with Gemini. Fix: Identity anchoring at the very top of every agent's instructions: "You are ONLY ServerAnalyst. You must NEVER speak as ClientAnalyst, NetworkAnalyst, or Coordinator." Bug 4: Gemini Gives Thin/Empty Responses Symptom: Some agents responded with just one sentence or "I concur." Root cause: Gemini 2.5 Flash is more concise than GPT-4o by default. Without explicit length requirements, it takes shortcuts. Fix: Added "Every response MUST be at least 200 words" and "Answer the Coordinator's questions" to every specialist's instructions. Bug 5: Coordinator's Report is 18K Characters Symptom: The Coordinator's Round 2 response was absurdly long — repeating everything every specialist said. Fix: Added word limits: "Round 1 max 300 words, Round 2 max 800 words" and "Synthesize, don't echo." Bug 6: MAX_ROUNDS Math Symptom: With MAX_ROUNDS=9, ClientAnalyst spoke a 3rd time after the Coordinator's final report — breaking the clean 2-round structure. Fix: MAX_ROUNDS must equal (number of agents × number of rounds). For 4 agents × 2 rounds = 8. 9. Running with Different AI Providers The beauty of SK's Strategy Pattern is that you change ONE LINE to switch providers. Everything else — agents, orchestration, callbacks, validation — stays identical. Gemini setup: from semantic_kernel.connectors.ai.google.google_ai import GoogleAIChatCompletion service = GoogleAIChatCompletion( gemini_model_id="gemini-2.5-flash", api_key=os.getenv("GOOGLE_AI_API_KEY"), ) OpenAI Setup from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion service = OpenAIChatCompletion( ai_model_id="gpt-4o", api_key=os.getenv("OPEN_AI_API_KEY"), ) 10. What to Build Next Add Plugins to Agents Give agents real tools — not just LLM reasoning - looks exciting right ;) class NetworkDiagnosticPlugin: (description="Pings a host and returns latency") def ping(self, host: str) -> str: result = subprocess.run(["ping", "-c", "3", host], capture_output=True, text=True) return result.stdout class LogSearchPlugin: (description="Searches server logs for error patterns") def search_logs(self, pattern: str, hours: int = 1) -> str: # Query your log aggregator (Splunk, ELK, Azure Monitor) return query_logs(pattern, hours) Add Filters for Governance Intercept every agent call for PII redaction and audit logging: .filter(filter_type=FilterTypes.FUNCTION_INVOCATION) async def audit_filter(context, next): print(f"[AUDIT] {context.function.name} called by agent") await next(context) print(f"[AUDIT] {context.function.name} returned") Try Different Orchestration Patterns Replace GroupChat with Sequential for a pipeline approach: # Instead of debate, each agent builds on the previous orchestration = SequentialOrchestration( members=[client_agent, network_agent, server_agent, coordinator_agent] ) Or Concurrent for parallel analysis: # All specialists analyze simultaneously, Coordinator aggregates orchestration = ConcurrentOrchestration( members=[client_agent, network_agent, server_agent] ) Deploy to Azure Move from InProcessRuntime to Azure Container Apps for production scaling. The agent code doesn't change — only the runtime. Summary The key insight from building this demo: multi-agent systems produce better results than single agents not because each agent is smarter, but because the debate structure forces cross-domain thinking that a single prompt can never achieve. The Coordinator's final report consistently identifies causal chains that span client, network, and server layers — exactly the kind of insight that production incident response teams need. Semantic Kernel makes this possible with clean separation of concerns: agents define WHAT to analyze, orchestration defines HOW they interact, the manager defines WHO speaks when, the runtime handles WHERE it executes, and callbacks let you OBSERVE everything. Each piece is independently swappable — that's the power of SK from Microsoft. Resources: GitHub: github.com/microsoft/semantic-kernel Docs: learn.microsoft.com/semantic-kernel Orchestration Patterns: learn.microsoft.com/semantic-kernel/frameworks/agent/agent-orchestration Discord: aka.ms/sk/discord Disclaimer: The sample scripts provided in this article are provided AS IS without warranty of any kind. The author is not responsible for any issues, damages, or problems that may arise from using these scripts. Users should thoroughly test any implementation in their environment before deploying to production. Azure services and APIs may change over time, which could affect the functionality of the provided scripts. Always refer to the latest Azure documentation for the most up-to-date information. Thanks for reading this blog! I hope you found it helpful and informative for building AI agents with SK (Semantic Kernel) 😀435Views3likes0CommentsHow Netstar Streamlined Fleet Monitoring and Reduced Custom Integrations with Drasi
When a high-value container goes silent between waystations, logistics teams lose critical visibility, risking delays that can cascade into port congestion and missed connections. Netstar, a connected fleet solutions provider supporting customers like Maersk, faced this challenge as its operations scaled. Timely notifications of delays, arrivals, and status changes became critical to keeping cargo moving efficiently through port systems. To address growing integration complexity and the need for real-time responsiveness, Netstar adopted Drasi. Drasi, built for change-driven solutions, provides continuously updated query results and automated reactions to data changes, enabling systems to detect and respond to critical changes as they happen. This shift to Drasi became foundational to how Netstar unified its fleet data, reduced engineering overhead, and improved monitoring workflows. The Fragmentation Challenge Growing operational complexity made an underlying challenge increasingly apparent. Tracking a container's journey from pickup to port terminal required reconciling data such as vehicle identifiers, waypoints, GPS location feeds, and IoT telemetry signals from siloed systems. With each new operational or business requirement, whether monitoring vehicle health or detecting route deviations, development teams found themselves repeatedly rebuilding similar patterns. "We were essentially rebuilding the same integration architecture for every use case," explains Daniel Joubert, General Manager and technical lead at Netstar. "One week we'd build a dashboard for location tracking. The next week, we'd build another one for breakdown detection. The engineering overhead was unsustainable." Batch-based processing compounded the issue. Critical signals such as missed health reports or route delays can surface long after they occur, potentially limiting Netstar’s ability to take timely action. Introducing Drasi for Change-driven Architecture Rather than continue building point solutions, Netstar adopted Drasi as the backbone of its real-time data architecture. Drasi simplifies systems that must detect, evaluate, and react to data changes quickly and efficiently at scale, aligning directly with Netstar’s needs. A Unified, Continuously Updated View Drasi connected directly to Netstar's existing data sources- Azure SQL databases for information such as vehicle identifiers and waypoints, and Azure EventHub for GPS location data and IoT telemetry. Drasi Continuous Queries joined this information into a single, always-current operational picture. Instead of multiple custom-built pipelines, Netstar gained a single source of truth for its fleet. Using Drasi Reactions, Netstar defined actions that trigger when specific events occur. When a truck fails to send a health signal within its expected window, or when a delay notification indicates potential supply chain disruption, the system responds immediately without human intervention, reducing the likelihood of missed events. Improvements Enabled by Drasi Using the Drasi plugin for Grafana, Netstar consolidated results from Continuous Queries into one monitoring interface. Operators no longer reconciled conflicting views across separate tools; they now track vehicle health, location, alerts, and route deviations in real time from a single dashboard. "The transformation was remarkable," says Dustyn Lightfoot, Solution Architect. "We were able to use a single Drasi instance to support multiple business use cases without building new infrastructure or writing additional code, for example, to stand up Blazor websites. More importantly, it eliminated the ongoing maintenance burden of managing dozens of custom pipelines." Drasi’s flexibility also extended beyond fleet tracking. By attaching an additional data source and defining new Continuous Queries, the same Drasi instance now surfaces changes in customer billing status and the legal contracts. This work required no new infrastructure, just connecting the source and writing queries (leveraging Drasi’s custom Delta functions), providing business teams with up-to-date information without a separate integration effort. Measurable Impact Netstar reports tangible improvements across engineering operations and real-time responsiveness: Faster incident response: Missing health signals now trigger alerts immediately rather than being discovered later through manual checks, improving the speed and reliability of operational response. Improved logistics coordination: Real time visibility into container movement through waystations and toward port terminals has enabled Netstar and partners like Maersk to coordinate shipments more efficiently, with automated alerts keeping all stakeholders informed as conditions change. Reduced development overhead: Using Drasi has reduced the amount of custom development previously needed to support fleet monitoring capabilities. The same Drasi-driven architecture now supports multiple business cases, from tracking and health monitoring to route optimization. Streamlined operator experience: Teams moved from several monitoring tools to a single Drasi-powered Grafana interface, simplifying daily operations and eliminating time spent reconciling conflicting data from different systems. Industry Context and What’s Next Demand for real-time supply chain visibility has intensified as global logistics disruptions highlight the risks of delayed reporting. "Our customers don't just want historical reports anymore. They need to know what's happening right now and be alerted the moment something changes," Daniel Joubert explains. "That shift from batch processing to continuous monitoring is becoming table stakes in fleet management." Building on this foundation, Netstar is now investigating how Drasi can support predictive maintenance- spotting patterns in vehicle health data early enough to prevent failures altogether. The same change-driven architecture could also streamline coordination across broader supply chain workflows. The Broader Architectural Shift Netstar’s implementation reflects a wider architectural move emerging across operational solutions: from systems that store and query data to platforms that detect and react to changes as they happen. In fleet logistics, financial systems, and industrial operations, the competitive advantage increasingly lies in eliminating the lag between event and response. "Building custom integrations for every use case was slowing us down and limiting what we could deliver to customers," Dustyn Lightfoot reflects. "Drasi gave us a reusable foundation that handles the hard parts, integrating disparate data sources and detecting meaningful changes, so we can focus on solving business problems rather than rebuilding infrastructure." The collaboration between Drasi and Netstar demonstrates how open source change-driven platforms can simplify complex operational challenges whilst providing actionable insights across distributed systems. As logistics operations evolve, architectures like Drasi’s may define the next era of competitive advantage- one where actionable insight arrives the moment conditions change. To learn more about Drasi visit Drasi.io.148Views0likes0CommentsSlow response times in different regions
I have a website which is primarily for people in Asia and uses Front Door. Microsoft say that content served through Front Door is hosted in POPs all over the world but Grafana checks show consistently bad performance in Asia. The London ping response times are consistently low from London but around 150ms from Singapore, frequently spiking to over 500ms. While London is closer to where the origin is hosted, I wouldn't expect pings to go to the origin but be handled by Front Door? Is there any way I can verify that the site is being propagated to regional POPs in the APAC area?Solved163Views0likes1CommentHow to Fix Azure Event Grid Entra Authentication issue for ACS and Dynamics 365 integrated Webhooks
Introduction: Azure Event Grid is a powerful event routing service that enables event-driven architectures in Azure. When delivering events to webhook endpoints, security becomes paramount. Microsoft provides a secure webhook delivery mechanism using Microsoft Entra ID (formerly Azure Active Directory) authentication through the AzureEventGridSecureWebhookSubscriber role. Problem Statement: When integrating Azure Communication Services with Dynamics 365 Contact Center using Microsoft Entra ID-authenticated Event Grid webhooks, the Event Grid subscription deployment fails with an error: "HTTP POST request failed with unknown error code" with empty HTTP status and code. For example: Important Note: Before moving forward, please verify that you have the Owner role assigned on app to create event subscription. Refer to the Microsoft guidelines below to validate the required prerequisites before proceeding: Set up incoming calls, call recording, and SMS services | Microsoft Learn Why This Happens: This happens because AzureEventGridSecureWebhookSubscriber role is NOT properly configured on Microsoft EventGrid SP (Service Principal) and event subscription entra ID or application who is trying to create event grid subscription. What is AzureEventGridSecureWebhookSubscriber Role: The AzureEventGridSecureWebhookSubscriber is an Azure Entra application role that: Enables your application to verify the identity of event senders Allows specific users/applications to create event subscriptions Authorizes Event Grid to deliver events to your webhook How It Works: Role Creation: You create this app role in your destination webhook application's Azure Entra registration Role Assignment: You assign this role to: Microsoft Event Grid service principal (so it can deliver events) Either Entra ID / Entra User or Event subscription creator applications (so they can create event grid subscriptions) Token Validation: When Event Grid delivers events, it includes an Azure Entra token with this role claim Authorization Check: Your webhook validates the token and checks for the role Key Participants: Webhook Application (Your App) Purpose: Receives and processes events App Registration: Created in Azure Entra Contains: The AzureEventGridSecureWebhookSubscriber app role Validates: Incoming tokens from Event Grid Microsoft Event Grid Service Principal Purpose: Delivers events to webhooks App ID: Different per Azure cloud (Public, Government, etc.) Public Azure: 4962773b-9cdb-44cf-a8bf-237846a00ab7 Needs: AzureEventGridSecureWebhookSubscriber role assigned Event Subscription Creator Entra or Application Purpose: Creates event subscriptions Could be: You, Your deployment pipeline, admin tool, or another application Needs: AzureEventGridSecureWebhookSubscriber role assigned Although the full PowerShell script is documented in the below Event Grid documentation, it may be complex to interpret and troubleshoot. Azure PowerShell - Secure WebHook delivery with Microsoft Entra Application in Azure Event Grid - Azure Event Grid | Microsoft Learn To improve accessibility, the following section provides a simplified step-by-step tested solution along with verification steps suitable for all users including non-technical: Steps: STEP 1: Verify/Create Microsoft.EventGrid Service Principal Azure Portal → Microsoft Entra ID → Enterprise applications Change filter to Application type: Microsoft Applications Search for: Microsoft.EventGrid Ideally, your Azure subscription should include this application ID, which is common across all Azure subscriptions: 4962773b-9cdb-44cf-a8bf-237846a00ab7. If this application ID is not present, please contact your Azure Cloud Administrator. STEP 2: Create the App Role "AzureEventGridSecureWebhookSubscriber" Using Azure Portal: Navigate to your Webhook App Registration: Azure Portal → Microsoft Entra ID → App registrations Click All applications Find your app by searching OR use the Object ID you have Click on your app Create the App Role: Display name: AzureEventGridSecureWebhookSubscriber Allowed member types: Both (Users/Groups + Applications) Value: AzureEventGridSecureWebhookSubscriber Description: Azure Event Grid Role Do you want to enable this app role?: Yes In left menu, click App roles Click + Create app role Fill in the form: Click Apply STEP 3: Assign YOUR USER to the Role Using Azure Portal: Switch to Enterprise Application view: Azure Portal → Microsoft Entra ID → Enterprise applications Search for your webhook app (by name) Click on it Assign yourself: In left menu, click Users and groups Click + Add user/group Under Users, click None Selected Search for your user account (use your email) Select yourself Click Select Under Select a role, click None Selected Select AzureEventGridSecureWebhookSubscriber Click Select Click Assign STEP 4: Assign Microsoft.EventGrid Service Principal to the Role This step MUST be done via PowerShell or Azure CLI (Portal doesn't support this directly as we have seen) so PowerShell is recommended You will need to execute this step with the help of your Entra admin. # Connect to Microsoft Graph Connect-MgGraph -Scopes "AppRoleAssignment.ReadWrite.All" # Replace this with your webhook app's Application (client) ID $webhookAppId = "YOUR-WEBHOOK-APP-ID-HERE" #starting with c5 # Get your webhook app's service principal $webhookSP = Get-MgServicePrincipal -Filter "appId eq '$webhookAppId'" Write-Host " Found webhook app: $($webhookSP.DisplayName)" # Get Event Grid service principal $eventGridSP = Get-MgServicePrincipal -Filter "appId eq '4962773b-9cdb-44cf-a8bf-237846a00ab7'" Write-Host " Found Event Grid service principal" # Get the app role $appRole = $webhookSP.AppRoles | Where-Object {$_.Value -eq "AzureEventGridSecureWebhookSubscriber"} Write-Host " Found app role: $($appRole.DisplayName)" # Create the assignment New-MgServicePrincipalAppRoleAssignment ` -ServicePrincipalId $eventGridSP.Id ` -PrincipalId $eventGridSP.Id ` -ResourceId $webhookSP.Id ` -AppRoleId $appRole.Id Write-Host "Successfully assigned Event Grid to your webhook app!" Verification Steps: Verify the App Role was created: Your App Registration → App roles You should see: AzureEventGridSecureWebhookSubscriber Verify your user assignment: Enterprise application (your webhook app) → Users and groups You should see your user with role AzureEventGridSecureWebhookSubscriber Verify Event Grid assignment: Same location → Users and groups You should see Microsoft.EventGrid with role AzureEventGridSecureWebhookSubscriber Sample Flow: Analogy For Simplification: Lets think it similar to the construction site bulding where you are the owner of the building. Building = Azure Entra app (webhook app) Building (Azure Entra App Registration for Webhook) ├─ Building Name: "MyWebhook-App" ├─ Building Address: Application ID ├─ Building Owner: You ├─ Security System: App Roles (the security badges you create) └─ Security Team: Azure Entra and your actual webhook auth code (which validates tokens) like doorman Step 1: Creat the badge (App role) You (the building owner) create a special badge: - Badge name: "AzureEventGridSecureWebhookSubscriber" - Badge color: Let's say it's GOLD - Who can have it: Companies (Applications) and People (Users) This badge is stored in your building's system (Webhook App Registration) Step 2: Give badge to the Event Grid Service: Event Grid: "Hey, I need to deliver messages to your building" You: "Okay, here's a GOLD badge for your SP" Event Grid: *wears the badge* Now Event Grid can: - Show the badge to Azure Entra - Get tokens that say "I have the GOLD badge" - Deliver messages to your webhook Step 3: Give badge to yourself (or your deployment tool) You also need a GOLD badge because: - You want to create event grid event subscriptions - Entra checks: "Does this person have a GOLD badge?" - If yes: You can create subscriptions - If no: "Access denied" Your deployment pipeline also gets a GOLD badge: - So it can automatically set up event subscriptions during CI/CD deployments Disclaimer: The sample scripts provided in this article are provided AS IS without warranty of any kind. The author is not responsible for any issues, damages, or problems that may arise from using these scripts. Users should thoroughly test any implementation in their environment before deploying to production. Azure services and APIs may change over time, which could affect the functionality of the provided scripts. Always refer to the latest Azure documentation for the most up-to-date information. Thanks for reading this blog! I hope you found it helpful and informative for this specific integration use case 😀337Views3likes0CommentsChange AVD Insights Workbooks to use VM Insights Queries
Currently the AVD Insights workbooks rely on the Microsoft Monitoring Agent which is being retired. Please release an AVD Insights solution that works with and utilises the VM Insights capabilities of Azure Monitor. This will allow us to decommission the use of the MMA.1.4KViews6likes3CommentsEvent-Driven to Change-Driven: Low-cost dependency inversion
Event-driven architectures tout scalability, loose coupling, and eventual consistency. The architectural patterns are sound, the theory is compelling, and the blog posts make it look straightforward. Then you implement it. Suddenly you're maintaining separate event stores, implementing transactional outboxes, debugging projection rebuilds, versioning events across a dozen micro-services, and writing mountains of boilerplate to handle what should be simple queries. Your domain events that were supposed to capture rich business meaning have devolved into glorified database change notifications. Downstream services diff field values to extract intent from "OrderUpdated" events because developers just don't get what constitutes a proper domain event. The complexity tax is real, don't get me wrong, it's very elegant but for many systems it's unjustified. Drasi offers an alternative: change-driven architecture that delivers reactive, real-time capabilities across multiple data sources without requiring you to rewrite your application or over complicate your architecture. What do we mean by “Event-driven” architecture As Martin Fowler notes, event-driven architecture isn't a single pattern, it's at least four distinct patterns that are often confused, each with its own benefits and traps. Event Notification is the simplest form. Here, events act as signals that something has happened, but carry minimal data, often just an identifier. The recipient must query the source system for more details if needed. For example, a service emits an OrderPlaced event with just the order ID. Downstream consumers must query the order service to retrieve full order details. Event Carried State Transfer broadcasts full state changes through events. When an order ships, you publish an OrderShipped event containing all the order details. Downstream services maintain their own materialized views or projections by consuming these events. Event Sourcing goes further, events become your source of truth. Instead of storing current state, you store the sequence of events that led to that state. Your order isn't a row in a database; it's the sum of OrderPlaced, ItemAdded, PaymentProcessed, and OrderShipped events. CQRS (Command Query Responsibility Segregation) separates write operations (commands) from read operations (queries). While not inherently event-driven, CQRS is often paired with event sourcing or event-carried state transfer to optimize for scalability and maintainability. Originally derived from Bertrand Meyer's Command-Query Separation principle and popularized by Greg Young, CQRS addresses a specific architectural challenge: the tension between optimizing for writes versus optimizing for reads. The pattern promises several benefits: Optimized data models: Your write model can focus on transactional consistency while read models optimize for query performance Scalability: Read and write sides can scale independently Temporal queries: With event sourcing, you get time travel for free—reconstruct state at any point in history Audit trail: Every change is captured as an immutable event While CQRS isn't inherently tied to Domain-Driven Design (DDD), the pattern complements DDD well. In DDD contexts, CQRS enables different bounded contexts to maintain their own read models tailored to their specific ubiquitous language, while the write model protects domain invariants. This is why you'll often see them discussed together, though each can be applied independently. The core motivation for these patterns is often to invert the dependency between systems, so that your downstream services do not need to know about your upstream services. The Developer's Struggle: When Domain Events Become Database Events Chris Kiehl puts it bluntly in his article "Don't Let the Internet Dupe You, Event Sourcing is Hard": "The sheer volume of plumbing code involved is staggering—instead of a friendly N-tier setup, you now have classes for commands, command handlers, command validators, events, aggregates, and then projections, model classes, access classes, custom materialization code, and so on." But the real tragedy isn't the boilerplate, it's what happens to those carefully crafted domain events. As developers are disconnected from the real-world business, they struggle to understand the nuances of domain events, a dangerous pattern emerges. Instead of modeling meaningful business processes, teams default to what they know: CRUD. Your event stream starts looking like this: OrderCreated OrderUpdated OrderUpdated (again) OrderUpdated (wait, what changed?) OrderDeleted As one developer noted on LinkedIn, these "CRUD events" are really just "leaky events that lack clarity and should not be used to replicate databases as this leaks implementation details and couples services to a shared data model." Dennis Doomen, reflecting on real-world production issues, observes: "It's only once you have a living, breathing machine, users which depend on you, consumers which you can't break, and all the other real-world complexities that plague software projects that the hard problems in event sourcing will rear their heads." The result? Your elegant event-driven architecture devolves into an expensive, brittle form of self-maintained Change Data Capture (CDC). You're not modeling business processes; you're just broadcasting database mutations with extra steps. The Anti-Corruption Layer: Your Defense Against the Outside World In DDD, an Anti-Corruption Layer (ACL) protects your bounded context from external models that would corrupt your domain. Think of it as a translator that speaks both languages, the messy external model and your clean internal model. The ACL ensures that changes to the external system don't ripple through your domain. If the legacy system changes its schema, you update the translator, not your entire domain model. When Event Taxonomies Become Your ACL (And Why They Fail) In most event-driven architectures, your event taxonomy is supposed to serve as the shared contract between services. Each service publishes events using its own ubiquitous language, and consumers translate these into their own models, this translation is the ACL. The theory looks beautiful: But reality? Most teams end up with this: Instead of OrderPaid events that carry business meaning, we get OrderUpdated events that force every consumer to reconstruct intent by diffing fields. When you change your database schema, say splitting the orders table or switching from SQL to NoSQL, every downstream service breaks because they're all coupled to your internal data model. You haven't built an anti-corruption layer. You've built a corruption pipeline that efficiently distributes your internal implementation details across the entire system, forcing you to deploy all services in lock step and eroding the decoupling benefits you were supposed to get. Enter Drasi: Continuous Queries This is where Drasi changes the game. Instead of publishing events and hoping downstream services can make sense of them, Drasi tails the changelog of the data source itself and derives meaning through continuous queries. A continuous query in Drasi isn't just a query that runs repeatedly, it's a living, breathing projection that reacts to changes in real-time. Here's the key insight: instead of imperative code that processes events ("when this happens, do that"), you write declarative queries that describe the state you care about ("I want to know about orders that are ready and have drivers waiting"). Let's break down what makes this powerful: Declarative vs. Imperative Traditional event processing: Drasi continuous query: Semantic Mapping from Low-Level Changes Drasi excels at transforming database-level changes into business-meaningful events. You're not reacting to "row updated in orders table", you're reacting to "order ready for curbside pickup." This enables the same core benefits of dependency inversion we get from event-driven architectures but at a fraction of the effort. Advanced Temporal Features Remember those developers struggling with "OrderUpdated" events, trying to figure out if something just happened or has been true for a while? Drasi handles this elegantly: This query only fires when a driver has been waiting for more than 10 minutes, no timestamp tracking, no state machines, no complex event correlation logic, imagine trying to manually implement this in a downstream event consumer. 😱 Cross-Source Aggregation Without Code With Drasi, you can have live projections across PostgreSQL, MySQL, SQL Server, and Cosmos DB as if they were a single graph: No custom aggregation service. No event stitching logic. No custom downstream datastore to track the sum or keep a materialized projection. Just a query. Continuous Queries as Your Shared Contract Drasi's continuous queries, combined with pre-processing middleware, can form the shared contract that your anti-corruption layer can depend on. The continuous query becomes your contract. Downstream systems don't know or care whether orders come from PostgreSQL, MongoDB, or a CSV file. They don't know if you normalized your database, denormalized it, or moved to event sourcing. They just consume the query results. Clean, semantic, and stable. Reactions as your Declarative Consumers Drasi does not simply output a stream of raw change diffs, instead it has a library of interchangeable Reactions, that can act on the output of continuous queries. These are declared using YAML and can do anything from host a web-socket endpoint that provides a live projection to your UI, to calling an Http endpoint or publishing a message on a queue. Example: The Curbside Pickup System Let's see how this works in Drasi's curbside pickup tutorial. This example has two independent databases and serves as a great illustration of a real-time projection built from multiple upstream services. The Business Problem A retail system needs to: Match ready orders with drivers who've arrived at pickup zones Alert staff when drivers wait more than 10 minutes without their order being ready Coordinate data from two different systems (retail ops in PostgreSQL, physical ops in MySQL) Traditional Event-Driven Approach In this architecture, you'd need something like: That's just the happy path. We haven't handled: Event ordering issues Partial failures Cache invalidation Service restarts and replay Duplicate events Transactional outboxing The Drasi Approach With Drasi, the entire aggregation service above becomes two queries: Delivery Dashboard Query: Wait Detection Query: That's it. No event handlers. No caching. No timers. No state management. Drasi handles: Change detection across both databases Correlation between orders and vehicles Temporal logic for wait detection Pushing updates to dashboards via SignalR The queries define your business logic declaratively. When data changes in either database, Drasi automatically re-evaluates the queries and triggers reactions for any changes in the result set. Drasi: The Non-Invasive Alternative to Legacy System Rewrites Here's perhaps the most compelling argument for Drasi: it doesn't require you to rewrite anything. Traditional event sourcing means: Redesigning your application around events Rewriting your persistence layer Implementing transactional outboxes Managing snapshots and replays Training your team on new patterns, steep learning curve Migrating existing data to event streams Building projection infrastructure Updating all consumers to handle events As one developer noted about their event sourcing journey: "Event Sourcing is a beautiful solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about." Drasi's approach: Keep your existing databases Keep your existing services Keep your existing deployment model Add continuous queries where you need reactive behavior Get the benefits of dependency inversion Gradually migrate complexity from code to queries You can start with a single query on a single table and expand from there. No big bang. No feature freeze. No three-month architecture sprint or large multi-year investments, full of risk. Migration Example: From Polling to Reactive Let's say you have a legacy order system where a scheduled job polls for ready orders every 30 seconds: With Drasi, you: Point Drasi at your existing database Write the continuous query Update your dashboard to receive pushes instead of polls Turn off the polling job Your database hasn't changed. Your order service hasn't changed. You've just added a reactive layer on top that eliminates polling overhead and reduces notification latency from 30 seconds to milliseconds. The intellectually satisfying complexity of event sourcing often obscures a simple truth: most systems don't need it. They need to know when interesting things change in their data and react accordingly. They need to combine data from multiple sources without writing bespoke aggregation services. They need to transform low-level changes into business-meaningful events. Drasi delivers these capabilities without the ceremony. Where Do We Go from Here? If you're building a new system and your team has deep event sourcing experience embrace the pattern. Event sourcing shines for certain domains. But if you're like many teams, trying to add reactive capabilities to existing systems, struggling with data synchronization across services, or finding that your "events" are just CRUD operations in disguise, consider the change-driven approach. Start small: Identify one painful polling loop or batch job Set up Drasi to monitor those same data sources Write a continuous query that captures the business condition Replace the polling with push-based reactions Measure the reduction in latency, overhead, and code complexity The best architecture isn't the most sophisticated one, it's the one your team can understand, maintain, and evolve. Sometimes that means acknowledging that we've been mid-curving it with overly complex event-driven architectures. Drasi and change-driven architecture offer the power of reactive systems without the complexity tax. Your data changes. Your queries notice. Your systems react. It makes it a non-event. Want to explore Drasi further? Check out the official documentation and try the curbside pickup tutorial to see change-driven architecture in action.580Views1like0CommentsPAAS resource metrics using Azure Data Collection Rule to Log Analytics Workspace
Hi Team, I want to build a use case to pull the Azure PAAS resources metrics using azure DCR and push that data metrics to log analytics workspace which eventually will push the data to azure event hub through streaming and final destination as azure postgres to store all the resources metrics information in a centralized table and create KPIs and dashboard for the clients for better utilization of resources. I have not used diagnose setting enabling option since it has its cons like we need to manually enable each resources settings also we get limited information extracted from diagnose setting. But while implementing i saw multiple articles stating DCR is not used for pulling PAAS metrics its only compatible for VM metrics. Want to understand is it possible to use DCR for PAAS metrics? Thanks in advance for any inputs.Solved151Views0likes2Comments