ai solutions
68 TopicsThree tiers of Agentic AI - and when to use none of them
Every enterprise has an AI agent. Almost none of them work in production. Walk into any enterprise technology review right now and you will find the same thing. Pilots running. Demos recorded. Steering committees impressed. And somewhere in the background, a quiet acknowledgment that the thing does not actually work at scale yet. OutSystems surveyed nearly 1,900 global IT leaders and found that 96% of organizations are already running AI agents in some capacity. Yet only one in nine has those agents operating in production at scale. The experiments are everywhere. The production systems are not. That gap is not a capability problem. The infrastructure has matured. Tool calling is standard across all major models. Frameworks like LangGraph, CrewAI, and Microsoft Agent Framework abstract orchestration logic. Model Context Protocol standardizes how agents access external tools and data sources. Google's Agent-to-Agent protocol now under Linux Foundation governance with over 50 enterprise technology partners including Salesforce, SAP, ServiceNow, and Workday standardizes how agents coordinate with each other. The protocols are in place. The frameworks are production ready. The gap is a selection and governance problem. Teams are building agents on problems that do not need them. Choosing the wrong tier for the ones that do. And treating governance as a compliance checkbox to add after launch, rather than an architectural input to design in from the start. The same OutSystems research found that 94% of organizations are concerned that AI sprawl is increasing complexity, technical debt, and security risk and only 12% have a centralized approach to managing it. Teams are deploying agents the way shadow IT spread through enterprises a decade ago: fast, fragmented, and without a shared definition of what production-ready actually means. I've built agentic systems across enterprise clients in logistics, retail, and B2B services. The failures I keep seeing are not technology failures. They are architecture and judgment failures problems that existed before the first line of code was written, in the conversation where nobody asked the prior question. This article is the framework I use before any platform conversation starts. What has genuinely shifted in the agentic landscape Three changes are shaping how enterprise agent architecture should be designed today and they are not incremental improvements on what existed before. The first is the move from single agents to multi-agent systems. Databricks' State of AI Agents report drawing on data from over 20,000 organizations, including more than 60% of the Fortune 500 found that multi-agent workflows on their platform grew 327% in just four months. This is not experimentation. It is production architecture shifting. A single agent handling everything routing, retrieval, reasoning, execution is being replaced by specialized agents coordinating through defined interfaces. A financial organization, for example, might run separate agents for intent classification, document retrieval, and compliance checking each narrow in scope, each connected to the next through a standardized protocol rather than tightly coupled code. The second is protocol standardization. MCP handles vertical connectivity how agents access tools, data sources, and APIs through a typed manifest and standardized invocation pattern. A2A handles horizontal connectivity how agents discover peer agents, delegate subtasks, and coordinate workflows. Production systems today use both. The practical consequence is that multi-agent architectures can be composed and governed as a platform rather than managed as a collection of one-off integrations. The third is governance as the differentiating factor between teams that ship and teams that stall. Databricks found that companies using AI governance tools get over 12 times more AI projects into production compared to those without. The teams running production agents are not running more sophisticated models. They built evaluation pipelines, audit trails, and human oversight gates before scaling not after the first incident. Tier 1 - Low-code agents: fast delivery with a defined ceiling The low-code tier is more capable than it was eighteen months ago. Copilot Studio, Salesforce Agentforce, and equivalent platforms now support richer connector libraries, better generative orchestration, and more flexible topic models. The ceiling is higher than it was. It is still a ceiling. The core pattern remains: a visual topic model drives a platform-managed LLM that classifies intent and routes to named execution branches. Connectors abstract credential management and API surface. A business team — analyst, citizen developer, IT operations — can build, deploy, and iterate without engineering involvement on every change. For bounded conversational problems, this is the fastest path from requirement to production. The production reality is documented clearly. Gartner data found that only 5% of Copilot Studio pilots moved to larger-scale deployment. A European telecom with dedicated IT resources and a full Microsoft enterprise agreement spent six months and did not deliver a single production agent. The visual builder works. The path from prototype to production, production-grade integrations, error handling, compliance logging, exception routing is where most enterprises get stuck, because it requires Power Platform expertise that most business teams do not have. The platform ceiling shows up predictably at four points. Async processing anything beyond a synchronous connector call, including approval chains, document pipelines, or batch operations cannot be handled natively. Full payload audit logs platform logs give conversation transcripts and connector summaries, not structured records of every API call and its parameters. Production volume concurrency limits and message throughput budgets bind faster than planning assumptions suggest. Root cause analysis in production you cannot inspect the LLM's confidence score or the alternatives it considered, which makes diagnosing misbehavior significantly harder than it should be. The correct diagnostic: can this use case be owned end-to-end by a business team, covered by standard connectors, with no latency SLA below three seconds and no payload-level compliance requirement? Yes, low code is the correct tier. Not a compromise. If no on any point, continue. If low-code is the right call for your use case: Copilot Studio quickstart Tier 2 - Pro-code agents: the architecture the current landscape demands The defining pattern in production pro-code architecture today is multi-agent. Specialized agents per domain, coordinating through MCP for tool access and A2A for peer-to-peer delegation, with a governance layer spanning the entire system. What this looks like in practice: a financial organization handling incoming compliance queries runs separate agents for intent classification, document retrieval, and the compliance check itself. None of these agents tries to do all three jobs. Each has a narrow responsibility, a defined input/output contract typed against a JSON Schema, and a clear handoff boundary. The 327% growth in multi-agent workflows reflects production teams discovering that the failure modes of monolithic agents topic collision, context overflow, degraded classification as scope expands are solved by specialization, not by making a single agent more capable. The discipline that makes multi-agent systems reliable is identical to what makes single-agent systems reliable, just enforced across more boundaries: the LLM layer reasons and coordinates; deterministic tool functions enforce. In a compliance pipeline, no LLM decides whether a document satisfies a regulatory requirement. That evaluation runs in a deterministic tool with a versioned rule set, testable outputs, and an immutable audit log. The LLM orchestrates the sequence. The tool produces the compliance record. Mixing these letting an LLM evaluate whether a rule pass collapses the audit trail and introduces probabilistic outputs on questions that have regulatory answers. MCP is the tool interface standard today. An MCP server exposes a typed manifest any compliant agent runtime can discover at startup. Tools are versioned, independently deployable, and reusable across agents without bespoke integration code. A2A extends this horizontally: agents advertise capability cards, discover peers, and delegate subtasks through a standardised protocol. The practical consequence is that multi-agent systems built on both protocols can be composed and governed as a platform rather than managed as a collection of one-off integrations. Observability is the architectural element that separates teams shipping production agents from teams perpetually in pilot. Build evaluation pipelines, distributed traces across all agent boundaries, and human review gates before scaling. The teams that add these after the first production incident spend months retrofitting what should have been designed in. If pro-code is the right call for your use case: Foundry Agent Service The hybrid pattern: still where production deployments land The shift to multi-agent architecture does not change the hybrid pattern it deepens it. Low-code at the conversational surface, pro-code multi-agent systems behind it, with a governance layer spanning both. On a logistics client engagement, the brief was a sales assistant for account managers shipment status, account health, and competitive context inside Teams. The business team wanted everything in Copilot Studio. Engineering wanted a custom agent runtime. Both were wrong. What we built: Copilot Studio handled all high-frequency, low-complexity queries shipment tracking, account status, open cases through Power Platform connectors. Zero custom code. That covered roughly 78% of actual interaction volume. Requests requiring multi-source reasoning competitive positioning on a specific lane, churn risk across an account portfolio, contract renewal analysis delegated via authenticated HTTP action to a pro-code multi-agent service on Azure. A retrieval agent pulled deal history and market intelligence through MCP-exposed tools. A synthesis agent composed the recommendation with confidence scoring. Structured JSON back to the low-code layer, rendered as an adaptive card in Teams. The HITL gate was non-negotiable and designed before deployment, not added after the first incident. No output reached a customer without a manager approval step. The agent drafts. A human sends. This boundary low-code for conversational volume, pro-code for reasoning depth maps directly to what the research shows separates teams that ship from teams that stall. The organizations running agents in production drew the line correctly between what the platform can own and what engineering needs to own. Then they built governance into both sides before scaling. The four gates - the prior question that still gets skipped Run every candidate use case through these four checks before the platform conversation begins. None of the recent infrastructure improvements change what they are checking, because none of them change the fundamental cost structure of agentic reasoning. Gate 1 - is the logic fully deterministic? If every valid output for every valid input can be enumerated in unit tests, the problem does not need an LLM. A rules engine executes in microseconds at zero inference cost and cannot produce a plausible-but-wrong answer. NeuBird AI's production ops agents which have resolved over a million alerts and saved enterprises over $2 million in engineering hours work because alert triage logic that can be expressed as rules runs in deterministic code, and the LLM only handles cases where pattern-matching is insufficient. That boundary is not incidental to the system's reliability. It is the reason for it. Gate 2 - is zero hallucination tolerance required? With over 80% of databases now being built by AI agents per Databricks' State of AI Agents report the surface area for hallucination-induced data errors has grown significantly. In domains where a wrong answer is a compliance event financial calculation, medical logic, regulatory determinations irreducible LLM output uncertainty is disqualifying regardless of model version or prompt engineering effort. Exit to deterministic code or classical ML with bounded output spaces. Gate 3 - is a sub-100ms latency SLA required? LLM inference is faster than it was eighteen months ago. It is not fast enough for payment transaction processing, real-time fraud scoring, or live inventory management. A three-agent system with MCP tool calls has a P50 latency measured in seconds. These problems need purpose-built transactional architecture. Gate 4 - is regulatory explainability required? A2A enables complex agent coordination and delegation. It does not make LLM reasoning reproducible in a regulatory sense. Temperature above zero means the same input produces different outputs across invocations. Regulators in financial services, healthcare, and consumer credit require deterministic, auditable decision rationale. Exit to deterministic workflow with structured audit logging at every Five production failure modes - one of them new The four original anti-patterns are still showing up in production. A fifth has been added by scale. Routing data retrieval through a reasoning loop. A direct API call returns account status in under 10ms. Routing the same request through an LLM reasoning step adds hundreds of milliseconds, consumes tokens on every call, and introduces output parsing on data that is already structured. The agent calls a structured tool. The tool calls the API. The agent never acts as the integration layer. Encoding business rules in prompts. Rules expressed in prompt text drift as models update. They produce probabilistic output across invocations and fail in ways that are difficult to reproduce and diagnose. A rule that must evaluate correctly every time belongs in a deterministic tool function unit-tested, version-controlled, independently deployable via MCP. No approval gate on CRUD operations. CRUD operations without a human approval step will eventually misfire on the input that testing did not cover. The gate needs to be designed before deployment, not added after the first incident involving a financial posting, a customer-facing communication, or a data deletion. Monolithic agent for all domains. A single agent accumulating every domain leads predictably to topic collision, context overflow, and maintenance that becomes impossible as scope expands. Specialized agents per domain, coordinating through A2A, is the architecture that scales. Ungoverned agent sprawl. This is the new one and currently the most prevalent. OutSystems found 94% of organizations concerned about it, with only 12% having a centralized response. Teams building agents independently across fragmented stacks, without shared governance, evaluation standards, or audit infrastructure, produce exactly the same organizational debt that shadow IT created but with higher stakes, because these systems make autonomous decisions rather than just storing and retrieving data. The fix is treating governance as an architectural input before deployment, not a compliance requirement after something breaks. The infrastructure is ready. The judgment is not. The tier decision sequence has not changed. Does the problem need natural language understanding or dynamic generation? No — deterministic system, stop. Can a business team own it through standard connectors with no sub-3-second latency SLA and no payload-level compliance requirement? Yes — low-code. Does it need custom orchestration, multi-agent coordination, or audit-grade observability? Yes — pro-code with MCP and A2A. Does it need both a conversational surface and deep backend reasoning? Hybrid, with a governance layer spanning both. What has changed is that governance is no longer optional infrastructure to add when you have time. The data is unambiguous. Companies with governance tools get over 12 times more AI projects into production than those without. Evaluation pipelines, distributed tracing across agent boundaries, human oversight gates, and centralised agent lifecycle management are not overhead. They are what converts experiments into production systems. The teams still stuck in pilot are not stuck because the technology failed them. They are stuck because they skipped this layer. The protocols are standardised. The frameworks are mature. The infrastructure exists. None of that is what is holding most enterprise agent programmes back. What is holding them back is a selection problem disguised as a technology problem — teams building agents before asking whether agents are warranted, choosing platforms before running the four gates, and treating governance as a checkpoint rather than an architectural input. I have built agents that should have been workflow engines. Not because the technology was wrong, but because nobody stopped early enough to ask whether it was necessary. The four gates in this article exist because I learned those lessons at clients' expense, not mine. The most useful thing I can offer any team starting an agentic AI project is not a framework selection guide. It is permission to say no — and a clear basis for saying it. Take the four gates framework to your next architecture review. If you have already shipped agents to production, I would like to hear what worked and what did not - comment below What to do next Three concrete steps depending on where you are right now. If you have pilots that have not reached production: Run them through the four gates in this article before the next sprint. Gate 1 alone will eliminate a meaningful percentage of them. The ones that survive all four are your real candidates for production investment. Download the attached file for gated checklist and take it into your next architecture review. If you are starting a new agent project: Do not open a platform before you have answered the gate questions. Once you have confirmed an agent is warranted and identified the tier, start here: Copilot Studio guided setup for low-code scenarios, or Foundry Agent Service for pro-code patterns with MCP and multi-agent coordination built in. Build governance infrastructure - evaluation pipeline, distributed tracing, HITL gates - before you scale, not after. If you have already shipped agents to production: Share what worked and what did not in the Azure AI Tech Community — tag posts with #AgentArchitecture. The most useful signal for teams still in pilot is hearing from practitioners who have been through production, not vendors describing what production should look like. References OutSystems — State of AI Development Report - https://www.outsystems.com/1/state-ai-development-report Databricks — State of AI Agents Report - https://www.databricks.com/resources/ebook/state-of-ai-agents Gartner — 2025 Microsoft 365 and Copilot Survey - https://www.gartner.com/en/documents/6548002 (Paywalled primary source — publicly reported via techpartner.news: https://www.techpartner.news/news/gartner-microsoft-copilot-hype-offset-by-roi-and-readiness-realities-618118) Anthropic — Model Context Protocol (MCP) - https://modelcontextprotocol.io Google Cloud — Agent-to-Agent Protocol (A2A) . https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability NeuBird AI — Production Operations Deployment Announcement NeuBird AI Closes $19.3M Funding Round to Scale Agentic AI Across Enterprise Production Operations ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al. https://arxiv.org/abs/2210.03629 Enterprise Integration Patterns — Gregor Hohpe & Bobby Woolf, Addison-Wesley https://www.enterpriseintegrationpatterns.com1.3KViews3likes1CommentAutomate Prior Authorization with AI Agents - Now Available as a Foundry Template
By Amit Mukherjee · Principal Solutions Engineer, Microsoft Health & Life Sciences Lindsey Craft-Goins · Technology Leader - Cloud & AI Platforms, Health & Life Sciences Joel Borellis · Director Solutions Engineering - Cloud & AI Platforms, Health & Life Sciences Prior authorization (PA) is one of the most expensive bottlenecks in U.S. healthcare. Physicians complete an average of 39 PA requests per week, spending roughly 13 hours of physician-and-staff time on PA-related work (AMA 2024 Prior Authorization Physician Survey). Turnaround averages 5–14 business days, and PA alone accounts for an estimated $35 billion in annual administrative spending (Sahni et al., Health Affairs Scholar, 2024). The regulatory clock is now ticking. CMS-0057-F mandates electronic PA with 72-hour urgent response starting in 2026. Forty-nine states plus DC already have PA laws on the books, and at least half of all U.S. state legislatures introduced new PA reform bills this year, including laws specifically targeting AI use in PA decisions (KFF Health News, April 2026). Today we’re making the Prior Authorization Multi-Agent Solution Accelerator available as a Microsoft Foundry template. Health plan payers can deploy a working, four-agent PA review pipeline to Azure using the Azure Developer CLI (“azd”) with a single command in supported environments, then customize it to their policies, workflows, and EHR environment. Try it now: Find the template in the Foundry template gallery, or clone directly from github.com/microsoft/Prior-Authorization-Multi-Agent-Solution-Accelerator What the template delivers The accelerator deploys four specialist Foundry hosted agents (Compliance, Clinical Reviewer, Coverage, and Synthesis), each independently containerized and managed by Foundry. In internal testing with synthetic demo cases, the pipeline reduced review workflow, from beginning to completion in under 5 minutes per case. Agent Role Key capability Compliance Documentation check 10-item checklist with blocking/non-blocking flags Clinical Reviewer Clinical evidence ICD-10 validation, PubMed + ClinicalTrials.gov search Coverage Policy matching CMS NCD/LCD lookup, per-criterion MET/NOT_MET mapping Synthesis Decision rubric 3-gate APPROVE/PEND with weighted confidence scoring Compliance and Clinical run in parallel. Coverage runs after clinical findings are ready. Synthesis evaluates all three outputs through a three-gate rubric. The result is a structured recommendation with per-criterion confidence scores and a full audit trail, not a black-box answer. Solution architecture The accelerator runs entirely on Azure. The frontend and backend deploy as Azure Container Apps. The four specialist agents are hosted by Microsoft Foundry. Real-time healthcare data flows through third-party MCP servers. Figure 1: Azure solution architecture How the pipeline works The four agents execute in a structured parallel-then-sequential pipeline. Compliance and Clinical run simultaneously in Phase 1. Coverage runs after clinical findings are ready. The Synthesis agent applies a three-gate decision rubric over all prior outputs. Figure 2: Agentic architecture, hosted agent pipeline Compliance and Clinical run in parallel via asyncio.gather, since neither depends on the other. Coverage runs sequentially after Clinical because it needs the structured clinical profile for criterion mapping. Synthesis evaluates all three outputs through a three-gate rubric (Provider, Codes, Medical Necessity) with weighted confidence scoring: 40% coverage criteria + 30% clinical extraction + 20% compliance + 10% policy match. The total pipeline time is bound by the slowest parallel agent plus the sequential agents, not the sum. In internal testing with synthetic demo cases, this architecture indicated materially reduced processing time compared to sequential manual workflows. Under the hood For the architect in the room, here are four design decisions worth knowing about: Foundry hosted agents: Each agent is independently containerized, versioned, and managed by Foundry’s runtime. The FastAPI backend is a pure HTTP dispatcher. All reasoning happens inside the agent containers, and there are no code changes between local (Docker Compose) and production (Foundry); the environment variable is the only switch. Structured output: Every agent uses MAF’s response_format enforcement to produce typed Pydantic schemas at the token level. No JSON parsing, no malformed fences, no free-form text. The orchestrator receives typed Python objects; the frontend receives a stable API contract. Keyless security: DefaultAzureCredential throughout, so no API keys are stored anywhere. Managed Identity handles production; azd tokens handle local development. Role assignments are provisioned automatically by Bicep at deploy time. Observability: All agents emit OpenTelemetry traces to Azure Application Insights. The Foundry portal shows per-agent spans correlated by case ID. End-to-end latency, per-agent contribution, and error rates are visible from day one with no additional configuration. For the full architecture documentation, agent specifications, Pydantic schemas, and extension guides, see the GitHub repository. Why this matters now Human-in-the-loop by design The system runs in LENIENT mode by default: it produces only APPROVE or PEND and is not designed to produce automated DENY outcomes in its default configuration. Every recommendation requires a clinician to Accept or Override with documented rationale before the decision is finalized. Override records flow to the audit PDF, notification letters, and downstream systems. This directly addresses the emerging wave of state legislation governing AI use in PA decisions. Domain experts own the rules Agent behavior is defined in markdown skill files, not Python code. When CMS updates a coverage determination or a plan changes its commercial policy, a clinician or compliance officer edits a text file and redeploys. No engineering PR required. Real-time healthcare data via MCP Agents connect to five MCP servers for real-time data: ICD-10 codes, NPI Registry, CMS Coverage policies, PubMed, and ClinicalTrials.gov. This incorporates real‑time clinical reference data sources to inform agent recommendations. Third-party MCP servers are included for demonstration with synthetic data only. Their inclusion does not constitute an endorsement by Microsoft. See the GitHub repository for production migration guidance. Audit-ready from day one Every case generates an 8-section audit justification PDF with per-criterion evidence, data source attribution, timestamps, and confidence breakdowns. Clinician overrides are recorded in Section 9. Notification letters (approval and pend) are generated automatically. These artifacts are designed to support CMS-0057-F documentation requirements. Deploy in under 15 minutes From the Foundry template gallery or from the command line: git clone https://github.com/microsoft/Prior-Authorization-Multi-Agent-Solution-Accelerator cd Prior-Authorization-Multi-Agent-Solution-Accelerator azd up That single command provisions Foundry, Azure Container Registry, Container Apps, builds all Docker images, registers the four agents, and runs health checks. The demo is live with a synthetic sample case as soon as deployment completes. What’s included What you customize 4 Foundry hosted agents Payer-specific coverage policies FastAPI orchestrator + Next.js frontend EHR/FHIR integration for clinical notes 5 MCP healthcare data connections Self-hosted MCP servers for production PHI Audit PDF + notification letter generation Authentication (Microsoft Entra ID) Full Bicep infrastructure-as-code Persistent storage (Cosmos DB / PostgreSQL) OpenTelemetry + App Insights observability Additional agents (Pharmacy, Financial) Built on Microsoft Foundry + Foundry hosted agents · Microsoft Agent Framework (MAF) · Azure OpenAI gpt-5.4 · Azure Container Apps · Azure Developer CLI + Bicep · OpenTelemetry + Azure Application Insights · DefaultAzureCredential (keyless, no secrets) Full architecture documentation, agent specifications, and extension guides are in the GitHub repository. Get started Foundry template gallery: Search “AI-Powered Prior Authorization for Healthcare” in the Foundry template section GitHub: github.com/microsoft/Prior-Authorization-Multi-Agent-Solution-Accelerator Disclaimers Not a medical device. This solution accelerator is not a medical device, is not FDA-cleared, and is not intended for autonomous clinical decision-making. All AI recommendations require qualified clinical review before any authorization decision is finalized. Not production-ready software. This is an open-source reference architecture (MIT License), not a supported Microsoft product. Customers are solely responsible for testing, validation, regulatory compliance, security hardening, and production deployment. Performance figures are illustrative. Metrics cited (including processing time reductions) are based on internal testing with synthetic demo data. Actual results will vary based on case complexity, infrastructure, and configuration. Third-party services included for demonstration only; not endorsed by Microsoft. Customers should evaluate providers against their compliance and data residency requirements. The demo uses synthetic data only. Customers deploying real patient data are responsible for HIPAA compliance and establishing appropriate Business Associate Agreements. This accelerator is intended to help customers align documentation workflows with CMS‑0057‑F requirements but has not been independently validated or certified for regulatory compliance.1.2KViews0likes0CommentsTroubleshooting Microsoft Foundry Accessing On‑Premises APIs Over Private Networking
Audience: Azure solution architects, network engineers, and AI practitioners deploying Microsoft Foundry in enterprise, network‑isolated environments. Scenario: Foundry Agent Service must call an on‑premises (or privately hosted) API over VPN or ExpressRoute using on‑premises corporate DNS. Connectivity works from a virtual machine (VM) in the virtual network (VNet), but fails when the same call is made from a Foundry agent. This post consolidates common field patterns observed across customer engagements and maps them directly to official Microsoft guidance. It highlights a frequently missed prerequisite for private connectivity: Project and Agent Capability Hosts. The Repeating Enterprise Pattern The architecture is familiar: Microsoft Foundry account and project Customer‑managed VNet with VPN or ExpressRoute to on‑premises Corporate (on‑premises) DNS used for API name resolution A VM in the VNet can successfully resolve and call the on‑prem API Foundry agents are configured to call the same API (often via an OpenAPI tool) Despite this, the agent fails with one or more of the following: DNS resolution failures Connection timeouts HTTP 401 or 403 responses Unexpected backend or proxy URLs appearing in logs The consistent question is: Why does this work from a VM in the VNet but not from the Foundry agents? Key Principle: Foundry Agents Do Not Automatically Run in Your VNet This is the most important mental model to reset. Creating a private endpoint for a Foundry Agent Service does not place agent runtime traffic into your VNet. Private endpoints are inbound constructs. Outbound connectivity from an agent only flows through your VNet when specific requirements are met. The most critical of those requirements is a Capability Host. What Is a Capability Host? A Capability Host defines where Foundry capabilities are allowed to execute. In private networking scenarios, the capability host: Binds a Foundry project or agent to a customer‑managed subnet Enables platform‑managed container injection into that subnet Ensures outbound traffic follows VNet routing, security controls, and DNS configuration Capability Host Scope Project Capability Host Associated at the project level Applies to all agents in the project Defines the customer subnet the project is allowed to use Agent Capability Host Associated at the individual agent level Can explicitly bind or override subnet placement Useful when agents require different isolation boundaries Key field insight: If no capability host is associated, the agent runtime is not injected into the VNet—even if VPN, ExpressRoute, private endpoints, and on‑prem DNS are correctly configured. Capability Host Lifecycle (Conceptual) 1 Microsoft Foundry Account ↓ Project ↓ Capability Host ↓ Delegated Agent Subnet (VNet) ↓ Agent Runtime (Container Injection) Without the capability host step, the chain breaks and the agent executes outside the customer network boundary. Why On‑Premises DNS Appears Correct—but Still Fails DNS is where most investigations stall. Teams typically confirm: On‑premises DNS resolves the API hostname VNet DNS settings forward queries to on‑prem DNS A VM in the subnet resolves and reaches the API Yet the agent still fails. The reason is simple: The VM is unquestionably inside the VNet The agent may not be Without a capability host, the agent runtime does not inherit: VNet DNS server settings Corporate DNS forwarding rules On‑premises name resolution paths As a result, DNS fails even though the DNS design itself is correct. Secondary Symptom: HTTP 401 Errors After subnet injection and DNS are corrected, some customers encounter HTTP 401 responses. This typically means: The API is now reachable The request is successfully routed through the private path Authentication or authorization is failing At this stage, troubleshooting moves from networking to identity: Validate the credential or token configured in the Foundry connection Confirm expected headers, audience, or auth flow Account for managed proxy hops in the request path A 401 at this point is progress—it confirms private connectivity is working. Permissions and Required Services for Capability Hosts Creating capability hosts requires both the correct Azure role-based access control (RBAC) permissions and the presence of specific dependent services. These prerequisites are frequently overlooked and can silently block capability host creation or leave hosts in a failed provisioning state. Required Permissions (RBAC) Microsoft documentation explicitly calls out the following minimum permissions: Contributor role on the Microsoft Foundry account to create capability hosts. User Access Administrator or Owner role on the subscription or resource group to grant the Foundry project’s managed identity access to dependent Azure resources when using standard agent setup. Without these roles, capability host creation may fail, or the host may be created without access to required downstream services. For details, see Role-based access control (RBAC) for Microsoft Foundry. Required Azure Services and Connections For standard and network-secured agent setups, capability hosts reference customer-owned Azure resources. The following services must exist and be connected to the Foundry project before creating the capability host: Azure Storage – for file uploads and artifacts Azure AI Search – for vector stores and retrieval Azure Cosmos DB – for thread and conversation storage Azure AI Services / Azure OpenAI – for model execution These services must be deployed in supported regions and, for private networking scenarios, in the same region as the virtual network. Capability hosts reference these resources through project connections, not raw resource IDs . Networking-Specific Requirements When using private networking: The agent subnet must already exist and be delegated appropriately. The capability host must reference the correct customer subnet at creation time. Required private endpoints and DNS resolution must be in place for dependent services. If networking or connections change, capability hosts cannot be updated in-place and must be deleted and recreated with the corrected configuration . Practical Fix Patterns 1. Create and Associate a Project Capability Host Bind the project to the intended delegated agent subnet Verify the customerSubnet reference Redeploy the agent after association This aligns directly with the Standard Setup with Private Networking model documented by Microsoft. 2. Validate Agent Placement and Network Inheritance Confirm the agent is associated with the expected capability host Verify the capability host references the correct subnet Ensure network and DNS settings are applied at the subnet level The agent inherits routing and DNS behavior only after successful subnet injection. 3. Validate DNS From the Agent Subnet Confirm VNet DNS settings point to on‑prem DNS Test name resolution from a VM in the same subnet Once injected, the agent uses the same DNS behavior as other resources in that subnet. 4. Use a Supported Foundry Experience Be aware of documented constraints: End‑to‑end network isolation is not supported in the newer Foundry portal experience Network‑isolated agent scenarios require the classic Foundry experience, SDK, or CLI Hosted agents do not support full isolation Mismatch here can make a correct network design appear broken. A Simple Checklist When a Foundry agent cannot reach an on‑prem API: Is a **Project Capability **Host associated? Is the capability host bound to the correct subnet? Is on‑prem DNS reachable from that subnet? Is a supported Foundry experience in use? If reachable, is authentication configured correctly? If items 1 and 2 are not satisfied, all other troubleshooting is premature. Closing Most private networking issues with Microsoft Foundry are not caused by VPNs or DNS infrastructure. They result from an incomplete understanding of **where the agent ****runtime ****actually **executes. Capability Hosts are the control point. When they are correctly configured, Foundry agents behave exactly as described in Microsoft guidance: they inherit VNet routing, DNS, and security controls and can securely access on‑premises systems over private connectivity. No capability host = no VNet injection = no on‑prem connectivity. References The following Microsoft‑published resources were referenced and aligned with throughout this article: Network‑secured agent setup (GitHub) – Reference implementation demonstrating a network‑secured Foundry Agent Service deployment with a customer‑managed virtual network, delegated agent subnet, and private connectivity patterns. This notebook illustrates how agent runtimes inherit network behavior only after subnet injection ralacher/network-secured-agent Set up private networking for Foundry Agent Service - Microsoft Foundry | Microsoft Learns .303Views1like0CommentsVector Drift in Azure AI Search: Three Hidden Reasons Your RAG Accuracy Degrades After Deployment
What Is Vector Drift? Vector drift occurs when embeddings stored in a vector index no longer accurately represent the semantic intent of incoming queries. Because vector similarity search depends on relative semantic positioning, even small changes in models, data distribution, or preprocessing logic can significantly affect retrieval quality over time. Unlike schema drift or data corruption, vector drift is subtle: The system continues to function Queries return results But relevance steadily declines Cause 1: Embedding Model Version Mismatch What Happens Documents are indexed using one embedding model, while query embeddings are generated using another. This typically happens due to: Model upgrades Shared Azure OpenAI resources across teams Inconsistent configuration between environments Why This Matters Embeddings generated by different models: Exist in different vector spaces Are not mathematically comparable Produce misleading similarity scores As a result, documents that were previously relevant may no longer rank correctly. Recommended Practice A single vector index should be bound to one embedding model and one dimension size for its entire lifecycle. If the embedding model changes, the index must be fully re-embedded and rebuilt. Cause 2: Incremental Content Updates Without Re-Embedding What Happens New documents are continuously added to the index, while existing embeddings remain unchanged. Over time, new content introduces: Updated terminology Policy changes New product or domain concepts Because semantic meaning is relative, the vector space shifts—but older vectors do not. Observable Impact Recently indexed documents dominate retrieval results Older but still valid content becomes harder to retrieve Recall degrades without obvious system errors Practical Guidance Treat embeddings as living assets, not static artifacts: Schedule periodic re-embedding for stable corpora Re-embed high-impact or frequently accessed documents Trigger re-embedding when domain vocabulary changes meaningfully Declining similarity scores or reduced citation coverage are often early signals of drift. Cause 3: Inconsistent Chunking Strategies What Happens Chunk size, overlap, or parsing logic is adjusted over time, but previously indexed content is not updated. The index ends up containing chunks created using different strategies. Why This Causes Drift Different chunking strategies produce: Different semantic density Different contextual boundaries Different retrieval behavior This inconsistency reduces ranking stability and makes retrieval outcomes unpredictable. Governance Recommendation Chunking strategy should be treated as part of the index contract: Use one chunking strategy per index Store chunk metadata (for example, chunk_version) Rebuild the index when chunking logic changes Design Principles Versioned embedding deployments Scheduled or event-driven re-embedding pipelines Standardized chunking strategy Retrieval quality observability Prompt and response evaluation Key Takeaways Vector drift is an architectural concern, not a service defect It emerges from model changes, evolving data, and preprocessing inconsistencies Long-lived RAG systems require embedding lifecycle management Azure AI Search provides the controls needed to mitigate drift effectively Conclusion Vector drift is an expected characteristic of production RAG systems. Teams that proactively manage embedding models, chunking strategies, and retrieval observability can maintain reliable relevance as their data and usage evolve. Recognizing and addressing vector drift is essential to building and operating robust AI solutions on Azure. Further Reading The following Microsoft resources provide additional guidance on vector search, embeddings, and production-grade RAG architectures on Azure. Azure AI Search – Vector Search Overview - https://learn.microsoft.com/azure/search/vector-search-overview Azure OpenAI – Embeddings Concepts - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/embeddings?view=foundry-classic&tabs=csharp Retrieval-Augmented Generation (RAG) Pattern on Azure - https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview?tabs=videos Azure Monitor – Observability Overview - https://learn.microsoft.com/azure/azure-monitor/overview591Views3likes2CommentsSimplifying Image Classification with Azure AutoML for Images: A Practical Guide
1. The Challenge of Traditional Image Classification Anyone who has worked with computer vision knows the drill: you need to classify images, so you dive into TensorFlow or PyTorch, spend days architecting a convolutional neural network, experiment with dozens of hyperparameters, and hope your model generalizes well. It’s time-consuming, requires deep expertise, and often feels like searching for a needle in a haystack. What if there was a better way? 2. Enter Azure AutoML for Images Azure AutoML for Images is a game-changer in the computer vision space. It’s a feature within Azure Machine Learning that automatically builds high-quality vision models from your image data with minimal code. Think of it as having an experienced ML engineer working alongside you, handling all the heavy lifting while you focus on your business problem. What Makes AutoML for Images Special? 1. Automatic Model Selection Instead of manually choosing between ResNet, EfficientNet, or dozens of other architectures, AutoML for Images (Azure ML) evaluates multiple state-of-the-art deep learning models and selects the best one for your specific dataset. It’s like having access to an entire model zoo with an intelligent curator. 2. Intelligent Hyperparameter Tuning The system doesn’t just pick a model — it optimizes it. Learning rates, batch sizes, augmentation strategies, and more are automatically tuned to squeeze out the best possible performance. What would take weeks of manual experimentation happens in hours. 3. Built-in Best Practices Data preprocessing, augmentation techniques, and training strategies that would require extensive domain knowledge are pre-configured and applied automatically. You get enterprise-grade ML without needing to be an ML expert. Key Capabilities The repository demonstrates several powerful features: Multi-class and Multi-label Classification: Whether you need to classify an image into a single category or tag it with multiple labels, AutoML manages both scenarios seamlessly. Format Flexibility: Works with standard image formats including JPEG and PNG, making it easy to integrate with existing datasets. Full Transparency: Unlike black-box solutions, you maintain complete visibility and control over the training process. You can monitor metrics, understand model decisions, and fine-tune as needed. Production-Ready Deployment: Once trained, models can be easily deployed to Azure endpoints, ready to serve predictions at scale. Real-World Applications The practical applications are vast: E-commerce: Automatically categorize product images for better search and recommendations. Healthcare: Classify medical images for diagnostic support. Manufacturing: Detect defects in production line images. Agriculture: Identify crop diseases or estimate yield from aerial imagery. Content Moderation: Automatically flag inappropriate visual content. 3. A Practical Example: Metal Defect Detection The repository includes a complete end-to-end example of detecting defects in metal surfaces — a critical quality control task in manufacturing. The notebooks demonstrate how to: Download and organize image data from sources like Kaggle, Create training and validation splits with proper directory structure, Upload data to Azure ML as versioned datasets, Configure GPU compute that scales based on demand, Train multiple models with automated hyperparameter tuning, Evaluate results with comprehensive metrics and visualizations, Deploy the best model as a production-ready REST API, Export to ONNX for edge deployment scenarios. The metal defect use case is particularly instructive because it mirrors real industrial applications where quality control is critical but expertise is scarce. The notebooks show how a small team can build production-grade computer vision systems without a dedicated ML research team. Getting Started: What You Need The prerequisites are straightforward: An Azure subscription (free tier available for experimentation) An Azure Machine Learning workspace Python 3.7 or later That’s it. No local GPU clusters to configure, no complex deep learning frameworks to master. Repository Structure The repository is thoughtfully organized into three progressive notebooks: Downloading images.ipynb Shows how to acquire and prepare image datasets Demonstrates proper directory structure for classification tasks Includes data exploration and visualization techniques image-classification-azure-automl-for-images/1. Downloading images.ipynb at main · retkowsky/image-classification-azure-automl-for-images Azure ML AutoML for Images.ipynb The core workflow: connect to Azure ML, upload data, configure training Covers both simple model training and advanced hyperparameter tuning Shows how to evaluate models and select the best performing one Demonstrates deployment to managed online endpoints image-classification-azure-automl-for-images/2. Azure ML AutoML for Images.ipynb at main · retkowsky/image-classification-azure-automl-for-images Edge with ONNX local model.ipynb Exports trained models to ONNX format Shows how to run inference locally without cloud connectivity Perfect for edge computing and IoT scenarios image-classification-azure-automl-for-images/3. Edge with ONNX local model.ipynb at main · retkowsky/image-classification-azure-automl-for-images Each Python notebook is self-contained with clear explanations, making it easy to understand each step of the process. You can run them sequentially to build a complete solution, or jump to specific sections relevant to your use case. The Developer Experience What sets this approach apart is the developer experience. The repository provides Python notebooks that guide you through the entire workflow. You’re not just reading documentation — you’re working with practical, runnable examples that demonstrate real scenarios. Let’s walk through the code to see how straightforward this actually is. Use-case description This image classification model is designed to identify and classify defects on metal surfaces in a manufacturing context. We want to classify defective images into Crazing, Inclusion, Patches, Pitted, Rolled & Scratches. Press enter or click to view image in full size All code and images are available here: retkowsky/image-classification-azure-automl-for-images: Azure AutoML for images — Image classification Step 1: Connect to Azure ML Workspace First, establish connection to your Azure ML workspace using Azure credentials: print("Connection to the Azure ML workspace…") credential = DefaultAzureCredential() ml_client = MLClient( credential, os.getenv("subscription_id"), os.getenv("resource_group"), os.getenv("workspace") ) print("✅ Done") That’s it. Step 2: Upload Your Dataset Upload your image dataset to Azure ML. The code handles this elegantly: my_images = Data( path=TRAIN_DIR, type=AssetTypes.URI_FOLDER, description="Metal defects images for images classification", name="metaldefectimagesds", ) uri_folder_data_asset = ml_client.data.create_or_update(my_images) print("🖼️ Informations:") print(uri_folder_data_asset) print("\n🖼️ Path to folder in Blob Storage:") print(uri_folder_data_asset.path) Your local images are now versioned data assets in Azure, ready for training. Step 3: Create GPU Compute Cluster AutoML needs compute power. Here’s how you create a GPU cluster that auto-scales: compute_name = "gpucluster" try: _ = ml_client.compute.get(compute_name) print("✅ Found existing Azure ML compute target.") except ResourceNotFoundError: print(f"🛠️ Creating a new Azure ML compute cluster '{compute_name}'…") compute_config = AmlCompute( name=compute_name, type="amlcompute", size="Standard_NC16as_T4_v3", # GPU VM idle_time_before_scale_down=1200, min_instances=0, # Scale to zero when idle max_instances=4, ) ml_client.begin_create_or_update(compute_config).result() print("✅ Done") The cluster scales from 0 to 4 instances based on workload, so you only pay for what you use. Step 4: Configure AutoML Training Now comes the magic. Here’s the entire configuration for an AutoML image classification job using a specific model (here a resnet34). It is possible as well to access all the available models from the image classification AutoML library. Press enter or click to view image in full size https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-image-models?view=azureml-api-2&tabs=python#supported-model-architectures image_classification_job = automl.image_classification( compute=compute_name, experiment_name=exp_name, training_data=my_training_data_input, validation_data=my_validation_data_input, target_column_name="label", ) # Set training parameters image_classification_job.set_limits(timeout_minutes=60) image_classification_job.set_training_parameters(model_name="resnet34") That’s approximately 10 lines of code to configure what would traditionally require hundreds of lines and deep expertise. Step 5: Hyperparameter Tuning (Optional) Want to explore multiple models and configurations? image_classification_job = automl.image_classification( compute=compute_name, # Compute cluster experiment_name=exp_name, # Azure ML job training_data=my_training_data_input, # Training validation_data=my_validation_data_input, # Validation target_column_name="label", # Target primary_metric=ClassificationPrimaryMetrics.ACCURACY, # Metric tags={"usecase": "metal defect", "type" : "computer vision", "product" : "azure ML", "ai": "image classification", "hyper": "YES"}, ) image_classification_job.set_limits( timeout_minutes=60, # Timeout in min max_trials=5, # Max model number max_concurrent_trials=2, # Concurrent training ) image_classification_job.extend_search_space([ SearchSpace( model_name=Choice(["vitb16r224", "vits16r224"]), learning_rate=Uniform(0.001, 0.01), # LR number_of_epochs=Choice([15, 30]), # Epoch ), SearchSpace( model_name=Choice(["resnet50"]), learning_rate=Uniform(0.001, 0.01), # LR layers_to_freeze=Choice([0, 2]), # Layers to freeze ), ]) image_classification_job.set_sweep( sampling_algorithm="Random", # Random sampling to select combinations of hyperparameters. early_termination=BanditPolicy(evaluation_interval=2, # The model is evaluated every 2 iterations. slack_factor=0.2, # If a run’s performance is 20% worse than the best run so far, it may be terminated. delay_evaluation=6), # The policy waits until 6 iterations have completed before starting to # evaluate and potentially terminate runs. ) AutoML will now automatically try different model architectures, learning rates, and augmentation strategies to find the best configuration. Step 6: Launch Training Submit the job and monitor progress: # Submit the job returned_job = ml_client.jobs.create_or_update(image_classification_job) print(f"✅ Created job: {returned_job}") # Stream the logs in real-time ml_client.jobs.stream(returned_job.name) While training runs, you can monitor metrics, view logs, and track progress through the Azure ML Studio UI or programmatically. Step 7: Results Step 8: Deploy to Production Once training completes, deploy the best model as a REST endpoint: # Create endpoint configuration online_endpoint_name = "metal-defects-classification" endpoint = ManagedOnlineEndpoint( name=online_endpoint_name, description="Metal defects image classification", auth_mode="key", tags={ "usecase": "metal defect", "type": "computer vision" }, ) # Deploy the endpoint ml_client.online_endpoints.begin_create_or_update(endpoint).result() Your model is now a production API endpoint, ready to classify images at scale. Beyond the Cloud: Edge Deployment with ONNX One of the most powerful aspects of this approach is flexibility in deployment. The repository includes a third notebook demonstrating how to export your trained model to ONNX (Open Neural Network Exchange) format for edge deployment. This means you can: Deploy models on IoT devices for real-time inference without cloud connectivity Reduce latency by processing images locally on edge hardware Lower costs by eliminating constant cloud API calls Ensure privacy by keeping sensitive images on-premises The ONNX export process is straightforward and integrates seamlessly with the AutoML workflow. Your cloud-trained model can run anywhere ONNX Runtime is supported — from Raspberry Pi devices to industrial controllers. import onnxruntime # Load the ONNX model session = onnxruntime.InferenceSession("model.onnx") # Run inference locally results = session.run(None, {input_name: image_data}) This cloud-to-edge workflow is particularly valuable for manufacturing, retail, and remote monitoring scenarios where edge processing is essential. Interactive webapp for image classification Interpreting model predictions Deployed endpoint returns base64 encoded image string if both model_explainability and visualizations are set to True. Why This Matters? In the AI era, the competitive advantage isn’t about who can build the most complex models — it’s about who can deploy effective solutions fastest. Azure AutoML for Images democratizes computer vision by making sophisticated ML accessible to a broader audience. Small teams can now accomplish what previously required dedicated ML specialists. Prototypes that took months can be built in days. And the quality? Often on par with or better than manually crafted solutions, thanks to AutoML’s systematic approach and access to cutting-edge techniques. What the Code Reveals Looking at the actual implementation reveals several important insights: Minimal Boilerplate: The entire training pipeline — from data upload to model deployment — requires less than 50 lines of meaningful code. Compare this to traditional PyTorch or TensorFlow implementations that often exceed several hundred lines. Built-in Best Practices: Notice how the code automatically manages concerns like data versioning, experiment tracking, and compute auto-scaling. These aren’t afterthoughts — they’re integral to the platform. Production-Ready from Day One: The deployed endpoint isn’t a prototype. It includes authentication, scaling, monitoring, and all the infrastructure needed for production workloads. You’re building production systems, not demos. Flexibility Without Complexity: The simple API hides complexity without sacrificing control. Need to specify a particular model architecture? One parameter. Want hyperparameter tuning? Add a few lines. The abstraction level is perfectly calibrated. Observable and Debuggable: The `.stream()` method and comprehensive logging mean you’re never in the dark about what’s happening. You can monitor training progress, inspect metrics, and debug issues — all critical for real projects. The Cost of Complexity Traditional ML projects fail not because of technology limitations but because of complexity. The learning curve is steep, the iteration cycles are long, and the resource requirements are high. By abstracting away this complexity, AutoML for Images changes the economics of computer vision projects. You can now: Validate ideas quickly: Test whether image classification solves your problem before committing significant resources Iterate faster: Experiment with different approaches in hours rather than weeks Scale expertise: Enable more team members to work with computer vision, not just ML specialists Conclusion Image classification is a fundamental building block for countless AI applications. Azure AutoML for Images makes it accessible, practical, and production-ready. Whether you’re a seasoned data scientist looking to accelerate your workflow or a developer taking your first steps into computer vision, this approach offers a compelling path forward. The future of ML isn’t about writing more complex code — it’s about writing smarter code that leverages powerful platforms to deliver business value faster. This repository shows you exactly how to do that. Practical Tips from the Code After reviewing the notebooks, here are some key takeaways for your own projects: Start with a Single Model: The basic configuration with `model_name=”resnet34"` is perfect for initial experiments. Only move to hyperparameter sweeps once you’ve validated your data and use case. Use Tags Strategically: The code demonstrates adding tags to jobs and endpoints (e.g., `”usecase”: “metal defect”`). This becomes invaluable when managing multiple experiments and models in production. Leverage Auto-Scaling: The compute configuration with `min_instances=0` means you’re not paying for idle resources. The cluster scales up when needed and scales down to zero when idle. Monitor Training Live: The `ml_client.jobs.stream()` method is your best friend during development. You see exactly what’s happening and can catch issues early. Version Your Data: Creating named data assets (`name=”metaldefectimagesds”`) means your experiments are reproducible. You can always trace back which data version produced which model. Think Cloud-to-Edge: Even if you’re deploying to the cloud initially, the ONNX export capability gives you flexibility for future edge scenarios without retraining. Resources Azure ML: https://azure.microsoft.com/en-us/products/machine-learning Demos notebooks: https://github.com/retkowsky/image-classification-azure-automl-for-images AutoML for Images documentation: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-image-models Available models: Set up AutoML for computer vision — Azure Machine Learning | Microsoft Learn Connect with the author: https://www.linkedin.com/in/serger/221Views0likes0CommentsTracking Every Token: Granular Cost and Usage Metrics for Microsoft Foundry Agents
As organizations scale their use of AI agents, one question keeps surfacing: how much is each agent actually costing us? Not at the subscription level. Not at the resource group level. Per agent, per model, per request. This post walks through a solution that answers that question by combining three Azure services Microsoft AI Foundry, Azure API Management (APIM), and Application Insights into an observable, metered AI gateway with granular token-level telemetry including custom dates greater than a month for deeper analysis. The Problem: AI Costs Can be a Black Box Foundry’s built-in monitoring and cost views are ultimately powered by telemetry stored in Application Insights, and the out-of-the-box dashboards don’t always provide the exact per-request/per-caller token breakdown or the custom aggregations/joins teams may want for bespoke dashboards (for example, breaking down tokens by APIM subscription, product, tenant, user, route, or agent step). Using APIM to stamp consistent caller/context metadata (headers/claims), Foundry to generate the agent/model run telemetry, and App Insights as the queryable store to let you correlate gateway, agent run, tool/model calls and then build custom KQL-driven dashboards. With data captured in App Insights and custom KQL queries, questions such as below can be answered: Which agent consumed the most tokens last week? What's the average cost per request for a specific agent? How do prompt tokens vs. completion tokens break down per model? Is one agent disproportionately expensive compared to others? Why This Solution Was Built This solution was built to close the observability gap between "we deployed agents" and "we understand what those agents cost." The goals were straightforward: Per-agent, per-model cost attribution - Know exactly which agent is consuming what, down to the token. Real-time telemetry, not batch reports - Metrics flow into Application Insights within minutes, query via KQL. Zero agent modification - The agents themselves don't need to know about telemetry. The tracking happens at the gateway layer. Extensibility - Any agent hosted in Microsoft Foundry and exposed through APIM can be added with a single function call. How It Works The architecture is intentionally simple three services, one data flow. The notebook serves as a testing and prototyping environment, but the same `call_agent()` and `track_llm_usage()` code can be lifted directly into any production Python application that calls Foundry agents. Azure API Management acts as the AI Gateway. Every request to a Foundry-hosted agent flows through APIM, which handles routing, rate limiting, authentication, and tracing. APIM adds its own trace headers (`Ocp-Apim-Trace-Location`) so you can correlate gateway-level diagnostics with your application telemetry. After the API request is successfully completed, we can extract the necessary data from response headers. The notebook is designed for testing and rapid iteration call an agent, inspect the response, verify that telemetry lands in App Insights. It uses `httpx` to call agents through APIM, authenticating with `DefaultAzureCredential` and an APIM subscription key. After each response, it extracts the `usage` object `input_tokens`, `output_tokens`, `total_tokens` — and calculates an estimated cost based on built-in per-model pricing. Application Insights receives this telemetry via OpenTelemetry. The solution sends data to two tables: customMetrics - Cumulative counters for prompt tokens, completion tokens, total tokens, and cost in USD. These power dashboards and alerts. traces - Structured log entries with `custom_dimensions` containing agent name, model, operation ID, token counts, and cost per request. These power ad-hoc KQL queries. traces - stores your application’s trace/log messages (plus custom properties/measurements) as queryable records in Azure Monitor Logs. Demonstrating Granular Cost and Usage Metrics This is where the solution shines. Once telemetry is flowing, you can answer detailed questions with simple KQL queries. Per-Request Detail Query the `traces` table to see every individual agent call with full token and cost breakdown: traces | where message == "llm.usage" | extend cd = parse_json(replace_string( tostring(customDimensions["custom_dimensions"]), "'", "\"")) | extend agent_name = tostring(cd["agent_name"]), model = tostring(cd["model"]), prompt_tokens = toint(cd["prompt_tokens"]), completion_tokens = toint(cd["completion_tokens"]), total_tokens = toint(cd["total_tokens"]), cost_usd = todouble(cd["cost_usd"]) | project timestamp, agent_name, model, prompt_tokens, completion_tokens, total_tokens, cost_usd | order by timestamp desc This gives you a line-item audit trail every request, every agent, every token. Aggregated Metrics Per Agent Summarize across all requests to see averages and totals grouped by agent and model: traces | where message == "llm.usage" | extend cd = parse_json(replace_string( tostring(customDimensions["custom_dimensions"]), "'", "\"")) | extend agent_name = tostring(cd["agent_name"]), model = tostring(cd["model"]), prompt_tokens = toint(cd["prompt_tokens"]), completion_tokens = toint(cd["completion_tokens"]), total_tokens = toint(cd["total_tokens"]), cost_usd = todouble(cd["cost_usd"]) | summarize calls = count(), avg_prompt = avg(prompt_tokens), avg_completion = avg(completion_tokens), avg_total = avg(total_tokens), avg_cost = avg(cost_usd), total_cost = sum(cost_usd) by agent_name, model | order by total_cost desc Now you can see at a glance: Which agent is the most expensive across all calls Average token consumption per request useful for prompt optimization Prompt-to-completion ratio a high ratio may indicate verbose system prompts that could be trimmed Cost trends by model is GPT-4.1 worth the premium over GPT-4o-mini for a particular agent? The same can be done in code with your custom solution: from datetime import timedelta from azure.identity import DefaultAzureCredential from azure.monitor.query import LogsQueryClient KQL = r""" traces | where message == "llm.usage" | extend cd_raw = tostring(customDimensions["custom_dimensions"]) | extend cd = parse_json(replace_string(cd_raw, "'", "\"")) | extend agent_name = tostring(cd["agent_name"]), model = tostring(cd["model"]), operation_id = tostring(cd["operation_id"]), prompt_tokens = toint(cd["prompt_tokens"]), completion_tokens = toint(cd["completion_tokens"]), total_tokens = toint(cd["total_tokens"]), cost_usd = todouble(cd["cost_usd"]) | project timestamp, agent_name, model, operation_id, prompt_tokens, completion_tokens, total_tokens, cost_usd | order by timestamp desc """ def query_logs(): credential = DefaultAzureCredential() client = LogsQueryClient(credential) resp = client.query_resource( resource_id=APP_INSIGHTS_RESOURCE_ID, # defined in config cell query=KQL, timespan=None, # No time filter — returns all available data (up to 90-day retention) ) if resp.status != "Success": raise RuntimeError(f"Query failed: {resp.status} - {getattr(resp, 'error', None)}") table = resp.tables[0] rows = [dict(zip(table.columns, r)) for r in table.rows] return rows if __name__ == "__main__": rows = query_logs() if not rows: print("No telemetry found. Wait 2-5 min after running the agent cell and try again.") else: print(f"Found {len(rows)} records\n") print(f"{'Timestamp':<28} {'Agent':<16} {'Model':<12} {'Op ID':<12} " f"{'Prompt':>8} {'Completion':>11} {'Total':>8} {'Cost ($)':>10}") print("-" * 110) for r in rows[:20]: ts = str(r.get("timestamp", ""))[:19] print(f"{ts:<28} {r.get('agent_name',''):<16} {r.get('model',''):<12} " f"{r.get('operation_id',''):<12} {r.get('prompt_tokens',0):>8} " f"{r.get('completion_tokens',0):>11} {r.get('total_tokens',0):>8} " f"{r.get('cost_usd',0):>10.6f}") What You Can Build on Top Azure Workbooks - Build interactive dashboards showing cost trends over time, agent comparison charts, and token distribution heatmaps. Alerts - Trigger notifications when a single agent exceeds a cost threshold or when token consumption spikes unexpectedly. Azure Dashboard pinning - Pin KQL query results directly to a shared Azure Dashboard for team visibility. Power BI integration - Export telemetry data for executive-level cost reporting across all AI agents. Extensibility: Add Any Agent in One Line The solution is designed to scale with your agent portfolio. Any agent hosted in Microsoft Foundry and exposed through APIM can be integrated without modifying the telemetry pipeline. Adding a new agent is a single function call: response = call_agent("YourNewAgent", "Your prompt here") Token tracking, cost estimation, and telemetry export happen automatically. No additional configuration, no new infrastructure. From Notebook to Production The notebook is a testing harness, a fast way to validate agent connectivity, inspect raw responses, and confirm that telemetry arrives in App Insights. But the code isn't limited to notebooks. The core functions `call_agent()`, `track_llm_usage()`, and the OpenTelemetry configuration are plain Python. They can be dropped directly into any production application that calls Foundry agents through APIM: FastAPI / Flask web service - Wrap `call_agent()` in an endpoint and get per-request cost tracking out of the box. Azure Functions - Call agents from a serverless function with the same telemetry pipeline. Background workers or batch pipelines - Process multiple agent calls and aggregate cost data across runs. CLI tools or scheduled jobs - Run agent evaluations on a schedule with automatic cost logging. The pattern stays the same regardless of where the code runs: # 1. Configure OpenTelemetry + App Insights (once at startup) configure_azure_monitor(connection_string=APP_INSIGHTS_CONN) # 2. Call any agent through APIM response = call_agent("FinanceAgent", "Summarize Q4 earnings") # 3. Token usage and cost are tracked automatically # → customMetrics and traces tables in App Insights Start with the notebook to prove the pattern works. Then move the same code into your production codebase, the telemetry travels with it. Key Takeaways AI cost observability matters. As agent counts grow, per-agent cost attribution becomes essential for budgeting and optimization. APIM as an AI Gateway gives you routing, rate limiting, and tracing in one place without touching agent code. OpenTelemetry + Application Insights provides a battle-tested telemetry pipeline that scales from a single notebook to production workloads. KQL makes the data actionable. Per-request audits, per-agent summaries, and cost trending are all a query away. The solution is additive, not invasive. Agents don't need modification. The telemetry layer wraps around them. This approach gives developers the abiility to view metrics per user, API Key, Agent, request / tool call, or business dimensions(Cost Center, app, environment). If you're running AI agents in Microsoft Foundry and want to understand what they cost at a granular level this pattern gives you the visibility to make informed decisions about model selection, prompt design, and budget allocation. The full solution is available on GitHub: https://github.com/ccoellomsft/foundry-agents-apim-appinsights1.2KViews1like0CommentsMicrosoft Foundry: Unlock Adaptive, Personalized Agents with User-Scoped Persistent Memory
From Knowledgeable to Personalized: Why Memory Matters Most AI agents today are knowledgeable — they ground responses in enterprise data sources and rely on short‑term, session‑based memory to maintain conversational coherence. This works well within a single interaction. But once the session ends, the context disappears. The agent starts fresh, unable to recall prior interactions, user preferences, or previously established context. In reality, enterprise users don’t interact with agents exclusively in one‑off sessions. Conversations can span days, weeks, evolving across multiple interactions rather than isolated sessions. Without a way to persist and safely reuse relevant context across interactions, AI agents remain efficient in the short term be being stateful within a session, but lose continuity over time due to their statelessness across sessions. Bridging this gap between short-term efficiency and long‑term adaptation exposes a deeper challenge. Persisting memory across sessions is not just a technical decision; in enterprise environments, it introduces legitimate concerns around privacy, data isolation, governance, and compliance — especially when multiple users interact with the same agent. What seems like an obvious next step quickly becomes a complex architectural problem, requiring organizations to balance the ability for agents to learn and adapt over time with the need to preserve trust, enforce isolation boundaries, and meet enterprise compliance requirements. In this post, I’ll walk through a practical design pattern for user‑scoped persistent memory, including a reference architecture and a deployable sample implementation that demonstrates how to apply this pattern in a real enterprise setting while preserving isolation, governance, and compliance. The Challenge of Persistent Memory in Enterprise AI Agents Extending memory beyond a single session seems like a natural way to make AI agents more adaptive. Retaining relevant context over time — such as preferences, prior decisions, or recurring patterns — would allow an agent to progressively tailor its behavior to each user, moving from simple responsiveness toward genuine adaptation. In enterprise environments, however, persistence introduces a different class of risk. Storing and reusing user context across interactions raises questions of privacy, data isolation, governance, and compliance — particularly when multiple users interact with shared systems. Without clear ownership and isolation boundaries, naïvely persisted memory can lead to cross‑user data leakage, policy violations, or unclear retention guarantees. As a result, many systems default to ephemeral, session‑only memory. This approach prioritizes safety and simplicity — but does so at the cost of long‑term personalization and continuity. The challenge, then, is not whether agents should remember, but how memory can be introduced without violating enterprise trust boundaries. Persistent Memory: Trade‑offs Between Abstraction and Control As AI agents evolve toward more adaptive behavior, several approaches to agent memory are emerging across the ecosystem. Each reflects a different set of trade-offs between abstraction, flexibility, and control — making it useful to briefly acknowledge these patterns before introducing the design presented here. Microsoft Foundry Agent Service includes a built‑in memory capability (currently in Preview) that enables agents to retain context beyond a single interaction. This approach integrates tightly with the Foundry runtime and abstracts much of the underlying memory management, making it well suited for scenarios that align closely with the managed agent lifecycle. Another notable approach combines Mem0 with Azure AI Search, where memory entries are stored and retrieved through vector search. In this model, memory is treated as an embedding‑centric store that emphasizes semantic recall and relevance. Mem0 is intentionally opinionated, defining how memory is structured, summarized, and retrieved to optimize for ease of use and rapid iteration. Both approaches represent meaningful progress. At the same time, some enterprises require an approach where user memory is explicitly owned, scoped, and governed within their existing data architecture — rather than implicitly managed by an agent framework or memory library. These requirements often stem from stricter expectations around data isolation, compliance, and long‑term control. User-Scoped Persistent Memory with Azure Cosmos DB The solution presented in this post provides a practical reference implementation for organizations that require explicit control over how user memory is stored, scoped, and governed. Rather than embedding long‑term memory implicitly within the agent runtime, this design models memory as a first‑class system component built on Azure Cosmos DB. At a high level, the architecture introduces user‑scoped persistent memory: a durable memory layer in which each user’s context is isolated and managed independently. Persistent memory is stored in Azure Cosmos DB containers partitioned by user identity and consists of curated, long‑lived signals — such as preferences, recurring intent, or summarized outcomes from prior interactions — rather than raw conversational transcripts. This keeps memory intentional, auditable, and easy to evolve over time. Short‑term, in‑session conversation state remains managed by Microsoft Foundry on the server side through its built‑in conversation and thread model. By separating ephemeral session context from durable user memory, the system preserves conversational coherence while avoiding uncontrolled accumulation of long‑term state within the agent runtime. This design enables continuity and personalization across sessions while deliberately avoiding the risks associated with shared or global memory models, including cross‑user data leakage, unclear ownership, and unintended reuse of context. Azure Cosmos DB provides enterprises with direct control over memory isolation, data residency, retention policies, and operational characteristics such as consistency, availability, and scale. In this architecture, knowledge grounding and memory serve complementary roles. Knowledge grounding ensures correctness by anchoring responses in trusted enterprise data sources. User‑scoped persistent memory ensures relevance by tailoring interactions to the individual user over time. Together, they enable trustworthy, adaptive AI agents that improve with use — without compromising enterprise boundaries. Architecture Components and Responsibilities Identity and User Scoping Microsoft Entra ID (App Registrations) — provides the frontend a client ID and tenant ID so the Microsoft Authentication Library (MSAL) can authenticate users via browser redirect. The oid (Object ID) claim from the ID token is used as the user identifier throughout the system. Agent Runtime and Orchestration Microsoft Foundry — serves as the unified AI platform for hosting models, managing agents, and maintaining conversation state. Foundry manages in‑session and thread‑level memory on the server side, preserving conversational continuity while keeping ephemeral context separate from long‑term user memory. Backend Agent Service — implements the AI agent using Microsoft Foundry’s agent and conversation APIs. The agent is responsible for reasoning, tool‑calling decisions, and response generation, delegating memory and search operations to external MCP servers. Memory and Knowledge Services MCP‑Memory — MCP server that hosts tools for extracting structured memory signals from conversations, generating embeddings, and persisting user‑scoped memories. Memories are written to and retrieved from Azure Cosmos DB, enforcing strict per‑user isolation. MCP‑Search — MCP server exposing tools for querying enterprise knowledge sources via Azure AI Search. This separation ensures that knowledge grounding and memory retrieval remain distinct concerns. Azure Cosmos DB for NoSQL — provides the durable, serverless document store for user‑scoped persistent memory. Memory containers are partitioned by user ID, enabling isolation, auditable access, configurable retention policies, and predictable scalability. Vector search is used to support semantic recall over stored memory entries. Azure AI Search — supplies hybrid retrieval (keyword and vector) with semantic reranking over the enterprise knowledge index. An integrated vectorizer backed by an embedding model is used for query‑time vectorization. Models text‑embedding‑3‑large — used for generating vector embeddings for both user‑scoped memories and enterprise knowledge search. gpt‑5‑mini — used for lightweight analysis tasks, such as extracting structured memory facts from conversational context. gpt‑5.1 — powers the AI agent, handling multi‑turn conversations, tool invocation, and response synthesis. Application and Hosting Infrastructure Frontend Web Application — a React‑based web UI that handles user authentication and presents a conversational chat interface. Azure Container Apps Environment — provides a shared execution environment for all services, including networking, scaling, and observability. Azure Container Apps — hosts the frontend, backend agent service, and MCP servers as independently scalable containers. Azure Container Registry — stores container images for all application components. Try It Yourself Demonstration of user‑scoped persistent memory across sessions. To make these concepts concrete, I’ve published a working reference implementation that demonstrates the architecture and patterns described above. The complete solution is available in the Agent-Memory GitHub repository. The repository README includes prerequisites, environment setup notes, and configuration details. Start by cloning the repository and moving into the project directory: git clone https://github.com/mardianto-msft/azure-agent-memory.git cd azure-agent-memory Next, sign in to Azure using the Azure CLI: az login Then authenticate the Azure Developer CLI: azd auth login Once authenticated, deploy the solution: azd up After deployment is complete, sign in using the provided demo users and interact with the agent across multiple sessions. Each user’s preferences and prior context are retained independently, the interaction continues seamlessly after signing out and returning later, and user context remains fully isolated with no cross‑identity leakage. The solution also includes a knowledge index initialized with selected Microsoft Outlook Help documentation, which the agent uses for knowledge grounding. This index can be easily replaced or extended with your own publicly accessible URLs to adapt the solution to different domains. Looking Ahead: Personalized Memory as a Foundation for Adaptive Agents As enterprise AI agents evolve, many teams are looking beyond larger models and improved retrieval toward human‑centered personalization at scale — building agents that adapt to individual users while operating within clearly defined trust boundaries. User‑scoped persistent memory enables this shift. By treating memory as a first‑class, user‑owned component, agents can maintain continuity across sessions while preserving isolation, governance, and compliance. Personalization becomes an intentional design choice, aligning with Microsoft’s human‑centered approach to AI, where users retain control over how systems adapt to them. This solution demonstrates how knowledge grounding and personalized memory serve complementary roles. Knowledge grounding ensures correctness by anchoring responses in trusted enterprise data. Personalized memory ensures relevance by tailoring interactions to the individual user. Together, they enable context‑aware, adaptive, and personalized agents — without compromising enterprise trust. Finally, this solution is intentionally presented as a reference design pattern, not a prescriptive architecture. It offers a practical starting point for enterprises designing adaptive, personalized agents, illustrating how user‑scoped memory can be modeled, governed, and integrated as a foundational capability for scalable enterprise AI.575Views1like1CommentBuilding Production-Ready, Secure, Observable, AI Agents with Real-Time Voice with Microsoft Foundry
We're excited to announce the general availability of Foundry Agent Service, Observability in Foundry Control Plane, and the Microsoft Foundry portal — plus Voice Live integration with Agent Service in public preview — giving teams a production-ready platform to build, deploy, and operate intelligent AI agents with enterprise-grade security and observability.8.5KViews2likes0CommentsBuild Sensitivity Label‑Aware, Secure RAG with Azure AI Search and Purview
Learn why many Azure AI Search developers unknowingly miss sensitivity‑labeled data—and how integrating Microsoft Purview sensitivity labels enables more secure, complete retrieval for RAG, copilot-style apps, and enterprise search. This post explains what sensitivity labels are, why they matter for Azure AI Search, what happens when the integration is not enabled, and how label-aware indexing and query‑time enforcement ensure label-protected documents are indexed, governed, and retrieved correctly in Azure AI Search when retrieved from SharePoint, OneLake, ADLS Gen2, and Azure Blob Storage. Includes supported sources, end‑to‑end flow, and links to documentation, demo apps, and video resources.1.7KViews4likes1CommentAnnouncing extended support for Fine Tuning gpt-4o and gpt-4o-mini
At Build 2025, we announced post-retirement, extended deployment and inference support for fine tuned models. Today, we’re excited to announce we’re extending fine-tuning training for current customers of our most popular Azure OpenAI models: gpt-4o (2024-08-06) and gpt-4o-mini (2024-07-18). Hundreds of customers have pushed trillions of tokens through fine-tuned versions of these models and we’re happy to provide even more runway for your AI agents and applications. Already using these models in Foundry? We have you covered as the only provider of fine tuning gpt-4o and gpt-4o-mini come April. Keep fine tuning! Not yet using Microsoft Foundry? Get started today by migrating your training data to Microsoft Foundry and fine tune using Global or Standard Training for gpt-4o and gpt-4o-mini using your existing OpenAI code. You’ll have the runway to continuously fine tune or update your models. You have until March 31 st , 2026, to become a fine-tuning customer of these models. Model Version Training retirement date Deployment retirement date gpt-4o 2024-08-06 No earlier than 2026-09-31 1 2027-03-31 gpt-4o-mini 2024-07-18 No earlier than 2026-09-31 1 2027-03-31 gpt-4.1 2025-04-14 At base model retirement One year after training retirement gpt-4.1-mini 2025-04-14 At base model retirement One year after training retirement gpt-4.1-nano 2025-04-14 At base model retirement One year after training retirement o4-mini 2025-04-16 At base model retirement One year after training retirement 1 For existing customers only. Otherwise, training retirement occurs at base model retirement593Views0likes0Comments