Forum Widgets
Latest Discussions
APIM within Foundry
Dear Azure AI Foundry team at Microsoft, Please reconsider the current architecture and developer experience around AI observability and token analytics. As it stands today, customers are expected to assemble an entire distributed system — APIM, Azure Functions, Static Web Apps, App Insights, Log Analytics, custom SSE parsing, and additional infrastructure — just to answer very basic operational questions: Which users are consuming the most tokens? Which models are being used the most? What are our real-time streaming costs? Which subscriptions/projects are generating spend? Even worse, many of these solutions break down when using streamed/SSE AI responses because APIM policies are not designed to reliably process chunked AI streams and partial JSON bodies. So customers end up building increasingly complicated middleware pipelines for functionality that should already exist natively inside the platform. At the same time: Azure clearly has access to token and billing telemetry internally customers are still billed for usage yet customers themselves are not given equivalent real-time visibility or tooling That creates a frustrating disconnect, making it feel like a money grab when. It's like paying for groceries and not allowing customers to receive a receipt. Another major issue is API key management. Providing effectively a single project-level credential for enterprise AI workloads creates operational and governance limitations that make multi-user auditing unnecessarily difficult. Why in the world, would the foundry team design this with only api key per project? Is there a secret reason for this, other than annoying customers? To be blunt: the current system design feels massively overengineered for customers while simultaneously underdelivering on the core metrics enterprises actually need. AI platform teams should not need to build 10+ supporting Azure services just to approximate token analytics for a single Foundry project. Azure has excellent infrastructure capabilities overall, which is exactly why this experience is so surprising. But if the platform architecture and observability story for AI workloads do not improve soon, many organizations — including ours — will seriously evaluate moving to alternative cloud providers and AI gateway solutions that provide simpler and more transparent operational tooling. Please prioritize: native streaming token telemetry first-class SSE observability proper per-user/per-model analytics better API credential management simpler AI cost governance workflows Right now, the operational overhead compared to the value delivered is far too high.HetacritMay 26, 2026Copper Contributor15Views1like0CommentsNew AI Foundry not sending refresh tokens to MCP (401 after access token expiration)
Hello, When connecting an MCP server hosted as an Azure Function using OAuth Passthrough in New AI Foundry Playground, the connection is established successfully, the Microsoft login popup appears, authentication succeeds, and the first MCP request returns data correctly. However, once the access token expires, the Playground and deployed AI Foundry agents to Copilot/Teams do not appear to refresh the token, despite offline_access being included in the requested scopes and a refresh URL being configured. All subsequent MCP calls fail with 401 Unauthorized until the connection is manually recreated. For testing, we reduced the token lifetime to 10 minutes to make the issue easier to reproduce. Impact This prevents long-lived or repeated MCP usage in AI Foundry Playground because the connection becomes unusable after token expiry and requires manual reconnection. MCP server host: Azure Function MCP server configuration in New AI Foundry: Endpoint: https://mcp-test-obo-rls-fabric.azurewebsites.net/mcp Client ID: <client_id> (redacted) Auth URL: https://login.microsoftonline.com/tenant_id(redacted)/oauth2/v2.0/authorize Token URL: https://login.microsoftonline.com/tenant_id(redacted)/oauth2/v2.0/token Refresh URL: https://login.microsoftonline.com/tenant_id(redacted)/oauth2/v2.0/token Scopes: openid profile offline_access api://(App ID)/user_impersonation Redirect URI: https://global.consent.azure-apim.net/redirect/(redacted)-fabric-rls-mcp Error returned tool_user_error: Authentication failed when connecting to the MCP server: https://mcp-test-obo-rls-fabric.azurewebsites.net:443/mcp : Response status code does not indicate success: 401 (Unauthorized). Response body: {"code":401,"message":"IDX10223: Lifetime validation failed. The token is expired. ValidTo (UTC): '03/30/2026 08:19:28', Current time (UTC): '03/30/2026 08:29:55'."}. Verify your authentication headers. Suggestions: First verify the required permissions. If the access token is expired or revoked, recreate the connection. If this connection is shared by other users or workflows, recreate it carefully to avoid disruption. Function App / Identity Provider (Entra) The Azure Function authentication configuration: Identity provider: Microsoft (MCP-Fabric-RLS-Server) App registration: MCP-Fabric-RLS-Server Supported account types: Single tenant Application (client) ID: App ID Client secret setting name: MICROSOFT_PROVIDER_AUTHENTICATION_SECRET Issuer URL: https://login.microsoftonline.com/tenant_id(redacted)/v2.0 Redirect URI is added to App Authentication Redirect URI configuration as "Web". Allowed token audiences api://App ID App ID https://mcp-test-obo-rls-fabric.azurewebsites.net Additional checks enabled Allow requests from any application Allow requests from any identity Allow requests only from issuer tenant tenant_id (redacted) Notes The first authenticated call succeeds, so the initial OAuth flow appears to be working. The failure only occurs after the access token expires. Because offline_access is requested and the refresh URL is configured, our expectation is that the client should refresh the token automatically. Our working hypothesis is that either: the refresh token is not being issued, the refresh token is not being stored/used by AI Foundry Playground, or OAuth Passthrough for MCP connections in this scenario does not currently support automatic refresh as expected. Thank you for any assistance provided.wwilczMay 26, 2026Copper Contributor137Views2likes1CommentAidemos Microsoft site doesn't work https://aidemos.microsoft.com/
Hello MS team, I am learning AI-900 in Coursera. The course guides me to try AI demos on https://aidemos.microsoft.com/. But it seems broken for weeks. According to the error message, it could be the issue of the backend. Could the MS team fix it, please? Best Regards, DaleDale_CuiMay 25, 2026Copper Contributor7.1KViews1like16CommentsAzure AI Foundry HTTP 403 "unusual behavior" block on Elevate-grant resource
SUMMARY Azure AI Foundry resource is returning HTTP 403 with the message "Your resource has been temporarily blocked because we detected unusual behavior" and has remained blocked for 24+ hours with no auto-clear, even at less than 2 RPM with a 10-token probe. The bigger issue: every standard support path is closed because the subscription sits on a Developer support plan — both the Azure Portal ticket form AND the az support REST/CLI API gate on plan tier. Posting here as one of the few remaining surfaces a Microsoft engineer can pick up. RESOURCE DETAILS Resource: bda-ai-foundry Subscription: f60e8dd3-ec1a-42bc-ba90-f79e7e835505 Organization: Bharat Dharma Academy Limited — Australian registered charitable NFP, Sanatan Dharma educational content Grant: Microsoft Elevate AFFECTED DEPLOYMENTS - gpt-5-mini (chat/completions, primary workhorse) - gpt-5-pro (responses endpoint, premium tier) - o4-mini (fast fallback) - text-embedding-3-large - Cohere-embed-v3-multilingual - FLUX-1.1-pro SUSPECTED TRIGGER Per Microsoft Q&A guidance from a volunteer moderator (May 22 2026), the most likely trigger is anomaly detection from: (a) the same API key being used from two geographically distant origins — Render in US East (Virginia) and a Mac workstation in Australia (Sydney); (b) bursty smoke-test calls after long idle periods. That diagnosis is consistent with the symptom: a 403 that does NOT auto-clear, and is NOT a traditional 429 quota error. SUPPORT PATHS ATTEMPTED (ALL BLOCKED) 1. Azure Portal -> Help + Support -> New Support Request -> redirects to Q&A / support-plan gate. 2. Direct support-request URL (typed manually in browser) -> same redirect / gate. 3. Microsoft Support virtual agent (contact page) -> loops back to Q&A. 4. Azure CLI in Cloud Shell — full az support in-subscription tickets create command with all required parameters populated (service classification, severity, contact details). Returned: (InvalidSupportPlan) Your support plan type is Developer. To create and update support tickets, and add communication operations, you need access to our high tier-support plans. 5. Microsoft for Nonprofits contact form — submitted in parallel. 6. Microsoft Q&A — root cause confirmed by volunteer moderator, awaiting MS engineer pickup. MITIGATIONS ALREADY IMPLEMENTED ON OUR SIDE To demonstrate this is not a runaway script: - All Azure traffic PAUSED — we are NOT retrying (we understand retries worsen the anomaly signal). - Async rate limiter + circuit breaker shipped on every Azure caller; any 403 immediately opens a 6-hour circuit so the resource cannot be hammered. - Going forward: * separate API keys per origin (Render in Virginia and Mac in Sydney) * separate deployments per workload (web vs batch) * Diagnostic Settings routed to Log Analytics WHAT I AM ASKING FOR 1. Manual review and clearance of the block on bda-ai-foundry. 2. The specific trigger reason if visible internally, so we can confirm our preventive mitigations are correct. 3. Guidance for the community: should grant-funded Microsoft Elevate / Nonprofit subscriptions have an alternative support escalation path that does not require a paid plan upgrade? The current state — every standard channel closed for Developer-plan subscriptions — is a significant gap for charitable projects that, by design, run on Microsoft's grant credits and do not maintain paid support contracts. This is a live charitable educational workload (Sanatan Dharma content serving the global Hindu community). Happy to provide any further diagnostic information. Thank you. Parag Srivastava Founder, Bharat Dharma Academy Limited Contact email available via my Tech Community profile.BharatDharmaAcademyLimitedMay 23, 2026Copper Contributor6Views0likes0CommentsThe Cloud Foundation for Safe Agentic AI
Why enterprise agents need more than a working prototype Most AI conversations start with the model. Which model should we use? Which framework? Which agent platform? Which demo can we build quickly enough to make the idea feel real? Those questions are not wrong, but they are rarely the first questions that matter in an enterprise environment. In real projects, the hard part usually appears after the first prototype works. The demo can answer a question, call a tool, retrieve a document, or update a record. Then someone asks whether it can be connected to production data, used by more teams, or allowed to trigger real actions. That is where the conversation changes. In the first part of this series, I looked at why many companies are less ready for agentic AI than they think. The blockers were practical and familiar: unclear business problems, immature processes, weak data foundations, and no clear owner when an AI system makes a poor recommendation or takes a wrong action. The message was simple: Before a company asks what agents can do, it needs to understand what it is ready to delegate. But business readiness is only the first layer. Even when the use case is clear, the process is understood, and leadership is aligned, another question appears. Is the platform ready to support agents safely? This is where Part 2 begins. Agentic AI does not behave like a normal application workload. A traditional application usually follows predefined paths. It receives a request, processes logic, returns a response, writes to a database, or calls an API. Agents introduce a different pattern. They reason over context, retrieve information, choose tools, trigger actions, interact with other services, and sometimes operate across multiple systems at once. That makes the surrounding cloud platform much more important. There is also a shadow AI angle to this. In many organizations, agent-like capabilities are already entering through SaaS platforms, vendor copilots, browser extensions, and productivity tools. These systems may not run inside the organization’s governed Azure subscriptions, but they can still interact with enterprise data and business workflows. If the official platform is not ready, teams will often find less governed ways to experiment anyway. That is not always malicious. Sometimes it is just people trying to solve their work with the tools available to them. The marketing analyst pasting customer data into a public chatbot because the official AI platform is six months away. The support team using a browser extension that summarizes tickets, without anyone realizing those tickets are also being sent to a third-party service. From a governance point of view, the effect is the same. Cloud readiness for agentic AI is not defined by access to cloud services or model endpoints alone. The real question is whether the platform can support controlled autonomy. Before enterprises can trust agents to act, the platform must be able to identify them, observe their behavior, restrict their permissions, enforce policy, and contain failure. Without that, an organization is not really deploying an intelligent assistant in a controlled way. It is introducing a workload that can interact with enterprise systems without anyone clearly watching what it does or being able to stop it. From business readiness to cloud readiness After the business foundation is clear, the next layer is the cloud foundation. A company may have a strong use case, executive support, and even a working prototype. But that does not mean it is ready to deploy agents in production. A prototype can run with broad access, manual supervision, loose logging, and a small group of test users. Production requires more discipline. It requires clear identity, controlled access, traceable activity, enforceable policy, and operational ownership. Cloud readiness for agentic AI comes down to four pillars, in this order: Identity-first architecture Observability Policy controls Platform constraints The order matters. 1. Identity-first architecture Identity comes first because nothing can be governed properly if it cannot be identified. In traditional cloud systems, we already learned this lesson with users, applications, service principals, managed identities, and workloads. Agents add another layer of non-human actors into the enterprise environment. If an agent can retrieve data, call tools, trigger workflows, or interact with business systems, it needs a clear identity. Without that foundation, governance becomes fragile. Teams may struggle to control what the agent can access, understand what it did, or determine who is accountable when something goes wrong. I have seen agents running in production where nobody could clearly say who owned them. They worked. Until they did not. Identity-first architecture means each agent or agentic workload should have a defined identity, ownership model, permission scope, and lifecycle. It should be clear whether the agent is acting on behalf of a user, acting as a service, or operating within a delegated boundary. This matters because permissions are not an implementation detail. They define the blast radius and accountability model of the system. In Azure environments, this is where Microsoft Entra ID and newer agent identity capabilities become important. As agents become more common across Azure AI Foundry, Copilot Studio, Microsoft 365, and custom frameworks, organizations need a way to understand which agents exist, who owns them, what they can access, and how their lifecycle is managed. Identity is not only about authentication. It is also about visibility, traceability, ownership, permission boundaries, and accountability. Agents should not remain hidden inside application logic or operate through shared identities. If they can retrieve data, call tools, or trigger actions, they need to be managed with the same care as any other production workload. 2. Observability Once identity is established, observability becomes the next pillar. Knowing that an agent exists is not enough. The platform must be able to show what the agent did. For normal applications, observability often focuses on service health, latency, failures, and resource usage. For agents, those signals still matter, but they are incomplete. Agent observability also needs to capture the execution path across model calls, retrieved context, orchestration steps, tool calls, policy decisions, approvals, denials, and final actions. This changes how we think about monitoring. With agentic systems, the question is not only whether a request succeeded or failed. Teams also need to understand the path that led to the outcome, the context used, the tools called, the policies applied, and the point where behavior changed. Without that visibility, it is difficult to investigate failures and improve reliability. This is also where observability starts to support governance, not just troubleshooting. Once teams can measure how agents behave, they can move toward KPI-based governance. That may include reliability, escalation rates, policy denials, grounding quality, tool-call failures, cost per interaction, latency, and business outcome metrics. Without this measurement layer, maturity remains mostly opinion-based. With it, governance becomes evidence-based. In Azure, Azure Monitor is the obvious starting point. Together with services such as Application Insights and Log Analytics, it provides the telemetry foundation needed to understand how AI workloads behave in production. For agentic systems, this usually requires combining platform telemetry with application-level traces from orchestration, retrieval, model calls, policy decisions, and tool execution. This visibility is what makes continuous improvement possible. It is also what allows governance to mature from “we think the agent is behaving correctly” to “we can measure how the agent behaves over time.” Small difference. Large consequence. 3. Policy controls The third pillar is policy controls. This comes after identity and observability because policy needs both. Identity defines who or what the rule applies to. Observability helps teams understand whether the rule is effective, bypassed, misconfigured, or too restrictive. Policy controls define the boundaries for what agents are allowed to do. They determine how agents access data, which tools they can use, which environments are in scope, when approval is required, and when an action or response should be blocked. The key point is simple: Prompts can guide behavior, but they are not a reliable enforcement layer. For enterprise systems, policy needs to be external, testable, auditable, and enforceable. This becomes especially important because agents may operate across multiple systems. An agent may retrieve information from one source, reason over the result, call a tool, update a ticket, send a message, or trigger a workflow. Each step may appear safe in isolation, while the full chain creates risk. Policy controls provide boundaries around that chain. In Azure, this starts at the cloud governance layer. Azure landing zones, management group structures, and Azure Policy can help define where AI workloads are deployed, how environments are separated, and which rules apply consistently across subscriptions. At runtime, Azure AI Content Safety can help detect harmful content, prompt attacks, unsafe interactions, or outputs that drift away from the intended task. For tool and API access, Azure API Management can also be used as a controlled gateway between agents and downstream systems. This can support centralized authentication, throttling, mediation, logging, and policy enforcement. It is not mandatory in every design, but it is a useful option when agents need governed access to APIs instead of direct backend connectivity. The goal is not to create friction for the sake of control. The goal is to make sure the agent operates inside boundaries that are defined outside the prompt and outside the model response. 4. Platform constraints The fourth pillar is platform constraints. This area often receives less attention early in the project, but it strongly shapes whether an agentic system can operate safely and reliably in production. These constraints include network isolation, private connectivity, data residency, regional availability, quota limits, model throughput, latency, logging retention, integration boundaries, cost behavior, and operational ownership. They may seem like implementation details during early design discussions, but they often determine whether the system can actually run in production. For agentic workloads, these constraints also shape where experimentation happens. Sandboxed environments, isolated subscriptions, limited tool access, and controlled test data can help teams evaluate agent behavior before exposing it to production systems. This becomes even more important when agents are allowed to generate code, call external tools, or execute actions that may not be fully trusted at design time. Platform constraints are where the earlier pillars meet implementation reality. Identity affects how agents connect to services. Observability affects logging cost, retention, and investigation capability. Policy affects routing, network design, tool exposure, and user experience. By the time an agentic system reaches production, these constraints are no longer background details. They become design boundaries. In Azure, this is where landing zone design, private networking, regional planning, quota management, cost management, and operational runbooks matter. Azure landing zones, private endpoints, private DNS, Azure Firewall, NSGs, and controlled network paths all influence whether the agent architecture can move from prototype to production without being redesigned halfway through. And yes, that redesign usually happens at the least convenient moment. Architecture has a sense of humor. Not a kind one. From principles to Azure capabilities The four pillars are not only architectural principles. They need to be translated into platform capabilities, operating practices, and governance controls. In practice, controlled agent deployment is rarely achieved by a single product or service. It requires multiple layers working together. Identity, monitoring, policy, networking, runtime safety, API exposure, and operational controls all play a part. Azure provides several services and patterns that can help implement these controls, but there is no fixed blueprint that applies to every organization. The right combination depends on the use case, regulatory requirements, existing landing zone design, integration landscape, and the level of autonomy expected from the agent. The examples below should be seen as a practical toolset, not as a mandatory checklist. Pillar Goal Example Azure capabilities Identity-first architecture Make agents visible, owned, permissioned, and governable as enterprise workloads. Microsoft Entra ID, Microsoft Entra Agent ID, managed identities, service principals, workload identities, access reviews, Conditional Access, Privileged Identity Management Observability Understand runtime behavior, trace execution paths, investigate failures, and improve reliability. Azure Monitor, Application Insights, Log Analytics, Azure AI Foundry tracing, diagnostic settings, distributed tracing, correlation IDs, application-level telemetry Policy controls Enforce boundaries around access, actions, content safety, APIs, and governance. Azure landing zones, management groups, Azure Policy, Azure AI Content Safety, Prompt Shields, Microsoft Purview, Azure API Management, RBAC, approval flows Platform constraints Operate within real cloud boundaries such as networking, region, quota, compliance, and operations. Azure landing zones, private endpoints, private DNS, private networking, Azure Firewall, NSGs, quota planning, regional architecture, cost management The purpose of this mapping is not to suggest that Azure has one single service for each pillar. It does not. The practical goal is to combine the right services and patterns so the platform can identify agents, monitor their behavior, enforce boundaries, and operate within known cloud constraints. Conclusion Agentic AI does not become enterprise-ready simply because a model is available, a prototype works, or a business sponsor is excited. The real question is whether the surrounding cloud foundation can support agents that act within boundaries the platform actually enforces. Together, these pillars move the discussion from building an agent to preparing the environment in which the agent can operate responsibly. That distinction is important. A prototype can rely on broad access, limited logging, and close manual supervision. A production system needs clearer boundaries around ownership, access, traceability, and control. This is also where the series moves naturally into Part 3. Once the business foundation is clear and the cloud foundation is in place, the next challenge is the design of the agent itself. The cloud foundation matters here because it provides the controlled environment in which agents can be tested, limited, and observed before they are trusted with broader enterprise access. For more advanced scenarios, that also includes sandboxing patterns for generated code, tool execution, and untrusted actions. In Part 3, I will move closer to implementation and look at how to design an enterprise-ready agent. That means defining the agent’s scope, grounding it with reliable knowledge, deciding which tools it can use, designing safe execution loops, adding human oversight where it matters, and thinking carefully about when a single agent is enough versus when multi-agent coordination is justified. That is where agentic AI starts becoming more than an idea. And, as usual, that is also where the architecture starts to matter. This article is part of my Agentic AI readiness series and was also published on Medium.19Views0likes0Comments8 Architectural Pillars to Boost GenAI LLM Accuracy and Performance in Low Cost
Smarter AI architecture, not bigger LLM models - how engineering teams push LLM accuracy and high performance in low cost. Enterprises using LLM (Large Language Models) hits the same ceiling and paying big price! A raw API call to a frontier model- GPT-4, Claude, Gemini delivers only 35-40% accuracy on structured output tasks like code generation, NL to DAX query generation, domain-specific reasoning. Prompt engineering pushes that to ~60%. But the final 35+ percentage points? Those come from system architecture, not model upgrades. This guide presents 8 architectural pillars, distilled from production Gen AI systems, that compound to close the accuracy gap. These patterns are model-agnostic and domain-agnostic, they apply equally to chatbots, coding assistants, content/query generators, automation agents, and any application where an LLM produces structured or semi-structured output. It’s based on my recent Gen AI projects. The key takeaway: use the LLM as one component in a larger system, not as the system itself. Surround it with deterministic guardrails, verified knowledge, and feedback loops. Pillar 1: Enhance Prompts with Verified Knowledge Context Impact: +35–40% accuracy (based on production use cases; may vary by domain) Top source of LLM errors in production is hallucinated identifiers knowledgebase, the model invents names, references, or structures that don't exist in the target system. This happens because LLMs are trained on general knowledge but deployed against specific, private enterprise systems they've never seen local database and knowledgebase. The fix is straightforward: inject verified, system-specific context (type definitions, API specs, ontologies, configuration schemas, entity catalogues) directly into the prompt so the model composes from known-good elements rather than recalling from training data. Use Knowledge Graph for better sematic knowledge. How to Implement Provide explicit context, never implicit- Whatever the LLM needs to reference identifiers, valid values, semantic knowledge, structures must appear verbatim in the prompt or retrieved context window. Filter aggressively. A full knowledge base with thousands of entities overwhelms the context window and confuses the model. Use intelligent filtering to surface only needed 5-10 most relevant elements per request. Store structured semantic knowledge in a graph or searchable index. This enables relationship-aware retrieval: "given entity X, what related entities, attributes, and constraints are also needed?" Include rich Semantic metadata. Names alone are insufficient. Include types, constraints, valid value ranges, relationships, and usage notes to minimize ambiguity. Keep context fresh. Stale context causes a different class of hallucination the model generates valid-looking output that references outdated structures. Sync your knowledge store with your source of truth. Why This Works LLMs excel at composition and reasoning combining elements, applying logic, following patterns. They are unreliable at recall of specific identifiers exact names, valid values, structural constraints. By offloading recall to a deterministic retrieval system and giving the LLM only composition tasks, you play to each system's strengths. Pillar 2: Tiered LLM Approach: Route Deterministically First, Use LLMs Last Impact: 80% cost reduction, 85% latency reduction, eliminates non-deterministic errors for most traffic. The most impactful architectural insight: most production requests don't need an LLM at all. A well-designed system handles 60-70% of traffic with deterministic logic templates, composition rules, cached results and reserves expensive, non-deterministic LLM calls only for genuinely novel inputs. The Three-Tier Model These metrics are from a real use case to convert NLP to Power BI DAX query. Tier Strategy Uses LLM ? Latency Accuracy Tier 0 Template slot-filling - handles requests that match known patterns exactly the system fills slots in a pre-built template with extracted parameters. No LLM, no non-determinism, near-perfect accuracy, sub-100ms response. No ~50ms 95-98% Tier 1 Compose from pre-validated fragments- handles requests that combine known patterns in new ways. The system retrieves pre-validated building blocks via search, composes them using deterministic rules, and validates the result. Still no LLM call. No ~200ms 90-95% Tier 2 Full LLM generation with enriched context- is reserved for genuinely novel requests that can't be served deterministically. Even here, the LLM receives maximum support: filtered context, relevant examples, explicit rules, and structured planning. Yes (1 call) 2-5s 88-93% Complexity-Based Routing A lightweight scoring function (evaluated in <1ms) routes each incoming request: Factors: reasoning depth, number of components, cross-references, constraints, nesting depth, novelty (distance from known patterns) Score 0-39: Tier 0 (deterministic template) Score 40-59: Tier 1 if confidence ≥ 85%, else Tier 2 Score 60+: Tier 2 (LLM generation) This routing achieves 96%+ accuracy in tier assignment and ensures the expensive path is only taken when necessary. Why This Matters Cost: 70-80% of requests cost zero LLM tokens Latency: Majority of responses in <200ms instead of 2-5s Reliability: Deterministic tiers produce identical output for identical input. Scalability: Deterministic tiers scale horizontally with trivial compute Pillar 3: Encode Prompt Anti-Patterns as Explicit Rules Impact: +8-10% accuracy, ~80% reduction in common structural errors LLM mistakes are patterned, not random. In any domain, 80% of errors cluster around a small set of 6-13 recurring structural mistakes. Instead of hoping the model avoids them through general instruction-following, compile these mistakes into explicit WRONG => CORRECT rules embedded directly in the system prompt. How to Implement Collect error data. Run 100+ requests through your system and categorize the failures. You'll find the same 6-13 patterns appearing repeatedly. Write concrete rules. For each pattern, show the exact wrong output and the exact correct alternative, with a one-line explanation of why. Embed in system prompt. Place rules prominently after the task description, before examples. Use formatting that's hard to ignore (headers, bold, explicit "NEVER" language). Keep the list short. 6-13 rules maximum. Beyond that, attention dilutes and the model starts ignoring rules. Prioritize by frequency. Refresh continuously. As the system improves (via other pillars), some errors disappear. New error types emerge. Update the rule set quarterly. Why This Works LLMs respond strongly to explicit negative examples. A generic instruction like "be careful with X" has minimal impact. But showing the exact wrong output the model tends to produce, paired with the correction, creates a strong avoidance signal. It's analogous to unit tests. Pillar 4: Retrieve Few-Shot Examples Dynamically Impact: +5-15% accuracy depending on domain complexity Static examples hardcoded in a prompt become stale, irrelevant of context tokens. Dynamic few-shot retrieval selects the 3-5 most relevant examples for each specific request, maximizing the signal-to-noise ratio in the prompt. Hybrid Retrieval Architecture The most effective approach combines two search strategies for intent search to understand natural language (NL) context: Keyword search (BM25) Finds examples with exact matching terms, identifiers, and domain vocabulary Vector search (semantic similarity) Finds examples with similar intent and structure, even if wording differs Rank fusion Merges results from both strategies, re-ranking by combined relevance This hybrid approach outperforms either strategy alone because keyword search catches exact identifier matches that vector search dilutes, while vector search captures semantic similarity that keyword search misses entirely. Best Gen AI Architectural Practices Match complexity to complexity. Simple requests should see simple examples. Complex requests should see complex examples. Mismatched examples confuse the model. Include negative examples. For the detected request type, include 1-2 "wrong => correct" pairs alongside positive examples. This reinforces Pillar 3's anti-pattern rules with concrete, contextually relevant demonstrations. Pre-compute embeddings. Generate vector embeddings at indexing time, not at query time. Cache retrieval results for repeated patterns. Curate quality over quantity. 3 excellent, diverse examples beat 10 mediocre ones. Each example should demonstrate a distinct pattern or edge case. Keep examples current. As your system evolves, old examples may demonstrate outdated patterns. Review and refresh the example store periodically. Pillar 5: Feedback Loop- Validate and Auto-Fix Every Output Deterministically Impact: +3-5% accuracy as a safety net, plus continuous improvement via feedback No matter how well-prompted, LLMs will occasionally produce outputs with minor structural errors - wrong casing, missing delimiters, references to slightly-incorrect identifiers, or subtle format violations. A deterministic post-processing pipeline catches and fixes these without any additional LLM calls. The Validation Pipeline LLM Output => Parse (grammar/AST) => Rule-Based Fixes => Compliance Check/validation => Final Output Each stage is fully deterministic: Parsing: Use a formal grammar or AST parser (ANTLR, tree-sitter, language-native parsers) to structurally analyse the output. Never regex-parse structured output - it's fragile and misses edge cases. Rule-based fixes: 10-20 deterministic transformation rules that correct known error patterns - name normalization, casing fixes, missing delimiters, structural repairs. Compliance check: Verify every identifier referenced in the output actually exists in the provided context. Flag unknown references. Design Principles Zero LLM calls in the fix pipeline. Every fix is a regex, an AST transformation, or a lookup table operation. Instant, free, deterministic, 100% reliable. Fail safe. If a fix is ambiguous (multiple valid corrections possible), pass through rather than corrupt. A minor error is better than a confident wrong "fix." Log everything. Track every fix applied, categorized by type. This data drives the feedback loop. The Critical Feedback Loop- The validation pipeline's most important function isn't fixing outputs, it's generating improvement signals: This creates a feedback loop: the auto-fix catches errors → the errors get promoted to upstream prevention → fewer errors reach the auto-fix → the system continuously tightens. Pillar 6: Multi-Agent Orchestration with Fewer Agents and Clear Contracts Impact: Reduced latency, clearer debugging, fewer failure modes The multi-agent pattern is powerful but commonly over-applied. The counter-intuitive lesson from production systems: fewer agents with well-defined responsibilities outperform many fine-grained agents. Why Fewer Is Better Each agent handoff introduces: Latency - serialization, network calls, context assembly Context loss - information dropped between boundaries Failure modes - each handoff is a potential error point Debugging complexity - tracing issues across many agents is exponentially harder Multi-Agent Orchestration Principles Merge agents that always run sequentially. If Agent A always feeds into Agent B with no branching or conditional logic, they should be one agent with two internal steps. Parallelize independent operations. Context retrieval and example lookup are independent, run them concurrently to halve retrieval latency. Route sub-tasks to cheaper models. Decomposed sub-problems are simpler by design. Use a smaller, faster, cheaper model (3x cost savings, 2x speed improvement). Define strict contracts. Each agent boundary should have an explicit schema defining inputs and outputs. No implicit assumptions about what crosses the boundary. Only 2 of 4 agents should call an LLM. The rest are purely deterministic. This minimizes non-deterministic behavior and cost. Pillar 7: Multi-Agent Cache at Multiple Hierarchical Levels Impact: 40-50% faster responses, 85%+ combined hit rate, significant cost reduction A single cache layer captures only one type of repetition. Production systems need hierarchical caching where multiple levels catch different repetition patterns , from exact duplicates to semantic near-misses. with -> A single cache layer captures only one type of repetition. Production systems need multi-level caching to handle exact matches, similar requests, and reusable fragments. or -> with Production systems need hierarchical caching where multiple levels handle exact matches, similar requests, and reusable fragments. Pillar 8: Measure Everything, Learn Continuously Impact: Enables data-driven iteration and prevents accuracy regressions. Architecture without observability is guesswork. The final pillar ensures every other pillar stays effective over time through comprehensive metrics and automated feedback loops. This isn't a one-time setup; it's a perpetual feedback loop. Every week, the top error patterns shift slightly. The auto-fix metrics tell you exactly where to focus next. Over months, this flywheel compounds into dramatic accuracy gains that no single prompt rewrite could achieve. Auto-Learning for New Domains When extending your system to new domains or knowledge areas: Auto-classify elements using naming conventions, type analysis, and structural patterns Auto-generate templates from universal patterns (transformations, comparisons, compositions, sequences) Bootstrap few-shot examples from successful template outputs Monitor for the first 100 requests, then curate only the edge cases manually This reduces domain onboarding from days of manual work to minutes of automated bootstrapping plus focused human review of outliers. Key Takeaways Architecture beats model size. A well-architected system with a smaller model outperforms a raw frontier model call on structured tasks at a fraction of the cost. Deterministic systems should do the heavy lifting. Reserve LLMs for genuinely novel, creative tasks. 70-80% of production requests should never touch an LLM. Verified knowledge is your top accuracy lever. Ground every prompt in context the model can trust. Errors are patterned, not random- Track them, compile them, and explicitly forbid them. Build feedback loops, not static systems- Every auto-fix, every cache miss, every routing decision is a signal for improvement. Fewer agents, done well- Fewer agents with strict contracts outperform 9 agents with fuzzy boundaries in accuracy, latency, and debuggability. Measure what matters and iterates- The system that wins isn't the one with the best day-one prompt, it's the one that improves fastest over time. Production-grade GenAI isn't about finding the perfect prompt or waiting for the next LLM model release. It's about building architectural guardrails that make failure nearly impossible and when failure does occur, the system learns from it automatically. These 8 pillars, applied together, transform any LLM from an unreliable black box into a precise, efficient, and continuously improving production system. -> Production Gen AI success is not about perfect prompts or waiting for the next LLM release. It comes from designing strong system guardrails that reduce failures and ensure consistent output. Even when failures happen, the system learns and improves automatically. When applied together, these 8 pillars turn an LLM into a reliable, efficient, and continuously improving production system.256Views1like1CommentFoundry Toolbox preview not working for hosted agent
Tried calling a hosted foundry agent with calls to Toolbox where I tried both web search and code interpreter. Neither of them work. If i use session.call_tool, I get an error like " meta={'tool_configuration': {'type': 'web_search'}} content=[TextContent(type='text', text='NotFound[404, user=The API deployment for this resource does not exist. If you created the deployment within the last 5 minutes, please wait a moment and try again.]', annotations=None, meta=None)] structuredContent=None isError=True". If i try agent.run asking for latest news on a topic ,I either get a generic pretrained knowledge based response (without reference to web search tool) Or a generic error of the type " I wasn't able to retrieve the latest news at the moment due to a technical issue." I have verified that the code uses the appropriate headers like " headers={"Foundry-Features": "Toolboxes=V1Preview"}" I have verified that a Foundry portal agent calling web search tool works as expected. However when I create a custom tool using MCP Server where I provide the URL of the foundry toolbox and then try to use this tool in a Portal created agent I always get an access issue even if i use project identity as the Entra authentication and despite the fact that Project Identity has Foundry User privilege on Foundry Project. I have also tried the github samples for deploying hosted agents with foundry toolbox without luck. Version of agent-framework as of date that I have tried is 1.4.0. Please advise on a resolution. Thanks!kundurmuttMay 19, 2026Copper Contributor20Views0likes0CommentsGPT-5.5-Pro not listed in foundry?
The model is mentioned in this blog post : https://azure.microsoft.com/en-us/blog/openais-gpt-5-5-in-microsoft-foundry-frontier-intelligence-on-an-enterprise-ready-platform/ But it is currently not listed on Foundry. Only latest pro model is 5.4-pro. When will 5.5-pro model be available on azure foundry?112Views0likes0CommentsUnable to add SharePoint site as a tool in Foundry Agent (403 – User does not have valid license)
Hi, I’m very new to Foundry and I’m trying to add a SharePoint site as a tool (SharePoint grounding) in a Foundry Agent, but it fails with: HTTP 403 – Forbidden Authorization Failed – User does not have valid license Tool: sharepoint_grounding Error{"error": "Tool_User_Error", "message": "[Sharepoint-tool] Request to Graph API failed with HTTP status 403, error-code: Forbidden and error-message: Authorization Failed - User does not have valid license. Client Request Id: 0000000000000000000000. Find out more troubleshooting details here - https://aka.ms/foundrysharepointtroubleshooting", "code": "sharepoint_grounding_tool_user_error", "tool": "sharepoint_grounding", "allow_retry": false, "extra_info": null} Azure roles, Graph permissions, and SharePoint access are all correctly configured (Owner, Azure AI Admin/Developer/User), and the SharePoint site is accessible outside Foundry. Despite this, Foundry blocks the tool with a license error. Any help or guidance would be very much appreciated. regards AngelaAngela2Apr 25, 2026Copper Contributor145Views0likes1CommentNeed Guidance on cost breakdown of Microsoft Foundry Agent portal I created
I have developed a complaint handling portal for customers and employees using Azure AI Foundry. The solution is built with Foundry agents, models from the catalog, input/output caching, agent logging/tracing, and other Foundry capabilities. The frontend and orchestration layer are deployed on Azure Container Apps. While Azure Cost Analysis provides an overview of spending, several parts remain unclear or act as a black box for accurate estimation, including: Token consumption assumptions (input/output tokens across different models and agents) User concurrency, sessions, and behavior patterns Agent logging and observability costs Impact of input/output caching Detailed resource consumption and billing in Azure Container Apps What is the best way to accurately calculate or estimate the total running cost for such an Azure AI Foundry-based platform with Container Apps frontend? Are there official Microsoft documentation, pricing guides, or reference architectures for cost breakdown? How do companies typically present costs for such AI platforms to attract customers (e.g., TCO models or per-user pricing)? I want to know how the platform costs are shown to customers. Thank you.Tasmia_MonzoorApr 25, 2026Copper Contributor119Views0likes1Comment
Tags
- AMA74 Topics
- AI Platform57 Topics
- TTS50 Topics
- azure ai foundry28 Topics
- azure ai27 Topics
- azure ai services22 Topics
- azure15 Topics
- azure machine learning13 Topics
- azureai13 Topics
- machine learning10 Topics