agent
16 TopicsBuilding an Auditable Security Layer for Agentic AI
Most agent failures do not look like breaches. They look like a normal chat, a normal answer, and a normal tool call. Until the next morning, when a single question collapses the whole story: who authorized that action. You think you deployed an agent. In reality, you deployed an unbounded automation pipeline that happens to speak English. I’m Hazem Ali — Microsoft AI MVP, Distinguished AI & ML Architect, Founder & CEO at Skytells. For over 20 years, I’ve built secure, scalable enterprise AI across cloud and edge, with a focus on agent security and sovereign, governed AI architectures. My work on these systems is widely referenced by practitioners across multiple regions. Hazem Ali honored to receive an official speaker invitation under the patronage of H.H. Sheikh Dr. Sultan bin Muhammad Al Qasimi, Member of the UAE Supreme Council and Ruler of Sharjah, to speak at the Sharjah International Conference on Linguistic Intelligence (SICLI), organized by the American University of Sharjah (AUS) and the Emirates Scholar Center for Research and Studies. This piece is a collaboration with Hammad Atta a Practice Lead – AI Security & Cloud Strategy and Dr. Yasir Mehmood , Dr Muhammad Zeeshan Baig, Dr. Muhammad Aatif, Dr. MUHAMMAD AZIZ UL HAQ. We align on one core idea: agent security is not about making the model behave. It is about building enforceable boundaries around the model and proving every privileged step. This article is meant to sit next to my earlier Tech Community piece, Zero-Trust Agent Architecture: How To Actually Secure Your Agents, and go one level deeper into the mechanics you can implement on Azure today. Let me break it down. The Principle: The model is not your boundary Let me break it down in the way I’d explain it in a design review. A boundary is something that still holds when the component on the other side is adversarial, confused, or simply wrong. An LLM is none of those reliably. In an agent, the model is not just a generator. It becomes a planner and scheduler. It decides when to retrieve, which tool to call, how to shape arguments, and when to loop. That means your real attack surface is not “bad output.” It is the control-flow graph the model is allowed to traverse. So if your “security” lives inside the prompt, you are putting policy in the same token stream the attacker can influence. That is not a boundary. That is a suggestion. The only stable design is to treat the model like an untrusted proposer and the runtime like the verifier. Here is the chain I use. Each gate is external to the model and survives manipulation. Context Gate: Everything that enters the model is treated as executable influence, not “text.” Capability Gate: Tools are invoked as constrained capabilities, not free-form function calls. Evidence Gate: Every privileged step produces a verifiable artifact, not a story. Retrieval Control Plane: What the agent can see is governed by labels and identity, not prompt etiquette. Detection Layer: Drift and probing become alerts, not surprises. Now the rare part, the part most people miss: the boundary is not “block or allow.” The boundary is stateful. Once the runtime sees a suspicious signal, the entire session must transition into a degraded capability state, and every downstream gate must enforce that state. 1. Treat context as executable influence, and preserve provenance If you do RAG, your documents are not “supporting info.” They are an input channel. That makes the biggest prompt-injection risk not the user. It is your documents. Microsoft’s Prompt Shields covers user prompt attacks (scanned at the user input intervention point) and document attacks (scanned at the user input and tool response intervention points). When enabled, each request returns annotation results with detected and filtered values that your runtime can translate into a policy decision: block, degrade, or allow. Provenance Collapse. Most teams concatenate prompt + policy + retrieved chunks into one blob. The moment you do that, you lose the one thing you need for a defensible boundary: you can no longer reliably tell which tokens came from where. That is how “context” becomes “authority.” For indirect/document attacks, Microsoft guidance recommends delimiting context documents inside the prompt using """<documents> ... </documents>""" to improve indirect attack detection. That delimiter is not formatting. It is a provenance marker that improves indirect attack detection through Prompt Shields. Minimal, practical pattern: // Provenance-preserving prompt construction for indirect/document attack detection function buildPrompt(system: string, user: string, retrievedDocs: string[]): string { const docs = retrievedDocs.map((d) => `- ${d}`).join("\n"); return [ system, "", `User: ${user}`, "", `""" <documents>\n${docs}\n</documents> """`, ].join("\n"); } Then treat Prompt Shields output as a session security event, not a banner: type RiskState = "NORMAL" | "SUSPECT" | "BLOCK"; type FilterPolicy = "BLOCK_ON_FILTERED" | "DEGRADE_ON_FILTERED"; function computeRiskState( shields: { detected: boolean; filtered?: boolean }, labels: string[], policy: FilterPolicy = "DEGRADE_ON_FILTERED", ): RiskState { // detected => hard stop if (shields.detected) return "BLOCK"; // filtered is an annotation signal: block or degrade by policy if (shields.filtered) { return policy === "BLOCK_ON_FILTERED" ? "BLOCK" : "SUSPECT"; } // example: sensitivity-based degradation independent of shield hits const sensitive = labels.some((l) => ["Confidential", "HighlyConfidential", "Regulated"].includes(l), ); return sensitive ? "SUSPECT" : "NORMAL"; } When the signal is clear, you block and log. When it is suspicious, you do not warn. You downgrade authority. QSAF Alignment: Prompt Injection Protection (Domain 1): QSAF-PI-001 (static pattern blacklist), QSAF-PI-002 (dynamic LLM analysis), QSAF-PI-003 (semantic embedding comparison) All addressed by Prompt Shields and provenance marking. Context Manipulation (Domain 2): QSAF-RC-004 (context drift), QSAF-RC-007 (nested prompt injection) – mitigated by stateful risk calculation. 2. Tools are capabilities with constraints, not functions When the model proposes a tool call, your runtime should re-derive what is allowed from identity plus risk state, then enforce it at the gateway. type ToolRequest = { tool: string; args: unknown; }; type Capabilities = { allowWrite: boolean; allowedTools: Set<string>; }; function deriveCapabilities(risk: RiskState, roles: string[]): Capabilities { const baseAllowed = new Set(["search_kb", "get_profile", "summarize"]); const isAdmin = roles.includes("Admin"); if (risk === "SUSPECT") { return { allowWrite: false, allowedTools: baseAllowed }; } if (risk === "BLOCK") { return { allowWrite: false, allowedTools: new Set() }; } // NORMAL const tools = new Set([ ...baseAllowed, ...(isAdmin ? ["update_record", "issue_refund"] : []), ]); return { allowWrite: isAdmin, allowedTools: tools }; } function authorizeTool(req: ToolRequest, caps: Capabilities): void { if (!caps.allowedTools.has(req.tool)) throw new Error("ToolNotAllowed"); if (!caps.allowWrite && req.tool.startsWith("update_")) { throw new Error("WriteDenied"); } } The model can ask. It cannot grant itself permission. QSAF Alignment: Plugin Abuse Monitoring (Domain 3): QSAF-PL-001 (whitelist enforcement), QSAF-PL-003 (restrict sensitive plugins), QSAF-PL-006 (rate‑limiting) – implemented via capability derivation and gateway policies. Behavioral Anomaly Detection (Domain 5): QSAF-BA-006 (plugin execution pattern deviance) – detected by comparing actual calls against derived capabilities. The Integrity Gate: Hash-chain the authority, not the output Let me add the part that makes investigations clean. Most teams treat integrity like an audit log problem. That is not enough. Logs explain. Integrity proves. The hard truth is that agent authority is assembled out of pieces: the system instruction, the user prompt, retrieved chunks, risk annotations, and finally the tool intent. If you do not bind those pieces together cryptographically, an incident review becomes a story-telling session. This is why QSAF has an entire domain for payload integrity and signing, including prompt hash signing, nonce or replay protection, and a hash chain lineage that tracks how a session evolved. Here is how you can map that into the runtime verifies. You build a canonical “authority envelope” for every privileged hop, compute a digest, and then: link it to the previous hop (hash chain) include a nonce (replay control) sign the digest with Azure Key Vault (Key Vault signs digests, it does not hash your content for you) import crypto from "crypto"; type AuthorityEnvelope = { sessionId: string; turnId: number; policyVersion: string; // provenance-preserved components systemHash: string; userHash: string; documentsHash: string; // hash of structured retrieved chunks (not just rendered text) shields: { detected: boolean; filtered: boolean; }; riskState: "NORMAL" | "SUSPECT" | "BLOCK"; // proposed action (if any) tool?: { name: string; argsHash: string; }; // anti-replay + lineage nonce: string; prevDigest?: string; ts: string; }; function sha256(bytes: string): string { return crypto.createHash("sha256").update(bytes).digest("hex"); } // Canonicalization matters. JSON.stringify is OK if you control key order. // For cross-language, use RFC 8785 (JCS) canonical JSON. function canonicalJson(x: unknown): string { return JSON.stringify(x); } function buildEnvelope( input: Omit<AuthorityEnvelope, "nonce" | "ts">, ): AuthorityEnvelope { return { ...input, nonce: crypto.randomUUID(), ts: new Date().toISOString(), }; } function digestEnvelope(env: AuthorityEnvelope): string { return sha256(canonicalJson(env)); } Then you call Key Vault to sign that digest (REST sign), and optionally verify later (REST verify). The rare failure mode this blocks is subtle: authority splicing. Without a hash chain, it is possible for the runtime to correctly validate a tool call, but later be unable to prove which retrieved chunk, which Prompt Shields result, and which policy version were in force when that call was authorized. With the chain, every privileged hop becomes tamper-evident. This is the point: Prompt Shields tells you “this looks dangerous.” Document delimiters preserve provenance. The integrity gate makes the runtime able to say, later, with evidence: “This is exactly what I accepted as authority.” QSAF Alignment: Payload Integrity & Signing (Domain 6): QSAF-PY-001 (prompt hash signing), QSAF-PY-005 (nonce/replay control), QSAF-PY-006 (hash chain lineage) – directly implemented via the envelope and chaining. Tools must sit behind a wall that can say “no” Tool calls are where language becomes authority. If an agent can call APIs that mutate state, your security story is not about the response text. It is about whether the tool call is allowed under explicit policy. This is exactly where Azure API Management belongs: as the tool gateway that enforces authentication and authorization before any tool request reaches your backend. The validate-jwt policy is the canonical enforcement mechanism for validating JWTs at the gateway. The design goal is simple: The model can request a tool call. The gateway decides if it is permitted. A capability token approach keeps it clean: <!-- APIM inbound policy sketch --> <validate-jwt header-name="Authorization" failed-validation-httpcode="401"> <required-claims> <claim name="scp"> <value>tools.read</value> </claim> </required-claims> </validate-jwt> The claim name (scp, roles, or custom claims) depends on your token issuer; the point is enforcing authorization at the gateway, not inside model text. Now you can enforce “read-only mode” by issuing tokens that simply do not carry write scopes. The model can try to call a write tool. It still gets denied by policy. Evidence is not logs. Evidence is a signed chain. Logs help you debug. Evidence helps you prove. So you hash the session envelope and the tool intent, then sign the digest using Azure Key Vault Keys. Key Vault sign creates a signature from a digest, and verify verifies a signature against a digest. Key Vault does not hash your content for you. Hash locally, then sign the digest.), and Key Vault documentation is explicit that signing is sign-hash, not “sign arbitrary content.” You hash locally, then ask Key Vault to sign the hash. import crypto from "crypto"; const sha256 = (x: unknown): string => crypto.createHash("sha256").update(JSON.stringify(x)).digest("hex"); type IntentEnvelope = { sessionId: string; userId: string; promptHash: string; documentsHash: string; tool: string; argsHash: string; nonce: string; ts: string; policyVersion: string; }; function buildIntent( sessionId: string, userId: string, prompt: string, docs: unknown, tool: string, args: unknown, policyVersion: string, ): IntentEnvelope { return { sessionId, userId, promptHash: sha256(prompt), documentsHash: sha256(docs), tool, argsHash: sha256(args), nonce: crypto.randomUUID(), ts: new Date().toISOString(), policyVersion, }; } Once you do this, your system stops “explaining.” It starts proving. Govern what the agent can see, not only what it can say RAG without governance eventually becomes a data exposure feature. This is why I treat retrieval as a governed operation. Microsoft Purview sensitivity labels give you a practical way to classify content and build retrieval rules on top of that classification. Microsoft documents creating and configuring sensitivity labels in Purview. The pattern is simple: Label the corpus. Filter retrieval by label and identity policy. Log label distribution per completion. Alert when a low-privilege identity retrieves high-sensitivity labels. This is how you keep sovereignty real. Not in a slide deck. In the retrieval path. Operate it like a security system: posture and detection Inline gates reduce risk. They do not eliminate it. Systems drift. People add tools. Policies get loosened. Attacks evolve. Microsoft Defender for Cloud’s Defender CSPM plan includes AI security posture management for generative AI apps and AI agents (Preview), including discovery/inventory of AI agents deployed with Azure AI Foundry. Then you use Microsoft Sentinel to turn your telemetry into incidents, with scheduled analytics rules. Your detections should match the gates you built: Repeated Prompt Shields detections from the same identity or session. Tool-call spikes after a suspicious document signal. APIM denials for write endpoints from sessions in read-only mode. High-sensitivity label retrieval by identities that should never touch that tier. QSAF Alignment: Behavioral Anomaly Detection (Domain 5): QSAF-BA-001 (session entropy), QSAF-BA-004 (repeated intent mutation), QSAF-BA-007 (unified risk score) – detected via Sentinel rules. Cross‑Environment Defense (Domain 9): QSAF-CE-006 (coordinated alert response) – using Sentinel incidents and playbooks. Where the reference checklist fits, quietly Behind the scenes, we use a control checklist lens to ensure we cover prompt/context attacks, tool misuse, integrity, governance, and operational monitoring. The point is not to rename Microsoft features into framework terms. The point is to make the system enforceable and auditable using Azure-native gates. Closing Zero trust for agents is not a slogan. It is a build. Prompt Shields gives you a front gate for both user prompt attacks and document attacks, with clear annotations like detected and filtered. API Management gives you a tool boundary that can say “no” regardless of what the model tries, using validate-jwt. Signed intent gives you evidence, using Key Vault’s sign-hash semantics. Purview labels give you governed retrieval. Sentinel and Defender give you an operating model, not wishful thinking. If you want the conceptual spine and the architectural principles that frame this pipeline, start with my earlier Tech Community pieces, then come back here and implement the gates. Thanks for reading — Hazem Ali283Views1like0CommentsPublished agent from Foundry doesn't work at all in Teams and M365
I've switched to the new version of Azure AI Foundry (New) and created a project there. Within this project, I created an Agent and connected two custom MCP servers to it. The agent works correctly inside Foundry Playground and responds to all test queries as expected. My goal was to make this agent available for my organization in Microsoft Teams / Microsoft 365 Copilot, so I followed all the steps described in the official Microsoft documentation: https://learn.microsoft.com/en-us/azure/ai-foundry/agents/how-to/publish-copilot?view=foundry Issue description The first problems started at Step 8 (Publishing the agent). Organization scope publishing I published the agent using Organization scope. The agent appeared in Microsoft Admin Center in the list of agents. However, when an administrator from my organization attempted to approve it, the approval always failed with a generic error: “Sorry, something went wrong” No diagnostic information, error codes, or logs were provided. We tried recreating and republishing the agent multiple times, but the result was always the same. Shared scope publishing As a workaround, I published the agent using Shared scope. In this case, the agent finally appeared in Microsoft Teams and Microsoft 365 Copilot. I can now see the agent here: Microsoft Teams → Copilot Microsoft Teams → Applications → Manage applications However, this revealed the main issue. Main problem The published agent cannot complete any query in Teams, despite the fact that: The agent works perfectly in Foundry Playground The agent responds correctly to the same prompts before publishing In Teams, every query results in messages such as: “Sorry, something went wrong. Try to complete a query later.” Simplification test To exclude MCP or instruction-related issues, I performed the following: Disabled all MCP tools Removed all complex instructions Left only a minimal system prompt: “When the user types 123, return 456” I then republished the agent. The agent appeared in Teams again, but the behavior did not change — it does not respond at all. Permissions warning in Teams When I go to: Teams → Applications → Manage Applications → My agent → View details I see a red warning label: “Permissions needed. Ask your IT admin to add InfoConnect Agent to this team/chat/meeting.” This message is confusing because: The administrator has already added all required permissions All relevant permissions were granted in Microsoft Entra ID Admin consent was provided Because of this warning, I also cannot properly share the agent with my colleagues. Additional observation I have a similar agent configured in Copilot Studio: It shows the same permissions warning However, that agent still responds correctly in Teams It can also successfully call some MCP tools This suggests that the issue is specific to Azure AI Foundry agents, not to Teams or tenant-wide permissions in general. Steps already taken to resolve the issue Configured all required RBAC roles in Azure Portal according to: https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/rbac-foundry?view=foundry-classic During publishing, an agent-bot application was automatically created. I added my account to this bot with the Azure AI User role I also assigned Azure AI User to: The project’s Managed Identity The project resource itself Verified all permissions related to AI agents publishing in: Microsoft Admin Center Microsoft Teams Admin Center Simplified and republished the agent multiple times Deleted the automatically created agent-bot and allowed Foundry to recreate it Created a new Foundry project, configured several simple agents, and published them — the same issue occurs Tried publishing with different models: gpt-4.1, o4-mini Manually configured permissions in: Microsoft Entra ID → App registrations / Enterprise applications → API permissions Added both Delegated and Application permissions and granted Admin consent Added myself and my colleagues as Azure AI User in: Foundry → Project → Project users Followed all steps mentioned in this related discussion: https://techcommunity.microsoft.com/discussions/azure-ai-foundry-discussions/unable-to-publish-foundry-agent-to-m365-copilot-or-teams/4481420 Questions How can I make a Foundry agent work correctly in Microsoft Teams? Why does the agent fail to process requests in Teams while working correctly in Foundry? What does the “Permissions needed” warning actually mean for Foundry agents? How can I properly share the agent with other users in my organization? Any guidance, diagnostics, or clarification on the correct publishing and permission model for Foundry agents in Teams would be greatly appreciated.Solved1.4KViews1like5CommentsBuild an Offline Hybrid RAG Stack with ONNX and Foundry Local
If you are building local AI applications, basic retrieval augmented generation is often only the starting point. This sample shows a more practical pattern: combine lexical retrieval, ONNX based semantic embeddings, and a Foundry Local chat model so the assistant stays grounded, remains offline, and degrades cleanly when the semantic path is unavailable. Why this sample is worth studying Many local RAG samples rely on a single retrieval strategy. That is usually enough for a proof of concept, but it breaks down quickly in production. Exact keywords, acronyms, and document codes behave differently from natural language questions and paraphrased requests. This repository keeps the original lexical retrieval path, adds local ONNX embeddings for semantic search, and fuses both signals in a hybrid ranking mode. The generation step runs through Foundry Local, so the entire assistant can remain on device. Lexical mode handles exact terms and structured vocabulary. Semantic mode handles paraphrases and more natural language phrasing. Hybrid mode combines both and is usually the best default. Lexical fallback protects the user experience if the embedding pipeline cannot start. Architectural overview The sample has two main flows: an offline ingestion pipeline and a local query pipeline. The architecture splits cleanly into offline ingestion at the top and runtime query handling at the bottom. Offline ingestion pipeline Read Markdown files from docs/ . Parse front matter and split each document into overlapping chunks. Generate dense embeddings when the ONNX model is available. Store chunks in SQLite with both sparse lexical features and optional dense vectors. Local query pipeline The browser posts a question to the Express API. ChatEngine resolves the requested retrieval mode. VectorStore retrieves lexical, semantic, or hybrid results. The prompt is assembled with the retrieved context and sent to a Foundry Local chat model. The answer is returned with source references and retrieval metadata. The sequence diagram shows the difference between lexical retrieval and hybrid retrieval. In hybrid mode, the query is embedded first, then lexical and semantic scores are fused before prompt assembly. Repository structure and core components The implementation is compact and readable. The main files to understand are listed below. src/config.js : retrieval defaults, paths, and model settings. src/embeddingEngine.js : local ONNX embedding generation through Transformers.js. src/vectorStore.js : SQLite storage plus lexical, semantic, and hybrid ranking. src/chatEngine.js : retrieval mode resolution, prompt assembly, and Foundry Local model execution. src/ingest.js : document ingestion and embedding generation during indexing. src/server.js : REST endpoints, streaming endpoints, upload support, and health reporting. Getting started To run the sample, you need Node.js 20 or newer, Foundry Local, and a local ONNX embedding model. The default model path is models/embeddings/bge-small-en-v1.5 . cd c:\Users\leestott\local-hybrid-retrival-onnx npm install huggingface-cli download BAAI/bge-small-en-v1.5 --local-dir models/embeddings/bge-small-en-v1.5 npm run ingest npm start Ingestion writes the local SQLite database to data/rag.db . If the embedding model is available, each chunk gets a dense vector as well as lexical features. If the embedding model is missing, ingestion still succeeds and the application remains usable in lexical mode. Best practice: local AI applications should treat model files, SQLite data, and native runtime compatibility as part of the deployable system, not as optional developer conveniences. Code walkthrough 1. Retrieval configuration The sample makes its retrieval behaviour explicit in configuration. That is useful for testing and for operator visibility. export const config = { model: "phi-3.5-mini", docsDir: path.join(ROOT, "docs"), dbPath: path.join(ROOT, "data", "rag.db"), chunkSize: 200, chunkOverlap: 25, topK: 3, retrievalMode: process.env.RETRIEVAL_MODE || "hybrid", retrievalModes: ["lexical", "semantic", "hybrid"], fallbackRetrievalMode: "lexical", retrievalWeights: { lexical: 0.45, semantic: 0.55, }, }; Those defaults tell you a lot about the intended operating profile. Chunks are small, the number of returned chunks is low, and the fallback path is explicit. 2. Local ONNX embeddings The embedding engine disables remote model loading and only uses local files. That matters for privacy, repeatability, and air gapped operation. env.allowLocalModels = true; env.allowRemoteModels = false; this.extractor = await pipeline("feature-extraction", resolvedPath, { local_files_only: true, }); const output = await this.extractor(text, { pooling: "mean", normalize: true, }); The mean pooling and normalisation step make the vectors suitable for cosine similarity based ranking. 3. Hybrid storage and ranking in SQLite Instead of adding a separate vector database, the sample stores lexical and semantic representations in the same SQLite table. That keeps the local footprint low and the implementation easy to debug. searchHybrid(query, queryEmbedding, topK = 5, weights = { lexical: 0.45, semantic: 0.55 }) { const lexicalResults = this.searchLexical(query, topK * 3); const semanticResults = this.searchSemantic(queryEmbedding, topK * 3); if (semanticResults.length === 0) { return lexicalResults.slice(0, topK).map((row) => ({ ...row, retrievalMode: "lexical", })); } const fused = [...combined.values()].map((row) => ({ ...row, score: (row.lexicalScore * lexicalWeight) + (row.semanticScore * semanticWeight), })); fused.sort((a, b) => b.score - a.score); return fused.slice(0, topK); } The important point is not just the weighted fusion. It is the fallback behaviour. If semantic retrieval cannot provide results, the user still gets lexical grounding instead of an empty context window. 4. Retrieval mode resolution in ChatEngine ChatEngine keeps the runtime behaviour predictable. It validates the requested mode and falls back to lexical search when semantic retrieval is unavailable. resolveRetrievalMode(requestedMode) { const desiredMode = config.retrievalModes.includes(requestedMode) ? requestedMode : config.retrievalMode; if ((desiredMode === "semantic" || desiredMode === "hybrid") && !this.semanticAvailable) { return config.fallbackRetrievalMode; } return desiredMode; } This is a sensible production design because local runtime failures are common. Missing model files or native dependency mismatches should reduce quality, not crash the entire assistant. 5. Foundry Local model management The sample uses FoundryLocalManager to discover, download, cache, and load the configured chat model. const manager = FoundryLocalManager.create({ appName: "gas-field-local-rag" }); const catalog = manager.catalog; this.model = await catalog.getModel(config.model); if (!this.model.isCached) { await this.model.download((progress) => { const pct = Math.round(progress * 100); this._emitStatus("download", `Downloading ${this.modelAlias}... ${pct}%`, progress); }); } await this.model.load(); this.chatClient = this.model.createChatClient(); this.chatClient.settings.temperature = 0.1; This gives the app a better local startup experience. The server can expose a status stream while the model initialises in the background. User experience and screenshots The client is intentionally simple, which makes it useful during evaluation. You can switch retrieval mode, test questions quickly, and inspect the retrieved sources. The landing page exposes retrieval mode directly in the UI. That makes it easy to compare lexical, semantic, and hybrid behaviour during testing. The sources panel shows grounding evidence and retrieval scores, which is useful when validating whether better answers are coming from better retrieval or just model phrasing. Best practices for ONNX RAG and Foundry Local Keep lexical fallback alive. Exact identifiers and runtime failures both make this necessary. Persist sparse and dense features together where possible. It simplifies debugging and operational reasoning. Use small chunks and conservative topK values for local context budgets. Expose health and status endpoints so users can see when the model is still loading or embeddings are unavailable. Test retrieval quality separately from generation quality. Pin and validate native runtime dependencies, especially ONNX Runtime, before tuning prompts. Practical warning: this repository already shows why runtime validation matters. A local app can ingest documents successfully and still fail at model initialisation if the native runtime stack is misaligned. How this compares with RAG and CAG The strongest value in this sample comes from where it sits between a basic local RAG baseline and a curated CAG design. Dimension Classic local RAG This hybrid ONNX RAG sample CAG Context assembly Retrieve chunks at query time, often lexically, then inject them into the prompt. Retrieve chunks at query time with lexical, semantic, or fused scoring, then inject the strongest results into the prompt. Use a prepared or cached context pack instead of fresh retrieval for every request. Main strength Easy to implement and easy to explain. Better recall for paraphrases without giving up exact match behaviour or offline execution. Predictable prompts and low query time overhead. Main weakness Misses synonyms and natural language reformulations. More moving parts, larger local asset footprint, and native runtime compatibility to manage. Coverage depends on curation quality and goes stale more easily. Failure behaviour Weak retrieval leads to weak grounding. Semantic failure can degrade to lexical retrieval if designed properly, which this sample does. Prepared context can be too narrow for new or unexpected questions. Best fit Simple local assistants and proof of concept systems. Offline copilots and technical assistants that need stronger recall across varied phrasing. Stable workflows with tightly bounded, curated knowledge. Samples Related samples: - Foundry Local RAG - https://github.com/leestott/local-rag - Foundry Local CAG - https://github.com/leestott/local-cag - Foundry Local hybrid-retrival-onnx https://github.com/leestott/local-hybrid-retrival-onnx Specific benefits of this hybrid approach over classic RAG It captures paraphrased questions that lexical search would often miss. It still preserves exact match performance for codes, terms, and product names. It gives operators a controlled degradation path when the semantic stack is unavailable. It stays local and inspectable without introducing a separate hosted vector service. Specific differences from CAG CAG shifts effort into context curation before the request. This sample retrieves evidence dynamically at runtime. CAG can be faster for fixed workflows, but it is usually less flexible when the document set changes. This hybrid RAG design is better suited to open ended knowledge search and growing document collections. What to validate before shipping Measure retrieval quality in each mode using exact term, acronym, and paraphrase queries. Check that sources shown in the UI reflect genuinely distinct evidence, not repeated chunks. Confirm the application remains usable when semantic retrieval is unavailable. Verify ONNX Runtime compatibility on the real target machines, not only on the development laptop. Test model download, cache, and startup behaviour with a clean environment. Final take For developers getting started with ONNX RAG and Foundry Local, this sample is a good technical reference because it demonstrates a realistic local architecture rather than a minimal demo. It shows how to build a grounded assistant that remains offline, supports multiple retrieval modes, and fails gracefully. Compared with classic local RAG, the hybrid design provides better recall and better resilience. Compared with CAG, it remains more flexible for changing document sets and less dependent on pre curated context packs. If you want a practical starting point for offline grounded AI on developer workstations or edge devices, this is the most balanced pattern in the repository set.306Views0likes0CommentsGraphRAG and PostgreSQL integration in docker with Cypher query and AI agents (Version 2*)
This is update from previous blog (version 1): GraphRAG and PostgreSQL integration in docker with Cypher query and AI agents | Microsoft Community Hub Review the business needs of this solution from version 1 What's new in version 2? MCP tools for GraphRAG and PostgreSQL with Apache AGE This solution now includes MCP tools for GraphRAG and PostgreSQL. There are five MCP tools exposed: [graphrag_search] Used to run query (local or global) with runtime-tunable API parameters. One important aspect is that query behavior can be tuned at runtime, without changing the underlying index. [age_get_schema_cached] Used for schema inspection and diagnostics. It returns the graph schema (node labels and relationship types) from cache by default; and can optionally refresh the cache by re‑querying the database. This tool is typically used for introspection or debugging, not for answering user questions about data. [age_entity_lookup] Used for quick entity discovery and disambiguation. It performs a simple substring match on entity names or titles and is especially useful for questions like “Who is X?” or as a preliminary step before issuing more complex graph queries. [age_cypher_query] Executes a user‑provided Cypher query directly against the AGE graph. This is intended for advanced users who already know the graph structure and want full control over traversal logic and filters. [age_nl2cypher_query] Bridges natural language and Cypher. This tool converts a natural‑language question into a Cypher query (using only Entity nodes and RELATED_TO edges), executes it, and returns the results. It is most effective for multi‑hop or structurally complex questions where semantic interpretation is needed first, but execution must remain deterministic. Besides that, This solution now uses Microsoft agent framework. It enables clean orchestration over MCP tools, allowing the agent to dynamically select between GraphRAG and graph query capabilities at runtime, with a looser coupling and clearer execution model than traditional Semantic Kernel function plugins. The new Docker image includes graphRAG3.0.5. This version stabilizes the 3.x configuration‑driven, API‑based architecture and improves indexing reliability, making graph construction more predictable and easier to integrate into real workflows. New architecture Updated Step 7 - run query in Jupyter notebook This step runs Jupyter notebook in docker, which is the same as stated in previous blog. > docker compose up query-notebook After clicking the link highlighted in the above screen shot, you can explore all files within the project in the docker, then find the query-notebook.ipynb. https://github.com/Azure-Samples/postgreSQL-graphRAG-docker/blob/main/project_folder/query-notebook.ipynb But in this new version of notebook, the graphRAG3.0.5 uses different library for local Search and global Search. New Step 8 - run agent and MCP tools in Jupyter notebook This step runs Jupyter notebook in docker. > docker compose up mcp-agent Click on the highlighted URL, you can start working on agent-notebook.ipynb. https://github.com/Azure-Samples/postgreSQL-graphRAG-docker/blob/main/project_folder/agent-notebook.... Multiple scenarios of agents with MCP tools are included in the notebook: GraphRAG search: local search and global search examples with direct mcp call. GraphRAG search: local search and global search examples with agent and include mcp tools. Cypher query in direct mcp call. Agent to query in natural language, and mcp tool included to convert NL2Cypher. Agent with unified mcp (all five mcp tools), and based on the question route to the corresponding tool. ['graphrag_search', 'age_get_schema_cached', 'age_cypher_query', 'age_entity_lookup', 'age_nl2cypher_query'] Router agent: selecting the right MCP tool The notebook also includes a router agent that has access to all five MCP tools and decides which one to invoke based on the user’s question. Rather than hard‑coding execution paths, the agent reasons about intent and selects the most appropriate capability at runtime. General routing guidance used in this solution Use [graphrag_search] when the question requires: full dataset understanding, themes, patterns, or trends across documents, exploratory or open‑ended analysis, global understanding or evaluation where we have a corpus of many tokens. In these cases, GraphRAG’s semantic retrieval and aggregation are a better fit than explicit graph traversal. Use AGE‑based tools [age_get_schema_cached, age_entity_lookup, age_cypher_query, age_nl2cypher_query] when the question involves: specific entities or explicit relationships, deterministic graph traversal or filtering, questions that depend on graph structure rather than document semantics, complex graph queries involving multiple entities or multi‑hop paths. Within the AGE toolset: [age_entity_lookup] is typically used for quick entity discovery or disambiguation. [age_cypher_query] is used when a precise Cypher query is already known. [age_nl2cypher_query] is used when the question is expressed in natural language but requires a non‑trivial Cypher query to answer. [age_get_schema_cached] is reserved for schema inspection and diagnostics. The router agent dynamically selects between semantic search and deterministic graph tools based on question intent, keeping retrieval, graph execution, and orchestration clearly separated and extensible. Note: The repository also includes [age_get_schema] and [age_get_schema_details] MCP tools for debugging and development purposes. These are not exposed to agents by default and are superseded by [age_get_schema_cached] for normal use. Key takeaways GraphRAG and postgreSQL AGE querying serve different purposes and each has its advantages. MCP tools provide a uniform interface to both semantic search and deterministic graph operations. Microsoft Agent Framework enables tool‑centric orchestration, where agents select the right capability at runtime instead of hard‑coding logic in prompts. The Jupyter‑based agent workflow makes it easy to experiment with different interaction patterns, from direct tool calls to fully routed agent execution. What's next In this solution, the MCP server and agent runtime are architecturally separated but deployed together in a single Docker container to demonstrate how MCP tools work and to keep local experimentation simple. There are other deployment options, such as running MCP servers remotely, where tools can be hosted and operated independently of the agent runtime. Contributions and enhancements are welcome.347Views1like0CommentsBuild a Fully Offline RAG App with Foundry Local: No Cloud Required
A practical guide to building an on-device AI support agent using Retrieval-Augmented Generation, JavaScript, and Microsoft Foundry Local. The Problem: AI That Can't Go Offline Most AI-powered applications today are firmly tethered to the cloud. They assume stable internet, low-latency API calls, and the comfort of a managed endpoint. But what happens when your users are in an environment with zero connectivity a gas pipeline in a remote field, a factory floor, an underground facility? That's exactly the scenario that motivated this project: a fully offline RAG-powered support agent that runs entirely on a laptop. No cloud. No API keys. No outbound network calls. Just a local model, a local vector store, and domain-specific documents all accessible from a browser on any device. The Gas Field Support Agent - running entirely on-device What is RAG and Why Should You Care? Retrieval-Augmented Generation (RAG) is a pattern that makes language models genuinely useful for domain-specific tasks. Instead of hoping the model "knows" the answer from pre-training, you: Retrieve relevant chunks from your own documents Augment the model's prompt with those chunks as context Generate a response grounded in your actual data The result: fewer hallucinations, traceable answers, and an AI that works with your content. If you're building internal tools, customer support bots, field manuals, or knowledge bases, RAG is the pattern you want. Why fully offline? Data sovereignty, air-gapped environments, field operations, latency-sensitive workflows, and regulatory constraints all demand AI that doesn't phone home. Running everything locally gives you complete control over your data and eliminates any external dependency. The Tech Stack This project is deliberately simple — no frameworks, no build steps, no Docker: Layer Technology Why AI Model Foundry Local + Phi-3.5 Mini Runs locally, OpenAI-compatible API, no GPU needed Backend Node.js + Express Lightweight, fast, universally known Vector Store SQLite via better-sqlite3 Zero infrastructure, single file on disk Retrieval TF-IDF + cosine similarity No embedding model required, fully offline Frontend Single HTML file with inline CSS No build step, mobile-responsive, field-ready The total dependency footprint is just four npm packages: express , openai , foundry-local-sdk , and better-sqlite3 . Architecture Overview The system has five layers — all running on a single machine: Five-layer architecture: Client → Server → RAG Pipeline → Data → AI Model Client Layer — A single HTML file served by Express, with quick-action buttons and responsive chat Server Layer — Express.js handles API routes for chat (streaming + non-streaming), document upload, and health checks RAG Pipeline — The chat engine orchestrates retrieval and generation; the chunker handles TF-IDF vectorization Data Layer — SQLite stores document chunks and their TF-IDF vectors; source docs live as .md files AI Layer — Foundry Local runs Phi-3.5 Mini Instruct on CPU/NPU, exposing an OpenAI-compatible API Getting Started in 5 Minutes You need two prerequisites: Node.js 20+ — nodejs.org Foundry Local — Microsoft's on-device AI runtime: Terminal winget install Microsoft.FoundryLocal Then clone, install, ingest, and run: git clone https://github.com/leestott/local-rag.git cd local-rag npm install npm run ingest # Index the 20 gas engineering documents npm start # Start the server + Foundry Local Open http://127.0.0.1:3000 and start chatting. Foundry Local auto-downloads Phi-3.5 Mini (~2 GB) on first run. How the RAG Pipeline Works Let's trace what happens when a user asks: "How do I detect a gas leak?" RAG query flow: Browser → Server → Vector Store → Model → Streaming response Step 1: Document Ingestion Before any queries happen, npm run ingest reads every .md file from the docs/ folder, splits each into overlapping chunks (~200 tokens, 25-token overlap), computes a TF-IDF vector for each chunk, and stores everything in SQLite. Chunking example docs/01-gas-leak-detection.md → Chunk 1: "Gas Leak Detection – Safety Warnings: Ensure all ignition..." → Chunk 2: "...sources are eliminated. Step-by-step: 1. Perform visual..." → Chunk 3: "...inspection of all joints. 2. Check calibration date..." The overlap ensures no information falls between chunk boundaries — a critical detail in any RAG system. Step 2: Query → Retrieval When the user sends a question, the server converts it into a TF-IDF vector, compares it against every stored chunk using cosine similarity, and returns the top-K most relevant results. For 20 documents (~200 chunks), this executes in under 10ms. src/vectorStore.js /** Retrieve top-K most relevant chunks for a query. */ search(query, topK = 5) { const queryTf = termFrequency(query); const rows = this.db.prepare("SELECT * FROM chunks").all(); const scored = rows.map((row) => { const chunkTf = new Map(JSON.parse(row.tf_json)); const score = cosineSimilarity(queryTf, chunkTf); return { ...row, score }; }); scored.sort((a, b) => b.score - a.score); return scored.slice(0, topK).filter((r) => r.score > 0); } Step 3: Prompt Construction The retrieved chunks are injected into the prompt alongside system instructions: Prompt structure System: You are an offline gas field support agent. Safety-first... Context: [Chunk 1: Gas Leak Detection – Safety Warnings...] [Chunk 2: Gas Leak Detection – Step-by-step...] [Chunk 3: Purging Procedures – Related safety...] User: How do I detect a gas leak? Step 4: Generation + Streaming The prompt is sent to Foundry Local via the OpenAI-compatible API. The response streams back token-by-token through Server-Sent Events (SSE) to the browser: Safety-first response with structured guidance Expandable sources with relevance scores Foundry Local: Your Local AI Runtime Foundry Local is what makes the "offline" part possible. It's a runtime from Microsoft that runs small language models (SLMs) on CPU or NPU — no GPU required. It exposes an OpenAI-compatible API and manages model downloads, caching, and lifecycle automatically. The integration code is minimal if you've used the OpenAI SDK before, this will feel instantly familiar: src/chatEngine.js import { FoundryLocalManager } from "foundry-local-sdk"; import { OpenAI } from "openai"; // Start Foundry Local and load the model const manager = new FoundryLocalManager(); const modelInfo = await manager.init("phi-3.5-mini"); // Use the standard OpenAI client — pointed at the local endpoint const client = new OpenAI({ baseURL: manager.endpoint, apiKey: manager.apiKey, }); // Chat completions work exactly like the cloud API const stream = await client.chat.completions.create({ model: modelInfo.id, messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "How do I detect a gas leak?" } ], stream: true, }); Portability matters Because Foundry Local uses the OpenAI API format, any code you write here can be ported to Azure OpenAI or OpenAI's cloud API with a single config change. You're not locked in. Why TF-IDF Instead of Embeddings? Most RAG tutorials use embedding models for retrieval. We chose TF-IDF for this project because: Fully offline — no embedding model to download or run Zero latency — vectorization is instantaneous (just math on word frequencies) Good enough — for a curated collection of 20 domain-specific documents, TF-IDF retrieves the right chunks reliably Transparent — you can inspect the vocabulary and weights, unlike neural embeddings For larger collections (thousands of documents) or when semantic similarity matters more than keyword overlap, you'd swap in an embedding model. But for this use case, TF-IDF keeps the stack simple and dependency-free. Mobile-Responsive Field UI Field engineers use this app on phones and tablets often wearing gloves. The UI is designed for harsh conditions with a dark, high-contrast theme, large touch targets (minimum 48px), and horizontally scrollable quick-action buttons. Desktop view Mobile view The entire frontend is a single index.html file — no React, no build step, no bundler. This keeps the project accessible and easy to deploy anywhere. Runtime Document Upload Users can upload new documents without restarting the server. The upload endpoint receives markdown content, chunks it, computes TF-IDF vectors, and inserts the chunks into SQLite — all in memory, immediately available for retrieval. Drag-and-drop document upload with instant indexing Adapt This for Your Own Domain This project is a scenario sample designed to be forked and customized. Here's the three-step process: 1. Replace the Documents Delete the gas engineering docs in docs/ and add your own .md files with optional YAML front-matter: docs/my-procedure.md --- title: Troubleshooting Widget Errors category: Support id: KB-001 --- # Troubleshooting Widget Errors ...your content here... 2. Edit the System Prompt Open src/prompts.js and rewrite the instructions for your domain: src/prompts.js export const SYSTEM_PROMPT = `You are an offline support agent for [YOUR DOMAIN]. Rules: - Only answer using the retrieved context - If the answer isn't in the context, say so - Use structured responses: Summary → Details → Reference `; 3. Tune the Retrieval Adjust chunking and retrieval parameters in src/config.js : src/config.js export const config = { model: "phi-3.5-mini", chunkSize: 200, // smaller = more precise, less context per chunk chunkOverlap: 25, // prevents info from falling between chunks topK: 3, // chunks per query (more = richer context, slower) }; Extending to Multi-Agent Architectures Once you have a working RAG agent, the natural next step is multi-agent orchestration where specialized agents collaborate to handle complex workflows. With Foundry Local's OpenAI-compatible API, you can compose multiple agent roles on the same machine: Multi-agent concept // Each agent is just a different system prompt + RAG scope const agents = { safety: { prompt: safetyPrompt, docs: "safety/*.md" }, diagnosis: { prompt: diagnosisPrompt, docs: "faults/*.md" }, procedure: { prompt: procedurePrompt, docs: "procedures/*.md" }, }; // Router determines which agent handles the query function route(query) { if (query.match(/safety|warning|hazard/i)) return agents.safety; if (query.match(/fault|error|code/i)) return agents.diagnosis; return agents.procedure; } // Each agent uses the same Foundry Local model endpoint const response = await client.chat.completions.create({ model: modelInfo.id, messages: [ { role: "system", content: selectedAgent.prompt }, { role: "system", content: `Context:\n${retrievedChunks}` }, { role: "user", content: userQuery } ], stream: true, }); This pattern lets you build specialized agent pipelines a triage agent routes to the right specialist, each with its own document scope and system prompt, all running on the same local Foundry instance. For production multi-agent systems, explore Microsoft Foundry for cloud-scale orchestration when connectivity is available. Local-first, cloud-ready Start with Foundry Local for development and offline scenarios. When your agents need cloud scale, swap to Azure AI Foundry with the same OpenAI-compatible API your agent code stays the same. Key Takeaways 1 RAG = Retrieve + Augment + Generate Ground your AI in real documents — dramatically reducing hallucination and making answers traceable. 2 Foundry Local makes local AI accessible OpenAI-compatible API running on CPU/NPU. No GPU required. No cloud dependency. 3 TF-IDF + SQLite is viable For small-to-medium document collections, you don't need a dedicated vector database. 4 Same API, local or cloud Build locally with Foundry Local, deploy with Azure OpenAI — zero code changes. What's Next? Embedding-based retrieval — swap TF-IDF for a local embedding model for better semantic matching Conversation memory — persist chat history across sessions Multi-agent routing — specialized agents for safety, diagnostics, and procedures PWA packaging — make it installable as a standalone app on mobile devices Hybrid retrieval — combine keyword search with semantic embeddings for best results Get the code Clone the repo, swap in your own documents, and start building: git clone https://github.com/leestott/local-rag.git github.com/leestott/local-rag — MIT licensed, contributions welcome. Open source under the MIT License. Built with Foundry Local and Node.js.886Views1like0CommentsIntegrating Microsoft Foundry with OpenClaw: Step by Step Model Configuration
Step 1: Deploying Models on Microsoft Foundry Let us kick things off in the Azure portal. To get our OpenClaw agent thinking like a genius, we need to deploy our models in Microsoft Foundry. For this guide, we are going to focus on deploying gpt-5.2-codex on Microsoft Foundry with OpenClaw. Navigate to your AI Hub, head over to the model catalog, choose the model you wish to use with OpenClaw and hit deploy. Once your deployment is successful, head to the endpoints section. Important: Grab your Endpoint URL and your API Keys right now and save them in a secure note. We will need these exact values to connect OpenClaw in a few minutes. Step 2: Installing and Initializing OpenClaw Next up, we need to get OpenClaw running on your machine. Open up your terminal and run the official installation script: curl -fsSL https://openclaw.ai/install.sh | bash The wizard will walk you through a few prompts. Here is exactly how to answer them to link up with our Azure setup: First Page (Model Selection): Choose "Skip for now". Second Page (Provider): Select azure-openai-responses. Model Selection: Select gpt-5.2-codex , For now only the models listed (hosted on Microsoft Foundry) in the picture below are available to be used with OpenClaw. Follow the rest of the standard prompts to finish the initial setup. Step 3: Editing the OpenClaw Configuration File Now for the fun part. We need to manually configure OpenClaw to talk to Microsoft Foundry. Open your configuration file located at ~/.openclaw/openclaw.json in your favorite text editor. Replace the contents of the models and agents sections with the following code block: { "models": { "providers": { "azure-openai-responses": { "baseUrl": "https://<YOUR_RESOURCE_NAME>.openai.azure.com/openai/v1", "apiKey": "<YOUR_AZURE_OPENAI_API_KEY>", "api": "openai-responses", "authHeader": false, "headers": { "api-key": "<YOUR_AZURE_OPENAI_API_KEY>" }, "models": [ { "id": "gpt-5.2-codex", "name": "GPT-5.2-Codex (Azure)", "reasoning": true, "input": ["text", "image"], "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, "contextWindow": 400000, "maxTokens": 16384, "compat": { "supportsStore": false } }, { "id": "gpt-5.2", "name": "GPT-5.2 (Azure)", "reasoning": false, "input": ["text", "image"], "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, "contextWindow": 272000, "maxTokens": 16384, "compat": { "supportsStore": false } } ] } } }, "agents": { "defaults": { "model": { "primary": "azure-openai-responses/gpt-5.2-codex" }, "models": { "azure-openai-responses/gpt-5.2-codex": {} }, "workspace": "/home/<USERNAME>/.openclaw/workspace", "compaction": { "mode": "safeguard" }, "maxConcurrent": 4, "subagents": { "maxConcurrent": 8 } } } } You will notice a few placeholders in that JSON. Here is exactly what you need to swap out: Placeholder Variable What It Is Where to Find It <YOUR_RESOURCE_NAME> The unique name of your Azure OpenAI resource. Found in your Azure Portal under the Azure OpenAI resource overview. <YOUR_AZURE_OPENAI_API_KEY> The secret key required to authenticate your requests. Found in Microsoft Foundry under your project endpoints or Azure Portal keys section. <USERNAME> Your local computer's user profile name. Open your terminal and type whoami to find this. Step 4: Restart the Gateway After saving the configuration file, you must restart the OpenClaw gateway for the new Foundry settings to take effect. Run this simple command: openclaw gateway restart Configuration Notes & Deep Dive If you are curious about why we configured the JSON that way, here is a quick breakdown of the technical details. Authentication Differences Azure OpenAI uses the api-key HTTP header for authentication. This is entirely different from the standard OpenAI Authorization: Bearer header. Our configuration file addresses this in two ways: Setting "authHeader": false completely disables the default Bearer header. Adding "headers": { "api-key": "<key>" } forces OpenClaw to send the API key via Azure's native header format. Important Note: Your API key must appear in both the apiKey field AND the headers.api-key field within the JSON for this to work correctly. The Base URL Azure OpenAI's v1-compatible endpoint follows this specific format: https://<your_resource_name>.openai.azure.com/openai/v1 The beautiful thing about this v1 endpoint is that it is largely compatible with the standard OpenAI API and does not require you to manually pass an api-version query parameter. Model Compatibility Settings "compat": { "supportsStore": false } disables the store parameter since Azure OpenAI does not currently support it. "reasoning": true enables the thinking mode for GPT-5.2-Codex. This supports low, medium, high, and xhigh levels. "reasoning": false is set for GPT-5.2 because it is a standard, non-reasoning model. Model Specifications & Cost Tracking If you want OpenClaw to accurately track your token usage costs, you can update the cost fields from 0 to the current Azure pricing. Here are the specs and costs for the models we just deployed: Model Specifications Model Context Window Max Output Tokens Image Input Reasoning gpt-5.2-codex 400,000 tokens 16,384 tokens Yes Yes gpt-5.2 272,000 tokens 16,384 tokens Yes No Current Cost (Adjust in JSON) Model Input (per 1M tokens) Output (per 1M tokens) Cached Input (per 1M tokens) gpt-5.2-codex $1.75 $14.00 $0.175 gpt-5.2 $2.00 $8.00 $0.50 Conclusion: And there you have it! You have successfully bridged the gap between the enterprise-grade infrastructure of Microsoft Foundry and the local autonomy of OpenClaw. By following these steps, you are not just running a chatbot; you are running a sophisticated agent capable of reasoning, coding, and executing tasks with the full power of GPT-5.2-codex behind it. The combination of Azure's reliability and OpenClaw's flexibility opens up a world of possibilities. Whether you are building an automated devops assistant, a research agent, or just exploring the bleeding edge of AI, you now have a robust foundation to build upon. Now it is time to let your agent loose on some real tasks. Go forth, experiment with different system prompts, and see what you can build. If you run into any interesting edge cases or come up with a unique configuration, let me know in the comments below. Happy coding!9.6KViews2likes2CommentsConditional Access for Agent Identities in Microsoft Entra
AI agents are rapidly becoming part of everyday enterprise operations summarizing incidents, analyzing logs, orchestrating workflows, or even acting as digital colleagues. As organizations adopt these intelligent automations, securing them becomes just as important as securing human identities. Microsoft Entra introduces Agent Identities and extends Conditional Access to them but with very limited controls compared to traditional users and workload identities. This blog breaks down what Agent Identities are, how Conditional Access applies to them, and what are current limitations. What Exactly Are Agent Identities? Microsoft Entra now supports a new identity type designed specifically for AI systems: Agent Identity – like an app/service principal but specialized for AI Agent User – an identity that behaves more like a human user Agent Blueprint – a template used to create agent identities This model exists because AI systems behave differently than humans or applications: they can act autonomously, operate continuously, and make decisions without user input. AI-driven automation must be governed and that’s where Conditional Access comes in. Conditional Access for Agents, but with Important Limitations Today, Conditional Access for agent identities is purposely minimal. Microsoft clearly states: Conditional Access applies only when: An agent identity requests a token An agent user requests a token It does NOT apply when: A blueprint acquires a token to create identities An agent performs intermediate token exchange What Controls Are Actually Available Today? ✔ Supported Today Category Supported? Details Identity Targeting ✔ Yes You can include/exclude agent identities & agent users Block Access ✔ Yes This is the only Grant control currently available Agent Risk (Preview) ✔ Yes Early stage risk evaluation Sign-in evaluation ✔ Yes Token acquisition governed by CA ❌NOT Supported Today These CA controls do not apply to Agent Identities: MFA Authentication strength Device compliance Approved client apps App protection policies Session controls User sign-in frequency Terms of Use Location conditions (network/device-based) Client apps (legacy/modern access) Why? Because agents do not perform interactive authentication and do not use device signals or session context like humans. Their authentication is purely machine‑driven. How Conditional Access Works for Agents When an agent identity (or agent user) requests a token, Microsoft Entra: Identifies the requesting agent Checks CA policy assignments Evaluates any agent-risk conditions Allow/Blocks token issuance if conditions meet That’s it. No MFA prompt. No device check. No authentication strength evaluation. This makes CA for agents fundamentally different from CA for humans. Why Is Conditional Access So Limited for Agents? Two major reasons: Agents cannot satisfy user-based controls AI agents cannot: Perform MFA Use biometrics Run on compliant devices Follow session prompts These are human-driven processes. Agents authenticate via secure credential flows They use: Client credentials Federated identity credentials Token exchange flows So CA is limited to identity-level allow/block and risk-based token decisions. Practical Use Cases (Given Today’s Limitations) Even with limited controls, CA for agents is still important. Stop compromised agents from continuing to operate If Microsoft Entra detects high agent risk: CA can block token issuance This halts the agent’s ability to act immediately Enforce separation of duties for AI agents Even though you cannot apply MFA or auth strength, you can: Separate agents into “allowed” vs “blocked” groups Apply different CA rules per department or system Prevent AI sprawl Large enterprises may generate hundreds of AI agents. CA gives central admin control: Only approved, vetted agents can operate Others are blocked at token-request time Why Agent Blueprints Cannot Be Governed by CA Blueprints are templates, not active identities. Blueprint token flows are system-level operations, not access attempts. Therefore: ❌ No CA evaluation ❌ No controls applied ❌ Not counted as agent activity Only actual agent identities are governed by CA. What the Future Might Include Microsoft hints the capabilities will expand: Agent risk scoring Agent behaviour analytics More granularity in CA for agents Additional grant controls Policy scoping at task or capability level But as of today, CA for agents remains intentionally constrained to allow safe onboarding of the new identity type without accidental disruption. Final Summary Conditional Access for Agent Identities is currently a lightweight enforcement mechanism designed to block unauthorized or risky agents, not a full policy suite like we have for human users. ✔ What it does: Controls whether an agent identity can acquire a token Allows blocking specific agents Implements early agent‑risk logic Applies Zero Trust principles at the identity perimeter ❌ What it does not do: Enforce MFA Enforce authentication strength Enforce device or location conditions Apply session controls Govern blueprints As organizations adopt more autonomous agents, this foundational layer keeps AI identities visible and controllable and sets the stage for richer governance in the future.Unable to delete Foundry Agent identity Entra app in Azure
I'm trying to delete an Entra app in Azure created by Foundry Agent identity blueprint as its currently unused and is causing EntraID hygiene alerts. However getting an error mentioning that delete is not supported. Is there any other way to delete an unused Entra app for an agent identity blueprint? Error detail: Agent Blueprints are not supported on the API version used in this request.214Views0likes2CommentsUnable to publish Foundry agent to M365 copilot or Teams
I’m encountering an issue while publishing an agent in Microsoft Foundry to M365 Copilot or Teams. After creating the agent and Foundry resource, the process automatically created a Bot Service resource. However, I noticed that this resource has the same ID as the Application ID shown in the configuration. Is this expected behavior? If not, how should I resolve it? I followed the steps in the official documentation: https://learn.microsoft.com/en-us/azure/ai-foundry/agents/how-to/publish-copilot?view=foundry Despite this, I keep getting the following error: There was a problem submitting the agent. Response status code does not indicate success: 401 (Unauthorized). Status Code: 401 Any guidance on what might be causing this and how to fix it would be greatly appreciated.Solved891Views0likes3Comments