agent
7 TopicsBuilding a Multi-Agent System with Azure AI Agent Service: Campus Event Management
Personal Background My name is Peace Silly. I studied French and Spanish at the University of Oxford, where I developed a strong interest in how language is structured and interpreted. That curiosity about syntax and meaning eventually led me to computer science, which I came to see as another language built on logic and structure. In the academic year 2024–2025, I completed the MSc Computer Science at University College London, where I developed this project as part of my Master’s thesis. Project Introduction Can large-scale event management be handled through a simple chat interface? This was the question that guided my Master’s thesis project at UCL. As part of the Industry Exchange Network (IXN) and in collaboration with Microsoft, I set out to explore how conversational interfaces and autonomous AI agents could simplify one of the most underestimated coordination challenges in campus life: managing events across multiple departments, societies, and facilities. At large universities, event management is rarely straightforward. Rooms are shared between academic timetables, student societies, and one-off events. A single lecture theatre might host a departmental seminar in the morning, a society meeting in the afternoon, and a careers talk in the evening, each relying on different systems, staff, and communication chains. Double bookings, last-minute cancellations, and maintenance issues are common, and coordinating changes often means long email threads, manual spreadsheets, and frustrated users. These inefficiencies do more than waste time; they directly affect how a campus functions day to day. When venues are unavailable or notifications fail to reach the right people, even small scheduling errors can ripple across entire departments. A smarter, more adaptive approach was needed, one that could manage complex workflows autonomously while remaining intuitive and human for end users. The result was the Event Management Multi-Agent System, a cloud-based platform where staff and students can query events, book rooms, and reschedule activities simply by chatting. Behind the scenes, a network of Azure-powered AI agents collaborates to handle scheduling, communication, and maintenance in real time, working together to keep the campus running smoothly. The user scenario shown in the figure below exemplifies the vision that guided the development of this multi-agent system. Starting with Microsoft Learning Resources I began my journey with Microsoft’s tutorial Build Your First Agent with Azure AI Foundry which introduced the fundamentals of the Azure AI Agent Service and provided an ideal foundation for experimentation. Within a few weeks, using the Azure Foundry environment, I extended those foundations into a fully functional multi-agent system. Azure Foundry’s visual interface was an invaluable learning space. It allowed me to deploy, test, and adjust model parameters such as temperature, system prompts, and function calling while observing how each change influenced the agents’ reasoning and collaboration. Through these experiments, I developed a strong conceptual understanding of orchestration and coordination before moving to the command line for more complex development later. When development issues inevitably arose, I relied on the Discord support community and the GitHub forum for troubleshooting. These communities were instrumental in addressing configuration issues and providing practical examples, ensuring that each agent performed reliably within the shared-thread framework. This early engagement with Microsoft’s learning materials not only accelerated my technical progress but also shaped how I approached experimentation, debugging, and iteration. It transformed a steep learning curve into a structured, hands-on process that mirrored professional software development practice. A Decentralised Team of AI Agents The system’s intelligence is distributed across three specialised agents, powered by OpenAI’s GPT-4.1 models through Azure OpenAI Service. They each perform a distinct role within the event management workflow: Scheduling Agent – interprets natural language requests, checks room availability, and allocates suitable venues. Communications Agent – notifies stakeholders when events are booked, modified, or cancelled. Maintenance Agent – monitors room readiness, posts fault reports when venues become unavailable, and triggers rescheduling when needed. Each agent operates independently but communicates through a shared thread, a transparent message log that serves as the coordination backbone. This thread acts as a persistent state space where agents post updates, react to changes, and maintain a record of every decision. For example, when a maintenance fault is detected, the Maintenance Agent logs the issue, the Scheduling Agent identifies an alternative venue, and the Communications Agent automatically notifies attendees. These interactions happen autonomously, with each agent responding to the evolving context recorded in the shared thread. Interfaces and Backend The system was designed with both developer-focused and user-facing interfaces, supporting rapid iteration and intuitive interaction. The Terminal Interface Initially, the agents were deployed and tested through a terminal interface, which provided a controlled environment for debugging and verifying logic step by step. This setup allowed quick testing of individual agents and observation of their interactions within the shared thread. The Chat Interface As the project evolved, I introduced a lightweight chat interface to make the system accessible to staff and students. This interface allows users to book rooms, query events, and reschedule activities using plain language. Recognising that some users might still want to see what happens behind the scenes, I added an optional toggle that reveals the intermediate steps of agent reasoning. This transparency feature proved valuable for debugging and for more technical users who wanted to understand how the agents collaborated. When a user interacts with the chat interface, they are effectively communicating with the Scheduling Agent, which acts as the primary entry point. The Scheduling Agent interprets natural-language commands such as “Book the Engineering Auditorium for Friday at 2 PM” or “Reschedule the robotics demo to another room.” It then coordinates with the Maintenance and Communications Agents to complete the process. Behind the scenes, the chat interface connects to a FastAPI backend responsible for core logic and data access. A Flask + HTMX layer handles lightweight rendering and interactivity, while the Azure AI Agent Service manages orchestration and shared-thread coordination. This combination enables seamless agent communication and reliable task execution without exposing any of the underlying complexity to the end user. Automated Notifications and Fault Detection Once an event is scheduled, the Scheduling Agent posts the confirmation to the shared thread. The Communications Agent, which subscribes to thread updates, automatically sends notifications to all relevant stakeholders by email. This ensures that every participant stays informed without any manual follow-up. The Maintenance Agent runs routine availability checks. If a fault is detected, it logs the issue to the shared thread, prompting the Scheduling Agent to find an alternative room. The Communications Agent then notifies attendees of the change, ensuring minimal disruption to ongoing events. Testing and Evaluation The system underwent several layers of testing to validate both functional and non-functional requirements. Unit and Integration Tests Backend reliability was evaluated through unit and integration tests to ensure that room allocation, conflict detection, and database operations behaved as intended. Automated test scripts verified end-to-end workflows for event creation, modification, and cancellation across all agents. Integration results confirmed that the shared-thread orchestration functioned correctly, with all test cases passing consistently. However, coverage analysis revealed that approximately 60% of the codebase was tested, leaving some areas such as Azure service integration and error-handling paths outside automated validation. These trade-offs were deliberate, balancing test depth with project scope and the constraints of mocking live dependencies. Azure AI Evaluation While functional testing confirmed correctness, it did not capture the agents’ reasoning or language quality. To assess this, I used Azure AI Evaluation, which measures conversational performance across metrics such as relevance, coherence, fluency, and groundedness. The results showed high scores in relevance (4.33) and groundedness (4.67), confirming the agents’ ability to generate accurate and context-aware responses. However, slightly lower fluency scores and weaker performance in multi-turn tasks revealed a retrieval–execution gap typical in task-oriented dialogue systems. Limitations and Insights The evaluation also surfaced several key limitations: Synthetic data: All tests were conducted with simulated datasets rather than live campus systems, limiting generalisability. Scalability: A non-functional requirement in the form of horizontal scalability was not tested. The architecture supports scaling conceptually but requires validation under heavier load. Despite these constraints, the testing process confirmed that the system was both technically reliable and linguistically robust, capable of autonomous coordination under normal conditions. The results provided a realistic picture of what worked well and what future iterations should focus on improving. Impact and Future Work This project demonstrates how conversational AI and multi-agent orchestration can streamline real operational processes. By combining Azure AI Agent Services with modular design principles, the system automates scheduling, communication, and maintenance while keeping the user experience simple and intuitive. The architecture also establishes a foundation for future extensions: Predictive maintenance to anticipate venue faults before they occur. Microsoft Teams integration for seamless in-chat scheduling. Scalability testing and real-user trials to validate performance at institutional scale. Beyond its technical results, the project underscores the potential of multi-agent systems in real-world coordination tasks. It illustrates how modularity, transparency, and intelligent orchestration can make everyday workflows more efficient and human-centred. Acknowledgements What began with a simple Microsoft tutorial evolved into a working prototype that reimagines how campuses could manage their daily operations through conversation and collaboration. This was both a challenging and rewarding journey, and I am deeply grateful to Professor Graham Roberts (UCL) and Professor Lee Stott (Microsoft) for their guidance, feedback, and support throughout the project.931Views4likes1CommentIntegrating Microsoft Foundry with OpenClaw: Step by Step Model Configuration
Step 1: Deploying Models on Microsoft Foundry Let us kick things off in the Azure portal. To get our OpenClaw agent thinking like a genius, we need to deploy our models in Microsoft Foundry. For this guide, we are going to focus on deploying gpt-5.2-codex on Microsoft Foundry with OpenClaw. Navigate to your AI Hub, head over to the model catalog, choose the model you wish to use with OpenClaw and hit deploy. Once your deployment is successful, head to the endpoints section. Important: Grab your Endpoint URL and your API Keys right now and save them in a secure note. We will need these exact values to connect OpenClaw in a few minutes. Step 2: Installing and Initializing OpenClaw Next up, we need to get OpenClaw running on your machine. Open up your terminal and run the official installation script: curl -fsSL https://openclaw.ai/install.sh | bash The wizard will walk you through a few prompts. Here is exactly how to answer them to link up with our Azure setup: First Page (Model Selection): Choose "Skip for now". Second Page (Provider): Select azure-openai-responses. Model Selection: Select gpt-5.2-codex , For now only the models listed (hosted on Microsoft Foundry) in the picture below are available to be used with OpenClaw. Follow the rest of the standard prompts to finish the initial setup. Step 3: Editing the OpenClaw Configuration File Now for the fun part. We need to manually configure OpenClaw to talk to Microsoft Foundry. Open your configuration file located at ~/.openclaw/openclaw.json in your favorite text editor. Replace the contents of the models and agents sections with the following code block: { "models": { "providers": { "azure-openai-responses": { "baseUrl": "https://<YOUR_RESOURCE_NAME>.openai.azure.com/openai/v1", "apiKey": "<YOUR_AZURE_OPENAI_API_KEY>", "api": "openai-responses", "authHeader": false, "headers": { "api-key": "<YOUR_AZURE_OPENAI_API_KEY>" }, "models": [ { "id": "gpt-5.2-codex", "name": "GPT-5.2-Codex (Azure)", "reasoning": true, "input": ["text", "image"], "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, "contextWindow": 400000, "maxTokens": 16384, "compat": { "supportsStore": false } }, { "id": "gpt-5.2", "name": "GPT-5.2 (Azure)", "reasoning": false, "input": ["text", "image"], "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, "contextWindow": 272000, "maxTokens": 16384, "compat": { "supportsStore": false } } ] } } }, "agents": { "defaults": { "model": { "primary": "azure-openai-responses/gpt-5.2-codex" }, "models": { "azure-openai-responses/gpt-5.2-codex": {} }, "workspace": "/home/<USERNAME>/.openclaw/workspace", "compaction": { "mode": "safeguard" }, "maxConcurrent": 4, "subagents": { "maxConcurrent": 8 } } } } You will notice a few placeholders in that JSON. Here is exactly what you need to swap out: Placeholder Variable What It Is Where to Find It <YOUR_RESOURCE_NAME> The unique name of your Azure OpenAI resource. Found in your Azure Portal under the Azure OpenAI resource overview. <YOUR_AZURE_OPENAI_API_KEY> The secret key required to authenticate your requests. Found in Microsoft Foundry under your project endpoints or Azure Portal keys section. <USERNAME> Your local computer's user profile name. Open your terminal and type whoami to find this. Step 4: Restart the Gateway After saving the configuration file, you must restart the OpenClaw gateway for the new Foundry settings to take effect. Run this simple command: openclaw gateway restart Configuration Notes & Deep Dive If you are curious about why we configured the JSON that way, here is a quick breakdown of the technical details. Authentication Differences Azure OpenAI uses the api-key HTTP header for authentication. This is entirely different from the standard OpenAI Authorization: Bearer header. Our configuration file addresses this in two ways: Setting "authHeader": false completely disables the default Bearer header. Adding "headers": { "api-key": "<key>" } forces OpenClaw to send the API key via Azure's native header format. Important Note: Your API key must appear in both the apiKey field AND the headers.api-key field within the JSON for this to work correctly. The Base URL Azure OpenAI's v1-compatible endpoint follows this specific format: https://<your_resource_name>.openai.azure.com/openai/v1 The beautiful thing about this v1 endpoint is that it is largely compatible with the standard OpenAI API and does not require you to manually pass an api-version query parameter. Model Compatibility Settings "compat": { "supportsStore": false } disables the store parameter since Azure OpenAI does not currently support it. "reasoning": true enables the thinking mode for GPT-5.2-Codex. This supports low, medium, high, and xhigh levels. "reasoning": false is set for GPT-5.2 because it is a standard, non-reasoning model. Model Specifications & Cost Tracking If you want OpenClaw to accurately track your token usage costs, you can update the cost fields from 0 to the current Azure pricing. Here are the specs and costs for the models we just deployed: Model Specifications Model Context Window Max Output Tokens Image Input Reasoning gpt-5.2-codex 400,000 tokens 16,384 tokens Yes Yes gpt-5.2 272,000 tokens 16,384 tokens Yes No Current Cost (Adjust in JSON) Model Input (per 1M tokens) Output (per 1M tokens) Cached Input (per 1M tokens) gpt-5.2-codex $1.75 $14.00 $0.175 gpt-5.2 $2.00 $8.00 $0.50 Conclusion: And there you have it! You have successfully bridged the gap between the enterprise-grade infrastructure of Microsoft Foundry and the local autonomy of OpenClaw. By following these steps, you are not just running a chatbot; you are running a sophisticated agent capable of reasoning, coding, and executing tasks with the full power of GPT-5.2-codex behind it. The combination of Azure's reliability and OpenClaw's flexibility opens up a world of possibilities. Whether you are building an automated devops assistant, a research agent, or just exploring the bleeding edge of AI, you now have a robust foundation to build upon. Now it is time to let your agent loose on some real tasks. Go forth, experiment with different system prompts, and see what you can build. If you run into any interesting edge cases or come up with a unique configuration, let me know in the comments below. Happy coding!11KViews2likes2CommentsBuild a Fully Offline RAG App with Foundry Local: No Cloud Required
A practical guide to building an on-device AI support agent using Retrieval-Augmented Generation, JavaScript, and Microsoft Foundry Local. The Problem: AI That Can't Go Offline Most AI-powered applications today are firmly tethered to the cloud. They assume stable internet, low-latency API calls, and the comfort of a managed endpoint. But what happens when your users are in an environment with zero connectivity a gas pipeline in a remote field, a factory floor, an underground facility? That's exactly the scenario that motivated this project: a fully offline RAG-powered support agent that runs entirely on a laptop. No cloud. No API keys. No outbound network calls. Just a local model, a local vector store, and domain-specific documents all accessible from a browser on any device. The Gas Field Support Agent - running entirely on-device What is RAG and Why Should You Care? Retrieval-Augmented Generation (RAG) is a pattern that makes language models genuinely useful for domain-specific tasks. Instead of hoping the model "knows" the answer from pre-training, you: Retrieve relevant chunks from your own documents Augment the model's prompt with those chunks as context Generate a response grounded in your actual data The result: fewer hallucinations, traceable answers, and an AI that works with your content. If you're building internal tools, customer support bots, field manuals, or knowledge bases, RAG is the pattern you want. Why fully offline? Data sovereignty, air-gapped environments, field operations, latency-sensitive workflows, and regulatory constraints all demand AI that doesn't phone home. Running everything locally gives you complete control over your data and eliminates any external dependency. The Tech Stack This project is deliberately simple — no frameworks, no build steps, no Docker: Layer Technology Why AI Model Foundry Local + Phi-3.5 Mini Runs locally, OpenAI-compatible API, no GPU needed Backend Node.js + Express Lightweight, fast, universally known Vector Store SQLite via better-sqlite3 Zero infrastructure, single file on disk Retrieval TF-IDF + cosine similarity No embedding model required, fully offline Frontend Single HTML file with inline CSS No build step, mobile-responsive, field-ready The total dependency footprint is just four npm packages: express , openai , foundry-local-sdk , and better-sqlite3 . Architecture Overview The system has five layers — all running on a single machine: Five-layer architecture: Client → Server → RAG Pipeline → Data → AI Model Client Layer — A single HTML file served by Express, with quick-action buttons and responsive chat Server Layer — Express.js handles API routes for chat (streaming + non-streaming), document upload, and health checks RAG Pipeline — The chat engine orchestrates retrieval and generation; the chunker handles TF-IDF vectorization Data Layer — SQLite stores document chunks and their TF-IDF vectors; source docs live as .md files AI Layer — Foundry Local runs Phi-3.5 Mini Instruct on CPU/NPU, exposing an OpenAI-compatible API Getting Started in 5 Minutes You need two prerequisites: Node.js 20+ — nodejs.org Foundry Local — Microsoft's on-device AI runtime: Terminal winget install Microsoft.FoundryLocal Then clone, install, ingest, and run: git clone https://github.com/leestott/local-rag.git cd local-rag npm install npm run ingest # Index the 20 gas engineering documents npm start # Start the server + Foundry Local Open http://127.0.0.1:3000 and start chatting. Foundry Local auto-downloads Phi-3.5 Mini (~2 GB) on first run. How the RAG Pipeline Works Let's trace what happens when a user asks: "How do I detect a gas leak?" RAG query flow: Browser → Server → Vector Store → Model → Streaming response Step 1: Document Ingestion Before any queries happen, npm run ingest reads every .md file from the docs/ folder, splits each into overlapping chunks (~200 tokens, 25-token overlap), computes a TF-IDF vector for each chunk, and stores everything in SQLite. Chunking example docs/01-gas-leak-detection.md → Chunk 1: "Gas Leak Detection – Safety Warnings: Ensure all ignition..." → Chunk 2: "...sources are eliminated. Step-by-step: 1. Perform visual..." → Chunk 3: "...inspection of all joints. 2. Check calibration date..." The overlap ensures no information falls between chunk boundaries — a critical detail in any RAG system. Step 2: Query → Retrieval When the user sends a question, the server converts it into a TF-IDF vector, compares it against every stored chunk using cosine similarity, and returns the top-K most relevant results. For 20 documents (~200 chunks), this executes in under 10ms. src/vectorStore.js /** Retrieve top-K most relevant chunks for a query. */ search(query, topK = 5) { const queryTf = termFrequency(query); const rows = this.db.prepare("SELECT * FROM chunks").all(); const scored = rows.map((row) => { const chunkTf = new Map(JSON.parse(row.tf_json)); const score = cosineSimilarity(queryTf, chunkTf); return { ...row, score }; }); scored.sort((a, b) => b.score - a.score); return scored.slice(0, topK).filter((r) => r.score > 0); } Step 3: Prompt Construction The retrieved chunks are injected into the prompt alongside system instructions: Prompt structure System: You are an offline gas field support agent. Safety-first... Context: [Chunk 1: Gas Leak Detection – Safety Warnings...] [Chunk 2: Gas Leak Detection – Step-by-step...] [Chunk 3: Purging Procedures – Related safety...] User: How do I detect a gas leak? Step 4: Generation + Streaming The prompt is sent to Foundry Local via the OpenAI-compatible API. The response streams back token-by-token through Server-Sent Events (SSE) to the browser: Safety-first response with structured guidance Expandable sources with relevance scores Foundry Local: Your Local AI Runtime Foundry Local is what makes the "offline" part possible. It's a runtime from Microsoft that runs small language models (SLMs) on CPU or NPU — no GPU required. It exposes an OpenAI-compatible API and manages model downloads, caching, and lifecycle automatically. The integration code is minimal if you've used the OpenAI SDK before, this will feel instantly familiar: src/chatEngine.js import { FoundryLocalManager } from "foundry-local-sdk"; import { OpenAI } from "openai"; // Start Foundry Local and load the model const manager = new FoundryLocalManager(); const modelInfo = await manager.init("phi-3.5-mini"); // Use the standard OpenAI client — pointed at the local endpoint const client = new OpenAI({ baseURL: manager.endpoint, apiKey: manager.apiKey, }); // Chat completions work exactly like the cloud API const stream = await client.chat.completions.create({ model: modelInfo.id, messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "How do I detect a gas leak?" } ], stream: true, }); Portability matters Because Foundry Local uses the OpenAI API format, any code you write here can be ported to Azure OpenAI or OpenAI's cloud API with a single config change. You're not locked in. Why TF-IDF Instead of Embeddings? Most RAG tutorials use embedding models for retrieval. We chose TF-IDF for this project because: Fully offline — no embedding model to download or run Zero latency — vectorization is instantaneous (just math on word frequencies) Good enough — for a curated collection of 20 domain-specific documents, TF-IDF retrieves the right chunks reliably Transparent — you can inspect the vocabulary and weights, unlike neural embeddings For larger collections (thousands of documents) or when semantic similarity matters more than keyword overlap, you'd swap in an embedding model. But for this use case, TF-IDF keeps the stack simple and dependency-free. Mobile-Responsive Field UI Field engineers use this app on phones and tablets often wearing gloves. The UI is designed for harsh conditions with a dark, high-contrast theme, large touch targets (minimum 48px), and horizontally scrollable quick-action buttons. Desktop view Mobile view The entire frontend is a single index.html file — no React, no build step, no bundler. This keeps the project accessible and easy to deploy anywhere. Runtime Document Upload Users can upload new documents without restarting the server. The upload endpoint receives markdown content, chunks it, computes TF-IDF vectors, and inserts the chunks into SQLite — all in memory, immediately available for retrieval. Drag-and-drop document upload with instant indexing Adapt This for Your Own Domain This project is a scenario sample designed to be forked and customized. Here's the three-step process: 1. Replace the Documents Delete the gas engineering docs in docs/ and add your own .md files with optional YAML front-matter: docs/my-procedure.md --- title: Troubleshooting Widget Errors category: Support id: KB-001 --- # Troubleshooting Widget Errors ...your content here... 2. Edit the System Prompt Open src/prompts.js and rewrite the instructions for your domain: src/prompts.js export const SYSTEM_PROMPT = `You are an offline support agent for [YOUR DOMAIN]. Rules: - Only answer using the retrieved context - If the answer isn't in the context, say so - Use structured responses: Summary → Details → Reference `; 3. Tune the Retrieval Adjust chunking and retrieval parameters in src/config.js : src/config.js export const config = { model: "phi-3.5-mini", chunkSize: 200, // smaller = more precise, less context per chunk chunkOverlap: 25, // prevents info from falling between chunks topK: 3, // chunks per query (more = richer context, slower) }; Extending to Multi-Agent Architectures Once you have a working RAG agent, the natural next step is multi-agent orchestration where specialized agents collaborate to handle complex workflows. With Foundry Local's OpenAI-compatible API, you can compose multiple agent roles on the same machine: Multi-agent concept // Each agent is just a different system prompt + RAG scope const agents = { safety: { prompt: safetyPrompt, docs: "safety/*.md" }, diagnosis: { prompt: diagnosisPrompt, docs: "faults/*.md" }, procedure: { prompt: procedurePrompt, docs: "procedures/*.md" }, }; // Router determines which agent handles the query function route(query) { if (query.match(/safety|warning|hazard/i)) return agents.safety; if (query.match(/fault|error|code/i)) return agents.diagnosis; return agents.procedure; } // Each agent uses the same Foundry Local model endpoint const response = await client.chat.completions.create({ model: modelInfo.id, messages: [ { role: "system", content: selectedAgent.prompt }, { role: "system", content: `Context:\n${retrievedChunks}` }, { role: "user", content: userQuery } ], stream: true, }); This pattern lets you build specialized agent pipelines a triage agent routes to the right specialist, each with its own document scope and system prompt, all running on the same local Foundry instance. For production multi-agent systems, explore Microsoft Foundry for cloud-scale orchestration when connectivity is available. Local-first, cloud-ready Start with Foundry Local for development and offline scenarios. When your agents need cloud scale, swap to Azure AI Foundry with the same OpenAI-compatible API your agent code stays the same. Key Takeaways 1 RAG = Retrieve + Augment + Generate Ground your AI in real documents — dramatically reducing hallucination and making answers traceable. 2 Foundry Local makes local AI accessible OpenAI-compatible API running on CPU/NPU. No GPU required. No cloud dependency. 3 TF-IDF + SQLite is viable For small-to-medium document collections, you don't need a dedicated vector database. 4 Same API, local or cloud Build locally with Foundry Local, deploy with Azure OpenAI — zero code changes. What's Next? Embedding-based retrieval — swap TF-IDF for a local embedding model for better semantic matching Conversation memory — persist chat history across sessions Multi-agent routing — specialized agents for safety, diagnostics, and procedures PWA packaging — make it installable as a standalone app on mobile devices Hybrid retrieval — combine keyword search with semantic embeddings for best results Get the code Clone the repo, swap in your own documents, and start building: git clone https://github.com/leestott/local-rag.git github.com/leestott/local-rag — MIT licensed, contributions welcome. Open source under the MIT License. Built with Foundry Local and Node.js.1.3KViews1like0CommentsBuilding an Auditable Security Layer for Agentic AI
Most agent failures do not look like breaches. They look like a normal chat, a normal answer, and a normal tool call. Until the next morning, when a single question collapses the whole story: who authorized that action. You think you deployed an agent. In reality, you deployed an unbounded automation pipeline that happens to speak English. I’m Hazem Ali — Microsoft AI MVP, Distinguished AI & ML Architect, Founder & CEO at Skytells. For over 20 years, I’ve built secure, scalable enterprise AI across cloud and edge, with a focus on agent security and sovereign, governed AI architectures. My work on these systems is widely referenced by practitioners across multiple regions. Hazem Ali honored to receive an official speaker invitation under the patronage of H.H. Sheikh Dr. Sultan bin Muhammad Al Qasimi, Member of the UAE Supreme Council and Ruler of Sharjah, to speak at the Sharjah International Conference on Linguistic Intelligence (SICLI), organized by the American University of Sharjah (AUS) and the Emirates Scholar Center for Research and Studies. This piece is a collaboration with Hammad Atta a Practice Lead – AI Security & Cloud Strategy and Dr. Yasir Mehmood , Dr Muhammad Zeeshan Baig, Dr. Muhammad Aatif, Dr. MUHAMMAD AZIZ UL HAQ. We align on one core idea: agent security is not about making the model behave. It is about building enforceable boundaries around the model and proving every privileged step. This article is meant to sit next to my earlier Tech Community piece, Zero-Trust Agent Architecture: How To Actually Secure Your Agents, and go one level deeper into the mechanics you can implement on Azure today. Let me break it down. The Principle: The model is not your boundary Let me break it down in the way I’d explain it in a design review. A boundary is something that still holds when the component on the other side is adversarial, confused, or simply wrong. An LLM is none of those reliably. In an agent, the model is not just a generator. It becomes a planner and scheduler. It decides when to retrieve, which tool to call, how to shape arguments, and when to loop. That means your real attack surface is not “bad output.” It is the control-flow graph the model is allowed to traverse. So if your “security” lives inside the prompt, you are putting policy in the same token stream the attacker can influence. That is not a boundary. That is a suggestion. The only stable design is to treat the model like an untrusted proposer and the runtime like the verifier. Here is the chain I use. Each gate is external to the model and survives manipulation. Context Gate: Everything that enters the model is treated as executable influence, not “text.” Capability Gate: Tools are invoked as constrained capabilities, not free-form function calls. Evidence Gate: Every privileged step produces a verifiable artifact, not a story. Retrieval Control Plane: What the agent can see is governed by labels and identity, not prompt etiquette. Detection Layer: Drift and probing become alerts, not surprises. Now the rare part, the part most people miss: the boundary is not “block or allow.” The boundary is stateful. Once the runtime sees a suspicious signal, the entire session must transition into a degraded capability state, and every downstream gate must enforce that state. 1. Treat context as executable influence, and preserve provenance If you do RAG, your documents are not “supporting info.” They are an input channel. That makes the biggest prompt-injection risk not the user. It is your documents. Microsoft’s Prompt Shields covers user prompt attacks (scanned at the user input intervention point) and document attacks (scanned at the user input and tool response intervention points). When enabled, each request returns annotation results with detected and filtered values that your runtime can translate into a policy decision: block, degrade, or allow. Provenance Collapse. Most teams concatenate prompt + policy + retrieved chunks into one blob. The moment you do that, you lose the one thing you need for a defensible boundary: you can no longer reliably tell which tokens came from where. That is how “context” becomes “authority.” For indirect/document attacks, Microsoft guidance recommends delimiting context documents inside the prompt using """<documents> ... </documents>""" to improve indirect attack detection. That delimiter is not formatting. It is a provenance marker that improves indirect attack detection through Prompt Shields. Minimal, practical pattern: // Provenance-preserving prompt construction for indirect/document attack detection function buildPrompt(system: string, user: string, retrievedDocs: string[]): string { const docs = retrievedDocs.map((d) => `- ${d}`).join("\n"); return [ system, "", `User: ${user}`, "", `""" <documents>\n${docs}\n</documents> """`, ].join("\n"); } Then treat Prompt Shields output as a session security event, not a banner: type RiskState = "NORMAL" | "SUSPECT" | "BLOCK"; type FilterPolicy = "BLOCK_ON_FILTERED" | "DEGRADE_ON_FILTERED"; function computeRiskState( shields: { detected: boolean; filtered?: boolean }, labels: string[], policy: FilterPolicy = "DEGRADE_ON_FILTERED", ): RiskState { // detected => hard stop if (shields.detected) return "BLOCK"; // filtered is an annotation signal: block or degrade by policy if (shields.filtered) { return policy === "BLOCK_ON_FILTERED" ? "BLOCK" : "SUSPECT"; } // example: sensitivity-based degradation independent of shield hits const sensitive = labels.some((l) => ["Confidential", "HighlyConfidential", "Regulated"].includes(l), ); return sensitive ? "SUSPECT" : "NORMAL"; } When the signal is clear, you block and log. When it is suspicious, you do not warn. You downgrade authority. QSAF Alignment: Prompt Injection Protection (Domain 1): QSAF-PI-001 (static pattern blacklist), QSAF-PI-002 (dynamic LLM analysis), QSAF-PI-003 (semantic embedding comparison) All addressed by Prompt Shields and provenance marking. Context Manipulation (Domain 2): QSAF-RC-004 (context drift), QSAF-RC-007 (nested prompt injection) – mitigated by stateful risk calculation. 2. Tools are capabilities with constraints, not functions When the model proposes a tool call, your runtime should re-derive what is allowed from identity plus risk state, then enforce it at the gateway. type ToolRequest = { tool: string; args: unknown; }; type Capabilities = { allowWrite: boolean; allowedTools: Set<string>; }; function deriveCapabilities(risk: RiskState, roles: string[]): Capabilities { const baseAllowed = new Set(["search_kb", "get_profile", "summarize"]); const isAdmin = roles.includes("Admin"); if (risk === "SUSPECT") { return { allowWrite: false, allowedTools: baseAllowed }; } if (risk === "BLOCK") { return { allowWrite: false, allowedTools: new Set() }; } // NORMAL const tools = new Set([ ...baseAllowed, ...(isAdmin ? ["update_record", "issue_refund"] : []), ]); return { allowWrite: isAdmin, allowedTools: tools }; } function authorizeTool(req: ToolRequest, caps: Capabilities): void { if (!caps.allowedTools.has(req.tool)) throw new Error("ToolNotAllowed"); if (!caps.allowWrite && req.tool.startsWith("update_")) { throw new Error("WriteDenied"); } } The model can ask. It cannot grant itself permission. QSAF Alignment: Plugin Abuse Monitoring (Domain 3): QSAF-PL-001 (whitelist enforcement), QSAF-PL-003 (restrict sensitive plugins), QSAF-PL-006 (rate‑limiting) – implemented via capability derivation and gateway policies. Behavioral Anomaly Detection (Domain 5): QSAF-BA-006 (plugin execution pattern deviance) – detected by comparing actual calls against derived capabilities. The Integrity Gate: Hash-chain the authority, not the output Let me add the part that makes investigations clean. Most teams treat integrity like an audit log problem. That is not enough. Logs explain. Integrity proves. The hard truth is that agent authority is assembled out of pieces: the system instruction, the user prompt, retrieved chunks, risk annotations, and finally the tool intent. If you do not bind those pieces together cryptographically, an incident review becomes a story-telling session. This is why QSAF has an entire domain for payload integrity and signing, including prompt hash signing, nonce or replay protection, and a hash chain lineage that tracks how a session evolved. Here is how you can map that into the runtime verifies. You build a canonical “authority envelope” for every privileged hop, compute a digest, and then: link it to the previous hop (hash chain) include a nonce (replay control) sign the digest with Azure Key Vault (Key Vault signs digests, it does not hash your content for you) import crypto from "crypto"; type AuthorityEnvelope = { sessionId: string; turnId: number; policyVersion: string; // provenance-preserved components systemHash: string; userHash: string; documentsHash: string; // hash of structured retrieved chunks (not just rendered text) shields: { detected: boolean; filtered: boolean; }; riskState: "NORMAL" | "SUSPECT" | "BLOCK"; // proposed action (if any) tool?: { name: string; argsHash: string; }; // anti-replay + lineage nonce: string; prevDigest?: string; ts: string; }; function sha256(bytes: string): string { return crypto.createHash("sha256").update(bytes).digest("hex"); } // Canonicalization matters. JSON.stringify is OK if you control key order. // For cross-language, use RFC 8785 (JCS) canonical JSON. function canonicalJson(x: unknown): string { return JSON.stringify(x); } function buildEnvelope( input: Omit<AuthorityEnvelope, "nonce" | "ts">, ): AuthorityEnvelope { return { ...input, nonce: crypto.randomUUID(), ts: new Date().toISOString(), }; } function digestEnvelope(env: AuthorityEnvelope): string { return sha256(canonicalJson(env)); } Then you call Key Vault to sign that digest (REST sign), and optionally verify later (REST verify). The rare failure mode this blocks is subtle: authority splicing. Without a hash chain, it is possible for the runtime to correctly validate a tool call, but later be unable to prove which retrieved chunk, which Prompt Shields result, and which policy version were in force when that call was authorized. With the chain, every privileged hop becomes tamper-evident. This is the point: Prompt Shields tells you “this looks dangerous.” Document delimiters preserve provenance. The integrity gate makes the runtime able to say, later, with evidence: “This is exactly what I accepted as authority.” QSAF Alignment: Payload Integrity & Signing (Domain 6): QSAF-PY-001 (prompt hash signing), QSAF-PY-005 (nonce/replay control), QSAF-PY-006 (hash chain lineage) – directly implemented via the envelope and chaining. Tools must sit behind a wall that can say “no” Tool calls are where language becomes authority. If an agent can call APIs that mutate state, your security story is not about the response text. It is about whether the tool call is allowed under explicit policy. This is exactly where Azure API Management belongs: as the tool gateway that enforces authentication and authorization before any tool request reaches your backend. The validate-jwt policy is the canonical enforcement mechanism for validating JWTs at the gateway. The design goal is simple: The model can request a tool call. The gateway decides if it is permitted. A capability token approach keeps it clean: <!-- APIM inbound policy sketch --> <validate-jwt header-name="Authorization" failed-validation-httpcode="401"> <required-claims> <claim name="scp"> <value>tools.read</value> </claim> </required-claims> </validate-jwt> The claim name (scp, roles, or custom claims) depends on your token issuer; the point is enforcing authorization at the gateway, not inside model text. Now you can enforce “read-only mode” by issuing tokens that simply do not carry write scopes. The model can try to call a write tool. It still gets denied by policy. Evidence is not logs. Evidence is a signed chain. Logs help you debug. Evidence helps you prove. So you hash the session envelope and the tool intent, then sign the digest using Azure Key Vault Keys. Key Vault sign creates a signature from a digest, and verify verifies a signature against a digest. Key Vault does not hash your content for you. Hash locally, then sign the digest.), and Key Vault documentation is explicit that signing is sign-hash, not “sign arbitrary content.” You hash locally, then ask Key Vault to sign the hash. import crypto from "crypto"; const sha256 = (x: unknown): string => crypto.createHash("sha256").update(JSON.stringify(x)).digest("hex"); type IntentEnvelope = { sessionId: string; userId: string; promptHash: string; documentsHash: string; tool: string; argsHash: string; nonce: string; ts: string; policyVersion: string; }; function buildIntent( sessionId: string, userId: string, prompt: string, docs: unknown, tool: string, args: unknown, policyVersion: string, ): IntentEnvelope { return { sessionId, userId, promptHash: sha256(prompt), documentsHash: sha256(docs), tool, argsHash: sha256(args), nonce: crypto.randomUUID(), ts: new Date().toISOString(), policyVersion, }; } Once you do this, your system stops “explaining.” It starts proving. Govern what the agent can see, not only what it can say RAG without governance eventually becomes a data exposure feature. This is why I treat retrieval as a governed operation. Microsoft Purview sensitivity labels give you a practical way to classify content and build retrieval rules on top of that classification. Microsoft documents creating and configuring sensitivity labels in Purview. The pattern is simple: Label the corpus. Filter retrieval by label and identity policy. Log label distribution per completion. Alert when a low-privilege identity retrieves high-sensitivity labels. This is how you keep sovereignty real. Not in a slide deck. In the retrieval path. Operate it like a security system: posture and detection Inline gates reduce risk. They do not eliminate it. Systems drift. People add tools. Policies get loosened. Attacks evolve. Microsoft Defender for Cloud’s Defender CSPM plan includes AI security posture management for generative AI apps and AI agents (Preview), including discovery/inventory of AI agents deployed with Azure AI Foundry. Then you use Microsoft Sentinel to turn your telemetry into incidents, with scheduled analytics rules. Your detections should match the gates you built: Repeated Prompt Shields detections from the same identity or session. Tool-call spikes after a suspicious document signal. APIM denials for write endpoints from sessions in read-only mode. High-sensitivity label retrieval by identities that should never touch that tier. QSAF Alignment: Behavioral Anomaly Detection (Domain 5): QSAF-BA-001 (session entropy), QSAF-BA-004 (repeated intent mutation), QSAF-BA-007 (unified risk score) – detected via Sentinel rules. Cross‑Environment Defense (Domain 9): QSAF-CE-006 (coordinated alert response) – using Sentinel incidents and playbooks. Where the reference checklist fits, quietly Behind the scenes, we use a control checklist lens to ensure we cover prompt/context attacks, tool misuse, integrity, governance, and operational monitoring. The point is not to rename Microsoft features into framework terms. The point is to make the system enforceable and auditable using Azure-native gates. Closing Zero trust for agents is not a slogan. It is a build. Prompt Shields gives you a front gate for both user prompt attacks and document attacks, with clear annotations like detected and filtered. API Management gives you a tool boundary that can say “no” regardless of what the model tries, using validate-jwt. Signed intent gives you evidence, using Key Vault’s sign-hash semantics. Purview labels give you governed retrieval. Sentinel and Defender give you an operating model, not wishful thinking. If you want the conceptual spine and the architectural principles that frame this pipeline, start with my earlier Tech Community pieces, then come back here and implement the gates. Thanks for reading — Hazem Ali395Views1like0CommentsRunning Phi-4 Locally with Microsoft Foundry Local: A Step-by-Step Guide
In our previous post, we explored how Phi-4 represents a new frontier in AI efficiency that delivers performance comparable to models 5x its size while being small enough to run on your laptop. Today, we're taking the next step: getting Phi-4 up and running locally on your machine using Microsoft Foundry Local. Whether you're a developer building AI-powered applications, an educator exploring AI capabilities, or simply curious about running state-of-the-art models without relying on cloud APIs, this guide will walk you through the entire process. Microsoft Foundry Local brings the power of Azure AI Foundry to your local device without requiring an Azure subscription, making local AI development more accessible than ever. So why do you want to run Phi-4 Locally? Before we dive into the setup, let's quickly recap why running models locally matters: Privacy and Control: Your data never leaves your machine. This is crucial for sensitive applications in healthcare, finance, or education where data privacy is paramount. Cost Efficiency: No API costs, no rate limits. Once you have the model downloaded, inference is completely free. Speed and Reliability: No network latency or dependency on external services. Your AI applications work even when you're offline. Learning and Experimentation: Full control over model parameters, prompts, and fine-tuning opportunities without restrictions. With Phi-4's compact size, these benefits are now accessible to anyone with a modern laptop—no expensive GPU required. What You'll Need Before we begin, make sure you have: Operating System: Windows 10/11, macOS (Intel or Apple Silicon), or Linux RAM: Minimum 16GB (32GB recommended for optimal performance) Storage: At least 5 - 10GB of free disk space Processor: Any modern CPU (GPU optional but provides faster inference) Note: Phi-4 works remarkably well even on consumer hardware 😀. Step 1: Installing Microsoft Foundry Local Microsoft Foundry Local is designed to make running AI models locally as simple as possible. It handles model downloads, manages memory efficiently, provides OpenAI-compatible APIs, and automatically optimizes for your hardware. For Windows Users: Open PowerShell or Command Prompt and run: winget install Microsoft.FoundryLocal For macOS Users (Apple Silicon): Open Terminal and run: brew install microsoft/foundrylocal/foundrylocal Verify Installation: Open your terminal and type. This should return the Microsoft Foundry Local version, confirming installation: foundry --version Step 2: Downloading Phi-4-Mini For this tutorial, we'll use Phi-4-mini, the lightweight 3.8 billion parameter version that's perfect for learning and experimentation. Open your terminal and run: foundry model run phi-4-mini You should see your download begin and something similar to the image below Available Phi Models on Foundry Local While we're using phi-4-mini for this guide, Foundry Local offers several Phi model variants and other open-source models optimized for different hardware and use cases: Model Hardware Type Size Best For phi-4-mini GPU chat-completion 3.72 GB Learning, fast responses, resource-constrained environments with GPU phi-4-mini CPU chat-completion 4.80 GB Learning, fast responses, CPU-only systems phi-4-mini-reasoning GPU chat-completion 3.15 GB Reasoning tasks with GPU acceleration phi-4-mini-reasoning CPU chat-completion 4.52 GB Mathematical proofs, logic puzzles with lower resource requirements phi-4 GPU chat-completion 8.37 GB Maximum reasoning performance, complex tasks with GPU phi-4 CPU chat-completion 10.16 GB Maximum reasoning performance, CPU-only systems phi-3.5-mini GPU chat-completion 2.16 GB Most lightweight option with GPU support phi-3.5-mini CPU chat-completion 2.53 GB Most lightweight option, CPU-optimized phi-3-mini-128k GPU chat-completion 2.13 GB Extended context (128k tokens), GPU-optimized phi-3-mini-128k CPU chat-completion 2.54 GB Extended context (128k tokens), CPU-optimized phi-3-mini-4k GPU chat-completion 2.13 GB Standard context (4k tokens), GPU-optimized phi-3-mini-4k CPU chat-completion 2.53 GB Standard context (4k tokens), CPU-optimized Note: Foundry Local automatically selects the best variant for your hardware. If you have an NVIDIA GPU, it will use the GPU-optimized version. Otherwise, it will use the CPU-optimized version. run the command below to see full list of models foundry model list Step 3: Test It Out Once the download completes, an interactive session will begin. Let's test Phi-4-mini's capabilities with a few different prompts: Example 1: Explanation Phi-4-mini provides a thorough, well-structured explanation! It starts with the basic definition, explains the process in biological systems, gives real-world examples (plant cells, human blood cells). The response is detailed yet accessible. Example 2: Mathematical Problem Solving Excellent step-by-step solution! Phi-4-mini breaks down the problem methodically: 1. Distributes on the left side 2. Isolates the variable terms 3. Simplifies progressively 4. Arrives at the final answer: x = 11 The model shows its work clearly, making it easy to follow the logic and ideal for educational purposes Example 3: Code Generation The model provides a concise Python function using string slicing ([::-1]) - the most Pythonic approach to reversing a string. It includes clear documentation with a docstring explaining the function's purpose, provides example usage demonstrating the output, and even explains how the slicing notation works under the hood. The response shows that the model understands not just how to write the code, but why this approach is preferred - noting that the [::-1] slice notation means "start at the end of the string and end at position 0, move with the step -1, negative one, which means one step backwards." This showcases the model's ability to generate production-ready code with proper documentation while being educational about Python idioms. To exit the interactive session, type `/bye` Step 4: Extending Phi-4 with Real-Time Tools Understanding Phi-4's Knowledge Cutoff Like all language models, Phi-4 has a knowledge cutoff date from its training data (typically several months old). This means it won't know about very recent events, current prices, or breaking news. For example, if you ask "Who won the 2024 NBA championship?" it might not have the answer. The good thing is, there's a powerful work-around. While Phi-4 is incredibly capable, connecting it to external tools like web search, databases, or APIs transforms it from a static knowledge base into a dynamic reasoning engine. This is where Microsoft Foundry's REST API comes in. Microsoft Foundry provides a simple API that lets you integrate Phi-4 into Python applications and connect it to real-time data sources. Here's a practical example: building a web-enhanced AI assistant. Web-Enhanced AI Assistant This simple application combines Phi-4's reasoning with real-time web search, allowing it to answer current questions accurately. Prerequisites: pip install foundry-local-sdk requests ddgs Create phi4_web_assistant.py: import requests from foundry_local import FoundryLocalManager from ddgs import DDGS import json def search_web(query): """Search the web and return top results""" try: results = list(DDGS().text(query, max_results=3)) if not results: return "No search results found." search_summary = "\n\n".join([ f"[Source {i+1}] {r['title']}\n{r['body'][:500]}" for i, r in enumerate(results) ]) return search_summary except Exception as e: return f"Search failed: {e}" def ask_phi4(endpoint, model_id, prompt): """Send a prompt to Phi-4 and stream response""" response = requests.post( f"{endpoint}/chat/completions", json={ "model": model_id, "messages": [{"role": "user", "content": prompt}], "stream": True }, stream=True, timeout=180 ) full_response = "" for line in response.iter_lines(): if line: line_text = line.decode('utf-8') if line_text.startswith('data: '): line_text = line_text[6:] # Remove 'data: ' prefix if line_text.strip() == '[DONE]': break try: data = json.loads(line_text) if 'choices' in data and len(data['choices']) > 0: delta = data['choices'][0].get('delta', {}) if 'content' in delta: chunk = delta['content'] print(chunk, end="", flush=True) full_response += chunk except json.JSONDecodeError: continue print() return full_response def web_enhanced_query(question): """Combine web search with Phi-4 reasoning""" # By using an alias, the most suitable model will be downloaded # to your device automatically alias = "phi-4-mini" # Create a FoundryLocalManager instance. This will start the Foundry # Local service if it is not already running and load the specified model. manager = FoundryLocalManager(alias) model_info = manager.get_model_info(alias) print("🔍 Searching the web...\n") search_results = search_web(question) prompt = f"""Here are recent search results: {search_results} Question: {question} Using only the information above, give a clear answer with specific details.""" print("🤖 Phi-4 Answer:\n") return ask_phi4(manager.endpoint, model_info.id, prompt) if __name__ == "__main__": # Try different questions question = "Who won the 2024 NBA championship?" # question = "What is the latest iPhone model released in 2024?" # question = "What is the current price of Bitcoin?" print(f"Question: {question}\n") print("=" * 60 + "\n") web_enhanced_query(question) print("\n" + "=" * 60) Run It: python phi4_web_assistant.py What Makes This Powerful By connecting Phi-4 to external tools, you create an intelligent system that: Accesses Real-Time Information: Get news, weather, sports scores, and breaking developments Verifies Facts: Cross-reference information with multiple sources Extends Capabilities: Connect to databases, APIs, file systems, or any other tool Enables Complex Applications: Build research assistants, customer support bots, educational tutors, and personal assistants This same pattern can be applied to connect Phi-4 to: Databases: Query your company's internal data APIs: Weather services, stock prices, translation services File Systems: Analyze documents and spreadsheets IoT Devices: Control smart home systems The possibilities are endless when you combine local AI reasoning with real-world data access. Troubleshooting Common Issues Service not running: Make sure Foundry Local is properly installed and the service is running. Try restarting with foundry --version to verify installation. Model downloads slowly: Check your internet connection and ensure you have enough disk space (5-10GB per model). Out of memory: Close other applications or try using a smaller model variant like phi-3.5-mini instead of the full phi-4. Connection issues: Verify that no other services are using the same ports. Foundry Local typically runs on http://localhost:5272. Model not found: Run foundry model list to see available models, then use foundry model run <model-name> to download and run a specific model. Your Next Steps with Foundry Local Congratulations! You now have Phi-4 running locally through Microsoft Foundry Local and understand how to extend it with external tools like web search. This combination of local AI reasoning with real-time data access opens up countless possibilities for building intelligent applications. Coming in Future Posts In the coming weeks, we'll explore advanced topics using Hugging Face: Fine-tuning Phi models on your own data for domain-specific applications Phi-4-multimodal: Analyze images, process audio, and combine multiple data types Advanced deployment patterns: RAG systems and multi-agent orchestration Resources to Explore EdgeAI for Beginners Course: Comprehensive 36-45 hour course covering Edge AI fundamentals, optimization, and production deployment Phi-4 Technical Report: Deep dive into architecture and benchmarks Phi Cookbook on GitHub: Practical examples and recipes Foundry Local Documentation: Complete technical documentation and API reference Module 08: Foundry Local Toolkit: 10 comprehensive samples including RAG applications and multi-agent systems Keep experimenting with Foundry Local, and stay tuned as we unlock the full potential of Edge AI! What will you build with Phi-4? Share your ideas and projects in the comments below!3.9KViews1like1Comment