rag
87 TopicsWhy Data Platforms Must Become Intelligence Platforms for AI Agents to Work
The promise and the gap Your organization has invested in an AI agent. You ask it: "Prepare a summary of Q3 revenue by region, including year-over-year trends and top product lines." The agent finds revenue numbers in a SQL warehouse, product metadata in Dataverse, regional mappings in SharePoint, historical data in Azure Blob Storage, and organizational context in Microsoft Graph. Five data sources. Five schemas. No shared definitions. The result? The agent hallucinates, returns incomplete data, or asks a dozen clarifying questions that defeat its purpose. This isn't a model limitation — modern AI models are highly capable. The real constraint is that enterprise data is not structured for reasoning. Traditional data platforms were built for humans to query. Intelligence platforms must be built for agents to _reason_ over. That distinction is the subject of this post. What you'll understand Why fragmented enterprise data blocks effective AI agents What distinguishes a storage platform from an intelligence platform How Microsoft Fabric and Azure AI Foundry work together to enable trustworthy, agent-ready data access The enterprise pain: Fragmented data breaks AI agents Enterprise data is spread across relational databases, data lakes, business applications, collaboration platforms, third-party APIs, and Microsoft Graph — each with its own schema and security model. Humans navigate this fragmentation through institutional knowledge and years of muscle memory. A seasoned analyst knows that "revenue" in the data warehouse means net revenue after returns, while "revenue" in the CRM means gross bookings. An AI agent does not. The cost of this fragmentation isn't hypothetical. Each new AI agent deployment can trigger another round of bespoke data preparation — custom integrations and transformation pipelines just to make data usable, let alone agent-ready. This approach doesn't scale. Why agents struggle without a semantic layer To produce a trustworthy answer, an AI agent needs: (1) **data access** to reach relevant sources, (2) **semantic context** to understand what the data _means_ (business definitions, relationships, hierarchies), and (3) **trust signals** like lineage, permissions, and freshness metadata. Traditional platforms provide the first but rarely the second or third — leaving agents to infer meaning from column names and table structures. This is fragile at best and misleading at worst. Figure 1: Without a shared semantic layer, AI agents must interpret raw, disconnected data across multiple systems — often leading to inconsistent or incomplete results. From storage to intelligence: What must change The fix isn't another ETL pipeline or another data integration tool. The fix is a fundamental shift in what we expect from a data platform. A storage platform asks: "Where is the data, and how do I access it?" An intelligence platform asks: "What does the data mean, who can use it, and how can an agent reason over it?" This shift requires four foundational pillars: Pillar 1: Unified data access OneLake, the data lake built into Microsoft Fabric, provides a single logical namespace across an organization. Whether data originates in a Fabric lakehouse, a warehouse, or an external storage account, OneLake makes it accessible through one interface — using shortcuts and mirroring rather than requiring data migration. This respects existing investments while reducing fragmentation. Pillar 2: Shared semantic layer Semantic models in Microsoft Fabric define business measures, table relationships, human-readable field descriptions, and row-level security. When an agent queries a semantic model instead of raw tables, it gets _answers_ — like `Total Revenue = $42.3M for North America in Q3` — not raw result sets requiring interpretation and aggregation. Before vs After: What changes for an agent? Without semantic layer: Queries raw tables Infers business meaning Risk of incorrect aggregation With semantic layer: Queries `[Total Revenue]` Uses business-defined logic Gets consistent, governed results Pillar 3: Context enrichment Microsoft Graph adds organizational signals — people and roles, activity patterns, and permissions — helping agents produce responses that are not just accurate, but _relevant_ and _appropriately scoped_ to the person asking. Pillar 4: Agent-ready APIs Data Agents in Microsoft Fabric (currently in preview) provide a natural-language interface to semantic models and lakehouses. Instead of generating SQL, an AI agent can ask: "What was Q3 revenue by region?" and receive a structured, sourced response. This is the critical difference: the platform provides structured context and business logic, helping reduce the reasoning burden on the agent. Figure 2: An intelligence platform adds semantic context, trust signals, and agent-ready APIs on top of unified data access — enabling AI agents to combine structured data, business definitions, and relationships to produce more consistent responses. Microsoft Fabric as the intelligence layer Microsoft Fabric is often described as a unified analytics platform. That description is accurate but incomplete. In the context of AI agents, Fabric's role is better understood as an **intelligence layer** — a platform that doesn't just store and process data, but _makes data understandable_ to autonomous systems. Let's look at each capability through the lens of agent readiness. OneLake: One namespace, many sources OneLake provides a single logical namespace backed by Azure Data Lake Storage Gen2. For AI agents, this means one authentication context, one discovery mechanism, and one governance surface. Key capabilities: **shortcuts** (reference external data without copying), **mirroring** (replicate from Azure SQL, Cosmos DB, or Snowflake), and a **unified security model**. For more on OneLake architecture, see [OneLake documentation on Microsoft Learn](https://learn.microsoft.com/fabric/onelake/onelake-overview). Semantic models: Business logic that agents can understand Semantic models (built on the Analysis Services engine) transform raw tables into business concepts: Raw Table Column Semantic Model Measure `fact_sales.amount` `[Total Revenue]` — Sum of net sales after returns `fact_sales.amount / dim_product.cost` `[Gross Margin %]` — Revenue minus COGS as a percentage `fact_sales.qty` YoY comparison `[YoY Growth %]` — Year-over-year quantity growth Code Snippet 1 — Querying a Fabric Semantic Model with Semantic Link (Python) import sempy.fabric as fabric # Query business-defined measures — no need to know underlying table schemas dax_query = """ EVALUATE SUMMARIZECOLUMNS( 'Geography'[Region], 'Calendar'[FiscalQuarter], "Total Revenue", [Total Revenue], "YoY Growth %", [YoY Growth %] ) """ result_df = fabric.evaluate_dax( dataset="Contoso Sales Analytics", workspace="Contoso Analytics Workspace", dax_string=dax_query ) print(result_df.head()) # NOTE: Output shown is illustrative and based on the semantic model definition # Output (illustrative): # Region FiscalQuarter Total Revenue YoY Growth % # North America Q3 FY2026 42300000 8.2 # Europe Q3 FY2026 31500000 5.7 Key takeaway: The agent doesn’t need to know that revenue is in `fact_sales.amount` or that fiscal quarters don’t align with calendar quarters. The semantic model handles all of this. Code Snippet 2 — Discovering Available Models and Measures (Python) Before an agent can query, it needs to _discover_ what data is available. Semantic Link provides programmatic access to model metadata — enabling agents to find relevant measures without hardcoded knowledge. import sempy.fabric as fabric # Discover available semantic models in the workspace datasets = fabric.list_datasets(workspace="Contoso Analytics Workspace") print(datasets[["Dataset Name", "Description"]]) # NOTE: Output shown is illustrative and based on the semantic model definition # Output (illustrative): # Dataset Name Description # Contoso Sales Analytics Revenue, margins, and growth metrics # Contoso HR Analytics Headcount, attrition, and hiring pipeline # Contoso Supply Chain Inventory, logistics, and supplier data # Inspect available measures — these are the business-defined metrics an agent can query measures = fabric.list_measures( dataset="Contoso Sales Analytics", workspace="Contoso Analytics Workspace" ) print(measures[["Table Name", "Measure Name", "Description"]]) # Output (illustrative): # Table Name Measure Name Description # Sales Total Revenue Sum of net sales after returns # Sales Gross Margin % Revenue minus COGS as a percentage # Sales YoY Growth % Year-over-year quantity growth Key takeaway: An agent can programmatically discover which semantic models exist and what measures they expose — turning the platform into a self-describing data catalog that agents can navigate autonomously. For more on Semantic Link, see the Semantic Link documentation on Microsoft Learn. Data Agents: Natural-language access for AI (preview) Note: Fabric Data Agents are currently in preview. See [Microsoft preview terms](https://learn.microsoft.com/legal/microsoft-fabric-preview) for details. A Data Agent wraps a semantic model and exposes it as a natural-language-queryable endpoint. An AI Foundry agent can register a Fabric Data Agent as a tool — when it needs data, it calls the Data Agent like any other tool. Important: In production scenarios, use managed identities or Microsoft Entra ID authentication. Always follow the [principle of least privilege](https://learn.microsoft.com/entra/identity-platform/secure-least-privileged-access) when configuring agent access. Microsoft Graph: Organizational context Microsoft Graph adds the final layer: who is asking (role-appropriate detail), what’s relevant (trending datasets), and who should review (data stewards). Fabric’s integration with Graph brings these signals into the data platform so agents produce contextually appropriate responses. Tying it together: Azure AI Foundry + Microsoft Fabric The real power of the intelligence platform concept emerges when you see how Azure AI Foundry and Microsoft Fabric are designed to work together. The integration pattern Azure AI Foundry provides the orchestration layer (conversations, tool selection, safety, response generation). Microsoft Fabric provides the data intelligence layer (data access, semantic context, structured query resolution). The integration follows a tool-calling pattern: 1.User prompt → End user asks a question through an AI Foundry-powered application. 2.Tool call → The agent selects the appropriate Fabric Data Agent and sends a natural-language query. 3.Semantic resolution → The Data Agent translates the query into DAX against the semantic model and executes it via OneLake. 4.Structured response → Results flow back through the stack, with each layer adding context (business definitions, permissions verification, data lineage). 5.User response → The AI Foundry agent presents a grounded, sourced answer to the user. Why these matters No custom ETL for agents — Agents query the intelligence platform directly No prompt-stuffing — The semantic model provides business context at query time No trust gap — Governed semantic models enforce row-level security and lineage No one-off integrations — Multiple agents reuse the same Data Agents Code Snippet 3 — Azure AI Foundry Agent with Fabric Data Agent Tool (Python) The following example shows how an Azure AI Foundry agent registers a Fabric Data Agent as a tool and uses it to answer a business question. The agent handles tool selection, query routing, and response grounding automatically. from azure.ai.projects import AIProjectClient from azure.ai.projects.models import FabricTool from azure.identity import DefaultAzureCredential # Connect to Azure AI Foundry project project_client = AIProjectClient.from_connection_string( credential=DefaultAzureCredential(), conn_str="<your-ai-foundry-connection-string>" ) # Register a Fabric Data Agent as a grounding tool # The connection references a Fabric workspace with semantic models fabric_tool = FabricTool(connection_id="<fabric-connection-id>") # Create an agent that uses the Fabric Data Agent for data queries agent = project_client.agents.create_agent( model="gpt-4o", name="Contoso Revenue Analyst", instructions="""You are a business analytics assistant for Contoso. Use the Fabric Data Agent tool to answer questions about revenue, margins, and growth. Always cite the source semantic model.""", tools=fabric_tool.definitions ) # Start a conversation thread = project_client.agents.create_thread() message = project_client.agents.create_message( thread_id=thread.id, role="user", content="What was Q3 revenue by region, and which region grew fastest?" ) # The agent automatically calls the Fabric Data Agent tool, # queries the semantic model, and returns a grounded response run = project_client.agents.create_and_process_run( thread_id=thread.id, agent_id=agent.id ) # Retrieve the agent's response messages = project_client.agents.list_messages(thread_id=thread.id) print(messages.data[0].content[0].text.value) # NOTE: Output shown is illustrative and based on the semantic model definition # Output (illustrative): # "Based on the Contoso Sales Analytics model, Q3 FY2026 revenue by region: # - North America: $42.3M (+8.2% YoY) # - Europe: $31.5M (+5.7% YoY) # - Asia Pacific: $18.9M (+12.1% YoY) — fastest growing # Source: Contoso Sales Analytics semantic model, OneLake" Key takeaway: The AI Foundry agent never writes SQL or DAX. It calls the Fabric Data Agent as a tool, which resolves the query against the semantic model. The response comes back grounded with source attribution — matching the five-step integration pattern described above. Figure 3: Each layer adds context — semantic models provide business definitions, Graph adds permissions awareness, and Data Agents provide the natural-language interface. Getting started: Practical next steps You don't need to redesign your entire data platform to begin this shift. Start with one high-value domain and expand incrementally. Step 1: Consolidate data access through OneLake Create OneLake shortcuts to your most critical data sources — core business metrics, customer data, financial records. No migration needed. [Create OneLake shortcuts](https://learn.microsoft.com/fabric/onelake/create-onelake-shortcut) Step 2: Build semantic models with business definitions For each major domain (sales, finance, operations), create a semantic model with key measures, table relationships, human-readable descriptions, and row-level security. [Create semantic models in Microsoft Fabric](https://learn.microsoft.com/fabric/data-warehouse/semantic-models) Step 3: Enable Data Agents (preview) Expose your semantic models as natural-language endpoints. Start with a single domain to validate the pattern. Note: Review the [preview terms](https://learn.microsoft.com/legal/microsoft-fabric-preview) and plan for API changes. [Fabric Data Agents overview](https://learn.microsoft.com/fabric/data-science/concept-data-agent) Step 4: Connect Azure AI Foundry agents Register Data Agents as tools in your AI Foundry agent configuration. Azure AI Foundry documentation Conclusion: The bottleneck isn't the model — it's the platform Models can reason, plan, and hold multi-turn conversations. But in the enterprise, the bottleneck for effective AI agents is the data platform underneath. Agents can’t reason over data they can’t find, apply business logic that isn’t encoded, respect permissions that aren’t enforced, or cite sources without lineage. The shift from storage to intelligence requires unified data access, a shared semantic layer, organizational context, and agent-ready APIs. Microsoft Fabric provides these capabilities, and its integration with Azure AI Foundry makes this intelligence layer accessible to AI agents. Disclaimer: Some features described in this post, including Fabric Data Agents, are currently in preview. Preview features may change before general availability, and their availability, functionality, and pricing may differ from the final release. See [Microsoft preview terms](https://learn.microsoft.com/legal/microsoft-fabric-preview) for details.Building Your First Local RAG Application with Foundry Local
A developer's guide to building an offline, mobile-responsive AI support agent using Retrieval-Augmented Generation, the Foundry Local SDK, and JavaScript. Imagine you are a gas field engineer standing beside a pipeline in a remote location. There is no Wi-Fi, no mobile signal, and you need a safety procedure right now. What do you do? This is the exact problem that inspired this project: a fully offline RAG-powered support agent that runs entirely on your machine. No cloud. No API keys. No outbound network calls. Just a local language model, a local vector store, and your own documents, all accessible from a browser on any device. In this post, you will learn how it works, how to build your own, and the key architectural decisions behind it. If you have ever wanted to build an AI application that runs locally and answers questions grounded in your own data, this is the place to start. The finished application: a browser-based AI support agent that runs entirely on your machine. What Is Retrieval-Augmented Generation? Retrieval-Augmented Generation (RAG) is a pattern that makes AI models genuinely useful for domain-specific tasks. Rather than hoping the model "knows" the answer from its training data, you: Retrieve relevant chunks from your own documents using a vector store Augment the model's prompt with those chunks as context Generate a response grounded in your actual data The result is fewer hallucinations, traceable answers with source attribution, and an AI that works with your content rather than relying on general knowledge. If you are building internal tools, customer support bots, field manuals, or knowledge bases, RAG is the pattern you want. RAG vs CAG: Understanding the Trade-offs If you have explored AI application patterns before, you have likely encountered Context-Augmented Generation (CAG). Both RAG and CAG solve the same core problem: grounding an AI model's answers in your own content. They take different approaches, and each has genuine strengths and limitations. RAG (Retrieval-Augmented Generation) How it works: Documents are split into chunks, vectorised, and stored in a database. At query time, the most relevant chunks are retrieved and injected into the prompt. Strengths: Scales to thousands or millions of documents Fine-grained retrieval at chunk level with source attribution Documents can be added or updated dynamically without restarting Token-efficient: only relevant chunks are sent to the model Supports runtime document upload via the web UI Limitations: More complex architecture: requires a vector store and chunking strategy Retrieval quality depends on chunking parameters and scoring method May miss relevant content if the retrieval step does not surface it CAG (Context-Augmented Generation) How it works: All documents are loaded at startup. The most relevant ones are selected per query using keyword scoring and injected into the prompt. Strengths: Drastically simpler architecture with no vector database or embeddings All information is always available to the model Minimal dependencies and easy to set up Near-instant document selection Limitations: Constrained by the model's context window size Best suited to small, curated document sets (tens of documents) Adding documents requires an application restart Want to compare these patterns hands-on? There is a CAG-based implementation of the same gas field scenario using whole-document context injection. Clone both repositories, run them side by side, and see how the architectures differ in practice. When Should You Choose Which? Consideration Choose RAG Choose CAG Document count Hundreds or thousands Tens of documents Document updates Frequent or dynamic (runtime upload) Infrequent (restart to reload) Source attribution Per-chunk with relevance scores Per-document Setup complexity Moderate (ingestion step required) Minimal Query precision Better for large or diverse collections Good for keyword-matchable content Infrastructure SQLite vector store (single file) None beyond the runtime For the sample application in this post (20 gas engineering procedure documents with runtime upload), RAG is the clear winner. If your document set is small and static, CAG may be simpler. Both patterns run fully offline using Foundry Local. Foundry Local: Your On-Device AI Runtime Foundry Local is a lightweight runtime from Microsoft that downloads, manages, and serves language models entirely on your device. No cloud account, no API keys, no outbound network calls (after the initial model download). What makes it particularly useful for developers: No GPU required: runs on CPU or NPU, making it accessible on standard laptops and desktops Native SDK bindings: in-process inference via the foundry-local-sdk npm package, with no HTTP round-trips to a local server Automatic model management: downloads, caches, and loads models automatically Hardware-optimised variant selection: the SDK picks the best variant for your hardware (GPU, NPU, or CPU) Real-time progress callbacks: ideal for building loading UIs that show download and initialisation progress The integration code is refreshingly minimal: import { FoundryLocalManager } from "foundry-local-sdk"; // Create a manager and discover models via the catalogue const manager = FoundryLocalManager.create({ appName: "gas-field-local-rag" }); const model = await manager.catalog.getModel("phi-3.5-mini"); // Download if not cached, then load into memory if (!model.isCached) { await model.download((progress) => { console.log(`Download: ${Math.round(progress * 100)}%`); }); } await model.load(); // Create a chat client for direct in-process inference const chatClient = model.createChatClient(); const response = await chatClient.completeChat([ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "How do I detect a gas leak?" } ]); That is it. No server configuration, no authentication tokens, no cloud provisioning. The model runs in the same process as your application. The Technology Stack The sample application is deliberately simple. No frameworks, no build steps, no Docker: Layer Technology Purpose AI Model Foundry Local + Phi-3.5 Mini Runs locally via native SDK bindings, no GPU required Back end Node.js + Express Lightweight HTTP server, everyone knows it Vector Store SQLite (via better-sqlite3 ) Zero infrastructure, single file on disc Retrieval TF-IDF + cosine similarity No embedding model required, fully offline Front end Single HTML file with inline CSS No build step, mobile-responsive, field-ready The total dependency footprint is three npm packages: express , foundry-local-sdk , and better-sqlite3 . Architecture Overview The five-layer architecture, all running on a single machine. The system has five layers, all running on a single machine: Client layer: a single HTML file served by Express, with quick-action buttons and a responsive chat interface Server layer: Express.js starts immediately and serves the UI plus SSE status and chat endpoints RAG pipeline: the chat engine orchestrates retrieval and generation; the chunker handles TF-IDF vectorisation; the prompts module provides safety-first system instructions Data layer: SQLite stores document chunks and their TF-IDF vectors; documents live as .md files in the docs/ folder AI layer: Foundry Local runs Phi-3.5 Mini on CPU or NPU via native SDK bindings Building the Solution Step by Step Prerequisites You need two things installed on your machine: Node.js 20 or later: download from nodejs.org Foundry Local: Microsoft's on-device AI runtime: winget install Microsoft.FoundryLocal The SDK will automatically download the Phi-3.5 Mini model (approximately 2 GB) the first time you run the application. Getting the Code Running # Clone the repository git clone https://github.com/leestott/local-rag.git cd local-rag # Install dependencies npm install # Ingest the 20 gas engineering documents into the vector store npm run ingest # Start the server npm start Open http://127.0.0.1:3000 in your browser. You will see the status indicator whilst the model loads. Once the model is ready, the status changes to "Offline Ready" and you can start chatting. Desktop view Mobile view How the RAG Pipeline Works Let us trace what happens when a user asks: "How do I detect a gas leak?" The query flow from browser to model and back. 1 Documents are ingested and indexed When you run npm run ingest , every .md file in the docs/ folder is read, parsed (with optional YAML front-matter for title, category, and ID), split into overlapping chunks of approximately 200 tokens, and stored in SQLite with TF-IDF vectors. 2 Model is loaded via the SDK The Foundry Local SDK discovers the model in the local catalogue and loads it into memory. If the model is not already cached, it downloads it first (with progress streamed to the browser via SSE). 3 User sends a question The question arrives at the Express server. The chat engine converts it into a TF-IDF vector, uses an inverted index to find candidate chunks, and scores them using cosine similarity. The top 3 chunks are returned in under 1 ms. 4 Prompt is constructed The engine builds a messages array containing: the system prompt (with safety-first instructions), the retrieved chunks as context, the conversation history, and the user's question. 5 Model generates a grounded response The prompt is sent to the locally loaded model via the Foundry Local SDK's native chat client. The response streams back token by token through Server-Sent Events to the browser. Source references with relevance scores are included. A response with safety warnings and step-by-step guidance The sources panel shows which chunks were used and their relevance Key Code Walkthrough The Vector Store (TF-IDF + SQLite) The vector store uses SQLite to persist document chunks alongside their TF-IDF vectors. At query time, an inverted index finds candidate chunks that share terms with the query, then cosine similarity ranks them: // src/vectorStore.js search(query, topK = 5) { const queryTf = termFrequency(query); this._ensureCache(); // Build in-memory cache on first access // Use inverted index to find candidates sharing at least one term const candidateIndices = new Set(); for (const term of queryTf.keys()) { const indices = this._invertedIndex.get(term); if (indices) { for (const idx of indices) candidateIndices.add(idx); } } // Score only candidates, not all rows const scored = []; for (const idx of candidateIndices) { const row = this._rowCache[idx]; const score = cosineSimilarity(queryTf, row.tf); if (score > 0) scored.push({ ...row, score }); } scored.sort((a, b) => b.score - a.score); return scored.slice(0, topK); } The inverted index, in-memory row cache, and prepared SQL statements bring retrieval time to sub-millisecond for typical query loads. Why TF-IDF Instead of Embeddings? Most RAG tutorials use embedding models for retrieval. This project uses TF-IDF because: Fully offline: no embedding model to download or run Zero latency: vectorisation is instantaneous (it is just maths on word frequencies) Good enough: for 20 domain-specific documents, TF-IDF retrieves the right chunks reliably Transparent: you can inspect the vocabulary and weights, unlike neural embeddings For larger collections or when semantic similarity matters more than keyword overlap, you would swap in an embedding model. For this use case, TF-IDF keeps the stack simple and dependency-free. The System Prompt For safety-critical domains, the system prompt is engineered to prioritise safety, prevent hallucination, and enforce structured responses: // src/prompts.js export const SYSTEM_PROMPT = `You are a local, offline support agent for gas field inspection and maintenance engineers. Behaviour Rules: - Always prioritise safety. If a procedure involves risk, explicitly call it out. - Do not hallucinate procedures, measurements, or tolerances. - If the answer is not in the provided context, say: "This information is not available in the local knowledge base." Response Format: - Summary (1-2 lines) - Safety Warnings (if applicable) - Step-by-step Guidance - Reference (document name + section)`; This pattern is transferable to any safety-critical domain: medical devices, electrical work, aviation maintenance, or chemical handling. Runtime Document Upload Unlike the CAG approach, RAG supports adding documents without restarting the server. Click the upload button to add new .md or .txt files. They are chunked, vectorised, and indexed immediately. The upload modal with the complete list of indexed documents. Adapting This for Your Own Domain The sample project is designed to be forked and adapted. Here is how to make it yours in four steps: 1. Replace the documents Delete the gas engineering documents in docs/ and add your own markdown files. The ingestion pipeline handles any markdown content with optional YAML front-matter: --- title: Troubleshooting Widget Errors category: Support id: KB-001 --- # Troubleshooting Widget Errors ...your content here... 2. Edit the system prompt Open src/prompts.js and rewrite the system prompt for your domain. Keep the structure (summary, safety, steps, reference) and update the language to match your users' expectations. 3. Tune the retrieval In src/config.js : chunkSize: 200 : smaller chunks give more precise retrieval, less context per chunk chunkOverlap: 25 : prevents information falling between chunks topK: 3 : how many chunks to retrieve per query (more gives more context but slower generation) 4. Swap the model Change config.model in src/config.js to any model available in the Foundry Local catalogue. Smaller models give faster responses on constrained devices; larger models give better quality. Building a Field-Ready UI The front end is a single HTML file with inline CSS. No React, no build tooling, no bundler. This keeps the project accessible to beginners and easy to deploy. Design decisions that matter for field use: Dark, high-contrast theme with 18px base font size for readability in bright sunlight Large touch targets (minimum 44px) for operation with gloves or PPE Quick-action buttons that wrap on mobile so all options are visible without scrolling Responsive layout that works from 320px to 1920px+ screen widths Streaming responses via SSE, so the user sees tokens arriving in real time The mobile chat experience, optimised for field use. Testing The project includes unit tests using the built-in Node.js test runner, with no extra test framework needed: # Run all tests npm test Tests cover the chunker, vector store, configuration, and server endpoints. Use them as a starting point when you adapt the project for your own domain. Ideas for Extending the Project Once you have the basics running, there are plenty of directions to explore: Embedding-based retrieval: use a local embedding model for better semantic matching on diverse queries Conversation memory: persist chat history across sessions using local storage or a lightweight database Multi-modal support: add image-based queries (photographing a fault code, for example) PWA packaging: make it installable as a standalone offline application on mobile devices Hybrid retrieval: combine TF-IDF keyword search with semantic embeddings for best results Try the CAG approach: compare with the local-cag sample to see which pattern suits your use case Ready to Build Your Own? Clone the RAG sample, swap in your own documents, and have an offline AI agent running in minutes. Or compare it with the CAG approach to see which pattern suits your use case best. Get the RAG Sample Get the CAG Sample Summary Building a local RAG application does not require a PhD in machine learning or a cloud budget. With Foundry Local, Node.js, and SQLite, you can create a fully offline, mobile-responsive AI agent that answers questions grounded in your own documents. The key takeaways: RAG is ideal for scalable, dynamic document sets where you need fine-grained retrieval with source attribution. Documents can be added at runtime without restarting. CAG is simpler when you have a small, stable set of documents that fit in the context window. See the local-cag sample to compare. Foundry Local makes on-device AI accessible: native SDK bindings, in-process inference, automatic model selection, and no GPU required. TF-IDF + SQLite is a viable vector store for small-to-medium collections, with sub-millisecond retrieval thanks to inverted indexing and in-memory caching. Start simple, iterate outwards. Begin with RAG and a handful of documents. If your needs are simpler, try CAG. Both patterns run entirely offline. Clone the repository, swap in your own documents, and start building. The best way to learn is to get your hands on the code. This project is open source under the MIT licence. It is a scenario sample for learning and experimentation, not production medical or safety advice. local-rag on GitHub · local-cag on GitHub · Foundry LocalGetting Started with Foundry Local: A Student Guide to the Microsoft Foundry Local Lab
If you want to start building AI applications on your own machine, the Microsoft Foundry Local Lab is one of the most useful places to begin. It is a practical workshop that takes you from first-time setup through to agents, retrieval, evaluation, speech transcription, tool calling, and a browser-based interface. The material is hands-on, cross-language, and designed to show how modern AI apps can run locally rather than depending on a cloud service for every step. This blog post is aimed at students, self-taught developers, and anyone learning how AI applications are put together in practice. Instead of treating large language models as a black box, the lab shows you how to install and manage local models, connect to them with code, structure tasks into workflows, and test whether the results are actually good enough. If you have been looking for a learning path that feels more like building real software and less like copying isolated snippets, this workshop is a strong starting point. What Is Foundry Local? Foundry Local is a local runtime for downloading, managing, and serving AI models on your own hardware. It exposes an OpenAI-compatible interface, which means you can work with familiar SDK patterns while keeping execution on your device. For learners, that matters for three reasons. First, it lowers the barrier to experimentation because you can run projects without setting up a cloud account for every test. Second, it helps you understand the moving parts behind AI applications, including model lifecycle, local inference, and application architecture. Third, it encourages privacy-aware development because the examples are designed to keep data on the machine wherever possible. The Foundry Local Lab uses that local-first approach to teach the full journey from simple prompts to multi-agent systems. It includes examples in Python, JavaScript, and C#, so you can follow the language that fits your course, your existing skills, or the platform you want to build on. Why This Lab Works Well for Learners A lot of AI tutorials stop at the moment a model replies to a prompt. That is useful for a first demo, but it does not teach you how to build a proper application. The Foundry Local Lab goes further. It is organised as a sequence of parts, each one adding a new idea and giving you working code to explore. You do not just ask a model to respond. You learn how to manage the service, choose a language SDK, construct retrieval pipelines, build agents, evaluate outputs, and expose the result through a usable interface. That sequence is especially helpful for students because the parts build on each other. Early labs focus on confidence and setup. Middle labs focus on architecture and patterns. Later labs move into more advanced ideas that are common in real projects, such as tool calling, evaluation, and custom model packaging. By the end, you have seen not just what a local AI app looks like, but how its different layers fit together. Before You Start The workshop expects a reasonably modern machine and at least one programming language environment. The core prerequisites are straightforward: install Foundry Local, clone the repository, and choose whether you want to work in Python, JavaScript, or C#. You do not need to master all three. In fact, most learners will get more value by picking one language first, completing the full path in that language, and only then comparing how the same patterns look elsewhere. If you are new to AI development, do not be put off by the number of parts. The early sections are accessible, and the later ones become much easier once you have completed the foundations. Think of the lab as a structured course rather than a single tutorial. What You Learn in Each Lab https://github.com/microsoft-foundry/foundry-local-lab Part 1: Getting Started with Foundry Local The first part introduces the basics of Foundry Local and gets you up and running. You learn how to install the CLI, inspect the model catalogue, download a model, and run it locally. This part also introduces practical details such as model aliases and dynamic service ports, which are small but important pieces of real development work. For students, the value of this part is confidence. You prove that local inference works on your machine, you see how the service behaves, and you learn the operational basics before writing any application code. By the end of Part 1, you should understand what Foundry Local does, how to start it, and how local model serving fits into an application workflow. Part 2: Foundry Local SDK Deep Dive Once the CLI makes sense, the workshop moves into the SDK. This part explains why application developers often use the SDK instead of relying only on terminal commands. You learn how to manage the service programmatically, browse available models, control model download and loading, and understand model metadata such as aliases and hardware-aware selection. This is where learners start to move from using a tool to building with a platform. You begin to see the difference between running a model manually and integrating it into software. By the end of this section, you should understand the API surface you will use in your own projects and know how to bootstrap the SDK in Python, JavaScript, or C#. Part 3: SDKs and APIs Part 3 turns the SDK concepts into a working chat application. You connect code to the local inference server and use the OpenAI-compatible API for streaming chat completions. The lab includes examples in all three supported languages, which makes it especially useful if you are comparing ecosystems or learning how the same idea is expressed through different syntax and libraries. The key learning outcome here is not just that you can get a response from a model. It is that you understand the boundary between your application and the local model service. You learn how messages are structured, how streaming works, and how to write the sort of integration code that becomes the foundation for every later lab. Part 4: Retrieval-Augmented Generation This is where the workshop starts to feel like modern AI engineering rather than basic prompting. In the retrieval-augmented generation lab, you build a simple RAG pipeline that grounds answers in supplied data. You work with an in-memory knowledge base, apply retrieval logic, score matches, and compose prompts that include grounded context. For learners, this part is important because it demonstrates a core truth of AI app development: a model on its own is often not enough. Useful applications usually need access to documents, notes, or structured information. By the end of Part 4, you understand why retrieval matters, how to pass retrieved context into a prompt, and how a pipeline can make answers more relevant and reliable. Part 5: Building AI Agents Part 5 introduces the concept of an agent. Instead of a one-off prompt and response, you begin to define behaviour through system instructions, roles, and conversation state. The lab uses the ChatAgent pattern and the Microsoft Agent Framework to show how an agent can maintain a purpose, respond with a persona, and return structured output such as JSON. This part helps learners understand the difference between a raw model call and a reusable application component. You learn how to design instructions that shape behaviour, how multi-turn interaction differs from single prompts, and why structured output matters when an AI component has to work inside a broader system. Part 6: Multi-Agent Workflows Once a single agent makes sense, the workshop expands the idea into a multi-agent workflow. The example pipeline uses roles such as researcher, writer, and editor, with outputs passed from one stage to the next. You explore sequential orchestration, shared configuration, and feedback loops between specialised components. For students, this lab is a very clear introduction to decomposition. Instead of asking one model to do everything at once, you break a task into smaller responsibilities. That pattern is useful well beyond AI. By the end of Part 6, you should understand why teams build multi-agent systems, how hand-offs are structured, and what trade-offs appear when more components are added to a workflow. Part 7: Zava Creative Writer Capstone Application The Zava Creative Writer is the capstone project that brings the earlier ideas together into a more production-style application. It uses multiple specialised agents, structured JSON hand-offs, product catalogue search, streaming output, and evaluation-style feedback loops. Rather than showing an isolated feature, this part shows how separate patterns combine into a complete system. This is one of the most valuable parts of the workshop for learner developers because it narrows the gap between tutorial code and real application design. You can see how orchestration, agent roles, and practical interfaces fit together. By the end of Part 7, you should be able to recognise the architecture of a serious local AI app and understand how the earlier labs support it. Part 8: Evaluation-Led Development Many beginner AI projects stop once the output looks good once or twice. This lab teaches a much stronger habit: evaluation-led development. You work with golden datasets, rule-based checks, and LLM-as-judge scoring to compare prompt or agent variants systematically. The goal is to move from anecdotal testing to repeatable assessment. This matters enormously for students because evaluation is one of the clearest differences between a classroom demo and dependable software. By the end of Part 8, you should understand how to define success criteria, compare outputs at scale, and use evidence rather than intuition when improving an AI component. Part 9: Voice Transcription with Whisper Part 9 broadens the workshop beyond text generation by introducing speech-to-text with Whisper running locally. You use the Foundry Local SDK to download and load the model, then transcribe local audio files through the compatible API surface. The emphasis is on privacy-first processing, with audio kept on-device. This section is a useful reminder that local AI development is not limited to chatbots. Learners see how a different modality fits into the same ecosystem and how local execution supports sensitive workloads. By the end of this lab, you should understand the transcription flow, the relevant client methods, and how speech features can be integrated into broader applications. Part 10: Using Custom or Hugging Face Models After learning the standard path, the workshop shows how to work with custom or Hugging Face models. This includes compiling models into optimised ONNX format with ONNX Runtime GenAI, choosing hardware-specific options, applying quantisation strategies, creating configuration files, and adding compiled models to the Foundry Local cache. For learner developers, this part opens the door to model engineering rather than simple model consumption. You begin to understand that model choice, optimisation, and packaging affect performance and usability. By the end of Part 10, you should have a clearer picture of how models move from an external source into a runnable local setup and why deployment format matters. Part 11: Tool Calling with Local Models Tool calling is one of the most practical patterns in current AI development, and this lab covers it directly. You define tool schemas, allow the model to request function calls, handle the multi-turn interaction loop, execute the tools locally, and return results back to the model. The examples include practical scenarios such as weather and population tools. This lab teaches learners how to move beyond generation into action. A model is no longer limited to producing text. It can decide when external data or a function is needed and incorporate that result into a useful answer. By the end of Part 11, you should understand the tool-calling flow and how AI systems connect reasoning with deterministic software behaviour. Part 12: Building a Web UI for the Zava Creative Writer Part 12 adds a browser-based front end to the capstone application. You learn how to serve a shared interface from Python, JavaScript, or C#, stream updates to the browser, consume NDJSON with the Fetch API and ReadableStream, and show live agent status as content is produced in real time. This part is especially good for students who want to build portfolio projects. It turns backend orchestration into something visible and interactive. By the end of Part 12, you should understand how to connect a local AI backend to a web interface and how streaming changes the user experience compared with waiting for one final response. Part 13: Workshop Complete The final part is a summary and extension point. It reviews what you have built across the previous sections and suggests ways to continue. Although it is not a new technical lab in the same way as the earlier parts, it plays an important role in learning. It helps you consolidate the architecture, the terminology, and the development patterns you have encountered. For learners, reflection matters. By the end of Part 13, you should be able to describe the full stack of a local AI application, from model management to user interface, and identify which area you want to deepen next. What Students Gain from the Full Workshop Taken together, these labs do more than teach Foundry Local itself. They teach how AI applications are built. You learn operational basics such as model setup and service management. You learn application integration through SDKs and APIs. You learn system design through RAG, agents, multi-agent orchestration, and web interfaces. You learn engineering discipline through evaluation. You also see how text, speech, custom models, and tool calling all fit into one local-first development workflow. That breadth makes the workshop useful in several settings. A student can use it as a self-study path. A lecturer can use it as source material for practical sessions. A learner developer can use it to build portfolio pieces and to understand which AI patterns are worth learning next. Because the repository includes Python, JavaScript, and C#, it also works well for comparing how architectural ideas transfer across languages. How to Approach the Lab as a Beginner If you are starting from scratch, the best route is simple. Complete Parts 1 to 3 in your preferred language first. That gives you the essential setup and integration skills. Then move into Parts 4 to 6 to understand how AI application patterns are composed. After that, use Parts 7 and 8 to learn how larger systems and evaluation fit together. Finally, explore Parts 9 to 12 based on your interests, whether that is speech, tooling, model customisation, or front-end work. It is also worth keeping notes as you go. Record what each part adds to your understanding, what code files matter, and what assumptions each example makes. That habit will help you move from following the labs to adapting the patterns in your own projects. Final Thoughts The Microsoft Foundry Local Lab is a strong introduction to local AI development because it treats learners like developers rather than spectators. You install, run, connect, orchestrate, evaluate, and present working systems. That makes it far more valuable than a short demo that only proves a model can answer a question. If you are a student or learner developer who wants to understand how AI applications are really built, this lab gives you a clear path. Start with the basics, pick one language, and work through the parts in order. By the time you finish, you will not just have used Foundry Local. You will have a practical foundation for building local AI applications with far more confidence and much better judgement.235Views0likes0CommentsBuilding an Offline AI Interview Coach with Foundry Local, RAG, and SQLite
How to build a 100% offline, AI-powered interview preparation tool using Microsoft Foundry Local, Retrieval-Augmented Generation, and nothing but JavaScript. Foundry Local 100% Offline RAG + TF-IDF JavaScript / Node.js Contents Introduction What is RAG and Why Offline? Architecture Overview Setting Up Foundry Local Building the RAG Pipeline The Chat Engine Dual Interfaces: Web & CLI Testing Adapting for Your Own Use Case What I Learned Getting Started Introduction Imagine preparing for a job interview with an AI assistant that knows your CV inside and out, understands the job you're applying for, and generates tailored questions, all without ever sending your data to the cloud. That's exactly what Interview Doctor does. Interview Doctor's web UI, a polished, dark-themed interface running entirely on your local machine. In this post, I'll walk you through how I built an interview prep tool as a fully offline JavaScript application using: Foundry Local — Microsoft's on-device AI runtime SQLite — for storing document chunks and TF-IDF vectors RAG (Retrieval-Augmented Generation) — to ground the AI in your actual documents Express.js — for the web server Node.js built-in test runner — for testing with zero extra dependencies No cloud. No API keys. No internet required. Everything runs on your machine. What is RAG and Why Does It Matter? Retrieval-Augmented Generation (RAG) is a pattern that makes AI models dramatically more useful for domain-specific tasks. Instead of relying solely on what a model learned during training (which can be outdated or generic), RAG: Retrieves relevant chunks from your own documents Augments the model's prompt with those chunks as context Generates a response grounded in your actual data For Interview Doctor, this means the AI doesn't just ask generic interview questions, it asks questions specific to your CV, your experience, and the specific job you're applying for. Why Offline RAG? Privacy is the obvious benefit, your CV and job applications never leave your device. But there's more: No API costs — run as many queries as you want No rate limits — iterate rapidly during your prep Works anywhere — on a plane, in a café with bad Wi-Fi, anywhere Consistent performance — no cold starts, no API latency Architecture Overview Complete architecture showing all components and data flow. The application has two interfaces (CLI and Web) that share the same core engine: Document Ingestion — PDFs and markdown files are chunked and indexed Vector Store — SQLite stores chunks with TF-IDF vectors Retrieval — queries are matched against stored chunks using cosine similarity Generation — relevant chunks are injected into the prompt sent to the local LLM Step 1: Setting Up Foundry Local First, install Foundry Local: # Windows winget install Microsoft.FoundryLocal # macOS brew install microsoft/foundrylocal/foundrylocal The JavaScript SDK handles everything else — starting the service, downloading the model, and connecting: import { FoundryLocalManager } from "foundry-local-sdk"; import { OpenAI } from "openai"; const manager = new FoundryLocalManager(); const modelInfo = await manager.init("phi-3.5-mini"); // Foundry Local exposes an OpenAI-compatible API const openai = new OpenAI({ baseURL: manager.endpoint, // Dynamic port, discovered by SDK apiKey: manager.apiKey, }); ⚠️ Key Insight Foundry Local uses a dynamic port never hardcode localhost:5272 . Always use manager.endpoint which is discovered by the SDK at runtime. Step 2: Building the RAG Pipeline Document Chunking Documents are split into overlapping chunks of ~200 tokens. The overlap ensures important context isn't lost at chunk boundaries: export function chunkText(text, maxTokens = 200, overlapTokens = 25) { const words = text.split(/\s+/).filter(Boolean); if (words.length <= maxTokens) return [text.trim()]; const chunks = []; let start = 0; while (start < words.length) { const end = Math.min(start + maxTokens, words.length); chunks.push(words.slice(start, end).join(" ")); if (end >= words.length) break; start = end - overlapTokens; } return chunks; } Why 200 tokens with 25-token overlap? Small chunks keep retrieved context compact for the model's limited context window. Overlap prevents information loss at boundaries. And it's all pure string operations, no dependencies needed. TF-IDF Vectors Instead of using a separate embedding model (which would consume precious memory alongside the LLM), we use TF-IDF, a classic information retrieval technique: export function termFrequency(text) { const tf = new Map(); const tokens = text .toLowerCase() .replace(/[^a-z0-9\-']/g, " ") .split(/\s+/) .filter((t) => t.length > 1); for (const t of tokens) { tf.set(t, (tf.get(t) || 0) + 1); } return tf; } export function cosineSimilarity(a, b) { let dot = 0, normA = 0, normB = 0; for (const [term, freq] of a) { normA += freq * freq; if (b.has(term)) dot += freq * b.get(term); } for (const [, freq] of b) normB += freq * freq; if (normA === 0 || normB === 0) return 0; return dot / (Math.sqrt(normA) * Math.sqrt(normB)); } Each document chunk becomes a sparse vector of word frequencies. At query time, we compute cosine similarity between the query vector and all stored chunk vectors to find the most relevant matches. SQLite as a Vector Store Chunks and their TF-IDF vectors are stored in SQLite using sql.js (pure JavaScript — no native compilation needed): export class VectorStore { // Created via: const store = await VectorStore.create(dbPath) insert(docId, title, category, chunkIndex, content) { const tf = termFrequency(content); const tfJson = JSON.stringify([...tf]); this.db.run( "INSERT INTO chunks (...) VALUES (?, ?, ?, ?, ?, ?)", [docId, title, category, chunkIndex, content, tfJson] ); this.save(); } search(query, topK = 5) { const queryTf = termFrequency(query); // Score each chunk by cosine similarity, return top-K } } 💡 Why SQLite for Vectors? For a CV plus a few job descriptions (dozens of chunks), brute-force cosine similarity over SQLite rows is near-instant (~1ms). No need for Pinecone, Qdrant, or Chroma — just a single .db file on disk. Step 3: The RAG Chat Engine The chat engine ties retrieval and generation together: async *queryStream(userMessage, history = []) { // 1. Retrieve relevant CV/JD chunks const chunks = this.retrieve(userMessage); const context = this._buildContext(chunks); // 2. Build the prompt with retrieved context const messages = [ { role: "system", content: SYSTEM_PROMPT }, { role: "system", content: `Retrieved context:\n\n${context}` }, ...history, { role: "user", content: userMessage }, ]; // 3. Stream from the local model const stream = await this.openai.chat.completions.create({ model: this.modelId, messages, temperature: 0.3, stream: true, }); // 4. Yield chunks as they arrive for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) yield { type: "text", data: content }; } } The flow is straightforward: vectorize the query, retrieve with cosine similarity, build a prompt with context, and stream from the local LLM. The temperature: 0.3 keeps responses focused — important for interview preparation where consistency matters. Step 4: Dual Interfaces — Web & CLI Web UI The web frontend is a single HTML file with inline CSS and JavaScript — no build step, no framework, no React or Vue. It communicates with the Express backend via REST and SSE: File upload via multipart/form-data Streaming chat via Server-Sent Events (SSE) Quick-action buttons for common follow-up queries (coaching tips, gap analysis, mock interview) The setup form with job title, seniority level, and a pasted job description — ready to generate tailored interview questions. CLI The CLI provides the same experience in the terminal with ANSI-coloured output: npm run cli It walks you through uploading your CV, entering the job details, and then generates streaming questions. Follow-up questions work interactively. Both interfaces share the same ChatEngine class, they're thin layers over identical logic. Edge Mode For constrained devices, toggle Edge mode to use a compact system prompt that fits within smaller context windows: Edge mode activated, uses a minimal prompt for devices with limited resources. Step 5: Testing Tests use the Node.js built-in test runner, no Jest, no Mocha, no extra dependencies: import { describe, it } from "node:test"; import assert from "node:assert/strict"; describe("chunkText", () => { it("returns single chunk for short text", () => { const chunks = chunkText("short text", 200, 25); assert.equal(chunks.length, 1); }); it("maintains overlap between chunks", () => { // Verifies overlapping tokens between consecutive chunks }); }); npm test Tests cover the chunker, vector store, config, prompts, and server API contract, all without needing Foundry Local running. Adapting for Your Own Use Case Interview Doctor is a pattern, not just a product. You can adapt it for any domain: What to Change How Domain documents Replace files in docs/ with your content System prompt Edit src/prompts.js Chunk sizes Adjust config.chunkSize and config.chunkOverlap Model Change config.model — run foundry model list UI Modify public/index.html — it's a single file Ideas for Adaptation Customer support bot — ingest your product docs and FAQs Code review assistant — ingest coding standards and best practices Study guide — ingest textbooks and lecture notes Compliance checker — ingest regulatory documents Onboarding assistant — ingest company handbooks and processes What I Learned Offline AI is production-ready. Foundry Local + small models like Phi-3.5 Mini are genuinely useful for focused tasks. You don't need vector databases for small collections. SQLite + TF-IDF is fast, simple, and has zero infrastructure overhead. RAG quality depends on chunking. Getting chunk sizes right for your use case is more impactful than the retrieval algorithm. The OpenAI-compatible API is a game-changer. Switching from cloud to local was mostly just changing the baseURL . Dual interfaces are easy when you share the engine. The CLI and Web UI are thin layers over the same ChatEngine class. ⚡ Performance Notes On a typical laptop (no GPU): ingestion takes under 1 second for ~20 documents, retrieval is ~1ms, and the first LLM token arrives in 2-5 seconds. Foundry Local automatically selects the best model variant for your hardware (CUDA GPU, NPU, or CPU). Getting Started git clone https://github.com/leestott/interview-doctor-js.git cd interview-doctor-js npm install npm run ingest npm start # Web UI at http://127.0.0.1:3000 # or npm run cli # Interactive terminal The full source code is on GitHub. Star it, fork it, adapt it — and good luck with your interviews! Resources Foundry Local — Microsoft's on-device AI runtime Foundry Local SDK (npm) — JavaScript SDK Foundry Local GitHub — Source, samples, and documentation Local RAG Reference — Reference RAG implementation Interview Doctor (JavaScript) — This project's source codeBuild an Offline Hybrid RAG Stack with ONNX and Foundry Local
If you are building local AI applications, basic retrieval augmented generation is often only the starting point. This sample shows a more practical pattern: combine lexical retrieval, ONNX based semantic embeddings, and a Foundry Local chat model so the assistant stays grounded, remains offline, and degrades cleanly when the semantic path is unavailable. Why this sample is worth studying Many local RAG samples rely on a single retrieval strategy. That is usually enough for a proof of concept, but it breaks down quickly in production. Exact keywords, acronyms, and document codes behave differently from natural language questions and paraphrased requests. This repository keeps the original lexical retrieval path, adds local ONNX embeddings for semantic search, and fuses both signals in a hybrid ranking mode. The generation step runs through Foundry Local, so the entire assistant can remain on device. Lexical mode handles exact terms and structured vocabulary. Semantic mode handles paraphrases and more natural language phrasing. Hybrid mode combines both and is usually the best default. Lexical fallback protects the user experience if the embedding pipeline cannot start. Architectural overview The sample has two main flows: an offline ingestion pipeline and a local query pipeline. The architecture splits cleanly into offline ingestion at the top and runtime query handling at the bottom. Offline ingestion pipeline Read Markdown files from docs/ . Parse front matter and split each document into overlapping chunks. Generate dense embeddings when the ONNX model is available. Store chunks in SQLite with both sparse lexical features and optional dense vectors. Local query pipeline The browser posts a question to the Express API. ChatEngine resolves the requested retrieval mode. VectorStore retrieves lexical, semantic, or hybrid results. The prompt is assembled with the retrieved context and sent to a Foundry Local chat model. The answer is returned with source references and retrieval metadata. The sequence diagram shows the difference between lexical retrieval and hybrid retrieval. In hybrid mode, the query is embedded first, then lexical and semantic scores are fused before prompt assembly. Repository structure and core components The implementation is compact and readable. The main files to understand are listed below. src/config.js : retrieval defaults, paths, and model settings. src/embeddingEngine.js : local ONNX embedding generation through Transformers.js. src/vectorStore.js : SQLite storage plus lexical, semantic, and hybrid ranking. src/chatEngine.js : retrieval mode resolution, prompt assembly, and Foundry Local model execution. src/ingest.js : document ingestion and embedding generation during indexing. src/server.js : REST endpoints, streaming endpoints, upload support, and health reporting. Getting started To run the sample, you need Node.js 20 or newer, Foundry Local, and a local ONNX embedding model. The default model path is models/embeddings/bge-small-en-v1.5 . cd c:\Users\leestott\local-hybrid-retrival-onnx npm install huggingface-cli download BAAI/bge-small-en-v1.5 --local-dir models/embeddings/bge-small-en-v1.5 npm run ingest npm start Ingestion writes the local SQLite database to data/rag.db . If the embedding model is available, each chunk gets a dense vector as well as lexical features. If the embedding model is missing, ingestion still succeeds and the application remains usable in lexical mode. Best practice: local AI applications should treat model files, SQLite data, and native runtime compatibility as part of the deployable system, not as optional developer conveniences. Code walkthrough 1. Retrieval configuration The sample makes its retrieval behaviour explicit in configuration. That is useful for testing and for operator visibility. export const config = { model: "phi-3.5-mini", docsDir: path.join(ROOT, "docs"), dbPath: path.join(ROOT, "data", "rag.db"), chunkSize: 200, chunkOverlap: 25, topK: 3, retrievalMode: process.env.RETRIEVAL_MODE || "hybrid", retrievalModes: ["lexical", "semantic", "hybrid"], fallbackRetrievalMode: "lexical", retrievalWeights: { lexical: 0.45, semantic: 0.55, }, }; Those defaults tell you a lot about the intended operating profile. Chunks are small, the number of returned chunks is low, and the fallback path is explicit. 2. Local ONNX embeddings The embedding engine disables remote model loading and only uses local files. That matters for privacy, repeatability, and air gapped operation. env.allowLocalModels = true; env.allowRemoteModels = false; this.extractor = await pipeline("feature-extraction", resolvedPath, { local_files_only: true, }); const output = await this.extractor(text, { pooling: "mean", normalize: true, }); The mean pooling and normalisation step make the vectors suitable for cosine similarity based ranking. 3. Hybrid storage and ranking in SQLite Instead of adding a separate vector database, the sample stores lexical and semantic representations in the same SQLite table. That keeps the local footprint low and the implementation easy to debug. searchHybrid(query, queryEmbedding, topK = 5, weights = { lexical: 0.45, semantic: 0.55 }) { const lexicalResults = this.searchLexical(query, topK * 3); const semanticResults = this.searchSemantic(queryEmbedding, topK * 3); if (semanticResults.length === 0) { return lexicalResults.slice(0, topK).map((row) => ({ ...row, retrievalMode: "lexical", })); } const fused = [...combined.values()].map((row) => ({ ...row, score: (row.lexicalScore * lexicalWeight) + (row.semanticScore * semanticWeight), })); fused.sort((a, b) => b.score - a.score); return fused.slice(0, topK); } The important point is not just the weighted fusion. It is the fallback behaviour. If semantic retrieval cannot provide results, the user still gets lexical grounding instead of an empty context window. 4. Retrieval mode resolution in ChatEngine ChatEngine keeps the runtime behaviour predictable. It validates the requested mode and falls back to lexical search when semantic retrieval is unavailable. resolveRetrievalMode(requestedMode) { const desiredMode = config.retrievalModes.includes(requestedMode) ? requestedMode : config.retrievalMode; if ((desiredMode === "semantic" || desiredMode === "hybrid") && !this.semanticAvailable) { return config.fallbackRetrievalMode; } return desiredMode; } This is a sensible production design because local runtime failures are common. Missing model files or native dependency mismatches should reduce quality, not crash the entire assistant. 5. Foundry Local model management The sample uses FoundryLocalManager to discover, download, cache, and load the configured chat model. const manager = FoundryLocalManager.create({ appName: "gas-field-local-rag" }); const catalog = manager.catalog; this.model = await catalog.getModel(config.model); if (!this.model.isCached) { await this.model.download((progress) => { const pct = Math.round(progress * 100); this._emitStatus("download", `Downloading ${this.modelAlias}... ${pct}%`, progress); }); } await this.model.load(); this.chatClient = this.model.createChatClient(); this.chatClient.settings.temperature = 0.1; This gives the app a better local startup experience. The server can expose a status stream while the model initialises in the background. User experience and screenshots The client is intentionally simple, which makes it useful during evaluation. You can switch retrieval mode, test questions quickly, and inspect the retrieved sources. The landing page exposes retrieval mode directly in the UI. That makes it easy to compare lexical, semantic, and hybrid behaviour during testing. The sources panel shows grounding evidence and retrieval scores, which is useful when validating whether better answers are coming from better retrieval or just model phrasing. Best practices for ONNX RAG and Foundry Local Keep lexical fallback alive. Exact identifiers and runtime failures both make this necessary. Persist sparse and dense features together where possible. It simplifies debugging and operational reasoning. Use small chunks and conservative topK values for local context budgets. Expose health and status endpoints so users can see when the model is still loading or embeddings are unavailable. Test retrieval quality separately from generation quality. Pin and validate native runtime dependencies, especially ONNX Runtime, before tuning prompts. Practical warning: this repository already shows why runtime validation matters. A local app can ingest documents successfully and still fail at model initialisation if the native runtime stack is misaligned. How this compares with RAG and CAG The strongest value in this sample comes from where it sits between a basic local RAG baseline and a curated CAG design. Dimension Classic local RAG This hybrid ONNX RAG sample CAG Context assembly Retrieve chunks at query time, often lexically, then inject them into the prompt. Retrieve chunks at query time with lexical, semantic, or fused scoring, then inject the strongest results into the prompt. Use a prepared or cached context pack instead of fresh retrieval for every request. Main strength Easy to implement and easy to explain. Better recall for paraphrases without giving up exact match behaviour or offline execution. Predictable prompts and low query time overhead. Main weakness Misses synonyms and natural language reformulations. More moving parts, larger local asset footprint, and native runtime compatibility to manage. Coverage depends on curation quality and goes stale more easily. Failure behaviour Weak retrieval leads to weak grounding. Semantic failure can degrade to lexical retrieval if designed properly, which this sample does. Prepared context can be too narrow for new or unexpected questions. Best fit Simple local assistants and proof of concept systems. Offline copilots and technical assistants that need stronger recall across varied phrasing. Stable workflows with tightly bounded, curated knowledge. Samples Related samples: - Foundry Local RAG - https://github.com/leestott/local-rag - Foundry Local CAG - https://github.com/leestott/local-cag - Foundry Local hybrid-retrival-onnx https://github.com/leestott/local-hybrid-retrival-onnx Specific benefits of this hybrid approach over classic RAG It captures paraphrased questions that lexical search would often miss. It still preserves exact match performance for codes, terms, and product names. It gives operators a controlled degradation path when the semantic stack is unavailable. It stays local and inspectable without introducing a separate hosted vector service. Specific differences from CAG CAG shifts effort into context curation before the request. This sample retrieves evidence dynamically at runtime. CAG can be faster for fixed workflows, but it is usually less flexible when the document set changes. This hybrid RAG design is better suited to open ended knowledge search and growing document collections. What to validate before shipping Measure retrieval quality in each mode using exact term, acronym, and paraphrase queries. Check that sources shown in the UI reflect genuinely distinct evidence, not repeated chunks. Confirm the application remains usable when semantic retrieval is unavailable. Verify ONNX Runtime compatibility on the real target machines, not only on the development laptop. Test model download, cache, and startup behaviour with a clean environment. Final take For developers getting started with ONNX RAG and Foundry Local, this sample is a good technical reference because it demonstrates a realistic local architecture rather than a minimal demo. It shows how to build a grounded assistant that remains offline, supports multiple retrieval modes, and fails gracefully. Compared with classic local RAG, the hybrid design provides better recall and better resilience. Compared with CAG, it remains more flexible for changing document sets and less dependent on pre curated context packs. If you want a practical starting point for offline grounded AI on developer workstations or edge devices, this is the most balanced pattern in the repository set.238Views0likes0CommentsVectorless Reasoning-Based RAG: A New Approach to Retrieval-Augmented Generation
Introduction Retrieval-Augmented Generation (RAG) has become a widely adopted architecture for building AI applications that combine Large Language Models (LLMs) with external knowledge sources. Traditional RAG pipelines rely heavily on vector embeddings and similarity search to retrieve relevant documents. While this works well for many scenarios, it introduces challenges such as: Requires chunking documents into small segments Important context can be split across chunks Embedding generation and vector databases add infrastructure complexity A new paradigm called Vectorless Reasoning-Based RAG is emerging to address these challenges. One framework enabling this approach is PageIndex, an open-source document indexing system that organizes documents into a hierarchical tree structure and allows Large Language Models (LLMs) to perform reasoning-based retrieval over that structure. Vectorless Reasoning-Based RAG Instead of vectors, this approach uses structured document navigation. User Query ->Document Tree Structure ->LLM Reasoning ->Relevant Nodes Retrieved ->LLM Generates Answer This mimics how humans read documents: Look at the table of contents Identify relevant sections Read the relevant content Answer the question Core features No Vector Database: It relies on document structure and LLM reasoning for retrieval. It does not depend on vector similarity search. No Chunking: Documents are not split into artificial chunks. Instead, they are organized using their natural structure, such as pages and sections. Human-like Retrieval: The system mimics how human experts read documents. It navigates through sections and extracts information from relevant parts. Better Explainability and Traceability: Retrieval is based on reasoning. The results can be traced back to specific pages and sections. This makes the process easier to interpret. It avoids opaque and approximate vector search, often called “vibe retrieval.” When to Use Vectorless RAG Vectorless RAG works best when: Data is structured or semi-structured Documents have clear metadata Knowledge sources are well organized Queries require reasoning rather than semantic similarity Examples: enterprise knowledge bases internal documentation systems compliance and policy search healthcare documentation financial reporting Implementing Vectorless RAG with Azure AI Foundry Step 1 : Install Pageindex using pip command, from pageindex import PageIndexClient import pageindex.utils as utils # Get your PageIndex API key from https://dash.pageindex.ai/api-keys PAGEINDEX_API_KEY = "YOUR_PAGEINDEX_API_KEY" pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY) Step 2 : Set up your LLM Example using Azure OpenAI: from openai import AsyncAzureOpenAI client = AsyncAzureOpenAI( api_key=AZURE_OPENAI_API_KEY, azure_endpoint=AZURE_OPENAI_ENDPOINT, api_version=AZURE_OPENAI_API_VERSION ) async def call_llm(prompt, temperature=0): response = await client.chat.completions.create( model=AZURE_DEPLOYMENT_NAME, messages=[{"role": "user", "content": prompt}], temperature=temperature ) return response.choices[0].message.content.strip() Step 3: Page Tree Generation import os, requests pdf_url = "https://arxiv.org/pdf/2501.12948.pdf" //give the pdf url for tree generation, here given one for example pdf_path = os.path.join("../data", pdf_url.split('/')[-1]) os.makedirs(os.path.dirname(pdf_path), exist_ok=True) response = requests.get(pdf_url) with open(pdf_path, "wb") as f: f.write(response.content) print(f"Downloaded {pdf_url}") doc_id = pi_client.submit_document(pdf_path)["doc_id"] print('Document Submitted:', doc_id) Step 4 : Print the generated pageindex tree structure if pi_client.is_retrieval_ready(doc_id): tree = pi_client.get_tree(doc_id, node_summary=True)['result'] print('Simplified Tree Structure of the Document:') utils.print_tree(tree) else: print("Processing document, please try again later...") Step 5 : Use LLM for tree search and identify nodes that might contain relevant context import json query = "What are the conclusions in this document?" tree_without_text = utils.remove_fields(tree.copy(), fields=['text']) search_prompt = f""" You are given a question and a tree structure of a document. Each node contains a node id, node title, and a corresponding summary. Your task is to find all nodes that are likely to contain the answer to the question. Question: {query} Document tree structure: {json.dumps(tree_without_text, indent=2)} Please reply in the following JSON format: {{ "thinking": "<Your thinking process on which nodes are relevant to the question>", "node_list": ["node_id_1", "node_id_2", ..., "node_id_n"] }} Directly return the final JSON structure. Do not output anything else. """ tree_search_result = await call_llm(search_prompt) Step 6 : Print retrieved nodes and reasoning process node_map = utils.create_node_mapping(tree) tree_search_result_json = json.loads(tree_search_result) print('Reasoning Process:') utils.print_wrapped(tree_search_result_json['thinking']) print('\nRetrieved Nodes:') for node_id in tree_search_result_json["node_list"]: node = node_map[node_id] print(f"Node ID: {node['node_id']}\t Page: {node['page_index']}\t Title: {node['title']}") Step 7: Answer generation node_list = json.loads(tree_search_result)["node_list"] relevant_content = "\n\n".join(node_map[node_id]["text"] for node_id in node_list) print('Retrieved Context:\n') utils.print_wrapped(relevant_content[:1000] + '...') answer_prompt = f""" Answer the question based on the context: Question: {query} Context: {relevant_content} Provide a clear, concise answer based only on the context provided. """ print('Generated Answer:\n') answer = await call_llm(answer_prompt) utils.print_wrapped(answer) When to Use Each Approach Both vector-based RAG and vectorless RAG have their strengths. Choosing the right approach depends on the nature of the documents and the type of retrieval required. When to Use Vector Database–Based RAG Vector-based retrieval works best when dealing with large collections of unrelated or loosely structured documents. In such cases, semantic similarity is often sufficient to identify relevant information quickly. Use vector RAG when: Searching across many independent documents Semantic similarity is sufficient to locate relevant content Real-time retrieval is required over very large datasets Common use cases include: Customer support knowledge bases Conversational chatbots Product and content search systems When to Use Vectorless RAG Vectorless approaches such as PageIndex are better suited for long, structured documents where understanding the logical organization of the content is important. Use vectorless RAG when: Documents contain clear hierarchical structure Logical reasoning across sections is required High retrieval accuracy is critical Typical examples include: Financial filings and regulatory reports Legal documents and contracts Technical manuals and documentation Academic and research papers In these scenarios, navigating the document structure allows the system to identify the exact section that logically contains the answer, rather than relying only on semantic similarity. Conclusion Vector databases significantly advanced RAG architectures by enabling scalable semantic search across large datasets. However, they are not the optimal solution for every type of document. Vectorless approaches such as PageIndex introduce a different philosophy: instead of retrieving text that is merely semantically similar, they retrieve text that is logically relevant by reasoning over the structure of the document. As RAG architectures continue to evolve, the future will likely combine the strengths of both approaches. Hybrid systems that integrate vector search for broad retrieval and reasoning-based navigation for precision may offer the best balance of scalability and accuracy for enterprise AI applications.3.1KViews2likes0CommentsAnnouncing the IQ Series: Foundry IQ
AI agents are rapidly becoming a new way to build applications. But for agents to be truly useful, they need access to the knowledge and context that helps them reason about the world they operate in. That’s where Foundry IQ comes in. Today we’re announcing the IQ Series: Foundry IQ, a new set of developer-focused episodes exploring how to build knowledge-centric AI systems using Foundry IQ. The series focuses on the core ideas behind how modern AI systems work with knowledge, how they retrieve information, reason across sources, synthesize answers, and orchestrate multi-step interactions. Instead of treating retrieval as a single step in a pipeline, Foundry IQ approaches knowledge as something that AI systems actively work with throughout the reasoning process. The IQ Series breaks down these concepts and shows how they come together when building real AI applications. You can explore the series and all the accompanying samples here: 👉 https://aka.ms/iq-series What is Foundry IQ? Foundry IQ helps AI systems work with knowledge in a more structured and intentional way. Rather than wiring retrieval logic directly into every application, developers can define knowledge bases that connect to documents, data sources, and other information systems. AI agents can then query these knowledge bases to gather the context they need to generate responses, make decisions, or complete tasks. This model allows knowledge to be organized, reused, and combined across applications, instead of being rebuilt for each new scenario. What's covered in the IQ Series? The Foundry IQ episodes in the IQ Series explore the key building blocks behind knowledge-driven AI systems from how knowledge enters the system to how agents ultimately query and use it. The series is released as three weekly episodes: Foundry IQ: Unlocking Knowledge for Your Agents — March 18, 2026: Introduces Foundry IQ and the core ideas behind it. The episode explains how AI agents work with knowledge and walks through the main components of the Foundry IQ that support knowledge-driven applications. Foundry IQ: Building the Data Pipeline with Knowledge Sources — March 25, 2026: Focuses on Knowledge Sources and how different types of content flow into Foundry IQ. It explores how systems such as SharePoint, Fabric, OneLake, Azure Blob Storage, Azure AI Search, and the web contribute information that AI systems can later retrieve and use. Foundry IQ: Querying the Multi-Source AI Knowledge Bases — April 1, 2026: Dives into the Knowledge Bases and how multiple knowledge sources can be organized behind a single endpoint. The episode demonstrates how AI systems query across these sources and synthesize information to answer complex questions. Each episode includes a short executive introduction, a tech talk exploring the topic in depth, and a visual recap with doodle summaries of the key ideas. Alongside the episodes, the GitHub repository provides cookbooks with sample code, summary of the episodes, and additinal learning resources, so developers can explore the concepts and apply them in their own projects. Explore the Repo All episodes and supporting materials live in the IQ Series repository: 👉 https://aka.ms/iq-series Inside the repository you’ll find: The Foundry IQ episode links Cookbooks for each episode Links to documentation and additional resources If you're building AI agents or exploring how AI systems can work with knowledge, the IQ Series is a great place to start. Watch the episodes and explore the cookbooks! We’re excited to see what you build and welcome your feedback & ideas as the series evolves.Build a Fully Offline RAG App with Foundry Local: No Cloud Required
A practical guide to building an on-device AI support agent using Retrieval-Augmented Generation, JavaScript, and Microsoft Foundry Local. The Problem: AI That Can't Go Offline Most AI-powered applications today are firmly tethered to the cloud. They assume stable internet, low-latency API calls, and the comfort of a managed endpoint. But what happens when your users are in an environment with zero connectivity a gas pipeline in a remote field, a factory floor, an underground facility? That's exactly the scenario that motivated this project: a fully offline RAG-powered support agent that runs entirely on a laptop. No cloud. No API keys. No outbound network calls. Just a local model, a local vector store, and domain-specific documents all accessible from a browser on any device. The Gas Field Support Agent - running entirely on-device What is RAG and Why Should You Care? Retrieval-Augmented Generation (RAG) is a pattern that makes language models genuinely useful for domain-specific tasks. Instead of hoping the model "knows" the answer from pre-training, you: Retrieve relevant chunks from your own documents Augment the model's prompt with those chunks as context Generate a response grounded in your actual data The result: fewer hallucinations, traceable answers, and an AI that works with your content. If you're building internal tools, customer support bots, field manuals, or knowledge bases, RAG is the pattern you want. Why fully offline? Data sovereignty, air-gapped environments, field operations, latency-sensitive workflows, and regulatory constraints all demand AI that doesn't phone home. Running everything locally gives you complete control over your data and eliminates any external dependency. The Tech Stack This project is deliberately simple — no frameworks, no build steps, no Docker: Layer Technology Why AI Model Foundry Local + Phi-3.5 Mini Runs locally, OpenAI-compatible API, no GPU needed Backend Node.js + Express Lightweight, fast, universally known Vector Store SQLite via better-sqlite3 Zero infrastructure, single file on disk Retrieval TF-IDF + cosine similarity No embedding model required, fully offline Frontend Single HTML file with inline CSS No build step, mobile-responsive, field-ready The total dependency footprint is just four npm packages: express , openai , foundry-local-sdk , and better-sqlite3 . Architecture Overview The system has five layers — all running on a single machine: Five-layer architecture: Client → Server → RAG Pipeline → Data → AI Model Client Layer — A single HTML file served by Express, with quick-action buttons and responsive chat Server Layer — Express.js handles API routes for chat (streaming + non-streaming), document upload, and health checks RAG Pipeline — The chat engine orchestrates retrieval and generation; the chunker handles TF-IDF vectorization Data Layer — SQLite stores document chunks and their TF-IDF vectors; source docs live as .md files AI Layer — Foundry Local runs Phi-3.5 Mini Instruct on CPU/NPU, exposing an OpenAI-compatible API Getting Started in 5 Minutes You need two prerequisites: Node.js 20+ — nodejs.org Foundry Local — Microsoft's on-device AI runtime: Terminal winget install Microsoft.FoundryLocal Then clone, install, ingest, and run: git clone https://github.com/leestott/local-rag.git cd local-rag npm install npm run ingest # Index the 20 gas engineering documents npm start # Start the server + Foundry Local Open http://127.0.0.1:3000 and start chatting. Foundry Local auto-downloads Phi-3.5 Mini (~2 GB) on first run. How the RAG Pipeline Works Let's trace what happens when a user asks: "How do I detect a gas leak?" RAG query flow: Browser → Server → Vector Store → Model → Streaming response Step 1: Document Ingestion Before any queries happen, npm run ingest reads every .md file from the docs/ folder, splits each into overlapping chunks (~200 tokens, 25-token overlap), computes a TF-IDF vector for each chunk, and stores everything in SQLite. Chunking example docs/01-gas-leak-detection.md → Chunk 1: "Gas Leak Detection – Safety Warnings: Ensure all ignition..." → Chunk 2: "...sources are eliminated. Step-by-step: 1. Perform visual..." → Chunk 3: "...inspection of all joints. 2. Check calibration date..." The overlap ensures no information falls between chunk boundaries — a critical detail in any RAG system. Step 2: Query → Retrieval When the user sends a question, the server converts it into a TF-IDF vector, compares it against every stored chunk using cosine similarity, and returns the top-K most relevant results. For 20 documents (~200 chunks), this executes in under 10ms. src/vectorStore.js /** Retrieve top-K most relevant chunks for a query. */ search(query, topK = 5) { const queryTf = termFrequency(query); const rows = this.db.prepare("SELECT * FROM chunks").all(); const scored = rows.map((row) => { const chunkTf = new Map(JSON.parse(row.tf_json)); const score = cosineSimilarity(queryTf, chunkTf); return { ...row, score }; }); scored.sort((a, b) => b.score - a.score); return scored.slice(0, topK).filter((r) => r.score > 0); } Step 3: Prompt Construction The retrieved chunks are injected into the prompt alongside system instructions: Prompt structure System: You are an offline gas field support agent. Safety-first... Context: [Chunk 1: Gas Leak Detection – Safety Warnings...] [Chunk 2: Gas Leak Detection – Step-by-step...] [Chunk 3: Purging Procedures – Related safety...] User: How do I detect a gas leak? Step 4: Generation + Streaming The prompt is sent to Foundry Local via the OpenAI-compatible API. The response streams back token-by-token through Server-Sent Events (SSE) to the browser: Safety-first response with structured guidance Expandable sources with relevance scores Foundry Local: Your Local AI Runtime Foundry Local is what makes the "offline" part possible. It's a runtime from Microsoft that runs small language models (SLMs) on CPU or NPU — no GPU required. It exposes an OpenAI-compatible API and manages model downloads, caching, and lifecycle automatically. The integration code is minimal if you've used the OpenAI SDK before, this will feel instantly familiar: src/chatEngine.js import { FoundryLocalManager } from "foundry-local-sdk"; import { OpenAI } from "openai"; // Start Foundry Local and load the model const manager = new FoundryLocalManager(); const modelInfo = await manager.init("phi-3.5-mini"); // Use the standard OpenAI client — pointed at the local endpoint const client = new OpenAI({ baseURL: manager.endpoint, apiKey: manager.apiKey, }); // Chat completions work exactly like the cloud API const stream = await client.chat.completions.create({ model: modelInfo.id, messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "How do I detect a gas leak?" } ], stream: true, }); Portability matters Because Foundry Local uses the OpenAI API format, any code you write here can be ported to Azure OpenAI or OpenAI's cloud API with a single config change. You're not locked in. Why TF-IDF Instead of Embeddings? Most RAG tutorials use embedding models for retrieval. We chose TF-IDF for this project because: Fully offline — no embedding model to download or run Zero latency — vectorization is instantaneous (just math on word frequencies) Good enough — for a curated collection of 20 domain-specific documents, TF-IDF retrieves the right chunks reliably Transparent — you can inspect the vocabulary and weights, unlike neural embeddings For larger collections (thousands of documents) or when semantic similarity matters more than keyword overlap, you'd swap in an embedding model. But for this use case, TF-IDF keeps the stack simple and dependency-free. Mobile-Responsive Field UI Field engineers use this app on phones and tablets often wearing gloves. The UI is designed for harsh conditions with a dark, high-contrast theme, large touch targets (minimum 48px), and horizontally scrollable quick-action buttons. Desktop view Mobile view The entire frontend is a single index.html file — no React, no build step, no bundler. This keeps the project accessible and easy to deploy anywhere. Runtime Document Upload Users can upload new documents without restarting the server. The upload endpoint receives markdown content, chunks it, computes TF-IDF vectors, and inserts the chunks into SQLite — all in memory, immediately available for retrieval. Drag-and-drop document upload with instant indexing Adapt This for Your Own Domain This project is a scenario sample designed to be forked and customized. Here's the three-step process: 1. Replace the Documents Delete the gas engineering docs in docs/ and add your own .md files with optional YAML front-matter: docs/my-procedure.md --- title: Troubleshooting Widget Errors category: Support id: KB-001 --- # Troubleshooting Widget Errors ...your content here... 2. Edit the System Prompt Open src/prompts.js and rewrite the instructions for your domain: src/prompts.js export const SYSTEM_PROMPT = `You are an offline support agent for [YOUR DOMAIN]. Rules: - Only answer using the retrieved context - If the answer isn't in the context, say so - Use structured responses: Summary → Details → Reference `; 3. Tune the Retrieval Adjust chunking and retrieval parameters in src/config.js : src/config.js export const config = { model: "phi-3.5-mini", chunkSize: 200, // smaller = more precise, less context per chunk chunkOverlap: 25, // prevents info from falling between chunks topK: 3, // chunks per query (more = richer context, slower) }; Extending to Multi-Agent Architectures Once you have a working RAG agent, the natural next step is multi-agent orchestration where specialized agents collaborate to handle complex workflows. With Foundry Local's OpenAI-compatible API, you can compose multiple agent roles on the same machine: Multi-agent concept // Each agent is just a different system prompt + RAG scope const agents = { safety: { prompt: safetyPrompt, docs: "safety/*.md" }, diagnosis: { prompt: diagnosisPrompt, docs: "faults/*.md" }, procedure: { prompt: procedurePrompt, docs: "procedures/*.md" }, }; // Router determines which agent handles the query function route(query) { if (query.match(/safety|warning|hazard/i)) return agents.safety; if (query.match(/fault|error|code/i)) return agents.diagnosis; return agents.procedure; } // Each agent uses the same Foundry Local model endpoint const response = await client.chat.completions.create({ model: modelInfo.id, messages: [ { role: "system", content: selectedAgent.prompt }, { role: "system", content: `Context:\n${retrievedChunks}` }, { role: "user", content: userQuery } ], stream: true, }); This pattern lets you build specialized agent pipelines a triage agent routes to the right specialist, each with its own document scope and system prompt, all running on the same local Foundry instance. For production multi-agent systems, explore Microsoft Foundry for cloud-scale orchestration when connectivity is available. Local-first, cloud-ready Start with Foundry Local for development and offline scenarios. When your agents need cloud scale, swap to Azure AI Foundry with the same OpenAI-compatible API your agent code stays the same. Key Takeaways 1 RAG = Retrieve + Augment + Generate Ground your AI in real documents — dramatically reducing hallucination and making answers traceable. 2 Foundry Local makes local AI accessible OpenAI-compatible API running on CPU/NPU. No GPU required. No cloud dependency. 3 TF-IDF + SQLite is viable For small-to-medium document collections, you don't need a dedicated vector database. 4 Same API, local or cloud Build locally with Foundry Local, deploy with Azure OpenAI — zero code changes. What's Next? Embedding-based retrieval — swap TF-IDF for a local embedding model for better semantic matching Conversation memory — persist chat history across sessions Multi-agent routing — specialized agents for safety, diagnostics, and procedures PWA packaging — make it installable as a standalone app on mobile devices Hybrid retrieval — combine keyword search with semantic embeddings for best results Get the code Clone the repo, swap in your own documents, and start building: git clone https://github.com/leestott/local-rag.git github.com/leestott/local-rag — MIT licensed, contributions welcome. Open source under the MIT License. Built with Foundry Local and Node.js.578Views1like0CommentsBuilding High-Performance Agentic Systems
Most enterprise chatbots fail in the same quiet way. They answer questions. They impress in demos. And then they stall in production. Knowledge goes stale. Answers cannot be audited. The system cannot act beyond generating text. When workflows require coordination, execution, or accountability, the chatbot stops being useful. Agentic systems exist because that model is insufficient. Instead of treating the LLM as the product, agentic architecture embeds it inside a bounded control loop: plan → act (tools) → observe → refine The model becomes one component in a runtime system with explicit state management, safety policies, identity enforcement, and operational telemetry. This shift is not speculative. A late-2025 MIT Sloan Management Review / BCG study reports that 35% of organizations have already adopted AI agents, with another 44% planning deployment. Microsoft is advancing open protocols for what it calls the “agentic web,” including Agent-to-Agent (A2A) interoperability and Model Context Protocol (MCP), with integration paths emerging across Copilot Studio and Azure AI Foundry. The real question is no longer whether agents are coming. It is whether enterprise architecture is ready for them. This article translates “agentic” into engineering reality: the runtime layers, latency and cost levers, orchestration patterns, and governance controls required for production deployment. The Core Capabilities of Agentic AI What makes an AI “agentic” is not a single feature—it’s the interaction of different capabilities. Together, they form the minimum set needed to move from “answering” to “operating”. Autonomy – Goal-Driven Task Completion Traditional bots are reactive: they wait for a prompt and produce output. Autonomy introduces a goal state and a control loop. The agent is given an objective (or a trigger) and it can decide the next step without being micromanaged. The critical engineering distinction is that autonomy must be bounded: in production, you implement it with explicit budgets and stop conditions—maximum tool calls, maximum retries, timeouts, and confidence thresholds. The typical execution shape is a loop: plan → act → observe → refine. A project-management agent, for example, doesn’t just answer “what’s the status?” It monitors signals (work items, commits, build health), detects a risk pattern (slippage, dependency blockage), and then either surfaces an alert or prepares a remediation action (re-plan milestones, notify owners). In high-stakes environments, autonomy is usually human-in-the-loop by design: the agent can draft changes, propose next actions, and only execute after approval. Over time, teams expand the autonomy envelope for low-risk actions while keeping approvals for irreversible or financially sensitive operations. Tool Integration – Taking Action and Staying Current A standalone LLM cannot fetch live enterprise state and cannot change it. Tool integration is how an agent becomes operational: it can query systems of record, call APIs, trigger workflows, and produce outputs that reflect the current world rather than the model’s pretraining snapshot. There are two classes of tools that matter in enterprise agents: Retrieval tools (grounding / RAG)When the agent needs facts, it retrieves them. This is the backbone of reducing hallucination: instead of guessing, the agent pulls authoritative content (SharePoint, Confluence, policy repositories, CRM records, Fabric datasets) and uses it as evidence. In practice, retrieval works best when it is engineered as a pipeline: query rewrite (optional) → hybrid search (keyword + vector) → filtering (metadata/ACL) → reranking → compact context injection. The point is not “stuff the prompt with documents,” but “inject only the minimum evidence required to answer accurately.” Action tools (function calling / connectors) These are the hands of the agent: update a CRM record, create a ticket, send an email, schedule a meeting, generate a report, run a pipeline. Tool integration shifts value from “advice” to “execution,” but also introduces risk—so action tools need guardrails: least-privilege permissions, input validation, idempotency keys, and post-condition checks (confirm the update actually happened). In Microsoft ecosystems, this tool plane often maps to Graph actions + business connectors (via Logic Apps/Power Automate) + custom APIs, with Copilot Studio (low code) or Foundry-style runtimes (pro code) orchestrating the calls. Memory (Context & Learning) – Context Awareness and Adaptation “Memory” is not just a long prompt. In agentic systems, memory is an explicit state strategy: Working memory: what the agent has learned during the current run (intermediate tool results, constraints, partial plans). Session memory: what should persist across turns (user preferences, ongoing tasks, summarized history). Long-term memory: enterprise knowledge the agent can retrieve (indexed documents, structured facts, embeddings + metadata). Short-term memory enables multi-step workflows without repeating questions. An HR onboarding agent can carry a new hire’s details from intake through provisioning without re-asking, because the workflow state is persisted and referenced. Long-term “learning” is typically implemented through feedback loops rather than real-time model weight updates: capturing corrections, storing validated outcomes, and periodically improving prompts, routing logic, retrieval configuration, or (where appropriate) fine-tuning. The key design rule is that memory must be policy-aware: retention rules, PII handling, and permission trimming apply to stored state as much as they apply to retrieved documents. Orchestration – Coordinating Multi-Agent Teams Complex enterprise work is rarely single-skill. Orchestration is how agentic systems scale capability without turning one agent into an unmaintainable monolith. The pattern is “manager + specialists”: an orchestrator decomposes the goal into subtasks, routes each to the best tool or sub-agent, and then composes a final response. This can be done sequentially or in parallel. Employee onboarding is a classic: HR intake, IT account creation, equipment provisioning, and training scheduling can run in parallel where dependencies allow. The engineering challenge is making orchestration reliable: defining strict input/output contracts between agents (often structured JSON), handling failures (timeouts, partial completion), and ensuring only one component has authority to send the final user-facing message to avoid conflicting outputs. In Microsoft terms, orchestration can be implemented as agentic flows in Copilot Studio, connected-agent patterns in Foundry, or explicit orchestrators in code using structured tool schemas and shared state. Strategic Impact – How Agentic AI Changes Knowledge Work Agentic AI is no longer an experimental overlay to enterprise systems. It is becoming an embedded operational layer inside core workflows. Unlike earlier chatbot deployments that answered isolated questions, modern enterprise agents execute end-to-end processes, interact with structured systems, maintain context, and operate within governed boundaries. The shift is not about conversational intelligence alone; it is about workflow execution at scale. The transformation becomes clearer when examining real implementations across industries. In legal services, agentic systems have moved beyond document summarization into operational case automation. Assembly Software’s NeosAI, built on Azure AI infrastructure, integrates directly into legal case management systems and automates document analysis, structured data extraction, and first-draft generation of legal correspondence. What makes this deployment impactful is not merely the generative drafting capability, but the integration architecture. NeosAI is not an isolated chatbot; it operates within the same document management systems, billing systems, and communication platforms lawyers already use. Firms report time savings of up to 25 hours per case, with document drafting cycles reduced from days to minutes for first-pass outputs. Importantly, the system runs within secure Azure environments with zero data retention policies, addressing one of the most sensitive concerns in legal AI adoption: client confidentiality. JPMorgan’s COiN platform represents another dimension of legal and financial automation. Instead of conversational assistance, COiN performs structured contract intelligence at production scale. It analyzes more than 12,000 commercial loan agreements annually, extracting over 150 clause attributes per document. Work that previously required approximately 360,000 human hours now executes in seconds. The architecture emphasizes structured NLP pipelines, taxonomy-based clause classification, and private cloud deployment for regulatory compliance. Rather than replacing legal professionals, the system flags unusual clauses for human review, maintaining oversight while dramatically accelerating analysis. Over time, COiN has also served as a knowledge retention mechanism, preserving institutional contract intelligence that would otherwise be lost with employee turnover. In financial services, the impact is similarly structural. Morgan Stanley’s internal AI Assistant allows wealth advisors to query over 100,000 proprietary research documents using natural language. Adoption has reached nearly universal usage across advisor teams, not because it replaces expertise, but because it compresses research time and surfaces insights instantly. Building on this foundation, the firm introduced an AI meeting debrief agent that transcribes client conversations using speech-to-text models and generates CRM notes and follow-up drafts through GPT-based reasoning. Advisors review outputs before finalization, preserving human judgment. The result is faster client engagement and measurable productivity improvements. What differentiates Morgan Stanley’s approach is not only deployment scale, but disciplined evaluation before release. The firm established rigorous benchmarking frameworks to test model outputs against expert standards for accuracy, compliance, and clarity. Only after meeting defined thresholds were systems expanded firmwide. This pattern—evaluation before scale—is becoming a defining trait of successful enterprise agent deployment. Human Resources provides a different perspective on agentic AI. Johnson Controls deployed an AI HR assistant inside Slack to manage policy questions, payroll inquiries, and onboarding support across a global workforce exceeding 100,000 employees. By embedding the agent in a channel employees already use, adoption barriers were reduced significantly. The result was a 30–40% reduction in live HR call volume, allowing HR teams to redirect focus toward strategic workforce initiatives. Similarly, Ciena integrated an AI assistant directly into Microsoft Teams, unifying HR and IT support through a single conversational interface. Employees no longer navigate separate portals; the agent orchestrates requests across backend systems such as Workday and ServiceNow. The technical lesson here is clear: integration breadth drives usability, and usability drives adoption. Engineering and IT operations reveal perhaps the most technically sophisticated application of agentic AI: multi-agent orchestration. In a proof-of-concept developed through collaboration between Microsoft and ServiceNow, an AI-driven incident response system coordinates multiple agents during high-priority outages. Microsoft 365 Copilot transcribes live war-room discussions and extracts action items, while ServiceNow’s Now Assist executes operational updates within IT service management systems. A Semantic Kernel–based manager agent maintains shared context and synchronizes activity across platforms. This eliminates the longstanding gap between real-time discussion and structured documentation, automatically generating incident reports while freeing engineers to focus on remediation rather than clerical tasks. The system demonstrates that orchestration is not conceptual—it is operational. Across these examples, the pattern is consistent. Agentic AI changes knowledge work by absorbing structured cognitive labor: document parsing, compliance classification, research synthesis, workflow routing, transcription, and task coordination. Humans remain essential for judgment, ethics, and accountability, but the operational layer increasingly runs through AI-mediated execution. The result is not incremental productivity improvement; it is structural acceleration of knowledge processes. Design and Governance Challenges – Managing the Risks As agentic AI shifts from answering questions to executing workflows, governance must mature accordingly. These systems retrieve enterprise data, invoke APIs, update records, and coordinate across platforms. That makes them operational actors inside your architecture—not just assistants. The primary shift is this: autonomy increases responsibility. Agents must be observable. Every retrieval, reasoning step, and tool invocation should be traceable. Without structured telemetry and audit trails, enterprises lose visibility into why an agent acted the way it did. Agents must also operate within scoped authority. Least-privilege access, role-based identity, and bounded credentials are essential. An HR agent should not access finance systems. A finance agent should not modify compliance data without policy constraints. Autonomy only works when it is deliberately constrained. Execution boundaries are equally critical. High-risk actions—financial approvals, legal submissions, production changes—should include embedded thresholds or human approval gates. Autonomy should be progressive, not absolute. Cost and performance must be governed just like cloud infrastructure. Agentic systems can trigger recursive calls and model loops. Without usage monitoring, rate limits, and model-tier routing, compute consumption can escalate unpredictably. Finally, agentic systems require continuous evaluation. Real-world testing, live monitoring, and drift detection ensure the system remains aligned with business rules and compliance requirements. These are not “set and forget” deployments. In short, agentic AI becomes sustainable only when autonomy is paired with observability, scoped authority, embedded guardrails, cost control, and structured oversight. Conclusion – Towards the Agentic Enterprise The organizations achieving meaningful returns from agentic AI share a common pattern. They do not treat AI agents as experimental tools. They design them as production systems with defined roles, scoped authority, measurable KPIs, embedded observability, and formal governance layers. When autonomy is paired with integration, memory, orchestration, and governance discipline, agentic AI becomes more than automation—it becomes an operational architecture. Enterprises that master this architecture are not merely reducing costs; they are redefining how knowledge work is executed. In this emerging model, human professionals focus on strategic judgment and innovation, while AI agents manage structured cognitive execution at scale. The competitive advantage will not belong to those who deploy the most AI, but to those who deploy it with architectural rigor and governance maturity. Before we rush to deploy more agents, a few questions are worth asking: If an AI agent executes a workflow in your enterprise today, can you trace every reasoning step and tool invocation behind that decision? Does your architecture treat AI as a conversational layer - or as an operational actor with scoped identity, cost controls, and policy enforcement? Where should autonomy stop in your organization - and who defines that boundary? Agentic AI is not just a capability shift. It is an architectural decision. Curious to hear how others are designing their control planes and orchestration layers. References MIT Sloan – “Agentic AI, Explained” by Beth Stackpole: A foundational overview of agentic AI, its distinction from traditional generative AI, and its implications for enterprise workflows, governance, and strategy. Microsoft TechCommunity – “Introducing Multi-Agent Orchestration in Foundry Agent Service”: Details Microsoft’s multi-agent orchestration capabilities, including Connected Agents, Multi-Agent Workflows, and integration with A2A and MCP protocols. Microsoft Learn – “Extend the Capabilities of Your Agent – Copilot Studio”: Explains how to build and extend custom agents in Microsoft Copilot Studio using tools, connectors, and enterprise data sources. Assembly Software’s NeosAI case – Microsoft Customer Stories JPMorgan COiN platform – GreenData Case Study HR support AI (Johnson Controls, Ciena, Databricks) – Moveworks case studies ServiceNow & Semantic Kernel multi-agent P1 Incident – Microsoft Semantic Kernel BlogLevel up your Python + AI skills with our complete series
We've just wrapped up our live series on Python + AI, a comprehensive nine-part journey diving deep into how to use generative AI models from Python. The series introduced multiple types of models, including LLMs, embedding models, and vision models. We dug into popular techniques like RAG, tool calling, and structured outputs. We assessed AI quality and safety using automated evaluations and red-teaming. Finally, we developed AI agents using popular Python agents frameworks and explored the new Model Context Protocol (MCP). To help you apply what you've learned, all of our code examples work with GitHub Models, a service that provides free models to every GitHub account holder for experimentation and education. Even if you missed the live series, you can still access all the material using the links below! If you're an instructor, feel free to use the slides and code examples in your own classes. If you're a Spanish speaker, check out the Spanish version of the series. Python + AI: Large Language Models 📺 Watch recording In this session, we explore Large Language Models (LLMs), the models that power ChatGPT and GitHub Copilot. We use Python to interact with LLMs using popular packages like the OpenAI SDK and LangChain. We experiment with prompt engineering and few-shot examples to improve outputs. We also demonstrate how to build a full-stack app powered by LLMs and explain the importance of concurrency and streaming for user-facing AI apps. Slides for this session Code repository with examples: python-openai-demos Python + AI: Vector embeddings 📺 Watch recording In our second session, we dive into a different type of model: the vector embedding model. A vector embedding is a way to encode text or images as an array of floating-point numbers. Vector embeddings enable similarity search across many types of content. In this session, we explore different vector embedding models, such as the OpenAI text-embedding-3 series, through both visualizations and Python code. We compare distance metrics, use quantization to reduce vector size, and experiment with multimodal embedding models. Slides for this session Code repository with examples: vector-embedding-demos Python + AI: Retrieval Augmented Generation 📺 Watch recording In our third session, we explore one of the most popular techniques used with LLMs: Retrieval Augmented Generation. RAG is an approach that provides context to the LLM, enabling it to deliver well-grounded answers for a particular domain. The RAG approach works with many types of data sources, including CSVs, webpages, documents, and databases. In this session, we walk through RAG flows in Python, starting with a simple flow and culminating in a full-stack RAG application based on Azure AI Search. Slides for this session Code repository with examples: python-openai-demos Python + AI: Vision models 📺 Watch recording Our fourth session is all about vision models! Vision models are LLMs that can accept both text and images, such as GPT-4o and GPT-4o mini. You can use these models for image captioning, data extraction, question answering, classification, and more! We use Python to send images to vision models, build a basic chat-with-images app, and create a multimodal search engine. Slides for this session Code repository with examples: openai-chat-vision-quickstart Python + AI: Structured outputs 📺 Watch recording In our fifth session, we discover how to get LLMs to output structured responses that adhere to a schema. In Python, all you need to do is define a Pydantic BaseModel to get validated output that perfectly meets your needs. We focus on the structured outputs mode available in OpenAI models, but you can use similar techniques with other model providers. Our examples demonstrate the many ways you can use structured responses, such as entity extraction, classification, and agentic workflows. Slides for this session Code repository with examples: python-openai-demos Python + AI: Quality and safety 📺 Watch recording This session covers a crucial topic: how to use AI safely and how to evaluate the quality of AI outputs. There are multiple mitigation layers when working with LLMs: the model itself, a safety system on top, the prompting and context, and the application user experience. We focus on Azure tools that make it easier to deploy safe AI systems into production. We demonstrate how to configure the Azure AI Content Safety system when working with Azure AI models and how to handle errors in Python code. Then we use the Azure AI Evaluation SDK to evaluate the safety and quality of output from your LLM. Slides for this session Code repository with examples: ai-quality-safety-demos Python + AI: Tool calling 📺 Watch recording In the final part of the series, we focus on the technologies needed to build AI agents, starting with the foundation: tool calling (also known as function calling). We define tool call specifications using both JSON schema and Python function definitions, then send these definitions to the LLM. We demonstrate how to properly handle tool call responses from LLMs, enable parallel tool calling, and iterate over multiple tool calls. Understanding tool calling is absolutely essential before diving into agents, so don't skip over this foundational session. Slides for this session Code repository with examples: python-openai-demos Python + AI: Agents 📺 Watch recording In the penultimate session, we build AI agents! We use Python AI agent frameworks such as the new agent-framework from Microsoft and the popular LangGraph framework. Our agents start simple and then increase in complexity, demonstrating different architectures such as multiple tools, supervisor patterns, graphs, and human-in-the-loop workflows. Slides for this session Code repository with examples: python-ai-agent-frameworks-demos Python + AI: Model Context Protocol 📺 Watch recording In the final session, we dive into the hottest technology of 2025: MCP (Model Context Protocol). This open protocol makes it easy to extend AI agents and chatbots with custom functionality, making them more powerful and flexible. We demonstrate how to use the Python FastMCP SDK to build an MCP server running locally and consume that server from chatbots like GitHub Copilot. Then we build our own MCP client to consume the server. Finally, we discover how easy it is to connect AI agent frameworks like LangGraph and Microsoft agent-framework to MCP servers. With great power comes great responsibility, so we briefly discuss the security risks that come with MCP, both as a user and as a developer. Slides for this session Code repository with examples: python-mcp-demo8.1KViews3likes0Comments