Blog Post

Microsoft Developer Community Blog
13 MIN READ

Build a Fully Offline AI App with Foundry Local and CAG

Lee_Stott's avatar
Lee_Stott
Icon for Microsoft rankMicrosoft
Apr 02, 2026

A hands-on guide to building an on-device AI support agent using Context-Augmented Generation, JavaScript, and Foundry Local.

You have probably heard the AI pitch: "just call our API." But what happens when your application needs to work without an internet connection? Perhaps your users are field engineers standing next to a pipeline in the middle of nowhere, or your organisation has strict data privacy requirements, or you simply want to build something that works without a cloud bill.

This post walks you through how to build a fully offline, on-device AI application using Foundry Local and a pattern called Context-Augmented Generation (CAG). By the end, you will have a clear understanding of what CAG is, how it compares to RAG, and the practical steps to build your own solution.

Screenshot of the Gas Field Support Agent landing page, showing a dark-themed chat interface with quick-action buttons for common questions

The finished application: a browser-based AI support agent that runs entirely on your machine.

What Is Context-Augmented Generation?

Context-Augmented Generation (CAG) is a pattern for making AI models useful with your own domain-specific content. Instead of hoping the model "knows" the answer from its training data, you pre-load your entire knowledge base into the model's context window at startup. Every query the model handles has access to all of your documents, all of the time.

The flow is straightforward:

  1. Load your documents into memory when the application starts.
  2. Inject the most relevant documents into the prompt alongside the user's question.
  3. Generate a response grounded in your content.

There is no retrieval pipeline, no vector database, and no embedding model. Your documents are read from disc, held in memory, and selected per query using simple keyword scoring. The model generates answers grounded in your content rather than relying on what it learnt during training.

CAG vs RAG: Understanding the Trade-offs

If you have explored AI application patterns before, you have likely encountered Retrieval-Augmented Generation (RAG). Both CAG and RAG solve the same core problem: grounding an AI model's answers in your own content. They take different approaches, and each has genuine strengths and limitations.

CAG (Context-Augmented Generation)

How it works: All documents are loaded at startup. The most relevant ones are selected per query using keyword scoring and injected into the prompt.

Strengths:

  • Drastically simpler architecture with no vector database, no embeddings, and no retrieval pipeline
  • Works fully offline with no external services
  • Minimal dependencies (just two npm packages in this sample)
  • Near-instant document selection with no embedding latency
  • Easy to set up, debug, and reason about

Limitations:

  • Constrained by the model's context window size
  • Best suited to small, curated document sets (tens of documents, not thousands)
  • Keyword scoring is less precise than semantic similarity for ambiguous queries
  • Adding documents requires an application restart

RAG (Retrieval-Augmented Generation)

How it works: Documents are chunked, embedded into vectors, and stored in a database. At query time, the most semantically similar chunks are retrieved and injected into the prompt.

Strengths:

  • Scales to thousands or millions of documents
  • Semantic search finds relevant content even when the user's wording differs from the source material
  • Documents can be added or updated dynamically without restarting
  • Fine-grained retrieval (chunk-level) can be more token-efficient for large collections

Limitations:

  • More complex architecture: requires an embedding model, a vector database, and a chunking strategy
  • Retrieval quality depends heavily on chunking, embedding model choice, and tuning
  • Additional latency from the embedding and search steps
  • More dependencies and infrastructure to manage

Want to compare these patterns hands-on? There is a RAG-based implementation of the same gas field scenario using vector search and embeddings. Clone both repositories, run them side by side, and see how the architectures differ in practice.

When Should You Choose Which?

ConsiderationChoose CAGChoose RAG
Document countTens of documentsHundreds or thousands
Offline requirementEssentialOptional (can run locally too)
Setup complexityMinimalModerate to high
Document updatesInfrequent (restart to reload)Frequent or dynamic
Query precisionGood for keyword-matchable contentBetter for semantically diverse queries
InfrastructureNone beyond the runtimeVector database, embedding model

For the sample application in this post (20 gas engineering procedure documents on a local machine), CAG is the clear winner. If your use case grows to hundreds of documents or requires real-time ingestion, RAG becomes the better choice. Both patterns can run offline using Foundry Local.

Foundry Local: Your On-Device AI Runtime

Foundry Local is a lightweight runtime from Microsoft that downloads, manages, and serves language models entirely on your device. No cloud account, no API keys, no outbound network calls (after the initial model download).

In this sample, your application is responsible for deciding which model to use, and it does that through the foundry-local-sdk. The app creates a FoundryLocalManager, asks the SDK for the local model catalogue, and then runs a small selection policy from src/modelSelector.js. That policy looks at the machine's available RAM, filters out models that are too large, ranks the remaining chat models by preference, and then returns the best fit for that device.

Why does it work this way? Because shipping one fixed model would either exclude lower-spec machines or underuse more capable ones. A 14B model may be perfectly reasonable on a 32 GB workstation, but the same choice would be slow or unusable on an 8 GB laptop. By selecting at runtime, the same codebase can run across a wider range of developer machines without manual tuning.

What makes it particularly useful for developers:

  • No GPU required — runs on CPU or NPU, making it accessible on standard laptops and desktops
  • Native SDK bindings — in-process inference via the foundry-local-sdk npm package, with no HTTP round-trips to a local server
  • Automatic model management — downloads, caches, and loads models automatically
  • Dynamic model selection — the SDK can evaluate your device's available RAM and pick the best model from the catalogue
  • Real-time progress callbacks — ideal for building loading UIs that show download and initialisation progress

The integration code is refreshingly minimal. Here is the core pattern:

import { FoundryLocalManager } from "foundry-local-sdk";

// Create a manager and get the model catalogue
const manager = FoundryLocalManager.create({ appName: "my-app" });

// Auto-select the best model for this device based on available RAM
const models = await manager.catalog.getModels();
const model = selectBestModel(models);

// Download if not cached, then load into memory
if (!model.isCached) {
  await model.download((progress) => {
    console.log(`Download: ${progress.toFixed(0)}%`);
  });
}
await model.load();

// Create a chat client for direct in-process inference
const chatClient = model.createChatClient();
const response = await chatClient.completeChat([
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "How do I detect a gas leak?" }
]);

That is it. No server configuration, no authentication tokens, no cloud provisioning. The model runs in the same process as your application.

The download step matters for a simple reason: offline inference only works once the model files exist locally. The SDK checks whether the chosen model is already cached on the machine. If it is not, the application asks Foundry Local to download it once, store it locally, and then load it into memory. After that first run, the cached model can be reused, which is why subsequent launches are much faster and can operate without any network dependency.

Put another way, there are two cooperating pieces here. Your application chooses which model is appropriate for the device and the scenario. Foundry Local and its SDK handle the mechanics of making that model available locally, caching it, loading it, and exposing a chat client for inference. That separation keeps the application logic clear whilst letting the runtime handle the heavy lifting.

The Technology Stack

The sample application is deliberately simple. No frameworks, no build steps, no Docker:

LayerTechnologyPurpose
AI ModelFoundry Local + auto-selected modelRuns locally via native SDK bindings; best model chosen for your device
Back endNode.js + ExpressLightweight HTTP server, everyone knows it
ContextMarkdown files pre-loaded at startupNo vector database, no embeddings, no retrieval step
Front endSingle HTML file with inline CSSNo build step, mobile-responsive, field-ready

The total dependency footprint is two npm packages: express and foundry-local-sdk.

Architecture Overview

Architecture diagram showing four layers: Client (HTML/CSS/JS), Server (Express.js), CAG Engine (document loading, keyword scoring, prompt construction), and AI Layer (Foundry Local with in-process inference)

The four-layer architecture, all running on a single machine.

The system has four layers, all running in a single process on your device:

  • Client layer: a single HTML file served by Express, with quick-action buttons and a responsive chat interface
  • Server layer: Express.js starts immediately and serves the UI plus an SSE status endpoint; API routes handle chat (streaming and non-streaming), context listing, and health checks
  • CAG engine: loads all domain documents at startup, selects the most relevant ones per query using keyword scoring, and injects them into the prompt
  • AI layer: Foundry Local runs the auto-selected model on CPU/NPU via native SDK bindings (in-process inference, no HTTP round-trips)

Building the Solution Step by Step

Prerequisites

You need two things installed on your machine:

  1. Node.js 20 or later: download from nodejs.org
  2. Foundry Local: Microsoft's on-device AI runtime:
    winget install Microsoft.FoundryLocal

Foundry Local will automatically select and download the best model for your device the first time you run the application. You can override this by setting the FOUNDRY_MODEL environment variable to a specific model alias.

Getting the Code Running

 
# Clone the repository
git clone https://github.com/leestott/local-cag.git
cd local-cag

# Install dependencies
npm install

# Start the server
npm start
 

Open http://127.0.0.1:3000 in your browser. You will see a loading overlay with a progress bar whilst the model downloads (first run only) and loads into memory. Once the model is ready, the overlay fades away and you can start chatting.

Desktop view of the application showing the chat interface with quick-action buttons

Desktop view

Mobile view of the application showing the responsive layout on a smaller screen

Mobile view

How the CAG Pipeline Works

Let us trace what happens when a user asks: "How do I detect a gas leak?"

Sequence diagram showing the CAG query flow: user sends a question, the server selects relevant documents, constructs a prompt, sends it to Foundry Local, and streams the response back

The query flow from browser to model and back.

1 Server starts and loads documents

When you run npm start, the Express server starts on port 3000. All .md files in the docs/ folder are read, parsed (with optional YAML front-matter for title, category, and ID), and grouped by category. A document index is built listing all available topics.

2 Model is selected and loaded

The model selector evaluates your system's available RAM and picks the best model from the Foundry Local catalogue. If the model is not already cached, it downloads it (with progress streamed to the browser via SSE). The model is then loaded into memory for in-process inference.

3 User sends a question

The question arrives at the Express server. The chat engine selects the top 3 most relevant documents using keyword scoring.

4 Prompt is constructed

The engine builds a messages array containing: the system prompt (with safety-first instructions), the document index (so the model knows all available topics), the 3 selected documents (approximately 6,000 characters), the conversation history, and the user's question.

5 Model generates a grounded response

The prompt is sent to the locally loaded model via the Foundry Local SDK's native bindings. The response streams back token by token through Server-Sent Events to the browser.

Chat response showing safety warnings followed by step-by-step gas leak detection guidance

A response with safety warnings and step-by-step guidance

Sources panel showing the specific documents referenced in the response

The sources panel shows which documents were used

Key Code Walkthrough

Loading Documents (the Context Module)

The context module reads all markdown files from the docs/ folder at startup. Each document can have optional YAML front-matter for metadata:

 
// src/context.js
export function loadDocuments() {
  const files = fs.readdirSync(config.docsDir)
    .filter(f => f.endsWith(".md"))
    .sort();

  const docs = [];
  for (const file of files) {
    const raw = fs.readFileSync(path.join(config.docsDir, file), "utf-8");
    const { meta, body } = parseFrontMatter(raw);
    docs.push({
      id: meta.id || path.basename(file, ".md"),
      title: meta.title || file,
      category: meta.category || "General",
      content: body.trim(),
    });
  }
  return docs;
}
 

There is no chunking, no vector computation, and no database. The documents are held in memory as plain text.

Dynamic Model Selection

Rather than hard-coding a model, the application evaluates your system at runtime:

 
// src/modelSelector.js
const totalRamMb = os.totalmem() / (1024 * 1024);
const budgetMb = totalRamMb * 0.6; // Use up to 60% of system RAM

// Filter to models that fit, rank by quality, boost cached models
const candidates = allModels.filter(m =>
  m.task === "chat-completion" &&
  m.fileSizeMb <= budgetMb
);

// Returns the best model: e.g. phi-4 on a 32 GB machine,
// or phi-3.5-mini on a laptop with 8 GB RAM
 

This means the same application runs on a powerful workstation (selecting a 14B parameter model) or a constrained laptop (selecting a 3.8B model), with no code changes required.

This is worth calling out because it is one of the most practical parts of the sample. Developers do not have to decide up front which single model every user should run. The application makes that decision at startup based on the hardware budget you set, then asks Foundry Local to fetch the model if it is missing. The result is a smoother first-run experience and fewer support headaches when the same app is used on mixed hardware.

The System Prompt

For safety-critical domains, the system prompt is engineered to prioritise safety, prevent hallucination, and enforce structured responses:

 
// src/prompts.js
export const SYSTEM_PROMPT = `You are a local, offline support agent
for gas field inspection and maintenance engineers.

Behaviour Rules:
- Always prioritise safety. If a procedure involves risk,
  explicitly call it out.
- Do not hallucinate procedures, measurements, or tolerances.
- If the answer is not in the provided context, say:
  "This information is not available in the local knowledge base."

Response Format:
- Summary (1-2 lines)
- Safety Warnings (if applicable)
- Step-by-step Guidance
- Reference (document name + section)`;
 

This pattern is transferable to any safety-critical domain: medical devices, electrical work, aviation maintenance, or chemical handling.

Adapting This for Your Own Domain

The sample project is designed to be forked and adapted. Here is how to make it yours in three steps:

1. Replace the documents

Delete the gas engineering documents in docs/ and add your own markdown files. The context module handles any markdown content with optional YAML front-matter:

 
---
title: Troubleshooting Widget Errors
category: Support
id: KB-001
---

# Troubleshooting Widget Errors
...your content here...
 

2. Edit the system prompt

Open src/prompts.js and rewrite the system prompt for your domain. Keep the structure (summary, safety, steps, reference) and update the language to match your users' expectations.

3. Override the model (optional)

By default the application auto-selects the best model. To force a specific model:

 
# See available models
foundry model list

# Force a smaller, faster model
FOUNDRY_MODEL=phi-3.5-mini npm start

# Or a larger, higher-quality model
FOUNDRY_MODEL=phi-4 npm start
 

Smaller models give faster responses on constrained devices. Larger models give better quality. The auto-selector picks the largest model that fits within 60% of your system RAM.

Building a Field-Ready UI

The front end is a single HTML file with inline CSS. No React, no build tooling, no bundler. This keeps the project accessible to beginners and easy to deploy.

Design decisions that matter for field use:

  • Dark, high-contrast theme with 18px base font size for readability in bright sunlight
  • Large touch targets (minimum 48px) for operation with gloves or PPE
  • Quick-action buttons for common questions, so engineers do not need to type on a phone
  • Responsive layout that works from 320px to 1920px+ screen widths
  • Streaming responses via SSE, so the user sees tokens arriving in real time
Mobile view of the chat interface showing a conversation with the AI agent on a small screen

The mobile chat experience, optimised for field use.

Visual Startup Progress with SSE

A standout feature of this application is the loading experience. When the user opens the browser, they see a progress overlay showing exactly what the application is doing:

  1. Loading domain documents
  2. Initialising the Foundry Local SDK
  3. Selecting the best model for the device
  4. Downloading the model (with a percentage progress bar, first run only)
  5. Loading the model into memory

This works because the Express server starts before the model finishes loading. The browser connects immediately and receives real-time status updates via Server-Sent Events. Chat endpoints return 503 whilst the model is loading, so the UI cannot send queries prematurely.

 
// Server-side: broadcast status to all connected browsers
function broadcastStatus(state) {
  initState = state;
  const payload = `data: ${JSON.stringify(state)}\n\n`;
  for (const client of statusClients) {
    client.write(payload);
  }
}

// During initialisation:
broadcastStatus({ stage: "downloading", message: "Downloading phi-4...", progress: 42 });
 

This pattern is worth adopting in any application where model loading takes more than a few seconds. Users should never stare at a blank screen wondering whether something is broken.

Testing

The project includes unit tests using the built-in Node.js test runner, with no extra test framework needed:

 
# Run all tests
npm test
 

Tests cover configuration, server endpoints, and document loading. Use them as a starting point when you adapt the project for your own domain.

Ideas for Extending the Project

Once you have the basics running, there are plenty of directions to explore:

  • Conversation memory: persist chat history across sessions using local storage or a lightweight database
  • Hybrid CAG + RAG: add a vector retrieval step for larger document collections that exceed the context window
  • Multi-modal support: add image-based queries (photographing a fault code, for example)
  • PWA packaging: make it installable as a standalone offline application on mobile devices
  • Custom model fine-tuning: fine-tune a model on your domain data for even better answers

Ready to Build Your Own?

Clone the CAG sample, swap in your own documents, and have an offline AI agent running in minutes. Or compare it with the RAG approach to see which pattern suits your use case best.

Get the CAG Sample    Get the RAG Sample

Summary

Building a local AI application does not require a PhD in machine learning or a cloud budget. With Foundry Local, Node.js, and a set of domain documents, you can create a fully offline, mobile-responsive AI agent that answers questions grounded in your own content.

The key takeaways:

  1. CAG is ideal for small, curated document sets where simplicity and offline capability matter most. No vector database, no embeddings, no retrieval pipeline.
  2. RAG scales further when you have hundreds or thousands of documents, or need semantic search for ambiguous queries. See the local-rag sample to compare.
  3. Foundry Local makes on-device AI accessible: native SDK bindings, in-process inference, automatic model selection, and no GPU required.
  4. The architecture is transferable. Replace the gas engineering documents with your own content, update the system prompt, and you have a domain-specific AI agent for any field.
  5. Start simple, iterate outwards. Begin with CAG and a handful of documents. If your needs outgrow the context window, graduate to RAG. Both patterns can run entirely offline.

Clone the repository, swap in your own documents, and start building. The best way to learn is to get your hands on the code.

Updated Mar 13, 2026
Version 1.0
No CommentsBe the first to comment