Blog Post

Microsoft Developer Community Blog
7 MIN READ

Building an Offline AI Interview Coach with Foundry Local, RAG, and SQLite

Lee_Stott's avatar
Lee_Stott
Icon for Microsoft rankMicrosoft
Mar 27, 2026

How to build a 100% offline, AI-powered interview preparation tool using Microsoft Foundry Local, Retrieval-Augmented Generation, and nothing but JavaScript.

Foundry Local 100% Offline RAG + TF-IDF JavaScript / Node.js

Introduction

Imagine preparing for a job interview with an AI assistant that knows your CV inside and out, understands the job you're applying for, and generates tailored questions, all without ever sending your data to the cloud. That's exactly what Interview Doctor does.

Interview Doctor - Landing Page

Interview Doctor's web UI, a polished, dark-themed interface running entirely on your local machine.

In this post, I'll walk you through how I built an interview prep tool as a fully offline JavaScript application using:

  • Foundry Local — Microsoft's on-device AI runtime
  • SQLite — for storing document chunks and TF-IDF vectors
  • RAG (Retrieval-Augmented Generation) — to ground the AI in your actual documents
  • Express.js — for the web server
  • Node.js built-in test runner — for testing with zero extra dependencies

No cloud. No API keys. No internet required. Everything runs on your machine.

What is RAG and Why Does It Matter?

Retrieval-Augmented Generation (RAG) is a pattern that makes AI models dramatically more useful for domain-specific tasks. Instead of relying solely on what a model learned during training (which can be outdated or generic), RAG:

  1. Retrieves relevant chunks from your own documents
  2. Augments the model's prompt with those chunks as context
  3. Generates a response grounded in your actual data

For Interview Doctor, this means the AI doesn't just ask generic interview questions, it asks questions specific to your CV, your experience, and the specific job you're applying for.

Why Offline RAG?

Privacy is the obvious benefit, your CV and job applications never leave your device. But there's more:

  • No API costs — run as many queries as you want
  • No rate limits — iterate rapidly during your prep
  • Works anywhere — on a plane, in a café with bad Wi-Fi, anywhere
  • Consistent performance — no cold starts, no API latency

Architecture Overview

Interview Doctor Architecture Diagram

Complete architecture showing all components and data flow.

The application has two interfaces (CLI and Web) that share the same core engine:

  1. Document Ingestion — PDFs and markdown files are chunked and indexed
  2. Vector Store — SQLite stores chunks with TF-IDF vectors
  3. Retrieval — queries are matched against stored chunks using cosine similarity
  4. Generation — relevant chunks are injected into the prompt sent to the local LLM

Step 1: Setting Up Foundry Local

First, install Foundry Local:

# Windows
winget install Microsoft.FoundryLocal

# macOS
brew install microsoft/foundrylocal/foundrylocal

The JavaScript SDK handles everything else — starting the service, downloading the model, and connecting:

import { FoundryLocalManager } from "foundry-local-sdk";
import { OpenAI } from "openai";

const manager = new FoundryLocalManager();
const modelInfo = await manager.init("phi-3.5-mini");

// Foundry Local exposes an OpenAI-compatible API
const openai = new OpenAI({
  baseURL: manager.endpoint,  // Dynamic port, discovered by SDK
  apiKey: manager.apiKey,
});

 

⚠️ Key Insight

Foundry Local uses a dynamic port never hardcode localhost:5272. Always use manager.endpoint which is discovered by the SDK at runtime.

Step 2: Building the RAG Pipeline

Document Chunking

Documents are split into overlapping chunks of ~200 tokens. The overlap ensures important context isn't lost at chunk boundaries:

export function chunkText(text, maxTokens = 200, overlapTokens = 25) {
  const words = text.split(/\s+/).filter(Boolean);
  if (words.length <= maxTokens) return [text.trim()];

  const chunks = [];
  let start = 0;
  while (start < words.length) {
    const end = Math.min(start + maxTokens, words.length);
    chunks.push(words.slice(start, end).join(" "));
    if (end >= words.length) break;
    start = end - overlapTokens;
  }
  return chunks;
}

Why 200 tokens with 25-token overlap? Small chunks keep retrieved context compact for the model's limited context window. Overlap prevents information loss at boundaries. And it's all pure string operations, no dependencies needed.

TF-IDF Vectors

Instead of using a separate embedding model (which would consume precious memory alongside the LLM), we use TF-IDF, a classic information retrieval technique:

 
export function termFrequency(text) {
  const tf = new Map();
  const tokens = text
    .toLowerCase()
    .replace(/[^a-z0-9\-']/g, " ")
    .split(/\s+/)
    .filter((t) => t.length > 1);
  for (const t of tokens) {
    tf.set(t, (tf.get(t) || 0) + 1);
  }
  return tf;
}

export function cosineSimilarity(a, b) {
  let dot = 0, normA = 0, normB = 0;
  for (const [term, freq] of a) {
    normA += freq * freq;
    if (b.has(term)) dot += freq * b.get(term);
  }
  for (const [, freq] of b) normB += freq * freq;
  if (normA === 0 || normB === 0) return 0;
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
 

 

 

Each document chunk becomes a sparse vector of word frequencies. At query time, we compute cosine similarity between the query vector and all stored chunk vectors to find the most relevant matches.

SQLite as a Vector Store

Chunks and their TF-IDF vectors are stored in SQLite using sql.js (pure JavaScript — no native compilation needed):

export class VectorStore {
  // Created via: const store = await VectorStore.create(dbPath)

  insert(docId, title, category, chunkIndex, content) {
    const tf = termFrequency(content);
    const tfJson = JSON.stringify([...tf]);
    this.db.run(
      "INSERT INTO chunks (...) VALUES (?, ?, ?, ?, ?, ?)",
      [docId, title, category, chunkIndex, content, tfJson]
    );
    this.save();
  }

  search(query, topK = 5) {
    const queryTf = termFrequency(query);
    // Score each chunk by cosine similarity, return top-K
  }
}

 

💡 Why SQLite for Vectors?

For a CV plus a few job descriptions (dozens of chunks), brute-force cosine similarity over SQLite rows is near-instant (~1ms). No need for Pinecone, Qdrant, or Chroma — just a single .db file on disk.

Step 3: The RAG Chat Engine

The chat engine ties retrieval and generation together:

async *queryStream(userMessage, history = []) {
  // 1. Retrieve relevant CV/JD chunks
  const chunks = this.retrieve(userMessage);
  const context = this._buildContext(chunks);

  // 2. Build the prompt with retrieved context
  const messages = [
    { role: "system", content: SYSTEM_PROMPT },
    { role: "system", content: `Retrieved context:\n\n${context}` },
    ...history,
    { role: "user", content: userMessage },
  ];

  // 3. Stream from the local model
  const stream = await this.openai.chat.completions.create({
    model: this.modelId,
    messages,
    temperature: 0.3,
    stream: true,
  });

  // 4. Yield chunks as they arrive
  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) yield { type: "text", data: content };
  }
}

The flow is straightforward: vectorize the query, retrieve with cosine similarity, build a prompt with context, and stream from the local LLM. The temperature: 0.3 keeps responses focused — important for interview preparation where consistency matters.

Step 4: Dual Interfaces — Web & CLI

Web UI

The web frontend is a single HTML file with inline CSS and JavaScript — no build step, no framework, no React or Vue. It communicates with the Express backend via REST and SSE:

  • File upload via multipart/form-data
  • Streaming chat via Server-Sent Events (SSE)
  • Quick-action buttons for common follow-up queries (coaching tips, gap analysis, mock interview)
Interview Doctor - Form filled with job details

The setup form with job title, seniority level, and a pasted job description — ready to generate tailored interview questions.

CLI

The CLI provides the same experience in the terminal with ANSI-coloured output:

npm run cli

It walks you through uploading your CV, entering the job details, and then generates streaming questions. Follow-up questions work interactively. Both interfaces share the same ChatEngine class, they're thin layers over identical logic.

Edge Mode

For constrained devices, toggle Edge mode to use a compact system prompt that fits within smaller context windows:

Interview Doctor - Edge Mode enabled

Edge mode activated, uses a minimal prompt for devices with limited resources.

Step 5: Testing

Tests use the Node.js built-in test runner, no Jest, no Mocha, no extra dependencies:

import { describe, it } from "node:test";
import assert from "node:assert/strict";

describe("chunkText", () => {
  it("returns single chunk for short text", () => {
    const chunks = chunkText("short text", 200, 25);
    assert.equal(chunks.length, 1);
  });

  it("maintains overlap between chunks", () => {
    // Verifies overlapping tokens between consecutive chunks
  });
});
npm test

Tests cover the chunker, vector store, config, prompts, and server API contract, all without needing Foundry Local running.

Adapting for Your Own Use Case

Interview Doctor is a pattern, not just a product. You can adapt it for any domain:

What to ChangeHow
Domain documentsReplace files in docs/ with your content
System promptEdit src/prompts.js
Chunk sizesAdjust config.chunkSize and config.chunkOverlap
ModelChange config.model — run foundry model list
UIModify public/index.html — it's a single file

Ideas for Adaptation

  • Customer support bot — ingest your product docs and FAQs
  • Code review assistant — ingest coding standards and best practices
  • Study guide — ingest textbooks and lecture notes
  • Compliance checker — ingest regulatory documents
  • Onboarding assistant — ingest company handbooks and processes

What I Learned

  1. Offline AI is production-ready. Foundry Local + small models like Phi-3.5 Mini are genuinely useful for focused tasks.
  2. You don't need vector databases for small collections. SQLite + TF-IDF is fast, simple, and has zero infrastructure overhead.
  3. RAG quality depends on chunking. Getting chunk sizes right for your use case is more impactful than the retrieval algorithm.
  4. The OpenAI-compatible API is a game-changer. Switching from cloud to local was mostly just changing the baseURL.
  5. Dual interfaces are easy when you share the engine. The CLI and Web UI are thin layers over the same ChatEngine class.

⚡ Performance Notes

On a typical laptop (no GPU): ingestion takes under 1 second for ~20 documents, retrieval is ~1ms, and the first LLM token arrives in 2-5 seconds. Foundry Local automatically selects the best model variant for your hardware (CUDA GPU, NPU, or CPU).

Getting Started

git clone https://github.com/leestott/interview-doctor-js.git
cd interview-doctor-js
npm install
npm run ingest
npm start      # Web UI at http://127.0.0.1:3000
# or
npm run cli    # Interactive terminal

The full source code is on GitHub. Star it, fork it, adapt it — and good luck with your interviews!


Resources

 

Published Mar 27, 2026
Version 1.0
No CommentsBe the first to comment