How to build a 100% offline, AI-powered interview preparation tool using Microsoft Foundry Local, Retrieval-Augmented Generation, and nothing but JavaScript.
Introduction
Imagine preparing for a job interview with an AI assistant that knows your CV inside and out, understands the job you're applying for, and generates tailored questions, all without ever sending your data to the cloud. That's exactly what Interview Doctor does.
Interview Doctor's web UI, a polished, dark-themed interface running entirely on your local machine.
In this post, I'll walk you through how I built an interview prep tool as a fully offline JavaScript application using:
- Foundry Local — Microsoft's on-device AI runtime
- SQLite — for storing document chunks and TF-IDF vectors
- RAG (Retrieval-Augmented Generation) — to ground the AI in your actual documents
- Express.js — for the web server
- Node.js built-in test runner — for testing with zero extra dependencies
No cloud. No API keys. No internet required. Everything runs on your machine.
What is RAG and Why Does It Matter?
Retrieval-Augmented Generation (RAG) is a pattern that makes AI models dramatically more useful for domain-specific tasks. Instead of relying solely on what a model learned during training (which can be outdated or generic), RAG:
- Retrieves relevant chunks from your own documents
- Augments the model's prompt with those chunks as context
- Generates a response grounded in your actual data
For Interview Doctor, this means the AI doesn't just ask generic interview questions, it asks questions specific to your CV, your experience, and the specific job you're applying for.
Why Offline RAG?
Privacy is the obvious benefit, your CV and job applications never leave your device. But there's more:
- No API costs — run as many queries as you want
- No rate limits — iterate rapidly during your prep
- Works anywhere — on a plane, in a café with bad Wi-Fi, anywhere
- Consistent performance — no cold starts, no API latency
Architecture Overview
Complete architecture showing all components and data flow.
The application has two interfaces (CLI and Web) that share the same core engine:
- Document Ingestion — PDFs and markdown files are chunked and indexed
- Vector Store — SQLite stores chunks with TF-IDF vectors
- Retrieval — queries are matched against stored chunks using cosine similarity
- Generation — relevant chunks are injected into the prompt sent to the local LLM
Step 1: Setting Up Foundry Local
First, install Foundry Local:
# Windows
winget install Microsoft.FoundryLocal
# macOS
brew install microsoft/foundrylocal/foundrylocal
The JavaScript SDK handles everything else — starting the service, downloading the model, and connecting:
import { FoundryLocalManager } from "foundry-local-sdk";
import { OpenAI } from "openai";
const manager = new FoundryLocalManager();
const modelInfo = await manager.init("phi-3.5-mini");
// Foundry Local exposes an OpenAI-compatible API
const openai = new OpenAI({
baseURL: manager.endpoint, // Dynamic port, discovered by SDK
apiKey: manager.apiKey,
});
⚠️ Key Insight
Foundry Local uses a dynamic port never hardcode localhost:5272. Always use manager.endpoint which is discovered by the SDK at runtime.
Step 2: Building the RAG Pipeline
Document Chunking
Documents are split into overlapping chunks of ~200 tokens. The overlap ensures important context isn't lost at chunk boundaries:
export function chunkText(text, maxTokens = 200, overlapTokens = 25) {
const words = text.split(/\s+/).filter(Boolean);
if (words.length <= maxTokens) return [text.trim()];
const chunks = [];
let start = 0;
while (start < words.length) {
const end = Math.min(start + maxTokens, words.length);
chunks.push(words.slice(start, end).join(" "));
if (end >= words.length) break;
start = end - overlapTokens;
}
return chunks;
}
Why 200 tokens with 25-token overlap? Small chunks keep retrieved context compact for the model's limited context window. Overlap prevents information loss at boundaries. And it's all pure string operations, no dependencies needed.
TF-IDF Vectors
Instead of using a separate embedding model (which would consume precious memory alongside the LLM), we use TF-IDF, a classic information retrieval technique:
export function termFrequency(text) {
const tf = new Map();
const tokens = text
.toLowerCase()
.replace(/[^a-z0-9\-']/g, " ")
.split(/\s+/)
.filter((t) => t.length > 1);
for (const t of tokens) {
tf.set(t, (tf.get(t) || 0) + 1);
}
return tf;
}
export function cosineSimilarity(a, b) {
let dot = 0, normA = 0, normB = 0;
for (const [term, freq] of a) {
normA += freq * freq;
if (b.has(term)) dot += freq * b.get(term);
}
for (const [, freq] of b) normB += freq * freq;
if (normA === 0 || normB === 0) return 0;
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
Each document chunk becomes a sparse vector of word frequencies. At query time, we compute cosine similarity between the query vector and all stored chunk vectors to find the most relevant matches.
SQLite as a Vector Store
Chunks and their TF-IDF vectors are stored in SQLite using sql.js (pure JavaScript — no native compilation needed):
export class VectorStore {
// Created via: const store = await VectorStore.create(dbPath)
insert(docId, title, category, chunkIndex, content) {
const tf = termFrequency(content);
const tfJson = JSON.stringify([...tf]);
this.db.run(
"INSERT INTO chunks (...) VALUES (?, ?, ?, ?, ?, ?)",
[docId, title, category, chunkIndex, content, tfJson]
);
this.save();
}
search(query, topK = 5) {
const queryTf = termFrequency(query);
// Score each chunk by cosine similarity, return top-K
}
}
💡 Why SQLite for Vectors?
For a CV plus a few job descriptions (dozens of chunks), brute-force cosine similarity over SQLite rows is near-instant (~1ms). No need for Pinecone, Qdrant, or Chroma — just a single .db file on disk.
Step 3: The RAG Chat Engine
The chat engine ties retrieval and generation together:
async *queryStream(userMessage, history = []) {
// 1. Retrieve relevant CV/JD chunks
const chunks = this.retrieve(userMessage);
const context = this._buildContext(chunks);
// 2. Build the prompt with retrieved context
const messages = [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "system", content: `Retrieved context:\n\n${context}` },
...history,
{ role: "user", content: userMessage },
];
// 3. Stream from the local model
const stream = await this.openai.chat.completions.create({
model: this.modelId,
messages,
temperature: 0.3,
stream: true,
});
// 4. Yield chunks as they arrive
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) yield { type: "text", data: content };
}
}
The flow is straightforward: vectorize the query, retrieve with cosine similarity, build a prompt with context, and stream from the local LLM. The temperature: 0.3 keeps responses focused — important for interview preparation where consistency matters.
Step 4: Dual Interfaces — Web & CLI
Web UI
The web frontend is a single HTML file with inline CSS and JavaScript — no build step, no framework, no React or Vue. It communicates with the Express backend via REST and SSE:
- File upload via
multipart/form-data - Streaming chat via Server-Sent Events (SSE)
- Quick-action buttons for common follow-up queries (coaching tips, gap analysis, mock interview)
The setup form with job title, seniority level, and a pasted job description — ready to generate tailored interview questions.
CLI
The CLI provides the same experience in the terminal with ANSI-coloured output:
npm run cli
It walks you through uploading your CV, entering the job details, and then generates streaming questions. Follow-up questions work interactively. Both interfaces share the same ChatEngine class, they're thin layers over identical logic.
Edge Mode
For constrained devices, toggle Edge mode to use a compact system prompt that fits within smaller context windows:
Edge mode activated, uses a minimal prompt for devices with limited resources.
Step 5: Testing
Tests use the Node.js built-in test runner, no Jest, no Mocha, no extra dependencies:
import { describe, it } from "node:test";
import assert from "node:assert/strict";
describe("chunkText", () => {
it("returns single chunk for short text", () => {
const chunks = chunkText("short text", 200, 25);
assert.equal(chunks.length, 1);
});
it("maintains overlap between chunks", () => {
// Verifies overlapping tokens between consecutive chunks
});
});
npm test
Tests cover the chunker, vector store, config, prompts, and server API contract, all without needing Foundry Local running.
Adapting for Your Own Use Case
Interview Doctor is a pattern, not just a product. You can adapt it for any domain:
| What to Change | How |
|---|---|
| Domain documents | Replace files in docs/ with your content |
| System prompt | Edit src/prompts.js |
| Chunk sizes | Adjust config.chunkSize and config.chunkOverlap |
| Model | Change config.model — run foundry model list |
| UI | Modify public/index.html — it's a single file |
Ideas for Adaptation
- Customer support bot — ingest your product docs and FAQs
- Code review assistant — ingest coding standards and best practices
- Study guide — ingest textbooks and lecture notes
- Compliance checker — ingest regulatory documents
- Onboarding assistant — ingest company handbooks and processes
What I Learned
- Offline AI is production-ready. Foundry Local + small models like Phi-3.5 Mini are genuinely useful for focused tasks.
- You don't need vector databases for small collections. SQLite + TF-IDF is fast, simple, and has zero infrastructure overhead.
- RAG quality depends on chunking. Getting chunk sizes right for your use case is more impactful than the retrieval algorithm.
- The OpenAI-compatible API is a game-changer. Switching from cloud to local was mostly just changing the
baseURL. - Dual interfaces are easy when you share the engine. The CLI and Web UI are thin layers over the same
ChatEngineclass.
⚡ Performance Notes
On a typical laptop (no GPU): ingestion takes under 1 second for ~20 documents, retrieval is ~1ms, and the first LLM token arrives in 2-5 seconds. Foundry Local automatically selects the best model variant for your hardware (CUDA GPU, NPU, or CPU).
Getting Started
git clone https://github.com/leestott/interview-doctor-js.git
cd interview-doctor-js
npm install
npm run ingest
npm start # Web UI at http://127.0.0.1:3000
# or
npm run cli # Interactive terminal
The full source code is on GitHub. Star it, fork it, adapt it — and good luck with your interviews!
Resources
- Foundry Local — Microsoft's on-device AI runtime
- Foundry Local SDK (npm) — JavaScript SDK
- Foundry Local GitHub — Source, samples, and documentation
- Local RAG Reference — Reference RAG implementation
- Interview Doctor (JavaScript) — This project's source code