javascript
45 TopicsJS AI Build-a-thon Setup in 5 Easy Steps
đĽ TL;DR â Youâre 5 Steps Away from an AI Adventure Set up your project repo, follow the quests, build cool stuff, and level up. Everythingâs automated, community-backed, and designed to help you actually learn AI â using the skills you already have. Letâs build the future. One quest at a time. đ Join the Build-a-thon | Chat on DiscordUse AI for Free with GitHub Models and TypeScript! đ¸đ¸đ¸
Learn how to use AI for free with GitHub Models! Test models like GPT-4o without paying for APIs or setting up infrastructure. This step-by-step guide shows how to integrate GitHub Models with TypeScript in the Microblog AI Remix project. Start exploring AI for free today!If You're Building AI on Azure, ECS 2026 is Where You Need to Be
Let me be direct: there's a lot of noise in the conference calendar. Generic cloud events. Vendor showcases dressed up as technical content. Sessions that look great on paper but leave you with nothing you can actually ship on Monday. ECS 2026 isn't that. As someone who will be on stage at Cologne this May, I can tell you the European Collaboration Summit combined with the European AI & Cloud Summit and European Biz Apps Summit is one of the few events I've seen where engineers leave with real, production-applicable knowledge. Three days. Three summits. 3,000+ attendees. One of the largest Microsoft-focused events in Europe, and it keeps getting better. If you're building AI systems on Azure, designing cloud-native architectures, or trying to figure out how to take your AI experiments to production â this is where the conversation is happening. What ECS 2026 Actually Is ECS 2026 runs May 5â7 at Confex in Cologne, Germany. It brings together three co-located summits under one roof: European Collaboration Summit â Microsoft 365, Teams, Copilot, and governance European AI & Cloud Summit â Azure architecture, AI agents, cloud security, responsible AI European BizApps Summit â Power Platform, Microsoft Fabric, Dynamics For Azure engineers and AI developers, the European AI & Cloud Summit is your primary destination. But don't ignore the overlap, some of the most interesting AI conversations happen at the intersection of collaboration tooling and cloud infrastructure. The scale matters here: 3,000+ attendees, 100+ sessions, multiple deep-dive tracks, and a speaker lineup that includes Microsoft executives, Regional Directors, and MVPs who have built, broken, and rebuilt production systems. The Azure + AI Track - What's Actually On the Agenda The AI & Cloud Summit agenda is built around real technical depth. Not "intro to AI" content, actual architecture decisions, patterns that work, and lessons from things that didn't. Here's what you can expect: AI Agents and Agentic Systems This is where the energy is right now, and ECS is leaning in. Expect sessions covering how to design agent workflows, chain reasoning steps, handle memory and state, and integrate with Azure AI services. Marco Casalaina, VP of Products for Azure AI at Microsoft, is speaking if you want to understand the direction of the Azure AI platform from the people building it, this is a direct line. Azure Architecture at Scale Cloud-native patterns, microservices, containers, and the architectural decisions that determine whether your system holds up under real load. These sessions go beyond theory you'll hear from engineers who've shipped these designs at enterprise scale. Observability, DevOps, and Production AI Getting AI to production is harder than the demos suggest. Sessions here cover monitoring AI systems, integrating LLMs into CI/CD pipelines, and building the operational practices that keep AI in production reliable and governable. Cloud Security and Compliance Security isn't optional when you're putting AI in front of users or connecting it to enterprise data. Tracks cover identity, access patterns, responsible AI governance, and how to design systems that satisfy compliance requirements without becoming unmaintainable. Pre-Conference Deep Dives One underrated part of ECS: the pre-conference workshops. These are extended, hands-on sessions typically 3â6 hours that let you go deep on a single topic with an expert. Think of them as intensive short courses where you can actually work through the material, not just watch slides. If you're newer to a particular area of Azure AI, or you want to build fluency in a specific pattern before the main conference sessions, these are worth the early travel. The Speaker Quality Is Different Here The ECS speaker roster includes Microsoft executives, Microsoft MVPs, and Regional Directors, people who have real accountability for the products and patterns they're presenting. You'll hear from over 20 Microsoft speakers: Marco Casalaina â VP of Products, Azure AI at Microsoft Adam Harmetz â VP of Product at Microsoft, Enterprise Agent And dozens of MVPs and Regional Directors who are in the field every day, solving the same problems you are. These aren't keynote-only speakers â they're in the session rooms, at the hallway track, available for real conversations. The Hallway Track Is Not a ClichĂŠ I know "networking" sounds like a corporate afterthought. At ECS it genuinely isn't. When you put 3,000 practitioners, engineers, architects, DevOps leads, security specialists in one venue for three days, the conversations between sessions are often more valuable than the sessions themselves. You get candid answers to "how are you actually handling X in production?" that you won't find in documentation. The European Microsoft community is tight-knit and collaborative. ECS is where that community concentrates. Why This Matters Right Now We're in a period where AI development is moving fast but the engineering discipline around it is still maturing. Most teams are figuring out: How to move from AI prototype to production system How to instrument and observe AI behaviour reliably How to design agent systems that don't become unmaintainable How to satisfy security and compliance requirements in AI-integrated architectures ECS 2026 is one of the few places where you can get direct answers to these questions from people who've solved them â not theoretically, but in production, on Azure, in the last 12 months. If you go, you'll come back with practical patterns you can apply immediately. That's the bar I hold events to. ECS consistently clears it. Register and Explore the Agenda Register for ECS 2026: ecs.events Explore the AI & Cloud Summit agenda: cloudsummit.eu/en/agenda Dates: May 5â7, 2026 | Location: Confex, Cologne, Germany Early registration is worth it the pre-conference workshops fill up. And if you're coming, find me, I'll be the one talking too much about AI agents and Azure deployments. See you in Cologne.Building Your First Local RAG Application with Foundry Local
A developer's guide to building an offline, mobile-responsive AI support agent using Retrieval-Augmented Generation, the Foundry Local SDK, and JavaScript. Imagine you are a gas field engineer standing beside a pipeline in a remote location. There is no Wi-Fi, no mobile signal, and you need a safety procedure right now. What do you do? This is the exact problem that inspired this project: a fully offline RAG-powered support agent that runs entirely on your machine. No cloud. No API keys. No outbound network calls. Just a local language model, a local vector store, and your own documents, all accessible from a browser on any device. In this post, you will learn how it works, how to build your own, and the key architectural decisions behind it. If you have ever wanted to build an AI application that runs locally and answers questions grounded in your own data, this is the place to start. The finished application: a browser-based AI support agent that runs entirely on your machine. What Is Retrieval-Augmented Generation? Retrieval-Augmented Generation (RAG) is a pattern that makes AI models genuinely useful for domain-specific tasks. Rather than hoping the model "knows" the answer from its training data, you: Retrieve relevant chunks from your own documents using a vector store Augment the model's prompt with those chunks as context Generate a response grounded in your actual data The result is fewer hallucinations, traceable answers with source attribution, and an AI that works with your content rather than relying on general knowledge. If you are building internal tools, customer support bots, field manuals, or knowledge bases, RAG is the pattern you want. RAG vs CAG: Understanding the Trade-offs If you have explored AI application patterns before, you have likely encountered Context-Augmented Generation (CAG). Both RAG and CAG solve the same core problem: grounding an AI model's answers in your own content. They take different approaches, and each has genuine strengths and limitations. RAG (Retrieval-Augmented Generation) How it works: Documents are split into chunks, vectorised, and stored in a database. At query time, the most relevant chunks are retrieved and injected into the prompt. Strengths: Scales to thousands or millions of documents Fine-grained retrieval at chunk level with source attribution Documents can be added or updated dynamically without restarting Token-efficient: only relevant chunks are sent to the model Supports runtime document upload via the web UI Limitations: More complex architecture: requires a vector store and chunking strategy Retrieval quality depends on chunking parameters and scoring method May miss relevant content if the retrieval step does not surface it CAG (Context-Augmented Generation) How it works: All documents are loaded at startup. The most relevant ones are selected per query using keyword scoring and injected into the prompt. Strengths: Drastically simpler architecture with no vector database or embeddings All information is always available to the model Minimal dependencies and easy to set up Near-instant document selection Limitations: Constrained by the model's context window size Best suited to small, curated document sets (tens of documents) Adding documents requires an application restart Want to compare these patterns hands-on? There is a CAG-based implementation of the same gas field scenario using whole-document context injection. Clone both repositories, run them side by side, and see how the architectures differ in practice. When Should You Choose Which? Consideration Choose RAG Choose CAG Document count Hundreds or thousands Tens of documents Document updates Frequent or dynamic (runtime upload) Infrequent (restart to reload) Source attribution Per-chunk with relevance scores Per-document Setup complexity Moderate (ingestion step required) Minimal Query precision Better for large or diverse collections Good for keyword-matchable content Infrastructure SQLite vector store (single file) None beyond the runtime For the sample application in this post (20 gas engineering procedure documents with runtime upload), RAG is the clear winner. If your document set is small and static, CAG may be simpler. Both patterns run fully offline using Foundry Local. Foundry Local: Your On-Device AI Runtime Foundry Local is a lightweight runtime from Microsoft that downloads, manages, and serves language models entirely on your device. No cloud account, no API keys, no outbound network calls (after the initial model download). What makes it particularly useful for developers: No GPU required: runs on CPU or NPU, making it accessible on standard laptops and desktops Native SDK bindings: in-process inference via the foundry-local-sdk npm package, with no HTTP round-trips to a local server Automatic model management: downloads, caches, and loads models automatically Hardware-optimised variant selection: the SDK picks the best variant for your hardware (GPU, NPU, or CPU) Real-time progress callbacks: ideal for building loading UIs that show download and initialisation progress The integration code is refreshingly minimal: import { FoundryLocalManager } from "foundry-local-sdk"; // Create a manager and discover models via the catalogue const manager = FoundryLocalManager.create({ appName: "gas-field-local-rag" }); const model = await manager.catalog.getModel("phi-3.5-mini"); // Download if not cached, then load into memory if (!model.isCached) { await model.download((progress) => { console.log(`Download: ${Math.round(progress * 100)}%`); }); } await model.load(); // Create a chat client for direct in-process inference const chatClient = model.createChatClient(); const response = await chatClient.completeChat([ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "How do I detect a gas leak?" } ]); That is it. No server configuration, no authentication tokens, no cloud provisioning. The model runs in the same process as your application. The Technology Stack The sample application is deliberately simple. No frameworks, no build steps, no Docker: Layer Technology Purpose AI Model Foundry Local + Phi-3.5 Mini Runs locally via native SDK bindings, no GPU required Back end Node.js + Express Lightweight HTTP server, everyone knows it Vector Store SQLite (via better-sqlite3 ) Zero infrastructure, single file on disc Retrieval TF-IDF + cosine similarity No embedding model required, fully offline Front end Single HTML file with inline CSS No build step, mobile-responsive, field-ready The total dependency footprint is three npm packages: express , foundry-local-sdk , and better-sqlite3 . Architecture Overview The five-layer architecture, all running on a single machine. The system has five layers, all running on a single machine: Client layer: a single HTML file served by Express, with quick-action buttons and a responsive chat interface Server layer: Express.js starts immediately and serves the UI plus SSE status and chat endpoints RAG pipeline: the chat engine orchestrates retrieval and generation; the chunker handles TF-IDF vectorisation; the prompts module provides safety-first system instructions Data layer: SQLite stores document chunks and their TF-IDF vectors; documents live as .md files in the docs/ folder AI layer: Foundry Local runs Phi-3.5 Mini on CPU or NPU via native SDK bindings Building the Solution Step by Step Prerequisites You need two things installed on your machine: Node.js 20 or later: download from nodejs.org Foundry Local: Microsoft's on-device AI runtime: winget install Microsoft.FoundryLocal The SDK will automatically download the Phi-3.5 Mini model (approximately 2 GB) the first time you run the application. Getting the Code Running # Clone the repository git clone https://github.com/leestott/local-rag.git cd local-rag # Install dependencies npm install # Ingest the 20 gas engineering documents into the vector store npm run ingest # Start the server npm start Open http://127.0.0.1:3000 in your browser. You will see the status indicator whilst the model loads. Once the model is ready, the status changes to "Offline Ready" and you can start chatting. Desktop view Mobile view How the RAG Pipeline Works Let us trace what happens when a user asks: "How do I detect a gas leak?" The query flow from browser to model and back. 1 Documents are ingested and indexed When you run npm run ingest , every .md file in the docs/ folder is read, parsed (with optional YAML front-matter for title, category, and ID), split into overlapping chunks of approximately 200 tokens, and stored in SQLite with TF-IDF vectors. 2 Model is loaded via the SDK The Foundry Local SDK discovers the model in the local catalogue and loads it into memory. If the model is not already cached, it downloads it first (with progress streamed to the browser via SSE). 3 User sends a question The question arrives at the Express server. The chat engine converts it into a TF-IDF vector, uses an inverted index to find candidate chunks, and scores them using cosine similarity. The top 3 chunks are returned in under 1 ms. 4 Prompt is constructed The engine builds a messages array containing: the system prompt (with safety-first instructions), the retrieved chunks as context, the conversation history, and the user's question. 5 Model generates a grounded response The prompt is sent to the locally loaded model via the Foundry Local SDK's native chat client. The response streams back token by token through Server-Sent Events to the browser. Source references with relevance scores are included. A response with safety warnings and step-by-step guidance The sources panel shows which chunks were used and their relevance Key Code Walkthrough The Vector Store (TF-IDF + SQLite) The vector store uses SQLite to persist document chunks alongside their TF-IDF vectors. At query time, an inverted index finds candidate chunks that share terms with the query, then cosine similarity ranks them: // src/vectorStore.js search(query, topK = 5) { const queryTf = termFrequency(query); this._ensureCache(); // Build in-memory cache on first access // Use inverted index to find candidates sharing at least one term const candidateIndices = new Set(); for (const term of queryTf.keys()) { const indices = this._invertedIndex.get(term); if (indices) { for (const idx of indices) candidateIndices.add(idx); } } // Score only candidates, not all rows const scored = []; for (const idx of candidateIndices) { const row = this._rowCache[idx]; const score = cosineSimilarity(queryTf, row.tf); if (score > 0) scored.push({ ...row, score }); } scored.sort((a, b) => b.score - a.score); return scored.slice(0, topK); } The inverted index, in-memory row cache, and prepared SQL statements bring retrieval time to sub-millisecond for typical query loads. Why TF-IDF Instead of Embeddings? Most RAG tutorials use embedding models for retrieval. This project uses TF-IDF because: Fully offline: no embedding model to download or run Zero latency: vectorisation is instantaneous (it is just maths on word frequencies) Good enough: for 20 domain-specific documents, TF-IDF retrieves the right chunks reliably Transparent: you can inspect the vocabulary and weights, unlike neural embeddings For larger collections or when semantic similarity matters more than keyword overlap, you would swap in an embedding model. For this use case, TF-IDF keeps the stack simple and dependency-free. The System Prompt For safety-critical domains, the system prompt is engineered to prioritise safety, prevent hallucination, and enforce structured responses: // src/prompts.js export const SYSTEM_PROMPT = `You are a local, offline support agent for gas field inspection and maintenance engineers. Behaviour Rules: - Always prioritise safety. If a procedure involves risk, explicitly call it out. - Do not hallucinate procedures, measurements, or tolerances. - If the answer is not in the provided context, say: "This information is not available in the local knowledge base." Response Format: - Summary (1-2 lines) - Safety Warnings (if applicable) - Step-by-step Guidance - Reference (document name + section)`; This pattern is transferable to any safety-critical domain: medical devices, electrical work, aviation maintenance, or chemical handling. Runtime Document Upload Unlike the CAG approach, RAG supports adding documents without restarting the server. Click the upload button to add new .md or .txt files. They are chunked, vectorised, and indexed immediately. The upload modal with the complete list of indexed documents. Adapting This for Your Own Domain The sample project is designed to be forked and adapted. Here is how to make it yours in four steps: 1. Replace the documents Delete the gas engineering documents in docs/ and add your own markdown files. The ingestion pipeline handles any markdown content with optional YAML front-matter: --- title: Troubleshooting Widget Errors category: Support id: KB-001 --- # Troubleshooting Widget Errors ...your content here... 2. Edit the system prompt Open src/prompts.js and rewrite the system prompt for your domain. Keep the structure (summary, safety, steps, reference) and update the language to match your users' expectations. 3. Tune the retrieval In src/config.js : chunkSize: 200 : smaller chunks give more precise retrieval, less context per chunk chunkOverlap: 25 : prevents information falling between chunks topK: 3 : how many chunks to retrieve per query (more gives more context but slower generation) 4. Swap the model Change config.model in src/config.js to any model available in the Foundry Local catalogue. Smaller models give faster responses on constrained devices; larger models give better quality. Building a Field-Ready UI The front end is a single HTML file with inline CSS. No React, no build tooling, no bundler. This keeps the project accessible to beginners and easy to deploy. Design decisions that matter for field use: Dark, high-contrast theme with 18px base font size for readability in bright sunlight Large touch targets (minimum 44px) for operation with gloves or PPE Quick-action buttons that wrap on mobile so all options are visible without scrolling Responsive layout that works from 320px to 1920px+ screen widths Streaming responses via SSE, so the user sees tokens arriving in real time The mobile chat experience, optimised for field use. Testing The project includes unit tests using the built-in Node.js test runner, with no extra test framework needed: # Run all tests npm test Tests cover the chunker, vector store, configuration, and server endpoints. Use them as a starting point when you adapt the project for your own domain. Ideas for Extending the Project Once you have the basics running, there are plenty of directions to explore: Embedding-based retrieval: use a local embedding model for better semantic matching on diverse queries Conversation memory: persist chat history across sessions using local storage or a lightweight database Multi-modal support: add image-based queries (photographing a fault code, for example) PWA packaging: make it installable as a standalone offline application on mobile devices Hybrid retrieval: combine TF-IDF keyword search with semantic embeddings for best results Try the CAG approach: compare with the local-cag sample to see which pattern suits your use case Ready to Build Your Own? Clone the RAG sample, swap in your own documents, and have an offline AI agent running in minutes. Or compare it with the CAG approach to see which pattern suits your use case best. Get the RAG Sample Get the CAG Sample Summary Building a local RAG application does not require a PhD in machine learning or a cloud budget. With Foundry Local, Node.js, and SQLite, you can create a fully offline, mobile-responsive AI agent that answers questions grounded in your own documents. The key takeaways: RAG is ideal for scalable, dynamic document sets where you need fine-grained retrieval with source attribution. Documents can be added at runtime without restarting. CAG is simpler when you have a small, stable set of documents that fit in the context window. See the local-cag sample to compare. Foundry Local makes on-device AI accessible: native SDK bindings, in-process inference, automatic model selection, and no GPU required. TF-IDF + SQLite is a viable vector store for small-to-medium collections, with sub-millisecond retrieval thanks to inverted indexing and in-memory caching. Start simple, iterate outwards. Begin with RAG and a handful of documents. If your needs are simpler, try CAG. Both patterns run entirely offline. Clone the repository, swap in your own documents, and start building. The best way to learn is to get your hands on the code. This project is open source under the MIT licence. It is a scenario sample for learning and experimentation, not production medical or safety advice. local-rag on GitHub ¡ local-cag on GitHub ¡ Foundry Local3.2KViews2likes0CommentsBuilding an Offline AI Interview Coach with Foundry Local, RAG, and SQLite
How to build a 100% offline, AI-powered interview preparation tool using Microsoft Foundry Local, Retrieval-Augmented Generation, and nothing but JavaScript. Foundry Local 100% Offline RAG + TF-IDF JavaScript / Node.js Contents Introduction What is RAG and Why Offline? Architecture Overview Setting Up Foundry Local Building the RAG Pipeline The Chat Engine Dual Interfaces: Web & CLI Testing Adapting for Your Own Use Case What I Learned Getting Started Introduction Imagine preparing for a job interview with an AI assistant that knows your CV inside and out, understands the job you're applying for, and generates tailored questions, all without ever sending your data to the cloud. That's exactly what Interview Doctor does. Interview Doctor's web UI, a polished, dark-themed interface running entirely on your local machine. In this post, I'll walk you through how I built an interview prep tool as a fully offline JavaScript application using: Foundry Local â Microsoft's on-device AI runtime SQLite â for storing document chunks and TF-IDF vectors RAG (Retrieval-Augmented Generation) â to ground the AI in your actual documents Express.js â for the web server Node.js built-in test runner â for testing with zero extra dependencies No cloud. No API keys. No internet required. Everything runs on your machine. What is RAG and Why Does It Matter? Retrieval-Augmented Generation (RAG) is a pattern that makes AI models dramatically more useful for domain-specific tasks. Instead of relying solely on what a model learned during training (which can be outdated or generic), RAG: Retrieves relevant chunks from your own documents Augments the model's prompt with those chunks as context Generates a response grounded in your actual data For Interview Doctor, this means the AI doesn't just ask generic interview questions, it asks questions specific to your CV, your experience, and the specific job you're applying for. Why Offline RAG? Privacy is the obvious benefit, your CV and job applications never leave your device. But there's more: No API costs â run as many queries as you want No rate limits â iterate rapidly during your prep Works anywhere â on a plane, in a cafĂŠ with bad Wi-Fi, anywhere Consistent performance â no cold starts, no API latency Architecture Overview Complete architecture showing all components and data flow. The application has two interfaces (CLI and Web) that share the same core engine: Document Ingestion â PDFs and markdown files are chunked and indexed Vector Store â SQLite stores chunks with TF-IDF vectors Retrieval â queries are matched against stored chunks using cosine similarity Generation â relevant chunks are injected into the prompt sent to the local LLM Step 1: Setting Up Foundry Local First, install Foundry Local: # Windows winget install Microsoft.FoundryLocal # macOS brew install microsoft/foundrylocal/foundrylocal The JavaScript SDK handles everything else â starting the service, downloading the model, and connecting: import { FoundryLocalManager } from "foundry-local-sdk"; import { OpenAI } from "openai"; const manager = new FoundryLocalManager(); const modelInfo = await manager.init("phi-3.5-mini"); // Foundry Local exposes an OpenAI-compatible API const openai = new OpenAI({ baseURL: manager.endpoint, // Dynamic port, discovered by SDK apiKey: manager.apiKey, }); â ď¸ Key Insight Foundry Local uses a dynamic port never hardcode localhost:5272 . Always use manager.endpoint which is discovered by the SDK at runtime. Step 2: Building the RAG Pipeline Document Chunking Documents are split into overlapping chunks of ~200 tokens. The overlap ensures important context isn't lost at chunk boundaries: export function chunkText(text, maxTokens = 200, overlapTokens = 25) { const words = text.split(/\s+/).filter(Boolean); if (words.length <= maxTokens) return [text.trim()]; const chunks = []; let start = 0; while (start < words.length) { const end = Math.min(start + maxTokens, words.length); chunks.push(words.slice(start, end).join(" ")); if (end >= words.length) break; start = end - overlapTokens; } return chunks; } Why 200 tokens with 25-token overlap? Small chunks keep retrieved context compact for the model's limited context window. Overlap prevents information loss at boundaries. And it's all pure string operations, no dependencies needed. TF-IDF Vectors Instead of using a separate embedding model (which would consume precious memory alongside the LLM), we use TF-IDF, a classic information retrieval technique: export function termFrequency(text) { const tf = new Map(); const tokens = text .toLowerCase() .replace(/[^a-z0-9\-']/g, " ") .split(/\s+/) .filter((t) => t.length > 1); for (const t of tokens) { tf.set(t, (tf.get(t) || 0) + 1); } return tf; } export function cosineSimilarity(a, b) { let dot = 0, normA = 0, normB = 0; for (const [term, freq] of a) { normA += freq * freq; if (b.has(term)) dot += freq * b.get(term); } for (const [, freq] of b) normB += freq * freq; if (normA === 0 || normB === 0) return 0; return dot / (Math.sqrt(normA) * Math.sqrt(normB)); } Each document chunk becomes a sparse vector of word frequencies. At query time, we compute cosine similarity between the query vector and all stored chunk vectors to find the most relevant matches. SQLite as a Vector Store Chunks and their TF-IDF vectors are stored in SQLite using sql.js (pure JavaScript â no native compilation needed): export class VectorStore { // Created via: const store = await VectorStore.create(dbPath) insert(docId, title, category, chunkIndex, content) { const tf = termFrequency(content); const tfJson = JSON.stringify([...tf]); this.db.run( "INSERT INTO chunks (...) VALUES (?, ?, ?, ?, ?, ?)", [docId, title, category, chunkIndex, content, tfJson] ); this.save(); } search(query, topK = 5) { const queryTf = termFrequency(query); // Score each chunk by cosine similarity, return top-K } } đĄ Why SQLite for Vectors? For a CV plus a few job descriptions (dozens of chunks), brute-force cosine similarity over SQLite rows is near-instant (~1ms). No need for Pinecone, Qdrant, or Chroma â just a single .db file on disk. Step 3: The RAG Chat Engine The chat engine ties retrieval and generation together: async *queryStream(userMessage, history = []) { // 1. Retrieve relevant CV/JD chunks const chunks = this.retrieve(userMessage); const context = this._buildContext(chunks); // 2. Build the prompt with retrieved context const messages = [ { role: "system", content: SYSTEM_PROMPT }, { role: "system", content: `Retrieved context:\n\n${context}` }, ...history, { role: "user", content: userMessage }, ]; // 3. Stream from the local model const stream = await this.openai.chat.completions.create({ model: this.modelId, messages, temperature: 0.3, stream: true, }); // 4. Yield chunks as they arrive for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) yield { type: "text", data: content }; } } The flow is straightforward: vectorize the query, retrieve with cosine similarity, build a prompt with context, and stream from the local LLM. The temperature: 0.3 keeps responses focused â important for interview preparation where consistency matters. Step 4: Dual Interfaces â Web & CLI Web UI The web frontend is a single HTML file with inline CSS and JavaScript â no build step, no framework, no React or Vue. It communicates with the Express backend via REST and SSE: File upload via multipart/form-data Streaming chat via Server-Sent Events (SSE) Quick-action buttons for common follow-up queries (coaching tips, gap analysis, mock interview) The setup form with job title, seniority level, and a pasted job description â ready to generate tailored interview questions. CLI The CLI provides the same experience in the terminal with ANSI-coloured output: npm run cli It walks you through uploading your CV, entering the job details, and then generates streaming questions. Follow-up questions work interactively. Both interfaces share the same ChatEngine class, they're thin layers over identical logic. Edge Mode For constrained devices, toggle Edge mode to use a compact system prompt that fits within smaller context windows: Edge mode activated, uses a minimal prompt for devices with limited resources. Step 5: Testing Tests use the Node.js built-in test runner, no Jest, no Mocha, no extra dependencies: import { describe, it } from "node:test"; import assert from "node:assert/strict"; describe("chunkText", () => { it("returns single chunk for short text", () => { const chunks = chunkText("short text", 200, 25); assert.equal(chunks.length, 1); }); it("maintains overlap between chunks", () => { // Verifies overlapping tokens between consecutive chunks }); }); npm test Tests cover the chunker, vector store, config, prompts, and server API contract, all without needing Foundry Local running. Adapting for Your Own Use Case Interview Doctor is a pattern, not just a product. You can adapt it for any domain: What to Change How Domain documents Replace files in docs/ with your content System prompt Edit src/prompts.js Chunk sizes Adjust config.chunkSize and config.chunkOverlap Model Change config.model â run foundry model list UI Modify public/index.html â it's a single file Ideas for Adaptation Customer support bot â ingest your product docs and FAQs Code review assistant â ingest coding standards and best practices Study guide â ingest textbooks and lecture notes Compliance checker â ingest regulatory documents Onboarding assistant â ingest company handbooks and processes What I Learned Offline AI is production-ready. Foundry Local + small models like Phi-3.5 Mini are genuinely useful for focused tasks. You don't need vector databases for small collections. SQLite + TF-IDF is fast, simple, and has zero infrastructure overhead. RAG quality depends on chunking. Getting chunk sizes right for your use case is more impactful than the retrieval algorithm. The OpenAI-compatible API is a game-changer. Switching from cloud to local was mostly just changing the baseURL . Dual interfaces are easy when you share the engine. The CLI and Web UI are thin layers over the same ChatEngine class. ⥠Performance Notes On a typical laptop (no GPU): ingestion takes under 1 second for ~20 documents, retrieval is ~1ms, and the first LLM token arrives in 2-5 seconds. Foundry Local automatically selects the best model variant for your hardware (CUDA GPU, NPU, or CPU). Getting Started git clone https://github.com/leestott/interview-doctor-js.git cd interview-doctor-js npm install npm run ingest npm start # Web UI at http://127.0.0.1:3000 # or npm run cli # Interactive terminal The full source code is on GitHub. Star it, fork it, adapt it â and good luck with your interviews! Resources Foundry Local â Microsoft's on-device AI runtime Foundry Local SDK (npm) â JavaScript SDK Foundry Local GitHub â Source, samples, and documentation Local RAG Reference â Reference RAG implementation Interview Doctor (JavaScript) â This project's source codeTeaching AI Development Through Gamification:
Introduction Learning AI development can feel overwhelming. Developers face abstract concepts like embeddings, prompt engineering, and workflow orchestration topics that traditional tutorials struggle to make tangible. How do you teach someone what an embedding "feels like" or why prompt engineering matters beyond theoretical examples? The answer lies in experiential learning through gamification. Instead of reading about AI concepts, what if developers could play a game that teaches these ideas through progressively challenging levels, immediate feedback, and real AI interactions? This article explores exactly that: building an educational adventure game that transforms AI learning from abstract theory into hands-on exploration. We'll dive into Foundry Local Learning Adventure, a JavaScript-based game that teaches AI fundamentals through five interactive levels. You'll learn how to create engaging educational experiences, integrate local AI models using Foundry Local, design progressive difficulty curves, and build cross-platform applications that run both in browsers and terminals. Whether you're an educator designing technical curriculum or a developer building learning tools, this architecture provides a proven blueprint for gamified technical education. Why Gamification Works for Technical Learning Traditional technical education follows a predictable pattern: read documentation, watch tutorials, attempt exercises, struggle with setup, eventually give up. The problem isn't content quality, it's engagement and friction. Gamification addresses both issues simultaneously. By framing learning as progression through levels, you create intrinsic motivation. Each completed challenge feels like unlocking a new ability in a game, triggering the same dopamine response that keeps players engaged in entertainment experiences. Progress is visible, achievements are celebrated, and setbacks feel like natural parts of the journey rather than personal failures. More importantly, gamification reduces friction. Instead of "install dependencies, configure API keys, read documentation, write code, debug errors," learners simply start the game and begin playing. The game handles setup, provides guardrails, and offers immediate feedback. When a concept clicks, the game celebrates it. When learners struggle, hints appear automatically. For AI development specifically, gamification solves a unique challenge: making probabilistic, non-deterministic systems feel approachable. Traditional programming has clear right and wrong answers, but AI outputs vary. A game can frame this variability as exploration rather than failure, teaching developers to evaluate AI responses critically while maintaining confidence. Architecture Overview: Dual-Platform Design for Maximum Reach The Foundry Local Learning Adventure implements a dual-platform architecture with separate but consistent implementations for web browsers and command-line terminals. This design maximizes accessibility, learners can start playing instantly in a browser, then graduate to CLI mode for the full terminal experience when they're ready to go deeper. The web version prioritizes zero-friction onboarding. It's deployed to GitHub Pages and can also be opened locally via a simple HTTP server, no build step, no package managers. The game starts with simulated AI responses in demo mode, but crucially, it also supports real AI responses when Foundry Local is installed. The web version auto-discovers Foundry Local's dynamic port through a foundry-port.json file (written by the startup scripts) or by scanning common ports. Progress saves to localStorage, badges unlock as you complete challenges, and an AI-powered mentor named Sage guides you through a chat widget in the corner. This version is perfect for classrooms, conference demos, and learners who want to try before committing to a full CLI setup. The CLI version provides the full terminal experience with real AI interactions. Built on Node.js with ES modules, this version features a custom FoundryLocalClient class that connects to Foundry Local's OpenAI-compatible REST API. Instead of relying on an external SDK, the game implements its own API client with automatic port discovery, model selection, and graceful fallback to demo mode. The terminal interface includes a rich command system ( play , hint , ask , explain , progress , badges ) and the Sage mentor provides contextual guidance throughout. Both versions implement the same five levels and learning objectives independently. The CLI uses game/src/game.js , levels.js , and mentor.js as ES modules, while the web version uses game/web/game-web.js and game-data.js . A key innovation is the automatic port discovery system, which eliminates manual configuration: // 3-tier port discovery strategy (game/src/game.js) class FoundryLocalClient { constructor() { this.commonPorts = [61341, 5272, 51319, 5000, 8080]; this.mode = 'demo'; // 'local', 'azure', or 'demo' } async initialize() { // Tier 1: CLI discovery - parse 'foundry service status' output const cliPort = await this.discoverPortViaCLI(); if (cliPort) { this.baseUrl = cliPort; this.mode = 'local'; return; } // Tier 2: Try configured URL from config.json if (await this.tryFoundryUrl(config.foundryLocal.baseUrl)) { this.mode = 'local'; return; } // Tier 3: Scan common ports for (const port of this.commonPorts) { if (await this.tryFoundryUrl(`http://127.0.0.1:${port}`)) { this.mode = 'local'; return; } } // Fallback: demo mode with simulated responses console.log('đĄ Running in demo mode (no Foundry Local detected)'); this.mode = 'demo'; } async chat(messages, options = {}) { if (this.mode === 'demo') return this.getDemoResponse(messages); const response = await fetch(`${this.baseUrl}/v1/chat/completions`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model: this.selectedModel, messages, temperature: options.temperature || 0.7, max_tokens: options.max_tokens || 300 }) }); const data = await response.json(); return data.choices[0].message.content; } } This architecture demonstrates several key principles for educational software: Progressive disclosure: Start simple (web demo mode), add complexity optionally (real AI via Foundry Local or Azure) Consistent learning outcomes: Both platforms teach the same five concepts through independently implemented but equivalent experiences Zero barriers to entry: No installation required for the web version eliminates the #1 reason learners abandon technical tutorials Automatic service discovery: The 3-tier port discovery strategy means no manual configuration, just install Foundry Local and play Graceful degradation: Three connection modes (local, Azure, demo) ensure the game always works regardless of setup Level Design: Teaching AI Concepts Through Progressive Challenges The game's five levels form a carefully designed curriculum that builds AI understanding incrementally. Each level introduces one core concept, provides hands-on practice, and validates learning before proceeding. Level 1: Meet the Model teaches the fundamental request-response pattern. Learners send their first message to an AI and see it respond. The challenge is deliberately trivial, just say hello, because the goal is building confidence. The level succeeds when the learner realizes "I can talk to an AI and it understands me." This moment of agency sets the foundation for everything else. The implementation focuses on positive reinforcement. In the CLI version, the Sage mentor celebrates each completion with contextual messages, while the web version displays inline celebration banners with badge animations: // Level 1 execution (game/src/game.js) async executeLevel1() { const level = this.levels.getLevel(1); this.displayLevelHeader(level); // Sage introduces the level const intro = await this.mentor.introduceLevel(1); console.log(`\nđ§ Sage: ${intro}`); const userPrompt = await this.askQuestion('\nYour prompt: '); console.log('\nđ¤ AI is thinking...'); const response = await this.client.chat([ { role: 'system', content: 'You are Sage, a friendly AI mentor.' }, { role: 'user', content: userPrompt } ]); console.log(`\nđ¨ AI Response:\n${response}`); if (response && response.length > 10) { // Sage celebrates const celebration = await this.mentor.celebrateLevelComplete(1); console.log(`\nđ§ Sage: ${celebration}`); console.log('\nđŻ You earned the Prompt Apprentice badge!'); console.log('đ +100 points'); this.progress.completeLevel(1, 100, 'đŻ Prompt Apprentice'); } } This celebration pattern repeats throughout, explicit acknowledgment of success via the Sage mentor, explanation of what was learned, and a preview of what's next. The mentor system ( game/src/mentor.js ) provides contextual encouragement using AI-generated or pre-written fallback messages, transforming abstract concepts into concrete achievements. Level 2: Prompt Mastery introduces prompt quality through comparison. The game presents a deliberately poor prompt: "tell me stuff about coding." Learners must rewrite it to be specific, contextual, and actionable. The game runs both prompts, displays results side-by-side, and asks learners to evaluate the difference. // Level 2: Prompt Improvement (game/src/game.js) async executeLevel2() { const level = this.levels.getLevel(2); this.displayLevelHeader(level); const intro = await this.mentor.introduceLevel(2); console.log(`\nđ§ Sage: ${intro}`); // Show the bad prompt const badPrompt = "tell me stuff about coding"; console.log(`\nâ Poor prompt: "${badPrompt}"`); console.log('\nđ¤ Getting response to bad prompt...'); const badResponse = await this.client.chat([ { role: 'user', content: badPrompt } ]); console.log(`\nđ Bad prompt result:\n${badResponse}`); // Get the learner's improved version console.log('\nâď¸ Now write a BETTER prompt about the same topic:'); const goodPrompt = await this.askQuestion('Your improved prompt: '); console.log('\nđ¤ Getting response to your prompt...'); const goodResponse = await this.client.chat([ { role: 'user', content: goodPrompt } ]); console.log(`\nđ Your prompt result:\n${goodResponse}`); // Evaluate: improved prompt should be longer and more specific const isImproved = goodPrompt.length > badPrompt.length && goodResponse.length > 0; if (isImproved) { const celebration = await this.mentor.celebrateLevelComplete(2); console.log(`\nđ§ Sage: ${celebration}`); console.log('\n⨠You earned the Prompt Engineer badge!'); console.log('đ +150 points'); this.progress.completeLevel(2, 150, '⨠Prompt Engineer'); } else { const hint = await this.mentor.provideHint(2); console.log(`\nđĄ Sage: ${hint}`); } } This comparative approach is powerful, learners don't just read about prompt engineering, they experience its impact directly. The before/after comparison makes quality differences undeniable. Level 3: Embeddings Explorer demystifies semantic search through practical demonstration. Learners search a knowledge base about Foundry Local using natural language queries. The game shows how embedding similarity works by returning relevant content even when exact keywords don't match. // Level 3: Embedding Search (game/src/game.js) async executeLevel3() { const level = this.levels.getLevel(3); this.displayLevelHeader(level); // Knowledge base loaded from game/data/knowledge-base.json const knowledgeBase = [ { id: 1, content: "Foundry Local runs AI models entirely on your device" }, { id: 2, content: "Embeddings convert text into numerical vectors" }, { id: 3, content: "Cosine similarity measures how related two texts are" }, // ... more entries about AI and Foundry Local ]; const query = await this.askQuestion('\nđ Search query: '); // Get embedding for user's query const queryEmbedding = await this.client.getEmbedding(query); // Get embeddings for all knowledge base entries const results = []; for (const item of knowledgeBase) { const itemEmbedding = await this.client.getEmbedding(item.content); const similarity = this.cosineSimilarity(queryEmbedding, itemEmbedding); results.push({ ...item, similarity }); } // Sort by similarity and show top matches results.sort((a, b) => b.similarity - a.similarity); console.log('\nđ Top matches:'); results.slice(0, 3).forEach((r, i) => { console.log(` ${i + 1}. (${(r.similarity * 100).toFixed(1)}%) ${r.content}`); }); } // Cosine similarity calculation (also in TaskHandler) cosineSimilarity(a, b) { const dot = a.reduce((sum, val, i) => sum + val * b[i], 0); const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0)); const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0)); return dot / (magA * magB); } // Demo mode generates pseudo-embeddings when Foundry isn't available getPseudoEmbedding(text) { // 128-dimension hash-based vector for offline demonstration const embedding = new Array(128).fill(0); for (let i = 0; i < text.length; i++) { embedding[i % 128] += text.charCodeAt(i) / 1000; } return embedding; } Learners query things like "How do I run AI offline?" and discover content about Foundry Local's offline capabilitiesâeven though the word "offline" appears nowhere in the result. When Foundry Local is running, the game calls the /v1/embeddings endpoint for real vector representations. In demo mode, a pseudo-embedding function generates 128-dimension hash-based vectors that still demonstrate the concept of similarity search. This concrete demonstration of semantic understanding beats any theoretical explanation. Level 4: Workflow Wizard teaches AI pipeline composition. Learners build a three-step workflow: summarize text â extract keywords â generate questions. Each step uses the previous output as input, demonstrating how complex AI tasks decompose into chains of simpler operations. // Level 4: Workflow Builder (game/src/game.js) async executeLevel4() { const level = this.levels.getLevel(4); this.displayLevelHeader(level); const intro = await this.mentor.introduceLevel(4); console.log(`\nđ§ Sage: ${intro}`); console.log('\nđ Enter text for the 3-step AI pipeline:'); const inputText = await this.askQuestion('Input text: '); // Step 1: Summarize console.log('\nâď¸ Step 1: Summarizing...'); const summary = await this.client.chat([ { role: 'system', content: 'Summarize this in 2 sentences.' }, { role: 'user', content: inputText } ]); console.log(` Result: ${summary}`); // Step 2: Extract keywords (chained from Step 1 output) console.log('\nđ Step 2: Extracting keywords...'); const keywords = await this.client.chat([ { role: 'system', content: 'Extract 5 important keywords.' }, { role: 'user', content: summary } ]); console.log(` Keywords: ${keywords}`); // Step 3: Generate questions (chained from Step 2 output) console.log('\nâ Step 3: Generating study questions...'); const questions = await this.client.chat([ { role: 'system', content: 'Create 3 quiz questions about these topics.' }, { role: 'user', content: keywords } ]); console.log(` Questions:\n${questions}`); console.log('\nâ Workflow complete!'); const celebration = await this.mentor.celebrateLevelComplete(4); console.log(`\nđ§ Sage: ${celebration}`); console.log('\n⥠You earned the Workflow Wizard badge!'); console.log('đ +250 points'); this.progress.completeLevel(4, 250, '⥠Workflow Wizard'); } This level bridges the gap between "toy examples" and real applications. Learners see firsthand how combining simple AI operations creates sophisticated functionality. Level 5: Build Your Own Tool challenges learners to create a custom AI-powered tool by selecting from pre-built templates and configuring them. Rather than asking learners to write arbitrary code, the game provides four structured templates that demonstrate how AI tools work in practice: // Level 5: Tool Builder templates (game/web/game-web.js) const TOOL_TEMPLATES = [ { id: 'summarizer', name: 'đ Text Summarizer', description: 'Summarizes long text into key points', systemPrompt: 'You are a text summarization tool. Provide concise summaries.', exampleInput: 'Paste any long article or document...' }, { id: 'translator', name: 'đ Code Translator', description: 'Translates code between programming languages', systemPrompt: 'You are a code translation tool. Convert code accurately.', exampleInput: 'function hello() { console.log("Hello!"); }' }, { id: 'reviewer', name: 'đ Code Reviewer', description: 'Reviews code for bugs, style, and improvements', systemPrompt: 'You are a code review tool. Identify issues and suggest fixes.', exampleInput: 'Paste code to review...' }, { id: 'custom', name: '⨠Custom Tool', description: 'Design your own AI tool with a custom system prompt', systemPrompt: '', // Learner provides this exampleInput: '' } ]; // Tool testing sends the configured system prompt + user input to Foundry Local async function testTool(template, userInput) { const response = await callFoundryAPI([ { role: 'system', content: template.systemPrompt }, { role: 'user', content: userInput } ]); console.log(`đ§ Tool output: ${response}`); return response; } This template-based approach is safer and more educational than arbitrary code execution. Learners select a template, customize its system prompt, test it with sample input, and see how the AI responds differently based on the tool's configuration. The "Custom Tool" option lets advanced learners design their own system prompts from scratch. Completing this level marks true understandingâlearners aren't just using AI, they're shaping what it can do through prompt design and tool composition. Building the Web Version: Zero-Install Educational Experience The web version demonstrates how to create educational software that requires absolutely zero setup. This is critical for workshops, classroom settings, and casual learners who won't commit to installation until they see value. The architecture is deliberately simple, vanilla JavaScript with ES6 modules, no build tools, no package managers. The HTML includes a multi-screen layout with a welcome screen, level selection grid, game area, and modals for progress, badges, help, and game completion: <!-- game/web/index.html --> <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Foundry Local Learning Adventure</title> <link rel="stylesheet" href="styles.css"> </head> <body> <!-- Welcome Screen with name input --> <div id="welcome-screen" class="screen active"> <h1>đŽ Foundry Local Learning Adventure</h1> <p>Master Microsoft Foundry AI - One Level at a Time!</p> <input type="text" id="player-name" placeholder="Enter your name"> <button id="start-btn">Start Adventure</button> <div id="foundry-status"><!-- Auto-detected connection status --></div> </div> <!-- Menu Screen with level grid --> <div id="menu-screen" class="screen"> <div class="level-grid"> <!-- 5 level cards with lock/unlock states --> </div> <div class="stats-bar"> <span id="points-display">0 points</span> <span id="badges-count">0/5 badges</span> </div> </div> <!-- Level Screen with task area --> <div id="level-screen" class="screen"> <div id="level-header"></div> <div id="task-area"><!-- Level-specific UI loads here --></div> <div id="response-area"></div> <div id="hint-area"></div> </div> <!-- Sage Mentor Chat Widget (fixed bottom-right) --> <div id="mentor-chat" class="mentor-widget"> <div class="mentor-header">đ§ Sage (AI Mentor)</div> <div id="mentor-messages"></div> <input type="text" id="mentor-input" placeholder="Ask Sage anything..."> </div> <script type="module" src="game-data.js"></script> <script type="module" src="game-web.js"></script> </body> </html> A critical feature of the web version is its ability to connect to a real Foundry Local instance. On startup, the game checks for a foundry-port.json file (written by the cross-platform start scripts) and falls back to scanning common ports: // game/web/game-web.js - Foundry Local auto-discovery let foundryConnection = { connected: false, baseUrl: null }; async function checkFoundryConnection() { // Try reading port from discovery file (written by start scripts) const discoveredPort = await readDiscoveredPort(); if (discoveredPort) { try { const resp = await fetch(`${discoveredPort}/v1/models`); if (resp.ok) { foundryConnection = { connected: true, baseUrl: discoveredPort }; updateStatusBadge('đ˘ Foundry Local Connected'); return; } } catch (e) { /* continue to port scan */ } } // Scan common Foundry Local ports const ports = [61341, 5272, 51319, 5000, 8080]; for (const port of ports) { try { const resp = await fetch(`http://127.0.0.1:${port}/v1/models`); if (resp.ok) { foundryConnection = { connected: true, baseUrl: `http://127.0.0.1:${port}` }; updateStatusBadge('đ˘ Foundry Local Connected'); return; } } catch (e) { continue; } } // Demo mode - use simulated responses from DEMO_RESPONSES updateStatusBadge('đĄ Demo Mode (install Foundry Local for real AI)'); } async function callFoundryAPI(messages) { if (!foundryConnection.connected) { return getDemoResponse(messages); // Simulated responses } const resp = await fetch(`${foundryConnection.baseUrl}/v1/chat/completions`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model: 'auto', messages, temperature: 0.7 }) }); const data = await resp.json(); return data.choices[0].message.content; } The web version also includes level-specific UIs: each level type has its own builder function that constructs the appropriate interface. For example, Level 2 (Prompt Improvement) shows a split-view with the bad prompt result on one side and the learner's improved prompt on the other. Level 3 (Embeddings) presents a search interface with similarity scores. Level 5 (Tool Builder) offers a template selector with four options (Text Summarizer, Code Translator, Code Reviewer, and Custom). This architecture teaches several patterns for web-based educational tools: LocalStorage for persistence: Progress survives page refreshes without requiring accounts or databases ES6 modules for organization: Clean separation between game data ( game-data.js ) and engine ( game-web.js ) Hybrid AI mode: Real AI when Foundry Local is available, simulated responses when it's notâsame code path for both Multi-screen navigation: Welcome, menu, level, and completion screens provide clear progression Always-available mentor: The Sage chat widget in the corner lets learners ask questions at any point Implementing the CLI Version with Real AI Integration The CLI version provides the authentic AI development experience. This version requires Node.js and Foundry Local, but rewards setup effort with genuine model interactions. Installation uses a startup script that handles prerequisites: #!/bin/bash # scripts/start-game.sh echo "đŽ Starting Foundry Local Learning Adventure..." # Check Node.js if ! command -v node &> /dev/null; then echo "â Node.js not found. Install from https://nodejs.org/" exit 1 fi # Check Foundry Local if ! command -v foundry &> /dev/null; then echo "â Foundry Local not found." echo " Install: winget install Microsoft.FoundryLocal" exit 1 fi # Start Foundry service echo "đ Starting Foundry Local service..." foundry service start # Wait for service sleep 2 # Load model echo "đŚ Loading Phi-4 model..." foundry model load phi-4 # Install dependencies echo "đĽ Installing game dependencies..." npm install # Start game echo "â Launching game..." npm start The game logic integrates with Foundry Local using the official SDK: // game/src/game.js import { FoundryLocalClient } from 'foundry-local-sdk'; import readline from 'readline/promises'; const client = new FoundryLocalClient({ endpoint: 'http://127.0.0.1:5272' // Default Foundry Local port }); async function getAIResponse(prompt, level) { try { const startTime = Date.now(); const completion = await client.chat.completions.create({ model: 'phi-4', messages: [ { role: 'system', content: `You are Sage, a friendly AI mentor teaching ${LEVELS[level-1].title}.` }, { role: 'user', content: prompt } ], temperature: 0.7, max_tokens: 300 }); const latency = Date.now() - startTime; console.log(`\nâąď¸ AI responded in ${latency}ms`); return completion.choices[0].message.content; } catch (error) { console.error('â AI error:', error.message); console.log('đĄ Falling back to demo mode...'); return getDemoResponse(prompt, level); } } async function playLevel(levelNumber) { const level = LEVELS[levelNumber - 1]; console.clear(); console.log(`\n${'='.repeat(60)}`); console.log(` Level ${levelNumber}: ${level.title}`); console.log(`${'='.repeat(60)}\n`); console.log(`đŻ ${level.objective}\n`); console.log(`đ ${level.description}\n`); const rl = readline.createInterface({ input: process.stdin, output: process.stdout }); const userPrompt = await rl.question('Your prompt: '); rl.close(); console.log('\nđ¤ AI is thinking...'); const response = await getAIResponse(userPrompt, levelNumber); console.log(`\nđ¨ AI Response:\n${response}\n`); // Evaluate success if (level.successCriteria(response, userPrompt)) { celebrateSuccess(level); updateProgress(levelNumber); if (levelNumber < 5) { const playNext = await askYesNo('Play next level?'); if (playNext) { await playLevel(levelNumber + 1); } } else { showGameComplete(); } } else { console.log(`\nđĄ Hint: ${level.hints[0]}\n`); const retry = await askYesNo('Try again?'); if (retry) { await playLevel(levelNumber); } } } The CLI version adds several enhancements that deepen learning: Latency visibility: Display response times so learners understand local vs cloud performance differences Graceful fallback: If Foundry Local fails, switch to demo mode automatically rather than crashing Interactive prompts: Use readline for natural command-line interaction patterns Progress persistence: Save to JSON files so learners can pause and resume Command history: Log all prompts and responses for learners to review their progression Key Takeaways and Educational Design Principles Building effective educational software for technical audiences requires balancing several competing concerns: accessibility vs authenticity, simplicity vs depth, guidance vs exploration. The Foundry Local Learning Adventure succeeds by making deliberate architectural choices that prioritize learner experience. Key principles demonstrated: Zero-friction starts win: The web version eliminates all setup barriers, maximizing the chance learners will actually begin Automatic service discovery: The 3-tier port discovery strategy means no manual configuration, just install Foundry Local and play Progressive challenge curves build confidence: Each level introduces exactly one new concept, building on previous knowledge Immediate feedback accelerates learning: Learners know instantly if they succeeded, with Sage providing contextual explanations Real tools create transferable skills: The CLI version uses professional developer patterns (OpenAI-compatible REST APIs, ES modules, readline) that apply beyond the game Celebration creates emotional investment: Badges, points, and Sage's encouragement transform learning into achievement Dual platforms expand reach: Web attracts casual learners, CLI converts them to serious practitionersâand both support real AI Graceful degradation ensures reliability: Three connection modes (local, Azure, demo) mean the game always works regardless of setup To extend this approach for your own educational projects, consider: Domain-specific challenges: Adapt level structure to your technical domain (e.g., API design, database optimization, security practices) Multiplayer competitions: Add leaderboards and time trials to introduce social motivation Adaptive difficulty: Track learner performance and adjust challenge difficulty dynamically Sandbox modes: After completing the curriculum, provide free-play areas for experimentation Community sharing: Let learners share custom levels or challenges they've created The complete implementation with all levels, both web and CLI versions, comprehensive tests, and deployment guides is available at github.com/leestott/FoundryLocal-LearningAdventure. You can play the web version immediately at leestott.github.io/FoundryLocal-LearningAdventure or clone the repository to experience the full CLI version with real AI. Resources and Further Reading Foundry Local Learning Adventure Repository - Complete source code for both web and CLI versions Play Online Now - Try the web version instantly in your browser (supports real AI with Foundry Local installed) Microsoft Foundry Local Documentation - Official SDK and CLI reference Contributing Guide - How to contribute new levels or improvementsBuilding a Privacy-First Hybrid AI Briefing Tool with Foundry Local and Azure OpenAI
Introduction Management consultants face a critical challenge: they need instant AI-powered insights from sensitive client documents, but traditional cloud-only AI solutions create unacceptable data privacy risks. Every document uploaded to a cloud API potentially exposes confidential client information, violates data residency requirements, and creates compliance headaches. The solution lies in a hybrid architecture that combines the speed and privacy of on-device AI with the sophistication of cloud modelsâbut only when explicitly requested. This article walks through building a production-ready briefing assistant that runs AI inference locally first, then optionally refines outputs using Azure OpenAI for executive-quality presentations. We'll explore a sample implementation using FL-Client-Briefing-Assistant, built with Next.js 14, TypeScript, and Microsoft Foundry Local. You'll learn how to architect privacy-first AI applications, implement sub-second local inference, and design transparent hybrid workflows that give users complete control over their data. Why Hybrid AI Architecture Matters for Enterprise Applications Before diving into implementation details, let's understand why a hybrid approach is essential for enterprise AI applications, particularly in consulting and professional services. Cloud-only AI services like OpenAI's GPT-4 offer remarkable capabilities, but they introduce several critical challenges. First, every API call sends your data to external servers, creating audit trails and potential exposure points. For consultants handling merger documents, financial reports, or strategic plans, this is often a non-starter. Second, cloud APIs introduce latency, typically 2-5 seconds per request due to network round-trips and queue times. Third, costs scale linearly with usage, making high-volume document analysis expensive at scale. Local-only AI solves privacy and latency concerns but sacrifices quality. Small language models (SLMs) running on laptops produce quick summaries, but they lack the nuanced reasoning and polish needed for C-suite presentations. You get fast, private results that may require significant manual refinement. The hybrid approach gives you the best of both worlds: instant, private local processing as the default, with optional cloud refinement only when quality matters most. This architecture respects data privacy by default while maintaining the flexibility to produce executive-grade outputs when needed. Architecture Overview: Three-Layer Design for Privacy and Performance The FL-Client-Briefing-Assistant implements a clean three-layer architecture that separates concerns and ensures privacy at every level. At the frontend, a Next.js 14 application provides the user interface with strong TypeScript typing throughout. Users interact with four quick-action templates: document summarization, talking points generation, risk analysis, and executive summaries. The UI clearly indicates which model (local or cloud) processed each request, ensuring transparency. The middle tier consists of Next.js API routes that act as orchestration endpoints. These routes validate requests using Zod schemas, route to appropriate inference services, and enforce privacy settings. Critically, the API layer never persists user content unless explicitly opted in via privacy settings. The inference layer contains two distinct services. The local service uses Foundry Local SDK to communicate with a locally running Phi-4 model (or similar SLM). This provides sub-second inference, typical 500ms-1s response times, completely offline. The cloud service connects to Azure OpenAI using the official JavaScript SDK, accessed via Managed Identity or API keys, with proper timeout and retry logic. Setting Up Foundry Local for On-Device Inference Foundry Local is Microsoft's runtime for running AI models entirely on your deviceâno internet required, no data leaving your machine. Here's how to get it running for this application. First, install Foundry Local on Windows using Windows Package Manager: winget install Microsoft.FoundryLocal After installation, verify the service is ready: foundry service start foundry service status The status command will show you the service endpoint, typically running on a dynamic port like http://127.0.0.1:5272 . This port changes between restarts, so your application must query it programmatically. Next, load an appropriate model. For briefing tasks, Phi-4 Mini provides an excellent balance of quality and speed: foundry model load phi-4 The model downloads (approximately 3.6GB) and loads into memory. This takes 2-5 minutes on first run but persists between sessions. Once loaded, inference is nearly instant, most requests complete in under 1 second. In your application, configure the connection in .env.local : the port for foundry local is dynamic so please ensure you add the correct port. FOUNDRY_LOCAL_ENDPOINT=http://127.0.0.1:**** The application uses the Foundry Local SDK to query the running service: import { FoundryLocalClient } from 'foundry-local-sdk'; const client = new FoundryLocalClient({ endpoint: process.env.FOUNDRY_LOCAL_ENDPOINT }); const response = await client.chat.completions.create({ model: 'phi-4', messages: [ { role: 'system', content: 'You are a professional consultant assistant.' }, { role: 'user', content: 'Summarize this document: ...' } ], max_tokens: 500, temperature: 0.3 }); This code demonstrates several best practices: Explicit model specification: Always name the model to ensure consistency across environments System message framing: Set the appropriate professional context for consulting use cases Conservative temperature: Use 0.3 for factual summarization tasks to reduce hallucination Token limits: Cap outputs to prevent excessive generation times and costs Implementing Privacy-First API Routes The Next.js API routes form the security boundary of the application. Every request must be validated, sanitized, and routed according to privacy settings before reaching inference services. Here's the core local inference route ( app/api/briefing/local/route.ts ): import { NextRequest, NextResponse } from 'next/server'; import { z } from 'zod'; import { FoundryLocalClient } from 'foundry-local-sdk'; const RequestSchema = z.object({ prompt: z.string().min(10).max(5000), template: z.enum(['summary', 'talking-points', 'risk-analysis', 'executive']), context: z.string().optional() }); export async function POST(request: NextRequest) { try { // Validate and parse request body const body = await request.json(); const validated = RequestSchema.parse(body); // Initialize Foundry Local client const client = new FoundryLocalClient({ endpoint: process.env.FOUNDRY_LOCAL_ENDPOINT! }); // Build system prompt based on template const systemPrompts = { 'summary': 'You are a consultant creating concise document summaries.', 'talking-points': 'You are preparing structured talking points for meetings.', 'risk-analysis': 'You are analyzing risks and opportunities systematically.', 'executive': 'You are crafting executive-level briefing notes.' }; // Execute local inference const startTime = Date.now(); const completion = await client.chat.completions.create({ model: 'phi-4', messages: [ { role: 'system', content: systemPrompts[validated.template] }, { role: 'user', content: validated.prompt } ], temperature: 0.3, max_tokens: 500 }); const latency = Date.now() - startTime; // Return structured response with metadata return NextResponse.json({ content: completion.choices[0].message.content, model: 'phi-4 (local)', latency_ms: latency, tokens: completion.usage?.total_tokens, timestamp: new Date().toISOString() }); } catch (error) { if (error instanceof z.ZodError) { return NextResponse.json( { error: 'Invalid request format', details: error.errors }, { status: 400 } ); } console.error('Local inference error:', error); return NextResponse.json( { error: 'Inference failed', message: error.message }, { status: 500 } ); } } This implementation demonstrates several critical security and quality patterns: Request validation with Zod: Every field is type-checked and bounded before processing, preventing injection attacks and malformed inputs Template-based system prompts: Different use cases get optimized prompts, improving output quality and consistency Comprehensive error handling: Validation errors, inference failures, and network issues are caught and reported with appropriate HTTP status codes Performance tracking: Latency measurement enables monitoring and helps users understand response times Metadata enrichment: Responses include model attribution, token usage, and timestamps for auditing The cloud refinement route follows a similar pattern but adds privacy checks: export async function POST(request: NextRequest) { try { const body = await request.json(); const validated = RequestSchema.parse(body); // Check privacy settings from cookie/header const confidentialMode = request.cookies.get('confidential-mode')?.value === 'true'; if (confidentialMode) { return NextResponse.json( { error: 'Cloud refinement disabled in confidential mode' }, { status: 403 } ); } // Proceed with Azure OpenAI call only if privacy allows const client = new OpenAI({ apiKey: process.env.AZURE_OPENAI_KEY, baseURL: process.env.AZURE_OPENAI_ENDPOINT, defaultHeaders: { 'api-key': process.env.AZURE_OPENAI_KEY } }); const completion = await client.chat.completions.create({ model: process.env.AZURE_OPENAI_DEPLOYMENT!, messages: [/* ... */], temperature: 0.5, // Slightly higher for creative refinement max_tokens: 800 }); return NextResponse.json({ content: completion.choices[0].message.content, model: `${process.env.AZURE_OPENAI_DEPLOYMENT} (cloud)`, privacy_notice: 'Content processed by Azure OpenAI', // ... metadata }); } catch (error) { // Error handling } } The confidential mode check is crucialâit ensures that even if a user accidentally clicks the refinement button, no data leaves the device when privacy mode is enabled. This fail-safe design prevents data leakage through UI mistakes or automated workflows. Building the Frontend: Transparent Privacy Controls The user interface must make privacy decisions explicit and visible. Users need to understand which AI service processed their content and make informed choices about cloud refinement. The main briefing interface ( app/page.tsx ) implements this transparency through clear visual indicators: 'use client'; import { useState, useEffect } from 'react'; import { PrivacySettings } from '@/components/PrivacySettings'; export default function BriefingAssistant() { const [confidentialMode, setConfidentialMode] = useState(true); // Privacy by default const [content, setContent] = useState(''); const [result, setResult] = useState(null); const [loading, setLoading] = useState(false); // Load privacy preference from localStorage useEffect(() => { const saved = localStorage.getItem('confidential-mode'); if (saved !== null) { setConfidentialMode(saved === 'true'); } }, []); async function generateBriefing(template: string, useCloud: boolean = false) { if (useCloud && confidentialMode) { alert('Cloud refinement is disabled in confidential mode. Adjust settings to enable.'); return; } setLoading(true); const endpoint = useCloud ? '/api/briefing/cloud' : '/api/briefing/local'; try { const response = await fetch(endpoint, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ prompt: content, template }) }); const data = await response.json(); setResult({ ...data, processedBy: useCloud ? 'cloud' : 'local' }); } catch (error) { console.error('Briefing generation failed:', error); } finally { setLoading(false); } } return ( <div className="briefing-assistant"> <header> <h1>Client Briefing Assistant</h1> <div className="status-bar"> <span className={confidentialMode ? 'confidential' : 'standard'}> {confidentialMode ? 'đ Confidential Mode' : 'đ Standard Mode'} </span> <PrivacySettings confidentialMode={confidentialMode} onChange={setConfidentialMode} /> </div> </header> <div className="quick-actions"> <button onClick={() => generateBriefing('summary')}> đ Summarize Document </button> <button onClick={() => generateBriefing('talking-points')}> đŹ Generate Talking Points </button> <button onClick={() => generateBriefing('risk-analysis')}> đŻ Risk Analysis </button> <button onClick={() => generateBriefing('executive')}> đ Executive Summary </button> </div> <textarea value={content} onChange={(e) => setContent(e.target.value)} placeholder="Paste client document or meeting notes here..." /> {result && ( <div className="result-card"> <div className="result-header"> <span className="model-badge">{result.model}</span> <span className="latency">{result.latency_ms}ms</span> </div> <div className="result-content">{result.content}</div> {result.processedBy === 'local' && !confidentialMode && ( <button onClick={() => generateBriefing(result.template, true)} className="refine-btn" > ⨠Refine for Executive Presentation </button> )} </div> )} </div> ); } This interface design embodies several principles of responsible AI UX: Privacy by default: Confidential mode is enabled unless explicitly changed, ensuring accidental cloud usage requires multiple intentional actions Clear attribution: Every result shows which model generated it and how long it took, building user trust through transparency Conditional refinement: The cloud refinement button only appears when privacy allows and local inference has completed, preventing premature cloud requests Persistent settings: Privacy preferences save to localStorage, respecting user choices across sessions Visual status indicators: The header always shows current privacy mode with recognizable icons (đ for confidential, đ for standard) Testing Privacy and Performance Requirements A privacy-first application demands rigorous testing to ensure data never leaks unintentionally. The project includes comprehensive test suites using Vitest for unit tests and Playwright for end-to-end scenarios. Here's a critical privacy test ( tests/privacy.test.ts ): import { describe, it, expect, beforeEach } from 'vitest'; import { TestUtils } from './utils/test-helpers'; describe('Privacy Controls', () => { let testUtils: TestUtils; beforeEach(() => { testUtils = new TestUtils(); testUtils.enableConfidentialMode(); }); it('should prevent cloud API calls when confidential mode is enabled', async () => { const response = await testUtils.requestBriefing({ template: 'summary', prompt: 'Confidential merger document...', cloud: true }); expect(response.status).toBe(403); expect(response.error).toContain('disabled in confidential mode'); }); it('should allow local inference in confidential mode', async () => { const response = await testUtils.requestBriefing({ template: 'summary', prompt: 'Confidential merger document...', cloud: false }); expect(response.status).toBe(200); expect(response.model).toContain('local'); expect(response.content).toBeTruthy(); }); it('should not persist sensitive content without opt-in', async () => { await testUtils.requestBriefing({ template: 'executive', prompt: 'Strategic acquisition plan...', cloud: false }); const history = await testUtils.getConversationHistory(); expect(history).toHaveLength(0); // No storage by default }); it('should support opt-in history with explicit consent', async () => { testUtils.enableHistorySaving(); await testUtils.requestBriefing({ template: 'executive', prompt: 'Strategic acquisition plan...', cloud: false }); const history = await testUtils.getConversationHistory(); expect(history).toHaveLength(1); expect(history[0].prompt).toContain('acquisition'); }); }); Performance testing ensures local inference meets the sub-second requirement: describe('Performance SLA', () => { it('should complete local inference in under 1 second', async () => { const samples = []; for (let i = 0; i < 10; i++) { const start = Date.now(); await testUtils.requestBriefing({ template: 'summary', prompt: 'Standard 500-word document...', cloud: false }); samples.push(Date.now() - start); } const p95 = calculatePercentile(samples, 95); expect(p95).toBeLessThan(1000); // 95th percentile under 1s }); it('should handle 5 concurrent requests without degradation', async () => { const requests = Array(5).fill(null).map(() => testUtils.requestBriefing({ template: 'talking-points', prompt: 'Meeting agenda...', cloud: false }) ); const results = await Promise.all(requests); expect(results.every(r => r.status === 200)).toBe(true); expect(results.every(r => r.latency_ms < 2000)).toBe(true); }); }); These tests validate the core promise: local inference is fast, private, and reliable under realistic loads. Deployment Considerations and Production Readiness Moving from development to production requires addressing several operational concerns: model distribution, environment configuration, monitoring, and incident response. For Foundry Local deployment, ensure IT teams pre-install the runtime and required models on consultant laptops. Use MDM (Mobile Device Management) systems or Group Policy to automate model downloads during onboarding. Models can be cached in shared network locations to avoid redundant downloads across teams. Environment configuration should separate local and cloud credentials cleanly: # .env.local (local development) FOUNDRY_LOCAL_ENDPOINT=http://127.0.0.1:5272 AZURE_OPENAI_ENDPOINT=https://your-org.openai.azure.com AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini AZURE_OPENAI_KEY=your-key-here # For production, use Azure Managed Identity instead of API keys USE_MANAGED_IDENTITY=true Managed Identity eliminates API key managementâthe application authenticates using Azure AD, with permissions controlled via IAM policies. This prevents key leakage and simplifies rotation. Monitoring should track both local and cloud usage patterns. Implement structured logging with clear privacy labels: logger.info('Briefing generated', { model: 'local', template: 'summary', latency_ms: 847, tokens: 312, privacy_mode: 'confidential', user_id: hash(userId), // Never log raw user IDs timestamp: new Date().toISOString() }); This approach enables operational insights (average latency, most-used templates, error rates) without exposing sensitive content or user identities. For incident response, establish clear escalation paths. If Foundry Local fails, the application should gracefully degradeâinform users that local inference is unavailable and offer cloud-only mode (with explicit consent). If cloud services fail, local inference continues uninterrupted, ensuring the application remains useful even during Azure outages. Key Takeaways and Next Steps Building a privacy-first hybrid AI application requires careful architectural decisions that prioritize user data protection while maintaining high-quality outputs. The FL-Client-Briefing-Assistant demonstrates that you can achieve sub-second local inference, transparent privacy controls, and optional cloud refinement in a production-ready package. Key lessons from this implementation: Privacy must be the default, not an opt-in featureâconfidential mode should require explicit action to disable Transparency builds trustâalways show users which model processed their data and how long it took Fallback strategies ensure reliabilityâgraceful degradation when services fail keeps the application useful Testing validates promisesâcomprehensive tests for privacy, performance, and functionality are non-negotiable Operational visibility without privacy leaksâstructured logging enables monitoring without exposing sensitive content To extend this application, consider adding: Document parsing: Integrate PDF, DOCX, and PPTX extractors to analyze file uploads directly Multi-document synthesis: Combine insights from multiple client documents into unified briefings Custom templates: Allow consultants to define their own briefing formats and save them for reuse Offline mode indicators: Detect network connectivity and disable cloud features automatically Audit logging: For regulated industries, implement immutable audit trails showing when cloud refinement was used The full implementation, including all code, tests, and deployment guides, is available at github.com/leestott/FL-Client-Briefing-Assistant. Clone the repository, follow the setup guide, and experience privacy-first AI in action. Resources and Further Reading FL-Client-Briefing-Assistant Repository - Complete source code and documentation Microsoft Foundry Local Documentation - Official runtime documentation and API reference Azure OpenAI Service - Cloud refinement integration guide Project Specification - Detailed requirements and acceptance criteria Implementation Guide - Architecture decisions and design patterns Testing Guide - How to run and interpret comprehensive test suitesRealâTime AI Streaming with Azure OpenAI and SignalR
TL;DR Weâll build a real-time AI app where Azure OpenAI streams responses and SignalR broadcasts them live to an Angular client. Users see answers appear incrementally just like ChatGPT while Azure SignalR Service handles scale. Youâll learn the architecture, streaming code, Angular integration, and optional enhancements like typing indicators and multi-agent scenarios. Why This Matters Modern users expect instant feedback. Waiting for a full AI response feels slow and breaks engagement. Streaming responses: Reduces perceived latency: Users see content as itâs generated. Improves UX: Mimics ChatGPTâs typing effect. Keeps users engaged: Especially for long-form answers. Scales for enterprise: Azure SignalR Service handles thousands of concurrent connections. What youâll build A SignalR Hub that calls Azure OpenAI with streaming enabled and forwards partial output to clients as it arrives. An Angular client that connects over WebSockets/SSE to the hub and renders partial content with a typing indicator. An optional Azure SignalR Service layer for scalable connection management (thousands to millions of longâlived connections). References: SignalR hosting & scale; Azure SignalR Service concepts. Architecture The hub calls Azure OpenAI with streaming enabled (await foreach over updates) and broadcasts partials to clients. Azure SignalR Service (optional) offloads connection scale and removes stickyâsession complexity in multiânode deployments. References: Streaming code pattern; scale/ARR affinity; Azure SignalR integration. Prerequisites Azure OpenAI resource with a deployed model (e.g., gpt-4o or gpt-4o-mini) .NET 8 API + ASP.NET Core SignalR backend Angular 16+ frontend (using microsoftâ/signalr) StepâbyâStep Implementation 1) Backend: ASP.NET Core + SignalR Install packages dotnet add package Microsoft.AspNetCore.SignalR dotnet add package Azure.AI.OpenAI --prerelease dotnet add package Azure.Identity dotnet add package Microsoft.Extensions.AI dotnet add package Microsoft.Extensions.AI.OpenAI --prerelease # Optional (managed scale): Azure SignalR Service dotnet add package Microsoft.Azure.SignalR Using DefaultAzureCredential (Entra ID) avoids storing raw keys in code and is the recommended auth model for Azure services. Program.cs var builder = WebApplication.CreateBuilder(args); builder.Services.AddSignalR(); // To offload connection management to Azure SignalR Service, uncomment: // builder.Services.AddSignalR().AddAzureSignalR(); builder.Services.AddSingleton<AiStreamingService>(); var app = builder.Build(); app.MapHub<ChatHub>("/chat"); app.Run(); AiStreamingService.cs - streams content from Azure OpenAI using Microsoft.Extensions.AI; using Azure.AI.OpenAI; using Azure.Identity; public class AiStreamingService { private readonly IChatClient _chatClient; public AiStreamingService(IConfiguration config) { var endpoint = new Uri(config["AZURE_OPENAI_ENDPOINT"]!); var deployment = config["AZURE_OPENAI_DEPLOYMENT"]!; // e.g., "gpt-4o-mini" var azureClient = new AzureOpenAIClient(endpoint, new DefaultAzureCredential()); _chatClient = azureClient.GetChatClient(deployment).AsIChatClient(); } public async IAsyncEnumerable<string> StreamReplyAsync(string userMessage) { var messages = new List<ChatMessage> { ChatMessage.CreateSystemMessage("You are a helpful assistant."), ChatMessage.CreateUserMessage(userMessage) }; await foreach (var update in _chatClient.CompleteChatStreamingAsync(messages)) { // Only text parts; ignore tool calls/annotations var chunk = string.Join("", update.Content .Where(p => p.Kind == ChatMessageContentPartKind.Text) .Select(p => ((TextContent)p).Text)); if (!string.IsNullOrEmpty(chunk)) yield return chunk; } } } Modern .NET AI extensions (Microsoft.Extensions.AI) expose a unified streaming pattern via CompleteChatStreamingAsync. ChatHub.cs - pushes partials to the caller using Microsoft.AspNetCore.SignalR; public class ChatHub : Hub { private readonly AiStreamingService _ai; public ChatHub(AiStreamingService ai) => _ai = ai; // Client calls: connection.invoke("AskAi", prompt) public async Task AskAi(string prompt) { var messageId = Guid.NewGuid().ToString("N"); await Clients.Caller.SendAsync("typing", messageId, true); await foreach (var partial in _ai.StreamReplyAsync(prompt)) { await Clients.Caller.SendAsync("partial", messageId, partial); } await Clients.Caller.SendAsync("typing", messageId, false); await Clients.Caller.SendAsync("completed", messageId); } } 2) Frontend: Angular client with microsoftâ/signalr Install the SignalR client npm i microsoft/signalr Create a SignalR service (Angular) // src/app/services/ai-stream.service.ts import { Injectable } from '@angular/core'; import * as signalR from '@microsoft/signalr'; import { BehaviorSubject, Observable } from 'rxjs'; @Injectable({ providedIn: 'root' }) export class AiStreamService { private connection?: signalR.HubConnection; private typing$ = new BehaviorSubject<boolean>(false); private partial$ = new BehaviorSubject<string>(''); private completed$ = new BehaviorSubject<boolean>(false); get typing(): Observable<boolean> { return this.typing$.asObservable(); } get partial(): Observable<string> { return this.partial$.asObservable(); } get completed(): Observable<boolean> { return this.completed$.asObservable(); } async start(): Promise<void> { this.connection = new signalR.HubConnectionBuilder() .withUrl('/chat') // same origin; use absolute URL if CORS .withAutomaticReconnect() .configureLogging(signalR.LogLevel.Information) .build(); this.connection.on('typing', (_id: string, on: boolean) => this.typing$.next(on)); this.connection.on('partial', (_id: string, text: string) => { // Append incremental content this.partial$.next((this.partial$.value || '') + text); }); this.connection.on('completed', (_id: string) => this.completed$.next(true)); await this.connection.start(); } async ask(prompt: string): Promise<void> { // Reset state per request this.partial$.next(''); this.completed$.next(false); await this.connection?.invoke('AskAi', prompt); } } Angular component // src/app/components/ai-chat/ai-chat.component.ts import { Component, OnInit } from '@angular/core'; import { AiStreamService } from '../../services/ai-stream.service'; @Component({ selector: 'app-ai-chat', templateUrl: './ai-chat.component.html', styleUrls: ['./ai-chat.component.css'] }) export class AiChatComponent implements OnInit { prompt = ''; output = ''; typing = false; done = false; constructor(private ai: AiStreamService) {} async ngOnInit() { await this.ai.start(); this.ai.typing.subscribe(on => this.typing = on); this.ai.partial.subscribe(text => this.output = text); this.ai.completed.subscribe(done => this.done = done); } async send() { this.output = ''; this.done = false; await this.ai.ask(this.prompt); } } HTML Template <!-- src/app/components/ai-chat/ai-chat.component.html --> <div class="chat"> <div class="prompt"> <input [(ngModel)]="prompt" placeholder="Ask me anythingâŚ" /> <button (click)="send()">Send</button> </div> <div class="response"> <pre>{{ output }}</pre> <div class="typing" *ngIf="typing">Assistant is typingâŚ</div> <div class="done" *ngIf="done">â Completed</div> </div> </div> Streaming modes, content filters, and UX Azure OpenAI streaming interacts with content filtering in two ways: Default streaming: The service buffers output into content chunks and runs content filters before each chunk is emitted; you still stream, but not necessarily tokenâbyâtoken. Asynchronous Filter (optional): The service returns tokenâlevel updates immediately and runs filters asynchronously. You get ultraâsmooth streaming but must handle delayed moderation signals (e.g., redaction or halting the stream). Best practices Append partials in small batches clientâside to avoid DOM thrash; finalize formatting on "completed". Log full messages serverâside only after completion to keep histories consistent (mirrors agent frameworks). Security & compliance Auth: Prefer Microsoft Entra ID (DefaultAzureCredential) to avoid key sprawl; use RBAC and Managed Identities where possible. Secrets: Store Azure SignalR connection strings in Key Vault and rotate periodically; never hardcode. CORS & crossâdomain: When hosting frontend and hub on different origins, configure CORS and use absolute URLs in withUrl(...). Connection management & scaling tips Persistent connection load: SignalR consumes TCP resources; separate heavy realâtime workloads or use Azure SignalR to protect other apps. Sticky sessions (selfâhosted): Required in most multiâserver scenarios unless WebSocketsâonly + SkipNegotiation applies; Azure SignalR removes this requirement. Learn more AIâPowered Group Chat sample (ASP.NET Core): Azure OpenAI .NET client (auth & streaming): SignalR JavaScript ClientAI Repo of the Week: Generative AI for Beginners with JavaScript
Introduction Ready to explore the fascinating world of Generative AI using your JavaScript skills? This weekâs featured repository, Generative AI for Beginners with JavaScript, is your launchpad into the future of application development. Whether you're just starting out or looking to expand your AI toolbox, this open-source GitHub resource offers a rich, hands-on journey. It includes interactive lessons, quizzes, and even time-travel storytelling featuring historical legends like Leonardo da Vinci and Ada Lovelace. Each chapter combines narrative-driven learning with practical exercises, helping you understand foundational AI concepts and apply them directly in code. Itâs immersive, educational, and genuinely fun. What You'll Learn 1. đ§ Foundations of Generative AI and LLMs Start with the basics: What is generative AI? How do large language models (LLMs) work? This chapter lays the groundwork for how these technologies are transforming JavaScript development. 2. đ Build Your First AI-Powered App Walk through setting up your environment and creating your first AI app. Learn how to configure prompts and unlock the potential of AI in your own projects. 3. đŻ Prompt Engineering Essentials Get hands-on with prompt engineering techniques that shape how AI models respond. Explore strategies for crafting prompts that are clear, targeted, and effective. 4. đŚ Structured Output with JSON Learn how to guide the model to return structured data formats like JSONâcritical for integrating AI into real-world applications. 5. đ Retrieval-Augmented Generation (RAG) Go beyond static prompts by combining LLMs with external data sources. Discover how RAG lets your app pull in live, contextual information for more intelligent results. 6. đ ď¸ Function Calling and Tool Use Give your LLM new powers! Learn how to connect your own functions and tools to your app, enabling more dynamic and actionable AI interactions. 7. đ Model Context Protocol (MCP) Dive into MCP, a new standard for organizing prompts, tools, and resources. Learn how it simplifies AI app development and fosters consistency across projects. 8. âď¸ Enhancing MCP Clients with LLMs Build on what youâve learned by integrating LLMs directly into your MCP clients. See how to make them smarter, faster, and more helpful. ⨠More chapters coming soonâwatch the repo for updates! Companion App: Interact with History Experience the power of generative AI in action through the companion web appâwhere you can chat with historical figures and witness how JavaScript brings AI to life in real time. Conclusion Generative AI for Beginners with JavaScript is more than a courseâitâs an adventure into how storytelling, coding, and AI can come together to create something fun and educational. Whether you're here to upskill, experiment, or build the next big thing, this repository is your all-in-one resource to get started with confidence. đ Jump into the future of developmentâcheck out the repo and start building with AI today!Add speech input & output to your app with the free browser APIs
One of the amazing benefits of modern machine learning is that computers can reliably turn text into speech, or transcribe speech into text, across multiple languages and accents. We can then use those capabilities to make our web apps more accessible for anyone who has a situational, temporary, or chronic issue that makes typing difficult. That describes so many people - for example, a parent holding a squirmy toddler in their hands, an athlete with a broken arm, or an individual with Parkinson's disease. There are two approaches we can use to add speech capabilities to our apps: Use the built-in browser APIs: the SpeechRecognition API and SpeechSynthesis API. Use a cloud-based service, like the Azure Speech API. Which one to use? The great thing about the browser APIs is that they're free and available in most modern browsers and operating systems. The drawback of the APIs is that they're often not as powerful and flexible as cloud-based services, and the speech output often sounds more robotic. There are also a few niche browser/OS combos where the built-in APIs don't work. That's why we decided to add both options to our most popular RAG chat solution, to give developers the option to decide for themselves. However, in this post, I'm going to show you how to add speech capabilities using the free built-in browser APIs, since free APIs are often easier to get started with and it's important to do what we can to improve the accessibility of our apps. The GIF below shows the end result, a chat app with both speech input and output buttons: All of the code described in this post is part of openai-chat-vision-quickstart, so you can grab the full code yourself after seeing how it works. Speech input with SpeechRecognition API To make it easier to add a speech input button to any app, I'm wrapping the functionality inside a custom HTML element, SpeechInputButton . First I construct the speech input button element with an instance of the SpeechRecognition API, making sure to use the browser's preferred language if any are set: class SpeechInputButton extends HTMLElement { constructor() { super(); this.isRecording = false; const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition; if (!SpeechRecognition) { this.dispatchEvent( new CustomEvent("speecherror", { detail: { error: "SpeechRecognition not supported" }, }) ); return; } this.speechRecognition = new SpeechRecognition(); this.speechRecognition.lang = navigator.language || navigator.userLanguage; this.speechRecognition.interimResults = false; this.speechRecognition.continuous = true; this.speechRecognition.maxAlternatives = 1; } Then I define the connectedCallback() method that will be called whenever this custom element has been added to the DOM. When that happens, I define the inner HTML to render a button and attach event listeners for both mouse and keyboard events. Since we want this to be fully accessible, keyboard support is important. connectedCallback() { this.innerHTML = ` <button class="btn btn-outline-secondary" type="button" title="Start recording (Shift + Space)"> <i class="bi bi-mic"></i> </button>`; this.recordButton = this.querySelector('button'); this.recordButton.addEventListener('click', () => this.toggleRecording()); document.addEventListener('keydown', this.handleKeydown.bind(this)); } handleKeydown(event) { if (event.key === 'Escape') { this.abortRecording(); } else if (event.key === ' ' && event.shiftKey) { // Shift + Space event.preventDefault(); this.toggleRecording(); } } toggleRecording() { if (this.isRecording) { this.stopRecording(); } else { this.startRecording(); } } The majority of the code is in the startRecording function. It sets up a listener for the "result" event from the SpeechRecognition instance, which contains the transcribed text. It also sets up a listener for the "end" event, which is triggered either automatically after a few seconds of silence (in some browsers) or when the user ends the recording by clicking the button. Finally, it sets up a listener for any "error" events. Once all listeners are ready, it calls start() on the SpeechRecognition instance and styles the button to be in an active state. startRecording() { if (this.speechRecognition == null) { this.dispatchEvent( new CustomEvent("speech-input-error", { detail: { error: "SpeechRecognition not supported" }, }) ); } this.speechRecognition.onresult = (event) => { let input = ""; for (const result of event.results) { input += result[0].transcript; } this.dispatchEvent( new CustomEvent("speech-input-result", { detail: { transcript: input }, }) ); }; this.speechRecognition.onend = () => { this.isRecording = false; this.renderButtonOff(); this.dispatchEvent(new Event("speech-input-end")); }; this.speechRecognition.onerror = (event) => { if (this.speechRecognition) { this.speechRecognition.stop(); if (event.error == "no-speech") { this.dispatchEvent( new CustomEvent("speech-input-error", { detail: {error: "No speech was detected. Please check your system audio settings and try again."}, })); } else if (event.error == "language-not-supported") { this.dispatchEvent( new CustomEvent("speech-input-error", { detail: {error: "The selected language is not supported. Please try a different language.", }})); } else if (event.error != "aborted") { this.dispatchEvent( new CustomEvent("speech-input-error", { detail: {error: "An error occurred while recording. Please try again: " + event.error}, })); } } }; this.speechRecognition.start(); this.isRecording = true; this.renderButtonOn(); } If the user stops the recording using the keyboard shortcut or button click, we call stop() on the SpeechRecognition instance. At that point, anything the user had said will be transcribed and become available via the "result" event. stopRecording() { if (this.speechRecognition) { this.speechRecognition.stop(); } } Alternatively, if the user presses the Escape keyboard shortcut, we instead call abort() on the SpeechRecognition instance, which stops the recording and does not send any previously untranscribed speech over. abortRecording() { if (this.speechRecognition) { this.speechRecognition.abort(); } } Once the custom HTML element is fully defined, we register it with the desired tag name, speech-input-button : customElements.define("speech-input-button", SpeechInputButton); To use the custom speech-input-button element in a chat application, we add it to the HTML for the chat form: <speech-input-button></speech-input-button> <input id="message" name="message" type="text" rows="1"></input> Then we attach an event listener for the custom events dispatched by the element, and we update the input text field with the transcribed text: const speechInputButton = document.querySelector("speech-input-button"); speechInputButton.addEventListener("speech-input-result", (event) => { messageInput.value += " " + event.detail.transcript.trim(); messageInput.focus(); }); You can see the full custom HTML element code in speech-input.js and the usage in index.html. There's also a fun pulsing animation for the button's active state in styles.css. Speech output with SpeechSynthesis API Once again, to make it easier to add a speech output button to any app, I'm wrapping the functionality inside a custom HTML element, SpeechOutputButton . When defining the custom element, we specify an observed attribute named "text", to store whatever text should be turned into speech when the button is clicked. class SpeechOutputButton extends HTMLElement { static observedAttributes = ["text"]; In the constructor, we check to make sure the SpeechSynthesis API is supported, and remember the browser's preferred language for later use. constructor() { super(); this.isPlaying = false; const SpeechSynthesis = window.speechSynthesis || window.webkitSpeechSynthesis; if (!SpeechSynthesis) { this.dispatchEvent( new CustomEvent("speech-output-error", { detail: { error: "SpeechSynthesis not supported" } })); return; } this.synth = SpeechSynthesis; this.lngCode = navigator.language || navigator.userLanguage; } When the custom element is added to the DOM, I define the inner HTML to render a button and attach mouse and keyboard event listeners: connectedCallback() { this.innerHTML = ` <button class="btn btn-outline-secondary" type="button"> <i class="bi bi-volume-up"></i> </button>`; this.speechButton = this.querySelector("button"); this.speechButton.addEventListener("click", () => this.toggleSpeechOutput() ); document.addEventListener('keydown', this.handleKeydown.bind(this)); } The majority of the code is in the toggleSpeechOutput function. If the speech is not yet playing, it creates a new SpeechSynthesisUtterance instance, passes it the "text" attribute, and sets the language and audio properties. It attempts to use a voice that's optimal for the desired language, but falls back to "en-US" if none is found. It attaches event listeners for the start and end events, which will change the button's style to look either active or unactive. Finally, it tells the SpeechSynthesis API to speak the utterance. toggleSpeechOutput() { if (!this.isConnected) { return; } const text = this.getAttribute("text"); if (this.synth != null) { if (this.isPlaying || text === "") { this.stopSpeech(); return; } // Create a new utterance and play it. const utterance = new SpeechSynthesisUtterance(text); utterance.lang = this.lngCode; utterance.volume = 1; utterance.rate = 1; utterance.pitch = 1; let voice = this.synth .getVoices() .filter((voice) => voice.lang === this.lngCode)[0]; if (!voice) { voice = this.synth .getVoices() .filter((voice) => voice.lang === "en-US")[0]; } utterance.voice = voice; if (!utterance) { return; } utterance.onstart = () => { this.isPlaying = true; this.renderButtonOn(); }; utterance.onend = () => { this.isPlaying = false; this.renderButtonOff(); }; this.synth.speak(utterance); } } When the user no longer wants to hear the speech output, indicated either via another press of the button or by pressing the Escape key, we call cancel() from the SpeechSynthesis API. stopSpeech() { if (this.synth) { this.synth.cancel(); this.isPlaying = false; this.renderButtonOff(); } } Once the custom HTML element is fully defined, we register it with the desired tag name, speech-output-button : customElements.define("speech-output-button", SpeechOutputButton); To use this custom speech-output-button element in a chat application, we construct it dynamically each time that we've received a full response from an LLM, and call setAttribute to pass in the text to be spoken: const speechOutput = document.createElement("speech-output-button"); speechOutput.setAttribute("text", answer); messageDiv.appendChild(speechOutput); You can see the full custom HTML element code in speech-output.js and the usage in index.html. This button also uses the same pulsing animation for the active state, defined in styles.css. Acknowledgments I want to give a huge shout-out to John Aziz for his amazing work adding speech input and output to the azure-search-openai-demo, as that was the basis for the code I shared in this blog post.1.5KViews2likes0Comments