performance

15 Topics

Stop Experimenting, Start Building: AI Apps & Agents Dev Days Has You Covered
The AI landscape has shifted. The question is no longer “Can we build AI applications?” it’s “Can we build AI applications that actually work in production?” Demos are easy. Reliable, scalable, resilient AI systems that handle real-world complexity? That’s where most teams struggle. If you’re an AI developer, software engineer, or solution architect who’s ready to move beyond prototypes and into production-grade AI, there’s a series built specifically for you. What Is AI Apps & Agents Dev Days? AI Apps & Agents Dev Days is a monthly technical series from Microsoft Reactor, delivered in partnership with Microsoft and NVIDIA. You can explore the full series at https://developer.microsoft.com/en-us/reactor/series/s-1590/ This isn’t a slide deck marathon. The series tagline says it best: “It’s not about slides, it’s about building.” Each session tackles real-world challenges, shares patterns that actually work, and digs into what’s next in AI-driven app and agent design. You bring your curiosity, your code, and your questions. You leave with something you can ship. The sessions are led by experienced engineers and advocates from both Microsoft and NVIDIA, people like Pamela Fox, Bruno Capuano, Anthony Shaw, Gwyneth Peña-Siguenza, and solutions architects from NVIDIA’s Cloud AI team. These aren’t theorists; they’re practitioners who build and ship the tools you use every day. What You’ll Learn The series covers the full spectrum of building AI applications and agent-based systems. Here are the key themes: Building AI Applications with Azure, GitHub, and Modern Tooling Sessions walk through how to wire up AI capabilities using Azure services, GitHub workflows, and the latest SDKs. The focus is always on code-first learning, you’ll see real implementations, not abstract architecture diagrams. Designing and Orchestrating AI Agents Agent development is one of the series’ strongest threads. Sessions cover how to build agents that orchestrate long-running workflows, persist state automatically, recover from failures, and pause for human-in-the-loop input, without losing progress. For example, the session “AI Agents That Don’t Break Under Pressure” demonstrates building durable, production-ready AI agents using the Microsoft Agent Framework, running on Azure Container Apps with NVIDIA serverless GPUs. Scaling LLM Inference and Deploying to Production Moving from a working prototype to a production deployment means grappling with inference performance, GPU infrastructure, and cost management. The series covers how to leverage NVIDIA GPU infrastructure alongside Azure services to scale inference effectively, including patterns for serverless GPU compute. Real-World Architecture Patterns Expect sessions on container-based deployments, distributed agent systems, and enterprise-grade architectures. You’ll learn how to use services like Azure Container Apps to host resilient AI workloads, how Foundry IQ fits into agent architectures as a trusted knowledge source, and how to make architectural decisions that balance performance, cost, and scalability. Why This Matters for Your Day Job There’s a critical gap between what most AI tutorials teach and what production systems actually require. This series bridges that gap: Production-ready patterns, not demos. Every session focuses on code and architecture you can take directly into your projects. You’ll learn patterns for state persistence, failure recovery, and durable execution — the things that break at 2 AM. Enterprise applicability. The scenarios covered — travel planning agents, multi-step workflows, GPU-accelerated inference — map directly to enterprise use cases. Whether you’re building internal tooling or customer-facing AI features, the patterns transfer. Honest trade-off discussions. The speakers don’t shy away from the hard questions: When do you need serverless GPUs versus dedicated compute? How do you handle agent failures gracefully? What does it actually cost to run these systems at scale? Watch On-Demand, Build at Your Own Pace Every session is available on-demand. You can watch, pause, and build along at your own pace, no need to rearrange your schedule. The full playlist is available at This is particularly valuable for technical content. Pause a session while you replicate the architecture in your own environment. Rewind when you need to catch a configuration detail. Build alongside the presenters rather than just watching passively. What You’ll Walk Away Wit After working through the series, you’ll have: Practical agent development skills — how to design, orchestrate, and deploy AI agents that handle real-world complexity, including state management, failure recovery, and human-in-the-loop patterns Production architecture patterns — battle-tested approaches for deploying AI workloads on Azure Container Apps, leveraging NVIDIA GPU infrastructure, and building resilient distributed systems Infrastructure decision-making confidence — a clearer understanding of when to use serverless GPUs, how to optimise inference costs, and how to choose the right compute strategy for your workload Working code and reference implementations — the sessions are built around live coding and sample applications (like the Travel Planner agent demo), giving you starting points you can adapt immediately A framework for continuous learning — with new sessions each month, you’ll stay current as the AI platform evolves and new capabilities emerge Start Building The AI applications that will matter most aren’t the ones with the flashiest demos — they’re the ones that work reliably, scale gracefully, and solve real problems. That’s exactly what this series helps you build. Whether you’re designing your first AI agent system or hardening an existing one for production, the AI Apps & Agents Dev Days sessions give you the patterns, tools, and practical knowledge to move forward with confidence. Explore the series at https://developer.microsoft.com/en-us/reactor/series/s-1590/ and start watching the on-demand sessions at the link above. The best time to level up your AI engineering skills was yesterday. The second-best time is right now and these sessions make it easy to start.
Lee_Stott
Apr 23, 2026 Place Microsoft Developer Community Blog
238Views
0likes
0Comments
Build a Fully Offline AI App with Foundry Local and CAG
A hands-on guide to building an on-device AI support agent using Context-Augmented Generation, JavaScript, and Foundry Local. You have probably heard the AI pitch: "just call our API." But what happens when your application needs to work without an internet connection? Perhaps your users are field engineers standing next to a pipeline in the middle of nowhere, or your organisation has strict data privacy requirements, or you simply want to build something that works without a cloud bill. This post walks you through how to build a fully offline, on-device AI application using Foundry Local and a pattern called Context-Augmented Generation (CAG). By the end, you will have a clear understanding of what CAG is, how it compares to RAG, and the practical steps to build your own solution. The finished application: a browser-based AI support agent that runs entirely on your machine. What Is Context-Augmented Generation? Context-Augmented Generation (CAG) is a pattern for making AI models useful with your own domain-specific content. Instead of hoping the model "knows" the answer from its training data, you pre-load your entire knowledge base into the model's context window at startup. Every query the model handles has access to all of your documents, all of the time. The flow is straightforward: Load your documents into memory when the application starts. Inject the most relevant documents into the prompt alongside the user's question. Generate a response grounded in your content. There is no retrieval pipeline, no vector database, and no embedding model. Your documents are read from disc, held in memory, and selected per query using simple keyword scoring. The model generates answers grounded in your content rather than relying on what it learnt during training. CAG vs RAG: Understanding the Trade-offs If you have explored AI application patterns before, you have likely encountered Retrieval-Augmented Generation (RAG). Both CAG and RAG solve the same core problem: grounding an AI model's answers in your own content. They take different approaches, and each has genuine strengths and limitations. CAG (Context-Augmented Generation) How it works: All documents are loaded at startup. The most relevant ones are selected per query using keyword scoring and injected into the prompt. Strengths: Drastically simpler architecture with no vector database, no embeddings, and no retrieval pipeline Works fully offline with no external services Minimal dependencies (just two npm packages in this sample) Near-instant document selection with no embedding latency Easy to set up, debug, and reason about Limitations: Constrained by the model's context window size Best suited to small, curated document sets (tens of documents, not thousands) Keyword scoring is less precise than semantic similarity for ambiguous queries Adding documents requires an application restart RAG (Retrieval-Augmented Generation) How it works: Documents are chunked, embedded into vectors, and stored in a database. At query time, the most semantically similar chunks are retrieved and injected into the prompt. Strengths: Scales to thousands or millions of documents Semantic search finds relevant content even when the user's wording differs from the source material Documents can be added or updated dynamically without restarting Fine-grained retrieval (chunk-level) can be more token-efficient for large collections Limitations: More complex architecture: requires an embedding model, a vector database, and a chunking strategy Retrieval quality depends heavily on chunking, embedding model choice, and tuning Additional latency from the embedding and search steps More dependencies and infrastructure to manage Want to compare these patterns hands-on? There is a RAG-based implementation of the same gas field scenario using vector search and embeddings. Clone both repositories, run them side by side, and see how the architectures differ in practice. When Should You Choose Which? Consideration Choose CAG Choose RAG Document count Tens of documents Hundreds or thousands Offline requirement Essential Optional (can run locally too) Setup complexity Minimal Moderate to high Document updates Infrequent (restart to reload) Frequent or dynamic Query precision Good for keyword-matchable content Better for semantically diverse queries Infrastructure None beyond the runtime Vector database, embedding model For the sample application in this post (20 gas engineering procedure documents on a local machine), CAG is the clear winner. If your use case grows to hundreds of documents or requires real-time ingestion, RAG becomes the better choice. Both patterns can run offline using Foundry Local. Foundry Local: Your On-Device AI Runtime Foundry Local is a lightweight runtime from Microsoft that downloads, manages, and serves language models entirely on your device. No cloud account, no API keys, no outbound network calls (after the initial model download). In this sample, your application is responsible for deciding which model to use, and it does that through the foundry-local-sdk . The app creates a FoundryLocalManager , asks the SDK for the local model catalogue, and then runs a small selection policy from src/modelSelector.js . That policy looks at the machine's available RAM, filters out models that are too large, ranks the remaining chat models by preference, and then returns the best fit for that device. Why does it work this way? Because shipping one fixed model would either exclude lower-spec machines or underuse more capable ones. A 14B model may be perfectly reasonable on a 32 GB workstation, but the same choice would be slow or unusable on an 8 GB laptop. By selecting at runtime, the same codebase can run across a wider range of developer machines without manual tuning. What makes it particularly useful for developers: No GPU required — runs on CPU or NPU, making it accessible on standard laptops and desktops Native SDK bindings — in-process inference via the foundry-local-sdk npm package, with no HTTP round-trips to a local server Automatic model management — downloads, caches, and loads models automatically Dynamic model selection — the SDK can evaluate your device's available RAM and pick the best model from the catalogue Real-time progress callbacks — ideal for building loading UIs that show download and initialisation progress The integration code is refreshingly minimal. Here is the core pattern: import { FoundryLocalManager } from "foundry-local-sdk"; // Create a manager and get the model catalogue const manager = FoundryLocalManager.create({ appName: "my-app" }); // Auto-select the best model for this device based on available RAM const models = await manager.catalog.getModels(); const model = selectBestModel(models); // Download if not cached, then load into memory if (!model.isCached) { await model.download((progress) => { console.log(`Download: ${progress.toFixed(0)}%`); }); } await model.load(); // Create a chat client for direct in-process inference const chatClient = model.createChatClient(); const response = await chatClient.completeChat([ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "How do I detect a gas leak?" } ]); That is it. No server configuration, no authentication tokens, no cloud provisioning. The model runs in the same process as your application. The download step matters for a simple reason: offline inference only works once the model files exist locally. The SDK checks whether the chosen model is already cached on the machine. If it is not, the application asks Foundry Local to download it once, store it locally, and then load it into memory. After that first run, the cached model can be reused, which is why subsequent launches are much faster and can operate without any network dependency. Put another way, there are two cooperating pieces here. Your application chooses which model is appropriate for the device and the scenario. Foundry Local and its SDK handle the mechanics of making that model available locally, caching it, loading it, and exposing a chat client for inference. That separation keeps the application logic clear whilst letting the runtime handle the heavy lifting. The Technology Stack The sample application is deliberately simple. No frameworks, no build steps, no Docker: Layer Technology Purpose AI Model Foundry Local + auto-selected model Runs locally via native SDK bindings; best model chosen for your device Back end Node.js + Express Lightweight HTTP server, everyone knows it Context Markdown files pre-loaded at startup No vector database, no embeddings, no retrieval step Front end Single HTML file with inline CSS No build step, mobile-responsive, field-ready The total dependency footprint is two npm packages: express and foundry-local-sdk . Architecture Overview The four-layer architecture, all running on a single machine. The system has four layers, all running in a single process on your device: Client layer: a single HTML file served by Express, with quick-action buttons and a responsive chat interface Server layer: Express.js starts immediately and serves the UI plus an SSE status endpoint; API routes handle chat (streaming and non-streaming), context listing, and health checks CAG engine: loads all domain documents at startup, selects the most relevant ones per query using keyword scoring, and injects them into the prompt AI layer: Foundry Local runs the auto-selected model on CPU/NPU via native SDK bindings (in-process inference, no HTTP round-trips) Building the Solution Step by Step Prerequisites You need two things installed on your machine: Node.js 20 or later: download from nodejs.org Foundry Local: Microsoft's on-device AI runtime: winget install Microsoft.FoundryLocal Foundry Local will automatically select and download the best model for your device the first time you run the application. You can override this by setting the FOUNDRY_MODEL environment variable to a specific model alias. Getting the Code Running # Clone the repository git clone https://github.com/leestott/local-cag.git cd local-cag # Install dependencies npm install # Start the server npm start Open http://127.0.0.1:3000 in your browser. You will see a loading overlay with a progress bar whilst the model downloads (first run only) and loads into memory. Once the model is ready, the overlay fades away and you can start chatting. Desktop view Mobile view How the CAG Pipeline Works Let us trace what happens when a user asks: "How do I detect a gas leak?" The query flow from browser to model and back. 1 Server starts and loads documents When you run npm start , the Express server starts on port 3000. All .md files in the docs/ folder are read, parsed (with optional YAML front-matter for title, category, and ID), and grouped by category. A document index is built listing all available topics. 2 Model is selected and loaded The model selector evaluates your system's available RAM and picks the best model from the Foundry Local catalogue. If the model is not already cached, it downloads it (with progress streamed to the browser via SSE). The model is then loaded into memory for in-process inference. 3 User sends a question The question arrives at the Express server. The chat engine selects the top 3 most relevant documents using keyword scoring. 4 Prompt is constructed The engine builds a messages array containing: the system prompt (with safety-first instructions), the document index (so the model knows all available topics), the 3 selected documents (approximately 6,000 characters), the conversation history, and the user's question. 5 Model generates a grounded response The prompt is sent to the locally loaded model via the Foundry Local SDK's native bindings. The response streams back token by token through Server-Sent Events to the browser. A response with safety warnings and step-by-step guidance The sources panel shows which documents were used Key Code Walkthrough Loading Documents (the Context Module) The context module reads all markdown files from the docs/ folder at startup. Each document can have optional YAML front-matter for metadata: // src/context.js export function loadDocuments() { const files = fs.readdirSync(config.docsDir) .filter(f => f.endsWith(".md")) .sort(); const docs = []; for (const file of files) { const raw = fs.readFileSync(path.join(config.docsDir, file), "utf-8"); const { meta, body } = parseFrontMatter(raw); docs.push({ id: meta.id || path.basename(file, ".md"), title: meta.title || file, category: meta.category || "General", content: body.trim(), }); } return docs; } There is no chunking, no vector computation, and no database. The documents are held in memory as plain text. Dynamic Model Selection Rather than hard-coding a model, the application evaluates your system at runtime: // src/modelSelector.js const totalRamMb = os.totalmem() / (1024 * 1024); const budgetMb = totalRamMb * 0.6; // Use up to 60% of system RAM // Filter to models that fit, rank by quality, boost cached models const candidates = allModels.filter(m => m.task === "chat-completion" && m.fileSizeMb <= budgetMb ); // Returns the best model: e.g. phi-4 on a 32 GB machine, // or phi-3.5-mini on a laptop with 8 GB RAM This means the same application runs on a powerful workstation (selecting a 14B parameter model) or a constrained laptop (selecting a 3.8B model), with no code changes required. This is worth calling out because it is one of the most practical parts of the sample. Developers do not have to decide up front which single model every user should run. The application makes that decision at startup based on the hardware budget you set, then asks Foundry Local to fetch the model if it is missing. The result is a smoother first-run experience and fewer support headaches when the same app is used on mixed hardware. The System Prompt For safety-critical domains, the system prompt is engineered to prioritise safety, prevent hallucination, and enforce structured responses: // src/prompts.js export const SYSTEM_PROMPT = `You are a local, offline support agent for gas field inspection and maintenance engineers. Behaviour Rules: - Always prioritise safety. If a procedure involves risk, explicitly call it out. - Do not hallucinate procedures, measurements, or tolerances. - If the answer is not in the provided context, say: "This information is not available in the local knowledge base." Response Format: - Summary (1-2 lines) - Safety Warnings (if applicable) - Step-by-step Guidance - Reference (document name + section)`; This pattern is transferable to any safety-critical domain: medical devices, electrical work, aviation maintenance, or chemical handling. Adapting This for Your Own Domain The sample project is designed to be forked and adapted. Here is how to make it yours in three steps: 1. Replace the documents Delete the gas engineering documents in docs/ and add your own markdown files. The context module handles any markdown content with optional YAML front-matter: --- title: Troubleshooting Widget Errors category: Support id: KB-001 --- # Troubleshooting Widget Errors ...your content here... 2. Edit the system prompt Open src/prompts.js and rewrite the system prompt for your domain. Keep the structure (summary, safety, steps, reference) and update the language to match your users' expectations. 3. Override the model (optional) By default the application auto-selects the best model. To force a specific model: # See available models foundry model list # Force a smaller, faster model FOUNDRY_MODEL=phi-3.5-mini npm start # Or a larger, higher-quality model FOUNDRY_MODEL=phi-4 npm start Smaller models give faster responses on constrained devices. Larger models give better quality. The auto-selector picks the largest model that fits within 60% of your system RAM. Building a Field-Ready UI The front end is a single HTML file with inline CSS. No React, no build tooling, no bundler. This keeps the project accessible to beginners and easy to deploy. Design decisions that matter for field use: Dark, high-contrast theme with 18px base font size for readability in bright sunlight Large touch targets (minimum 48px) for operation with gloves or PPE Quick-action buttons for common questions, so engineers do not need to type on a phone Responsive layout that works from 320px to 1920px+ screen widths Streaming responses via SSE, so the user sees tokens arriving in real time The mobile chat experience, optimised for field use. Visual Startup Progress with SSE A standout feature of this application is the loading experience. When the user opens the browser, they see a progress overlay showing exactly what the application is doing: Loading domain documents Initialising the Foundry Local SDK Selecting the best model for the device Downloading the model (with a percentage progress bar, first run only) Loading the model into memory This works because the Express server starts before the model finishes loading. The browser connects immediately and receives real-time status updates via Server-Sent Events. Chat endpoints return 503 whilst the model is loading, so the UI cannot send queries prematurely. // Server-side: broadcast status to all connected browsers function broadcastStatus(state) { initState = state; const payload = `data: ${JSON.stringify(state)}\n\n`; for (const client of statusClients) { client.write(payload); } } // During initialisation: broadcastStatus({ stage: "downloading", message: "Downloading phi-4...", progress: 42 }); This pattern is worth adopting in any application where model loading takes more than a few seconds. Users should never stare at a blank screen wondering whether something is broken. Testing The project includes unit tests using the built-in Node.js test runner, with no extra test framework needed: # Run all tests npm test Tests cover configuration, server endpoints, and document loading. Use them as a starting point when you adapt the project for your own domain. Ideas for Extending the Project Once you have the basics running, there are plenty of directions to explore: Conversation memory: persist chat history across sessions using local storage or a lightweight database Hybrid CAG + RAG: add a vector retrieval step for larger document collections that exceed the context window Multi-modal support: add image-based queries (photographing a fault code, for example) PWA packaging: make it installable as a standalone offline application on mobile devices Custom model fine-tuning: fine-tune a model on your domain data for even better answers Ready to Build Your Own? Clone the CAG sample, swap in your own documents, and have an offline AI agent running in minutes. Or compare it with the RAG approach to see which pattern suits your use case best. Get the CAG Sample Get the RAG Sample Summary Building a local AI application does not require a PhD in machine learning or a cloud budget. With Foundry Local, Node.js, and a set of domain documents, you can create a fully offline, mobile-responsive AI agent that answers questions grounded in your own content. The key takeaways: CAG is ideal for small, curated document sets where simplicity and offline capability matter most. No vector database, no embeddings, no retrieval pipeline. RAG scales further when you have hundreds or thousands of documents, or need semantic search for ambiguous queries. See the local-rag sample to compare. Foundry Local makes on-device AI accessible: native SDK bindings, in-process inference, automatic model selection, and no GPU required. The architecture is transferable. Replace the gas engineering documents with your own content, update the system prompt, and you have a domain-specific AI agent for any field. Start simple, iterate outwards. Begin with CAG and a handful of documents. If your needs outgrow the context window, graduate to RAG. Both patterns can run entirely offline. Clone the repository, swap in your own documents, and start building. The best way to learn is to get your hands on the code. This project is open source under the MIT licence. It is a scenario sample for learning and experimentation, not production medical or safety advice. local-cag on GitHub · local-rag on GitHub · Foundry Local
Lee_Stott
Apr 02, 2026 Place Microsoft Developer Community Blog
467Views
2likes
0Comments
Building a Smart Building HVAC Digital Twin with AI Copilot Using Foundry Local
Introduction Building operations teams face a constant challenge: optimizing HVAC systems for energy efficiency while maintaining occupant comfort and air quality. Traditional building management systems display raw sensor data, temperatures, pressures, CO₂ levels—but translating this into actionable insights requires deep HVAC expertise. What if operators could simply ask "Why is the third floor so warm?" and get an intelligent answer grounded in real building state? This article demonstrates building a sample smart building digital twin with an AI-powered operations copilot, implemented using DigitalTwin, React, Three.js, and Microsoft Foundry Local. You'll learn how to architect physics-based simulators that model thermal dynamics, implement 3D visualizations of building systems, integrate natural language AI control, and design fault injection systems for testing and training. Whether you're building IoT platforms for commercial real estate, designing energy management systems, or implementing predictive maintenance for building automation, this sample provides proven patterns for intelligent facility operations. Why Digital Twins Matter for Building Operations Physical buildings generate enormous operational data but lack intelligent interpretation layers. A 50,000 square foot office building might have 500+ sensors streaming metrics every minute, zone temperatures, humidity levels, equipment runtimes, energy consumption. Traditional BMS (Building Management Systems) visualize this data as charts and gauges, but operators must manually correlate patterns, diagnose issues, and predict failures. Digital twins solve this through physics-based simulation coupled with AI interpretation. Instead of just displaying current temperature readings, a digital twin models thermal dynamics, heat transfer rates, HVAC response characteristics, occupancy impacts. When conditions deviate from expectations, the twin compares observed versus predicted states, identifying root causes. Layer AI on top, and operators get natural language explanations: "The conference room is 3 degrees too warm because the VAV damper is stuck at 40% open, reducing airflow by 60%." This application focuses on HVAC, the largest building energy consumer, typically 40-50% of total usage. Optimizing HVAC by just 10% through better controls can save thousands of dollars monthly while improving occupant satisfaction. The digital twin enables "what-if" scenarios before making changes: "What happens to energy consumption and comfort if we raise the cooling setpoint by 2 degrees during peak demand response events?" Architecture: Three-Tier Digital Twin System The application implements a clean three-tier architecture separating visualization, simulation, and state management: The frontend uses React with Three.js for 3D visualization. Users see an interactive 3D model of the three-floor building with color-coded zones indicating temperature and CO₂ levels. Click any equipment, AHUs, VAVs, chillers, to see detailed telemetry. The control panel enables adjusting setpoints, running simulation steps, and activating demand response scenarios. Real-time charts display KPIs: energy consumption, comfort compliance, air quality levels. The backend Node.js/Express server orchestrates simulation and state management. It maintains the digital twin state as JSON, the single source of truth for all equipment, zones, and telemetry. REST API endpoints handle control requests, simulation steps, and AI copilot queries. WebSocket connections push real-time updates to the frontend for live monitoring. The HVAC simulator implements physics-based models: 1R1C thermal models for zones, affinity laws for fan power, chiller COP calculations, CO₂ mass balance equations. Foundry Local provides AI copilot capabilities. The backend uses foundry-local-sdk to query locally running models. Natural language queries ("How's the lobby temperature?") get answered with building state context. The copilot can explain anomalies, suggest optimizations, and even execute commands when explicitly requested. Implementing Physics-Based HVAC Simulation Accurate simulation requires modeling actual HVAC physics. The simulator implements several established building energy models: // backend/src/simulator/thermal-model.js class ZoneThermalModel { // 1R1C (one resistance, one capacitance) thermal model static calculateTemperatureChange(zone, delta_t_seconds) { const C_thermal = zone.volume * 1.2 * 1000; // Heat capacity (J/K) const R_thermal = zone.r_value * zone.envelope_area; // Thermal resistance // Internal heat gains (occupancy, equipment, lighting) const Q_internal = zone.occupancy * 100 + // 100W per person zone.equipment_load + zone.lighting_load; // Cooling/heating from HVAC const airflow_kg_s = zone.vav.airflow_cfm * 0.0004719; // CFM to kg/s const c_p_air = 1006; // Specific heat of air (J/kg·K) const Q_hvac = airflow_kg_s * c_p_air * (zone.vav.supply_temp - zone.temperature); // Envelope losses const Q_envelope = (zone.outdoor_temp - zone.temperature) / R_thermal; // Net energy balance const Q_net = Q_internal + Q_hvac + Q_envelope; // Temperature change: Q = C * dT/dt const dT = (Q_net / C_thermal) * delta_t_seconds; return zone.temperature + dT; } } This model captures essential thermal dynamics while remaining computationally fast enough for real-time simulation. It accounts for internal heat generation from occupants and equipment, HVAC cooling/heating contributions, and heat loss through the building envelope. The CO₂ model uses mass balance equations: class AirQualityModel { static calculateCO2Change(zone, delta_t_seconds) { // CO₂ generation from occupants const G_co2 = zone.occupancy * 0.0052; // L/s per person at rest // Outdoor air ventilation rate const V_oa = zone.vav.outdoor_air_cfm * 0.000471947; // CFM to m³/s // CO₂ concentration difference (indoor - outdoor) const delta_CO2 = zone.co2_ppm - 400; // Outdoor ~400ppm // Mass balance: dC/dt = (G - V*ΔC) / Volume const dCO2_dt = (G_co2 - V_oa * delta_CO2) / zone.volume; return zone.co2_ppm + (dCO2_dt * delta_t_seconds); } } These models execute every simulation step, updating the entire building state: async function simulateStep(twin, timestep_minutes) { const delta_t = timestep_minutes * 60; // Convert to seconds // Update each zone for (const zone of twin.zones) { zone.temperature = ZoneThermalModel.calculateTemperatureChange(zone, delta_t); zone.co2_ppm = AirQualityModel.calculateCO2Change(zone, delta_t); } // Update equipment based on zone demands for (const vav of twin.vavs) { updateVAVOperation(vav, twin.zones); } for (const ahu of twin.ahus) { updateAHUOperation(ahu, twin.vavs); } updateChillerOperation(twin.chiller, twin.ahus); updateBoilerOperation(twin.boiler, twin.ahus); // Calculate system KPIs twin.kpis = calculateSystemKPIs(twin); // Detect alerts twin.alerts = detectAnomalies(twin); // Persist updated state await saveTwinState(twin); return twin; } 3D Visualization with React and Three.js The frontend renders an interactive 3D building view that updates in real-time as conditions change. Using React Three Fiber simplifies Three.js integration with React's component model: // frontend/src/components/BuildingView3D.jsx import { Canvas } from '@react-three/fiber'; import { OrbitControls } from '@react-three/drei'; export function BuildingView3D({ twinState }) { return ( {/* Render building floors */} {twinState.zones.map(zone => ( selectZone(zone.id)} /> ))} {/* Render equipment */} {twinState.ahus.map(ahu => ( ))} ); } function ZoneMesh({ zone, onClick }) { const color = getTemperatureColor(zone.temperature, zone.setpoint); return ( ); } function getTemperatureColor(current, setpoint) { const deviation = current - setpoint; if (Math.abs(deviation) < 1) return '#00ff00'; // Green: comfortable if (Math.abs(deviation) < 3) return '#ffff00'; // Yellow: acceptable return '#ff0000'; // Red: uncomfortable } This visualization immediately shows building state at a glance, operators see "hot spots" in red, comfortable zones in green, and can click any area for detailed metrics. Integrating AI Copilot for Natural Language Control The AI copilot transforms building data into conversational insights. Instead of navigating multiple screens, operators simply ask questions: // backend/src/routes/copilot.js import { FoundryLocalClient } from 'foundry-local-sdk'; const foundry = new FoundryLocalClient({ endpoint: process.env.FOUNDRY_LOCAL_ENDPOINT }); router.post('/api/copilot/chat', async (req, res) => { const { message } = req.body; // Load current building state const twin = await loadTwinState(); // Build context for AI const context = buildBuildingContext(twin); const completion = await foundry.chat.completions.create({ model: 'phi-4', messages: [ { role: 'system', content: `You are an HVAC operations assistant for a 3-floor office building. Current Building State: ${context} Answer questions about equipment status, comfort conditions, and energy usage. Provide specific, actionable information based on the current data. Do not speculate beyond provided information.` }, { role: 'user', content: message } ], temperature: 0.3, max_tokens: 300 }); res.json({ response: completion.choices[0].message.content, model: 'phi-4', timestamp: new Date().toISOString() }); }); function buildBuildingContext(twin) { const alerts = twin.alerts.filter(a => a.active); const zoneStatus = twin.zones.map(z => `${z.name}: ${z.temperature.toFixed(1)}°F (setpoint: ${z.setpoint}°F), CO₂: ${z.co2_ppm}ppm, ${z.occupancy} people` ); return ` Zones: ${zoneStatus.join('\n')} Active Alerts: ${alerts.length} ${alerts.map(a => `- ${a.severity}: ${a.message}`).join('\n')} Energy: ${twin.kpis.total_power_kw.toFixed(1)} kW Chiller: ${twin.chiller.percent_load}% load, ${twin.chiller.cop.toFixed(2)} COP `.trim(); } Example interactions: User: "Why is the third floor conference room warm?" AI: "The Executive Suite on Floor 3 is currently 75.2°F, 3 degrees above the 72°F setpoint. This is due to high occupancy (8 people) and VAV-F3-EXEC operating at reduced airflow (285 CFM vs 450 CFM capacity). AHU-2 is supplying adequate cool air at 55°F. Recommend checking VAV damper position—it may be partially stuck." User: "What's our current energy usage?" AI: "Total building power consumption is 47.3 kW. Breakdown: Chiller (24.1 kW, 51%), AHU fans (12.8 kW, 27%), Pumps (7.2 kW, 15%), Other (3.2 kW, 7%). Chiller is operating at 62% load with a COP of 4.8, which is efficient for current conditions." User: "Activate demand response level 2" AI: "I cannot directly control building systems for safety reasons. To activate DR Level 2, use the Control Panel and click 'Demand Response' → 'Level 2'. This will raise cooling setpoints by 3°F and reduce auxiliary loads, targeting 15% energy reduction." The AI provides grounded, specific answers citing actual equipment IDs and metrics. It refuses to directly execute control commands, instead guiding operators to explicit control interfaces, a critical safety pattern for building systems. Fault Injection for Testing and Training Real building operations experience equipment failures, stuck dampers, sensor drift, communication losses. The digital twin includes comprehensive fault injection capabilities to train operators and test control logic: // backend/src/simulator/fault-injector.js const FAULT_CATALOG = { chillerFailure: { description: 'Chiller compressor failure', apply: (twin) => { twin.chiller.status = 'FAULT'; twin.chiller.cooling_output = 0; twin.alerts.push({ id: 'chiller-fault', severity: 'CRITICAL', message: 'Chiller compressor failure - no cooling available', equipment: 'CHILLER-01' }); } }, stuckVAVDamper: { description: 'VAV damper stuck at current position', apply: (twin, vavId) => { const vav = twin.vavs.find(v => v.id === vavId); vav.damper_stuck = true; vav.damper_position_fixed = vav.damper_position; twin.alerts.push({ id: `vav-stuck-${vavId}`, severity: 'HIGH', message: `VAV ${vavId} damper stuck at ${vav.damper_position}%`, equipment: vavId }); } }, sensorDrift: { description: 'Temperature sensor reading 5°F high', apply: (twin, zoneId) => { const zone = twin.zones.find(z => z.id === zoneId); zone.sensor_drift = 5.0; zone.temperature_measured = zone.temperature_actual + 5.0; } }, communicationLoss: { description: 'Equipment communication timeout', apply: (twin, equipmentId) => { const equipment = findEquipmentById(twin, equipmentId); equipment.comm_status = 'OFFLINE'; equipment.stale_data = true; twin.alerts.push({ id: `comm-loss-${equipmentId}`, severity: 'MEDIUM', message: `Lost communication with ${equipmentId}`, equipment: equipmentId }); } } }; router.post('/api/twin/fault', async (req, res) => { const { faultType, targetEquipment } = req.body; const twin = await loadTwinState(); const fault = FAULT_CATALOG[faultType]; if (!fault) { return res.status(400).json({ error: 'Unknown fault type' }); } fault.apply(twin, targetEquipment); await saveTwinState(twin); res.json({ message: `Applied fault: ${fault.description}`, affectedEquipment: targetEquipment, timestamp: new Date().toISOString() }); }); Operators can inject faults to practice diagnosis and response. Training scenarios might include: "The chiller just failed during a heat wave, how do you maintain comfort?" or "Multiple VAV dampers are stuck, which zones need immediate attention?" Key Takeaways and Production Deployment Building a physics-based digital twin with AI capabilities requires balancing simulation accuracy with computational performance, providing intuitive visualization while maintaining technical depth, and enabling AI assistance without compromising safety. Key architectural lessons: Physics models enable prediction: Comparing predicted vs observed behavior identifies anomalies that simple thresholds miss 3D visualization improves spatial understanding: Operators immediately see which floors or zones need attention AI copilots accelerate diagnosis: Natural language queries get answers in seconds vs. minutes of manual data examination Fault injection validates readiness: Testing failure scenarios prepares operators for real incidents JSON state enables integration: Simple file-based state makes connecting to real BMS systems straightforward For production deployment, connect the twin to actual building systems via BACnet, Modbus, or MQTT integrations. Replace simulated telemetry with real sensor streams. Calibrate model parameters against historical building performance. Implement continuous learning where the twin's predictions improve as it observes actual building behavior. The complete implementation with simulation engine, 3D visualization, AI copilot, and fault injection system is available at github.com/leestott/DigitalTwin. Clone the repository and run the startup scripts to explore the digital twin, no building hardware required. Resources and Further Reading Smart Building HVAC Digital Twin Repository - Complete source code and simulation engine Setup and Quick Start Guide - Installation instructions and usage examples Microsoft Foundry Local Documentation - AI integration reference HVAC Simulation Documentation - Physics model details and calibration Three.js Documentation - 3D visualization framework ASHRAE Standards - Building energy modeling standards
Lee_Stott
Apr 01, 2026 Place Microsoft Developer Community Blog
630Views
3likes
1Comment
Building Your First Local RAG Application with Foundry Local
A developer's guide to building an offline, mobile-responsive AI support agent using Retrieval-Augmented Generation, the Foundry Local SDK, and JavaScript. Imagine you are a gas field engineer standing beside a pipeline in a remote location. There is no Wi-Fi, no mobile signal, and you need a safety procedure right now. What do you do? This is the exact problem that inspired this project: a fully offline RAG-powered support agent that runs entirely on your machine. No cloud. No API keys. No outbound network calls. Just a local language model, a local vector store, and your own documents, all accessible from a browser on any device. In this post, you will learn how it works, how to build your own, and the key architectural decisions behind it. If you have ever wanted to build an AI application that runs locally and answers questions grounded in your own data, this is the place to start. The finished application: a browser-based AI support agent that runs entirely on your machine. What Is Retrieval-Augmented Generation? Retrieval-Augmented Generation (RAG) is a pattern that makes AI models genuinely useful for domain-specific tasks. Rather than hoping the model "knows" the answer from its training data, you: Retrieve relevant chunks from your own documents using a vector store Augment the model's prompt with those chunks as context Generate a response grounded in your actual data The result is fewer hallucinations, traceable answers with source attribution, and an AI that works with your content rather than relying on general knowledge. If you are building internal tools, customer support bots, field manuals, or knowledge bases, RAG is the pattern you want. RAG vs CAG: Understanding the Trade-offs If you have explored AI application patterns before, you have likely encountered Context-Augmented Generation (CAG). Both RAG and CAG solve the same core problem: grounding an AI model's answers in your own content. They take different approaches, and each has genuine strengths and limitations. RAG (Retrieval-Augmented Generation) How it works: Documents are split into chunks, vectorised, and stored in a database. At query time, the most relevant chunks are retrieved and injected into the prompt. Strengths: Scales to thousands or millions of documents Fine-grained retrieval at chunk level with source attribution Documents can be added or updated dynamically without restarting Token-efficient: only relevant chunks are sent to the model Supports runtime document upload via the web UI Limitations: More complex architecture: requires a vector store and chunking strategy Retrieval quality depends on chunking parameters and scoring method May miss relevant content if the retrieval step does not surface it CAG (Context-Augmented Generation) How it works: All documents are loaded at startup. The most relevant ones are selected per query using keyword scoring and injected into the prompt. Strengths: Drastically simpler architecture with no vector database or embeddings All information is always available to the model Minimal dependencies and easy to set up Near-instant document selection Limitations: Constrained by the model's context window size Best suited to small, curated document sets (tens of documents) Adding documents requires an application restart Want to compare these patterns hands-on? There is a CAG-based implementation of the same gas field scenario using whole-document context injection. Clone both repositories, run them side by side, and see how the architectures differ in practice. When Should You Choose Which? Consideration Choose RAG Choose CAG Document count Hundreds or thousands Tens of documents Document updates Frequent or dynamic (runtime upload) Infrequent (restart to reload) Source attribution Per-chunk with relevance scores Per-document Setup complexity Moderate (ingestion step required) Minimal Query precision Better for large or diverse collections Good for keyword-matchable content Infrastructure SQLite vector store (single file) None beyond the runtime For the sample application in this post (20 gas engineering procedure documents with runtime upload), RAG is the clear winner. If your document set is small and static, CAG may be simpler. Both patterns run fully offline using Foundry Local. Foundry Local: Your On-Device AI Runtime Foundry Local is a lightweight runtime from Microsoft that downloads, manages, and serves language models entirely on your device. No cloud account, no API keys, no outbound network calls (after the initial model download). What makes it particularly useful for developers: No GPU required: runs on CPU or NPU, making it accessible on standard laptops and desktops Native SDK bindings: in-process inference via the foundry-local-sdk npm package, with no HTTP round-trips to a local server Automatic model management: downloads, caches, and loads models automatically Hardware-optimised variant selection: the SDK picks the best variant for your hardware (GPU, NPU, or CPU) Real-time progress callbacks: ideal for building loading UIs that show download and initialisation progress The integration code is refreshingly minimal: import { FoundryLocalManager } from "foundry-local-sdk"; // Create a manager and discover models via the catalogue const manager = FoundryLocalManager.create({ appName: "gas-field-local-rag" }); const model = await manager.catalog.getModel("phi-3.5-mini"); // Download if not cached, then load into memory if (!model.isCached) { await model.download((progress) => { console.log(`Download: ${Math.round(progress * 100)}%`); }); } await model.load(); // Create a chat client for direct in-process inference const chatClient = model.createChatClient(); const response = await chatClient.completeChat([ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "How do I detect a gas leak?" } ]); That is it. No server configuration, no authentication tokens, no cloud provisioning. The model runs in the same process as your application. The Technology Stack The sample application is deliberately simple. No frameworks, no build steps, no Docker: Layer Technology Purpose AI Model Foundry Local + Phi-3.5 Mini Runs locally via native SDK bindings, no GPU required Back end Node.js + Express Lightweight HTTP server, everyone knows it Vector Store SQLite (via better-sqlite3 ) Zero infrastructure, single file on disc Retrieval TF-IDF + cosine similarity No embedding model required, fully offline Front end Single HTML file with inline CSS No build step, mobile-responsive, field-ready The total dependency footprint is three npm packages: express , foundry-local-sdk , and better-sqlite3 . Architecture Overview The five-layer architecture, all running on a single machine. The system has five layers, all running on a single machine: Client layer: a single HTML file served by Express, with quick-action buttons and a responsive chat interface Server layer: Express.js starts immediately and serves the UI plus SSE status and chat endpoints RAG pipeline: the chat engine orchestrates retrieval and generation; the chunker handles TF-IDF vectorisation; the prompts module provides safety-first system instructions Data layer: SQLite stores document chunks and their TF-IDF vectors; documents live as .md files in the docs/ folder AI layer: Foundry Local runs Phi-3.5 Mini on CPU or NPU via native SDK bindings Building the Solution Step by Step Prerequisites You need two things installed on your machine: Node.js 20 or later: download from nodejs.org Foundry Local: Microsoft's on-device AI runtime: winget install Microsoft.FoundryLocal The SDK will automatically download the Phi-3.5 Mini model (approximately 2 GB) the first time you run the application. Getting the Code Running # Clone the repository git clone https://github.com/leestott/local-rag.git cd local-rag # Install dependencies npm install # Ingest the 20 gas engineering documents into the vector store npm run ingest # Start the server npm start Open http://127.0.0.1:3000 in your browser. You will see the status indicator whilst the model loads. Once the model is ready, the status changes to "Offline Ready" and you can start chatting. Desktop view Mobile view How the RAG Pipeline Works Let us trace what happens when a user asks: "How do I detect a gas leak?" The query flow from browser to model and back. 1 Documents are ingested and indexed When you run npm run ingest , every .md file in the docs/ folder is read, parsed (with optional YAML front-matter for title, category, and ID), split into overlapping chunks of approximately 200 tokens, and stored in SQLite with TF-IDF vectors. 2 Model is loaded via the SDK The Foundry Local SDK discovers the model in the local catalogue and loads it into memory. If the model is not already cached, it downloads it first (with progress streamed to the browser via SSE). 3 User sends a question The question arrives at the Express server. The chat engine converts it into a TF-IDF vector, uses an inverted index to find candidate chunks, and scores them using cosine similarity. The top 3 chunks are returned in under 1 ms. 4 Prompt is constructed The engine builds a messages array containing: the system prompt (with safety-first instructions), the retrieved chunks as context, the conversation history, and the user's question. 5 Model generates a grounded response The prompt is sent to the locally loaded model via the Foundry Local SDK's native chat client. The response streams back token by token through Server-Sent Events to the browser. Source references with relevance scores are included. A response with safety warnings and step-by-step guidance The sources panel shows which chunks were used and their relevance Key Code Walkthrough The Vector Store (TF-IDF + SQLite) The vector store uses SQLite to persist document chunks alongside their TF-IDF vectors. At query time, an inverted index finds candidate chunks that share terms with the query, then cosine similarity ranks them: // src/vectorStore.js search(query, topK = 5) { const queryTf = termFrequency(query); this._ensureCache(); // Build in-memory cache on first access // Use inverted index to find candidates sharing at least one term const candidateIndices = new Set(); for (const term of queryTf.keys()) { const indices = this._invertedIndex.get(term); if (indices) { for (const idx of indices) candidateIndices.add(idx); } } // Score only candidates, not all rows const scored = []; for (const idx of candidateIndices) { const row = this._rowCache[idx]; const score = cosineSimilarity(queryTf, row.tf); if (score > 0) scored.push({ ...row, score }); } scored.sort((a, b) => b.score - a.score); return scored.slice(0, topK); } The inverted index, in-memory row cache, and prepared SQL statements bring retrieval time to sub-millisecond for typical query loads. Why TF-IDF Instead of Embeddings? Most RAG tutorials use embedding models for retrieval. This project uses TF-IDF because: Fully offline: no embedding model to download or run Zero latency: vectorisation is instantaneous (it is just maths on word frequencies) Good enough: for 20 domain-specific documents, TF-IDF retrieves the right chunks reliably Transparent: you can inspect the vocabulary and weights, unlike neural embeddings For larger collections or when semantic similarity matters more than keyword overlap, you would swap in an embedding model. For this use case, TF-IDF keeps the stack simple and dependency-free. The System Prompt For safety-critical domains, the system prompt is engineered to prioritise safety, prevent hallucination, and enforce structured responses: // src/prompts.js export const SYSTEM_PROMPT = `You are a local, offline support agent for gas field inspection and maintenance engineers. Behaviour Rules: - Always prioritise safety. If a procedure involves risk, explicitly call it out. - Do not hallucinate procedures, measurements, or tolerances. - If the answer is not in the provided context, say: "This information is not available in the local knowledge base." Response Format: - Summary (1-2 lines) - Safety Warnings (if applicable) - Step-by-step Guidance - Reference (document name + section)`; This pattern is transferable to any safety-critical domain: medical devices, electrical work, aviation maintenance, or chemical handling. Runtime Document Upload Unlike the CAG approach, RAG supports adding documents without restarting the server. Click the upload button to add new .md or .txt files. They are chunked, vectorised, and indexed immediately. The upload modal with the complete list of indexed documents. Adapting This for Your Own Domain The sample project is designed to be forked and adapted. Here is how to make it yours in four steps: 1. Replace the documents Delete the gas engineering documents in docs/ and add your own markdown files. The ingestion pipeline handles any markdown content with optional YAML front-matter: --- title: Troubleshooting Widget Errors category: Support id: KB-001 --- # Troubleshooting Widget Errors ...your content here... 2. Edit the system prompt Open src/prompts.js and rewrite the system prompt for your domain. Keep the structure (summary, safety, steps, reference) and update the language to match your users' expectations. 3. Tune the retrieval In src/config.js : chunkSize: 200 : smaller chunks give more precise retrieval, less context per chunk chunkOverlap: 25 : prevents information falling between chunks topK: 3 : how many chunks to retrieve per query (more gives more context but slower generation) 4. Swap the model Change config.model in src/config.js to any model available in the Foundry Local catalogue. Smaller models give faster responses on constrained devices; larger models give better quality. Building a Field-Ready UI The front end is a single HTML file with inline CSS. No React, no build tooling, no bundler. This keeps the project accessible to beginners and easy to deploy. Design decisions that matter for field use: Dark, high-contrast theme with 18px base font size for readability in bright sunlight Large touch targets (minimum 44px) for operation with gloves or PPE Quick-action buttons that wrap on mobile so all options are visible without scrolling Responsive layout that works from 320px to 1920px+ screen widths Streaming responses via SSE, so the user sees tokens arriving in real time The mobile chat experience, optimised for field use. Testing The project includes unit tests using the built-in Node.js test runner, with no extra test framework needed: # Run all tests npm test Tests cover the chunker, vector store, configuration, and server endpoints. Use them as a starting point when you adapt the project for your own domain. Ideas for Extending the Project Once you have the basics running, there are plenty of directions to explore: Embedding-based retrieval: use a local embedding model for better semantic matching on diverse queries Conversation memory: persist chat history across sessions using local storage or a lightweight database Multi-modal support: add image-based queries (photographing a fault code, for example) PWA packaging: make it installable as a standalone offline application on mobile devices Hybrid retrieval: combine TF-IDF keyword search with semantic embeddings for best results Try the CAG approach: compare with the local-cag sample to see which pattern suits your use case Ready to Build Your Own? Clone the RAG sample, swap in your own documents, and have an offline AI agent running in minutes. Or compare it with the CAG approach to see which pattern suits your use case best. Get the RAG Sample Get the CAG Sample Summary Building a local RAG application does not require a PhD in machine learning or a cloud budget. With Foundry Local, Node.js, and SQLite, you can create a fully offline, mobile-responsive AI agent that answers questions grounded in your own documents. The key takeaways: RAG is ideal for scalable, dynamic document sets where you need fine-grained retrieval with source attribution. Documents can be added at runtime without restarting. CAG is simpler when you have a small, stable set of documents that fit in the context window. See the local-cag sample to compare. Foundry Local makes on-device AI accessible: native SDK bindings, in-process inference, automatic model selection, and no GPU required. TF-IDF + SQLite is a viable vector store for small-to-medium collections, with sub-millisecond retrieval thanks to inverted indexing and in-memory caching. Start simple, iterate outwards. Begin with RAG and a handful of documents. If your needs are simpler, try CAG. Both patterns run entirely offline. Clone the repository, swap in your own documents, and start building. The best way to learn is to get your hands on the code. This project is open source under the MIT licence. It is a scenario sample for learning and experimentation, not production medical or safety advice. local-rag on GitHub · local-cag on GitHub · Foundry Local
Lee_Stott
Mar 30, 2026 Place Microsoft Developer Community Blog
2.8KViews
2likes
0Comments
Building an Offline AI Interview Coach with Foundry Local, RAG, and SQLite
How to build a 100% offline, AI-powered interview preparation tool using Microsoft Foundry Local, Retrieval-Augmented Generation, and nothing but JavaScript. Foundry Local 100% Offline RAG + TF-IDF JavaScript / Node.js Contents Introduction What is RAG and Why Offline? Architecture Overview Setting Up Foundry Local Building the RAG Pipeline The Chat Engine Dual Interfaces: Web & CLI Testing Adapting for Your Own Use Case What I Learned Getting Started Introduction Imagine preparing for a job interview with an AI assistant that knows your CV inside and out, understands the job you're applying for, and generates tailored questions, all without ever sending your data to the cloud. That's exactly what Interview Doctor does. Interview Doctor's web UI, a polished, dark-themed interface running entirely on your local machine. In this post, I'll walk you through how I built an interview prep tool as a fully offline JavaScript application using: Foundry Local — Microsoft's on-device AI runtime SQLite — for storing document chunks and TF-IDF vectors RAG (Retrieval-Augmented Generation) — to ground the AI in your actual documents Express.js — for the web server Node.js built-in test runner — for testing with zero extra dependencies No cloud. No API keys. No internet required. Everything runs on your machine. What is RAG and Why Does It Matter? Retrieval-Augmented Generation (RAG) is a pattern that makes AI models dramatically more useful for domain-specific tasks. Instead of relying solely on what a model learned during training (which can be outdated or generic), RAG: Retrieves relevant chunks from your own documents Augments the model's prompt with those chunks as context Generates a response grounded in your actual data For Interview Doctor, this means the AI doesn't just ask generic interview questions, it asks questions specific to your CV, your experience, and the specific job you're applying for. Why Offline RAG? Privacy is the obvious benefit, your CV and job applications never leave your device. But there's more: No API costs — run as many queries as you want No rate limits — iterate rapidly during your prep Works anywhere — on a plane, in a café with bad Wi-Fi, anywhere Consistent performance — no cold starts, no API latency Architecture Overview Complete architecture showing all components and data flow. The application has two interfaces (CLI and Web) that share the same core engine: Document Ingestion — PDFs and markdown files are chunked and indexed Vector Store — SQLite stores chunks with TF-IDF vectors Retrieval — queries are matched against stored chunks using cosine similarity Generation — relevant chunks are injected into the prompt sent to the local LLM Step 1: Setting Up Foundry Local First, install Foundry Local: # Windows winget install Microsoft.FoundryLocal # macOS brew install microsoft/foundrylocal/foundrylocal The JavaScript SDK handles everything else — starting the service, downloading the model, and connecting: import { FoundryLocalManager } from "foundry-local-sdk"; import { OpenAI } from "openai"; const manager = new FoundryLocalManager(); const modelInfo = await manager.init("phi-3.5-mini"); // Foundry Local exposes an OpenAI-compatible API const openai = new OpenAI({ baseURL: manager.endpoint, // Dynamic port, discovered by SDK apiKey: manager.apiKey, }); ⚠️ Key Insight Foundry Local uses a dynamic port never hardcode localhost:5272 . Always use manager.endpoint which is discovered by the SDK at runtime. Step 2: Building the RAG Pipeline Document Chunking Documents are split into overlapping chunks of ~200 tokens. The overlap ensures important context isn't lost at chunk boundaries: export function chunkText(text, maxTokens = 200, overlapTokens = 25) { const words = text.split(/\s+/).filter(Boolean); if (words.length <= maxTokens) return [text.trim()]; const chunks = []; let start = 0; while (start < words.length) { const end = Math.min(start + maxTokens, words.length); chunks.push(words.slice(start, end).join(" ")); if (end >= words.length) break; start = end - overlapTokens; } return chunks; } Why 200 tokens with 25-token overlap? Small chunks keep retrieved context compact for the model's limited context window. Overlap prevents information loss at boundaries. And it's all pure string operations, no dependencies needed. TF-IDF Vectors Instead of using a separate embedding model (which would consume precious memory alongside the LLM), we use TF-IDF, a classic information retrieval technique: export function termFrequency(text) { const tf = new Map(); const tokens = text .toLowerCase() .replace(/[^a-z0-9\-']/g, " ") .split(/\s+/) .filter((t) => t.length > 1); for (const t of tokens) { tf.set(t, (tf.get(t) || 0) + 1); } return tf; } export function cosineSimilarity(a, b) { let dot = 0, normA = 0, normB = 0; for (const [term, freq] of a) { normA += freq * freq; if (b.has(term)) dot += freq * b.get(term); } for (const [, freq] of b) normB += freq * freq; if (normA === 0 || normB === 0) return 0; return dot / (Math.sqrt(normA) * Math.sqrt(normB)); } Each document chunk becomes a sparse vector of word frequencies. At query time, we compute cosine similarity between the query vector and all stored chunk vectors to find the most relevant matches. SQLite as a Vector Store Chunks and their TF-IDF vectors are stored in SQLite using sql.js (pure JavaScript — no native compilation needed): export class VectorStore { // Created via: const store = await VectorStore.create(dbPath) insert(docId, title, category, chunkIndex, content) { const tf = termFrequency(content); const tfJson = JSON.stringify([...tf]); this.db.run( "INSERT INTO chunks (...) VALUES (?, ?, ?, ?, ?, ?)", [docId, title, category, chunkIndex, content, tfJson] ); this.save(); } search(query, topK = 5) { const queryTf = termFrequency(query); // Score each chunk by cosine similarity, return top-K } } 💡 Why SQLite for Vectors? For a CV plus a few job descriptions (dozens of chunks), brute-force cosine similarity over SQLite rows is near-instant (~1ms). No need for Pinecone, Qdrant, or Chroma — just a single .db file on disk. Step 3: The RAG Chat Engine The chat engine ties retrieval and generation together: async *queryStream(userMessage, history = []) { // 1. Retrieve relevant CV/JD chunks const chunks = this.retrieve(userMessage); const context = this._buildContext(chunks); // 2. Build the prompt with retrieved context const messages = [ { role: "system", content: SYSTEM_PROMPT }, { role: "system", content: `Retrieved context:\n\n${context}` }, ...history, { role: "user", content: userMessage }, ]; // 3. Stream from the local model const stream = await this.openai.chat.completions.create({ model: this.modelId, messages, temperature: 0.3, stream: true, }); // 4. Yield chunks as they arrive for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) yield { type: "text", data: content }; } } The flow is straightforward: vectorize the query, retrieve with cosine similarity, build a prompt with context, and stream from the local LLM. The temperature: 0.3 keeps responses focused — important for interview preparation where consistency matters. Step 4: Dual Interfaces — Web & CLI Web UI The web frontend is a single HTML file with inline CSS and JavaScript — no build step, no framework, no React or Vue. It communicates with the Express backend via REST and SSE: File upload via multipart/form-data Streaming chat via Server-Sent Events (SSE) Quick-action buttons for common follow-up queries (coaching tips, gap analysis, mock interview) The setup form with job title, seniority level, and a pasted job description — ready to generate tailored interview questions. CLI The CLI provides the same experience in the terminal with ANSI-coloured output: npm run cli It walks you through uploading your CV, entering the job details, and then generates streaming questions. Follow-up questions work interactively. Both interfaces share the same ChatEngine class, they're thin layers over identical logic. Edge Mode For constrained devices, toggle Edge mode to use a compact system prompt that fits within smaller context windows: Edge mode activated, uses a minimal prompt for devices with limited resources. Step 5: Testing Tests use the Node.js built-in test runner, no Jest, no Mocha, no extra dependencies: import { describe, it } from "node:test"; import assert from "node:assert/strict"; describe("chunkText", () => { it("returns single chunk for short text", () => { const chunks = chunkText("short text", 200, 25); assert.equal(chunks.length, 1); }); it("maintains overlap between chunks", () => { // Verifies overlapping tokens between consecutive chunks }); }); npm test Tests cover the chunker, vector store, config, prompts, and server API contract, all without needing Foundry Local running. Adapting for Your Own Use Case Interview Doctor is a pattern, not just a product. You can adapt it for any domain: What to Change How Domain documents Replace files in docs/ with your content System prompt Edit src/prompts.js Chunk sizes Adjust config.chunkSize and config.chunkOverlap Model Change config.model — run foundry model list UI Modify public/index.html — it's a single file Ideas for Adaptation Customer support bot — ingest your product docs and FAQs Code review assistant — ingest coding standards and best practices Study guide — ingest textbooks and lecture notes Compliance checker — ingest regulatory documents Onboarding assistant — ingest company handbooks and processes What I Learned Offline AI is production-ready. Foundry Local + small models like Phi-3.5 Mini are genuinely useful for focused tasks. You don't need vector databases for small collections. SQLite + TF-IDF is fast, simple, and has zero infrastructure overhead. RAG quality depends on chunking. Getting chunk sizes right for your use case is more impactful than the retrieval algorithm. The OpenAI-compatible API is a game-changer. Switching from cloud to local was mostly just changing the baseURL . Dual interfaces are easy when you share the engine. The CLI and Web UI are thin layers over the same ChatEngine class. ⚡ Performance Notes On a typical laptop (no GPU): ingestion takes under 1 second for ~20 documents, retrieval is ~1ms, and the first LLM token arrives in 2-5 seconds. Foundry Local automatically selects the best model variant for your hardware (CUDA GPU, NPU, or CPU). Getting Started git clone https://github.com/leestott/interview-doctor-js.git cd interview-doctor-js npm install npm run ingest npm start # Web UI at http://127.0.0.1:3000 # or npm run cli # Interactive terminal The full source code is on GitHub. Star it, fork it, adapt it — and good luck with your interviews! Resources Foundry Local — Microsoft's on-device AI runtime Foundry Local SDK (npm) — JavaScript SDK Foundry Local GitHub — Source, samples, and documentation Local RAG Reference — Reference RAG implementation Interview Doctor (JavaScript) — This project's source code
Lee_Stott
Mar 27, 2026 Place Microsoft Developer Community Blog
663Views
2likes
0Comments
Microsoft Foundry Labs: A Practical Fast Lane from Research to Real Developer Work
Why developers need a fast lane from research → prototypes AI engineering has a speed problem, but it is not a shortage of announcements. The hard part is turning research into a useful prototype before the next wave of models, tools, or agent patterns shows up. That gap matters. AI engineers want to compare quality, latency, and cost before they wire a model into a product. Full-stack teams want to test whether an agent workflow is real or just demo. Platform and operations teams want to know when an experiment can graduate into something observable and supportable. Microsoft makes that case directly in introducing Microsoft Foundry Labs: breakthroughs are arriving faster, and time from research to product has compressed from years to months. If you build real systems, the question is not "What is the coolest demo?" It is "Which experiments are worth my next hour, and how do I evaluate them without creating demo-ware?" That is where Microsoft Foundry Labs becomes interesting. What is Microsoft Foundry Labs? Microsoft Foundry Labs is a place to explore early-stage experiments and prototypes from Microsoft, with an explicit focus on research-driven innovation. The homepage describes it as a way to get a glimpse of potential future directions for AI through experimental technologies from Microsoft Research and more. The announcement adds the operating idea: Labs is a single access point for developers to experiment with new models from Microsoft, explore frameworks, and share feedback. That framing matters. Labs is not just a gallery of flashy ideas. It is a developer-facing exploration surface for projects that are still close to research: models, agent systems, UX ideas, and tool experiments. Here's some things you can do on Labs: Play with tomorrow’s AI, today: 30+ experimental projects—from models to agents—are openly available to fork and build upon, alongside direct access to breakthrough research from Microsoft. Go from prototype to production, fast: Seamless integration with Microsoft Foundry gives you access to 11,000+ models with built-in compute, safety, observability, and governance—so you can move from local experimentation to full-scale production without complex containerization or switching platforms. Build with the people shaping the future of AI: Join a thriving community of 25,000+ developers across Discord and GitHub with direct access to Microsoft researchers and engineers to share feedback and help shape the most promising technologies. What Labs is not: it is not a promise that every project has a production deployment path today, a long-term support commitment, or a hardened enterprise operating model. Spotlight: a few Labs experiments worth a developer's attention Phi-4-Reasoning-Vision-15B: A compact open-weight multimodal reasoning model that is interesting if you care about the quality-versus-efficiency tradeoff in smaller reasoning systems. BitNet: A native 1-bit large language model that is compelling for engineers who care about memory, compute, and energy efficiency. Fara-7B: An ultra-compact agentic small language model designed for computer use, which makes it relevant for builders exploring UI automation and on-device agents. OmniParser V2: A screen parsing module that turns interfaces into actionable elements, directly relevant to computer-use and UI-interaction agents. If you want to inspect actual code, the Labs project pages also expose official repository links for some of these experiments, including OmniParser, Magentic-UI, and BitNet. Labs vs. Foundry: how to think about the boundary The simplest mental model is this: Labs is the exploration edge; Foundry is the platform layer. The Microsoft Foundry documentation describes the broader platform as "the AI app and agent factory" to build, optimize, and govern AI apps and agents at scale. That is a different promise from Labs. Foundry is where you move from curiosity to implementation: model access, agent services, SDKs, observability, evaluation, monitoring, and governance. Labs helps you explore what might matter next. Foundry helps you build, optimize, and govern what matters now. Labs is where you test a research-shaped idea. Foundry is where you decide whether that idea can survive integration, evaluation, tracing, cost controls, and production scrutiny. That also means Labs is not a replacement for the broader Foundry workflow. If an experiment catches your attention, the next question is not "Can I ship this tomorrow?" It is "What is the integration path, and how will I measure whether it deserves promotion?" What's real today vs. what's experimental Real today: Labs is live as an official exploration hub, and Foundry is the broader platform for building, evaluating, monitoring, and governing AI apps and agents. Experimental by design: Labs projects are presented as experiments and prototypes, so they still need validation for your use case. A developer's lens: Models, Agents, Observability What makes Labs useful is not that it shows new things. It is that it gives developers a way to inspect those things through the same three concerns that matter in every serious AI system: model choice, agent design, and observability. Diagram description: imagine a loop with three boxes in a row: Models, Agents, and Observability. A forward arrow runs across the row, and a feedback arrow loops from Observability back to Models. The point is that evaluation data should change both model choices and agent design, instead of arriving too late. Models: what to look for in Labs experiments If you are model-curious, Labs should trigger an evaluation mindset, not a fandom mindset. When you see something like Phi-4-Reasoning-Vision-15B or BitNet on the Labs homepage, ask three things: what capability is being demonstrated, what constraints are obvious, and what the integration path would look like. This is where the Microsoft Foundry Playgrounds mindset is useful even if you started in Labs. The documentation emphasizes model comparison, prompt iteration, parameter tuning, tools, safety guardrails, and code export. It also pushes the right pre-production questions: price-to-performance, latency, tool integration, and code readiness. That is how I would use Labs for models: not to choose winners, but to generate hypotheses worth testing. If a Labs experiment looks promising, move quickly into a small evaluation matrix around capability, latency, cost, and integration friction. Agents: what Labs unlocks for agent builders Labs is especially interesting for agent builders because many of the projects point toward orchestration and tool-use patterns that matter in practice. The official announcement highlights projects across models and agentic frameworks, including Magentic-One and OmniParser v2. On the homepage, projects such as Fara-7B, OmniParser V2, TypeAgent, and Magentic-UI point in a similar direction: agents get more useful when they can reason over tools, interfaces, plans, and human feedback loops. For working developers, that means Labs can act as a scouting surface for agent patterns rather than just agent demos. Look for UI or computer-use style agents when your system needs to act through an interface rather than an API. Look for planning or tool-selection patterns when orchestration matters more than raw model quality. My suggestion: when a Labs project looks relevant to agent work, do not ask "Can I copy this architecture?" Ask "Which agent pattern is being explored here, and under what constraints would it be useful in my system?" Observability: how to experiment responsibly and measure what matters Observability is where prototypes usually go to die, because teams postpone it until after they have something flashy. That is backwards. If you care about real systems, tracing, evaluation, monitoring, and governance should start during prototyping. The Microsoft Foundry documentation already puts that operating model in plain view through guidance for tracing applications, evaluating agentic workflows, and monitoring generative AI apps. The Microsoft Foundry Playgrounds page is also explicit that the agents playground supports tracing and evaluation through AgentOps. At the governance layer, the AI gateway in Azure API Management documentation reinforces why this matters beyond demos. It covers monitoring and logging AI interactions, tracking token metrics, logging prompts and completions, managing quotas, applying safety policies, and governing models, agents, and tools. You do not need every one of those controls on day one, but you do need the habit: if a prototype cannot tell you what it did, why it failed, and what it cost, it is not ready to influence a roadmap. "Pick one and try it": a 20-minute hands-on path Keep this lightweight and tool-agnostic. The point is not to memorize a product UI. The point is to run a disciplined experiment. Browse Labs and pick an experiment aligned to your work. Start at Microsoft Foundry Labs and choose one project that is adjacent to a real problem you have: model efficiency, multimodal reasoning, UI agents, debugging workflows, or human-in-the-loop design. Read the project page and jump to the repo or paper if available. Use the Labs entry to understand the claim being made. Then read the supporting material, not just the summary sentence. Define one small test task and explicit success criteria. Keep it concrete: latency budget, accuracy target, cost ceiling, acceptable safety behavior, or failure rate under a narrow scenario. Capture telemetry from the start. At minimum, keep prompts or inputs, outputs, intermediate decisions, and failures. If the experiment involves tools or agents, include tool choices and obvious reasons for failure or recovery. Make a hard call. Decide whether to keep exploring or wait for a stronger production-grade path. "Interesting" is not the same as "ready for integration." Minimal experiment logger (my suggestion): if you want a lightweight way to avoid demo-ware, even a local JSONL log is enough to capture prompts, outputs, decisions, failures, and latency while you compare ideas from Labs. import json import time from pathlib import Path LOG_PATH = Path("experiment-log.jsonl") def record_event(name, payload): # Append one event per line so runs are easy to diff and analyze later. with LOG_PATH.open("a", encoding="utf-8") as handle: handle.write(json.dumps({"event": name, **payload}) + "\n") def run_experiment(user_input): started = time.time() try: # Replace this stub with your real model or agent call. output = user_input.upper() decision = "keep exploring" if len(output) < 80 else "wait" record_event( "experiment_result", { "input": user_input, "output": output, "decision": decision, "latency_ms": round((time.time() - started) * 1000, 2), "failure": None, }, ) except Exception as error: record_event( "experiment_result", { "input": user_input, "output": None, "decision": "failed", "latency_ms": round((time.time() - started) * 1000, 2), "failure": str(error), }, ) raise if __name__ == "__main__": run_experiment("Summarize the constraints of this Labs project.") That script is intentionally boring. That is the point. It gives you a repeatable, runnable starting point for comparing experiments without pretending you already have a full observability stack. Practical tips: how I evaluate Labs experiments before betting a roadmap on them Separate the idea from the implementation path. A strong research direction can still have a weak near-term integration story. Test one workload, not ten. Pick a narrow task that resembles your production reality and see whether the experiment moves the needle. Track cost and latency as first-class metrics. A novel capability that breaks your budget or response-time envelope is still a failed fit. Treat agent demos skeptically unless you can inspect behavior. Tool calls, traces, failure cases, and recovery paths matter more than polished output. Common pitfalls are predictable here. Do not confuse a research win with a deployment path. Labs is for exploration, so you still need to validate integration, safety, and operations. Do not evaluate with vague prompts. Use a narrow task and explicit success criteria, or you will end up comparing vibes instead of outcomes. Do not skip telemetry because the prototype is small. If you cannot inspect failures early, the prototype will teach you very little. Do not ignore known limitations. For example, the Fara-7B project page explicitly notes challenges on more complex tasks, instruction-following mistakes, and hallucinations, which is exactly the kind of constraint you should carry into evaluation. What to explore next Azure AI Foundry Labs matters because it gives developers a practical way to explore research-shaped ideas before they harden into mainstream patterns. The smart move is to use Labs as an input into better platform decisions: explore in Labs, validate with the discipline encouraged by Foundry playgrounds, and then bring the learnings back into the broader Foundry workflow. Takeaway 1: Labs is an exploration surface for early-stage, research-driven experiments and prototypes, not a blanket promise of production readiness. Takeaway 2: The right workflow is Labs for discovery, then Microsoft Foundry for implementation, optimization, evaluation, monitoring, and governance. Takeaway 3: Tracing, evaluations, and telemetry should start during prototyping, because that is how you avoid confusing a compelling demo with a viable system. If you are curious, start with Microsoft Foundry Labs, read the official context in Introducing Microsoft Foundry Labs, and then map what you learn into the platform guidance in Microsoft Foundry documentation. Try this next Open Microsoft Foundry Labs and choose one experiment that matches a real workload you care about. Use the mindset from Microsoft Foundry Playgrounds to define a small validation task around quality, latency, cost, and safety. Write down the minimum telemetry you need before continuing: inputs, outputs, decisions, failures, and token or cost signals. Read the relevant operating guidance in AI gateway in Azure API Management if your experiment may eventually need monitoring, quotas, safety policies, or governance. Promote only the experiments that can explain their value clearly in a Foundry-shaped build, evaluation, and observability workflow.
Lee_Stott
Mar 25, 2026 Place Microsoft Developer Community Blog
527Views
1like
0Comments
Building real-world AI automation with Foundry Local and the Microsoft Agent Framework
A hands-on guide to building real-world AI automation with Foundry Local, the Microsoft Agent Framework, and PyBullet. No cloud subscription, no API keys, no internet required. Why Developers Should Care About Offline AI Imagine telling a robot arm to "pick up the cube" and watching it execute the command in a physics simulator, all powered by a language model running on your laptop. No API calls leave your machine. No token costs accumulate. No internet connection is needed. That is what this project delivers, and every piece of it is open source and ready for you to fork, extend, and experiment with. Most AI demos today lean on cloud endpoints. That works for prototypes, but it introduces latency, ongoing costs, and data privacy concerns. For robotics and industrial automation, those trade-offs are unacceptable. You need inference that runs where the hardware is: on the factory floor, in the lab, or on your development machine. Foundry Local gives you an OpenAI-compatible endpoint running entirely on-device. Pair it with a multi-agent orchestration framework and a physics engine, and you have a complete pipeline that translates natural language into validated, safe robot actions. This post walks through how we built it, why the architecture works, and how you can start experimenting with your own offline AI simulators today. Architecture The system uses four specialised agents orchestrated by the Microsoft Agent Framework: Agent What It Does Speed PlannerAgent Sends user command to Foundry Local LLM → JSON action plan 4–45 s SafetyAgent Validates against workspace bounds + schema < 1 ms ExecutorAgent Dispatches actions to PyBullet (IK, gripper) < 2 s NarratorAgent Template summary (LLM opt-in via env var) < 1 ms User (text / voice) │ ▼ ┌──────────────┐ │ Orchestrator │ └──────┬───────┘ │ ┌────┴────┐ ▼ ▼ Planner Narrator │ ▼ Safety │ ▼ Executor │ ▼ PyBullet Setting Up Foundry Local from foundry_local import FoundryLocalManager import openai manager = FoundryLocalManager("qwen2.5-coder-0.5b") client = openai.OpenAI( base_url=manager.endpoint, api_key=manager.api_key, ) resp = client.chat.completions.create( model=manager.get_model_info("qwen2.5-coder-0.5b").id, messages=[{"role": "user", "content": "pick up the cube"}], max_tokens=128, stream=True, ) from foundry_local import FoundryLocalManager import openai manager = FoundryLocalManager("qwen2.5-coder-0.5b") client = openai.OpenAI( base_url=manager.endpoint, api_key=manager.api_key, ) resp = client.chat.completions.create( model=manager.get_model_info("qwen2.5-coder-0.5b").id, messages=[{"role": "user", "content": "pick up the cube"}], max_tokens=128, stream=True, ) The SDK auto-selects the best hardware backend (CUDA GPU → QNN NPU → CPU). No configuration needed. How the LLM Drives the Simulator Understanding the interaction between the language model and the physics simulator is central to the project. The two never communicate directly. Instead, a structured JSON contract forms the bridge between natural language and physical motion. From Words to JSON When a user says “pick up the cube”, the PlannerAgent sends the command to the Foundry Local LLM alongside a compact system prompt. The prompt lists every permitted tool and shows the expected JSON format. The LLM responds with a structured plan: { "type": "plan", "actions": [ {"tool": "describe_scene", "args": {}}, {"tool": "pick", "args": {"object": "cube_1"}} ] } The planner parses this response, validates it against the action schema, and retries once if the JSON is malformed. This constrained output format is what makes small models (0.5B parameters) viable: the response space is narrow enough that even a compact model can produce correct JSON reliably. From JSON to Motion Once the SafetyAgent approves the plan, the ExecutorAgent maps each action to concrete PyBullet calls: move_ee(target_xyz) : The target position in Cartesian coordinates is passed to PyBullet's inverse kinematics solver, which computes the seven joint angles needed to place the end-effector at that position. The robot then interpolates smoothly from its current joint state to the target, stepping the physics simulation at each increment. pick(object) : This triggers a multi-step grasp sequence. The controller looks up the object's position in the scene, moves the end-effector above the object, descends to grasp height, closes the gripper fingers with a configurable force, and lifts. At every step, PyBullet resolves contact forces and friction so that the object behaves realistically. place(target_xyz) : The reverse of a pick. The robot carries the grasped object to the target coordinates and opens the gripper, allowing the physics engine to drop the object naturally. describe_scene() : Rather than moving the robot, this action queries the simulation state and returns the position, orientation, and name of every object on the table, along with the current end-effector pose. The Abstraction Boundary The critical design choice is that the LLM knows nothing about joint angles, inverse kinematics, or physics. It operates purely at the level of high-level tool calls ( pick , move_ee ). The ActionExecutor translates those tool calls into the low-level API that PyBullet provides. This separation means the LLM prompt stays simple, the safety layer can validate plans without understanding kinematics, and the executor can be swapped out without retraining or re-prompting the model. Voice Input Pipeline Voice commands follow three stages: Browser capture: MediaRecorder captures audio, client-side resamples to 16 kHz mono WAV Server transcription: Foundry Local Whisper (ONNX, cached after first load) with automatic 30 s chunking Command execution: transcribed text goes through the same Planner → Safety → Executor pipeline The mic button (🎤) only appears when a Whisper model is cached or loaded. Whisper models are filtered out of the LLM dropdown. Web UI in Action Pick command Describe command Move command Reset command Performance: Model Choice Matters Model Params Inference Pipeline Total qwen2.5-coder-0.5b 0.5 B ~4 s ~5 s phi-4-mini 3.6 B ~35 s ~36 s qwen2.5-coder-7b 7 B ~45 s ~46 s For interactive robot control, qwen2.5-coder-0.5b is the clear winner: valid JSON for a 7-tool schema in under 5 seconds. The Simulator in Action Here is the Panda robot arm performing a pick-and-place sequence in PyBullet. Each frame is rendered by the simulator's built-in camera and streamed to the web UI in real time. Overview Reaching Above the cube Gripper detail Front interaction Side layout Get Running in Five Minutes You do not need a GPU, a cloud account, or any prior robotics experience. The entire stack runs on a standard development machine. # 1. Install Foundry Local winget install Microsoft.FoundryLocal # Windows brew install foundrylocal # macOS # 2. Download models (one-time, cached locally) foundry model run qwen2.5-coder-0.5b # Chat brain (~4 s inference) foundry model run whisper-base # Voice input (194 MB) # 3. Clone and set up the project git clone https://github.com/leestott/robot-simulator-foundrylocal cd robot-simulator-foundrylocal .\setup.ps1 # or ./setup.sh on macOS/Linux # 4. Launch the web UI python -m src.app --web --no-gui # → http://localhost:8080 Once the server starts, open your browser and try these commands in the chat box: "pick up the cube": the robot grasps the blue cube and lifts it "describe the scene": returns every object's name and position "move to 0.3 0.2 0.5": sends the end-effector to specific coordinates "reset": returns the arm to its neutral pose If you have a microphone connected, hold the mic button and speak your command instead of typing. Voice input uses a local Whisper model, so your audio never leaves the machine. Experiment and Build Your Own The project is deliberately simple so that you can modify it quickly. Here are some ideas to get started. Add a new robot action The robot currently understands seven tools. Adding an eighth takes four steps: Define the schema in TOOL_SCHEMAS ( src/brain/action_schema.py ). Write a _do_<tool> handler in src/executor/action_executor.py . Register it in ActionExecutor._dispatch . Add a test in tests/test_executor.py . For example, you could add a rotate_ee tool that spins the end-effector to a given roll/pitch/yaw without changing position. Add a new agent Every agent follows the same pattern: an async run(context) method that reads from and writes to a shared dictionary. Create a new file in src/agents/ , register it in orchestrator.py , and the pipeline will call it in sequence. Ideas for new agents: VisionAgent: analyse a camera frame to detect objects and update the scene state before planning. CostEstimatorAgent: predict how many simulation steps an action plan will take and warn the user if it is expensive. ExplanationAgent: generate a step-by-step natural language walkthrough of the plan before execution, allowing the user to approve or reject it. Swap the LLM python -m src.app --web --model phi-4-mini Or use the model dropdown in the web UI; no restart is needed. Try different models and compare accuracy against inference speed. Smaller models are faster but may produce malformed JSON more often. Larger models are more accurate but slower. The retry logic in the planner compensates for occasional failures, so even a small model works well in practice. Swap the simulator PyBullet is one option, but the architecture does not depend on it. You could replace the simulation layer with: MuJoCo: a high-fidelity physics engine popular in reinforcement learning research. Isaac Sim: NVIDIA's GPU-accelerated robotics simulator with photorealistic rendering. Gazebo: the standard ROS simulator, useful if you plan to move to real hardware through ROS 2. The only requirement is that your replacement implements the same interface as PandaRobot and GraspController . Build something completely different The pattern at the heart of this project (LLM produces structured JSON, safety layer validates, executor dispatches to a domain-specific engine) is not limited to robotics. You could apply the same architecture to: Home automation: "turn off the kitchen lights and set the thermostat to 19 degrees" translated into MQTT or Zigbee commands. Game AI: natural language control of characters in a game engine, with the safety agent preventing invalid moves. CAD automation: voice-driven 3D modelling where the LLM generates geometry commands for OpenSCAD or FreeCAD. Lab instrumentation: controlling scientific equipment (pumps, stages, spectrometers) via natural language, with the safety agent enforcing hardware limits. From Simulator to Real Robot One of the most common questions about projects like this is whether it could control a real robot. The answer is yes, and the architecture is designed to make that transition straightforward. What Stays the Same The entire upper half of the pipeline is hardware-agnostic: The LLM planner generates the same JSON action plans regardless of whether the target is simulated or physical. It has no knowledge of the underlying hardware. The safety agent validates workspace bounds and tool schemas. For a real robot, you would tighten the bounds to match the physical workspace and add checks for obstacle clearance using sensor data. The orchestrator coordinates agents in the same sequence. No changes are needed. The narrator reports what happened. It works with any result data the executor returns. What Changes The only component that must be replaced is the executor layer, specifically the PandaRobot class and the GraspController . In simulation, these call PyBullet's inverse kinematics solver and step the physics engine. On a real robot, they would instead call the hardware driver. For a Franka Emika Panda (the same robot modelled in the simulation), the replacement options include: libfranka: Franka's C++ real-time control library, which accepts joint position or torque commands at 1 kHz. ROS 2 with MoveIt: A robotics middleware stack that provides motion planning, collision avoidance, and hardware abstraction. The move_ee action would become a MoveIt goal, and the framework would handle trajectory planning and execution. Franka ROS 2 driver: Combines libfranka with ROS 2 for a drop-in replacement of the simulation controller. The ActionExecutor._dispatch method maps tool names to handler functions. Replacing _do_move_ee , _do_pick , and _do_place with calls to a real robot driver is the only code change required. Key Considerations for Real Hardware Safety: A simulated robot cannot cause physical harm; a real robot can. The safety agent would need to incorporate real-time collision checking against sensor data (point clouds from depth cameras, for example) rather than relying solely on static workspace bounds. Perception: In simulation, object positions are known exactly. On a real robot, you would need a perception system (cameras with object detection or fiducial markers) to locate objects before grasping. Calibration: The simulated robot's coordinate frame matches the URDF model perfectly. A real robot requires hand-eye calibration to align camera coordinates with the robot's base frame. Latency: Real actuators have physical response times. The executor would need to wait for motion completion signals from the hardware rather than stepping a simulation loop. Gripper feedback: In PyBullet, grasp success is determined by contact forces. A real gripper would provide force or torque feedback to confirm whether an object has been securely grasped. The Simulation as a Development Tool This is precisely why simulation-first development is valuable. You can iterate on the LLM prompts, agent logic, and command pipeline without risk to hardware. Once the pipeline reliably produces correct action plans in simulation, moving to a real robot is a matter of swapping the lowest layer of the stack. Key Takeaways for Developers On-device AI is production-ready. Foundry Local serves models through a standard OpenAI-compatible API. If your code already uses the OpenAI SDK, switching to local inference is a one-line change to base_url . Small models are surprisingly capable. A 0.5B parameter model produces valid JSON action plans in under 5 seconds. For constrained output schemas, you do not need a 70B model. Multi-agent pipelines are more reliable than monolithic prompts. Splitting planning, validation, execution, and narration across four agents makes each one simpler to test, debug, and replace. Simulation is the safest way to iterate. You can refine LLM prompts, agent logic, and tool schemas without risking real hardware. When the pipeline is reliable, swapping the executor for a real robot driver is the only change needed. The pattern generalises beyond robotics. Structured JSON output from an LLM, validated by a safety layer, dispatched to a domain-specific engine: that pattern works for home automation, game AI, CAD, lab equipment, and any other domain where you need safe, structured control. You can start building today. The entire project runs on a standard laptop with no GPU, no cloud account, and no API keys. Clone the repository, run the setup script, and you will have a working voice-controlled robot simulator in under five minutes. Ready to start building? Clone the repository, try the commands, and then start experimenting. Fork it, add your own agents, swap in a different simulator, or apply the pattern to an entirely different domain. The best way to learn how local AI can solve real-world problems is to build something yourself. Source code: github.com/leestott/robot-simulator-foundrylocal Built with Foundry Local, Microsoft Agent Framework, PyBullet, and FastAPI.
Lee_Stott
Mar 23, 2026 Place Microsoft Developer Community Blog
825Views
1like
1Comment
Building a Multi-Agent On-Call Copilot with Microsoft Agent Framework
Four AI agents, one incident payload, structured triage in under 60 seconds powered by Microsoft Agent Framework and Foundry Hosted Agents. Multi-Agent Microsoft Agent Framework Foundry Hosted Agents Python SRE / Incident Response When an incident fires at 3 AM, every second the on-call engineer spends piecing together alerts, logs, and metrics is a second not spent fixing the problem. What if an AI system could ingest the raw incident signals and hand you a structured triage, a Slack update, a stakeholder brief, and a draft post-incident report, all in under 10 seconds? That’s exactly what On-Call Copilot does. In this post, we’ll walk through how we built it using the Microsoft Agent Framework, deployed it as a Foundry Hosted Agent, and discuss the key design decisions that make multi-agent orchestration practical for production workloads. The full source code is open-source on GitHub. You can deploy your own instance with a single azd up . Why Multi-Agent? The Problem with Single-Prompt Triage Early AI incident assistants used a single large prompt: “Here is the incident. Give me root causes, actions, a Slack message, and a post-incident report.” This approach has two fundamental problems: Context overload. A real incident may have 800 lines of logs, 10 alert lines, and dense metrics. Asking one model to process everything and produce four distinct output formats in a single turn pushes token limits and degrades quality. Conflicting concerns. Triage reasoning and communication drafting are cognitively different tasks. A model optimised for structured JSON analysis often produces stilted Slack messages—and vice versa. The fix is specialisation: decompose the task into focused agents, give each agent a narrow instruction set, and run them in parallel. This is the core pattern that the Microsoft Agent Framework makes easy. Architecture: Four Agents Running Concurrently On-Call Copilot is deployed as a Foundry Hosted Agent—a containerised Python service running on Microsoft Foundry’s managed infrastructure. The core orchestrator uses ConcurrentBuilder from the Microsoft Agent Framework SDK to run four specialist agents in parallel via asyncio.gather() . All four panels populated simultaneously: Triage (red), Summary (blue), Comms (green), PIR (purple). Architecture: The orchestrator runs four specialist agents concurrently via asyncio.gather(), then merges their JSON fragments into a single response. All four agents The solution share a single Azure OpenAI Model Router deployment. Rather than hardcoding gpt-4o or gpt-4o-mini , Model Router analyses request complexity and routes automatically. A simple triage prompt costs less; a long post-incident synthesis uses a more capable model. One deployment name, zero model-selection code. Meet the Four Agents 🔍 Triage Agent Root cause analysis, immediate actions, missing data identification, and runbook alignment. suspected_root_causes · immediate_actions · missing_information · runbook_alignment 📋 Summary Agent Concise incident narrative: what happened and current status (ONGOING / MITIGATED / RESOLVED). summary.what_happened · summary.current_status 📢 Comms Agent Audience-appropriate communications: Slack channel update with emoji conventions, plus a non-technical stakeholder brief. comms.slack_update · comms.stakeholder_update 📝 PIR Agent Post-incident report: chronological timeline, quantified customer impact, and specific prevention actions. post_incident_report.timeline · .customer_impact · .prevention_actions The Code: Building the Orchestrator The entry point is remarkably concise. ConcurrentBuilder handles all the async wiring—you just declare the agents and let the framework handle parallelism, error propagation, and response merging. main.py — Orchestrator from agent_framework import ConcurrentBuilder from agent_framework.azure import AzureOpenAIChatClient from azure.ai.agentserver.agentframework import from_agent_framework from azure.identity import DefaultAzureCredential, get_bearer_token_provider from app.agents.triage import TRIAGE_INSTRUCTIONS from app.agents.comms import COMMS_INSTRUCTIONS from app.agents.pir import PIR_INSTRUCTIONS from app.agents.summary import SUMMARY_INSTRUCTIONS _credential = DefaultAzureCredential() _token_provider = get_bearer_token_provider( _credential, "https://cognitiveservices.azure.com/.default" ) def create_workflow_builder(): """Create 4 specialist agents and wire them into a ConcurrentBuilder.""" triage = AzureOpenAIChatClient(ad_token_provider=_token_provider).create_agent( instructions=TRIAGE_INSTRUCTIONS, name="triage-agent", ) summary = AzureOpenAIChatClient(ad_token_provider=_token_provider).create_agent( instructions=SUMMARY_INSTRUCTIONS, name="summary-agent", ) comms = AzureOpenAIChatClient(ad_token_provider=_token_provider).create_agent( instructions=COMMS_INSTRUCTIONS, name="comms-agent", ) pir = AzureOpenAIChatClient(ad_token_provider=_token_provider).create_agent( instructions=PIR_INSTRUCTIONS, name="pir-agent", ) return ConcurrentBuilder().participants([triage, summary, comms, pir]) def main(): builder = create_workflow_builder() from_agent_framework(builder.build).run() # starts on port 8088 if __name__ == "__main__": main() Key insight: DefaultAzureCredential means there are no API keys anywhere in the codebase. The container uses managed identity in production; local development uses your az login session. The same code runs in both environments without modification. Agent Instructions: Prompts as Configuration Each agent receives a tightly scoped system prompt that defines its output schema and guardrails. Here’s the Triage Agent—the most complex of the four: app/agents/triage.py TRIAGE_INSTRUCTIONS = """\ You are the **Triage Agent**, an expert Site Reliability Engineer specialising in root cause analysis and incident response. ## Task Analyse the incident data and return a single JSON object with ONLY these keys: { "suspected_root_causes": [ { "hypothesis": "string – concise root cause hypothesis", "evidence": ["string – supporting evidence from the input"], "confidence": 0.0 // 0-1, how confident you are } ], "immediate_actions": [ { "step": "string – concrete action with runnable command if applicable", "owner_role": "oncall-eng | dba | infra-eng | platform-eng", "priority": "P0 | P1 | P2 | P3" } ], "missing_information": [ { "question": "string – what data is missing", "why_it_matters": "string – why this data would help" } ], "runbook_alignment": { "matched_steps": ["string – runbook steps that match the situation"], "gaps": ["string – gaps or missing runbook coverage"] } } ## Guardrails 1. **No secrets** – redact any credential-like material as [REDACTED]. 2. **No hallucination** – if data is insufficient, set confidence to 0 and add entries to missing_information. 3. **Diagnostic suggestions** – when data is sparse, include diagnostic steps in immediate_actions. 4. **Structured output only** – return ONLY valid JSON, no prose. """ The Comms Agent follows the same pattern but targets a different audience: app/agents/comms.py COMMS_INSTRUCTIONS = """\ You are the **Comms Agent**, an expert incident communications writer. ## Task Return a single JSON object with ONLY this key: { "comms": { "slack_update": "Slack-formatted message with emoji, severity, status, impact, next steps, and ETA", "stakeholder_update": "Non-technical summary for executives. Focus on business impact and resolution." } } ## Guidelines - Slack: Use :rotating_light: for active SEV1/2, :warning: for degraded, :white_check_mark: for resolved. - Stakeholder: No jargon. Translate to business impact. - Tone: Calm, factual, action-oriented. Never blame individuals. - Structured output only – return ONLY valid JSON, no prose. """ Instructions as config, not code. Agent behaviour is defined entirely by instruction text strings. A non-developer can refine agent behaviour by editing the prompt and redeploying no Python changes needed. The Incident Envelope: What Goes In The agent accepts a single JSON envelope. It can come from a monitoring alert webhook, a PagerDuty payload, or a manual CLI invocation: Incident Input (JSON) { "incident_id": "INC-20260217-002", "title": "DB connection pool exhausted — checkout-api degraded", "severity": "SEV1", "timeframe": { "start": "2026-02-17T14:02:00Z", "end": null }, "alerts": [ { "name": "DatabaseConnectionPoolNearLimit", "description": "Connection pool at 99.7% on orders-db-primary", "timestamp": "2026-02-17T14:03:00Z" } ], "logs": [ { "source": "order-worker", "lines": [ "ERROR: connection timeout after 30s (attempt 3/3)", "WARN: pool exhausted, queueing request (queue_depth=847)" ] } ], "metrics": [ { "name": "db_connection_pool_utilization_pct", "window": "5m", "values_summary": "Jumped from 22% to 99.7% at 14:03Z" } ], "runbook_excerpt": "Step 1: Check DB connection dashboard...", "constraints": { "max_time_minutes": 15, "environment": "production", "region": "swedencentral" } } Declaring the Hosted Agent The agent is registered with Microsoft Foundry via a declarative agent.yaml file. This tells Foundry how to discover and route requests to the container: agent.yaml kind: hosted name: oncall-copilot description: | Multi-agent hosted agent that ingests incident signals and runs 4 specialist agents concurrently via Microsoft Agent Framework ConcurrentBuilder. metadata: tags: - Azure AI AgentServer - Microsoft Agent Framework - Multi-Agent - Model Router protocols: - protocol: responses environment_variables: - name: AZURE_OPENAI_ENDPOINT value: ${AZURE_OPENAI_ENDPOINT} - name: AZURE_OPENAI_CHAT_DEPLOYMENT_NAME value: model-router The protocols: [responses] declaration exposes the agent via the Foundry Responses API on port 8088. Clients can invoke it with a standard HTTP POST no custom API needed. Invoking the Agent Once deployed, you can invoke the agent with the project’s built-in scripts or directly via curl : CLI / curl # Using the included invoke script python scripts/invoke.py --demo 2 # multi-signal SEV1 demo python scripts/invoke.py --scenario 1 # Redis cluster outage # Or with curl directly TOKEN=$(az account get-access-token \ --resource https://ai.azure.com --query accessToken -o tsv) curl -X POST \ "$AZURE_AI_PROJECT_ENDPOINT/openai/responses?api-version=2025-05-15-preview" \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "input": [ {"role": "user", "content": "<incident JSON here>"} ], "agent": { "type": "agent_reference", "name": "oncall-copilot" } }' The Browser UI The project includes a zero-dependency browser UI built with plain HTML, CSS, and vanilla JavaScript—no React, no bundler. A Python http.server backend proxies requests to the Foundry endpoint. The empty state. Quick-load buttons pre-populate the JSON editor with demo incidents or scenario files. Demo 1 loaded: API Gateway 5xx spike, SEV3. The JSON is fully editable before submitting. Agent Output Panels Triage: Root causes ranked by confidence. Evidence is collapsed under each hypothesis. Triage: Immediate actions with P0/P1/P2 priority badges and owner roles. Comms: Slack card with emoji substitution and a stakeholder executive summary. PIR: Chronological timeline with an ONGOING marker, customer impact in a red-bordered box. Performance: Parallel Execution Matters Incident Type Complexity Parallel Latency Sequential (est.) Single alert, minimal context (SEV4) Low 4–6 s ~16 s Multi-signal, logs + metrics (SEV2) Medium 7–10 s ~28 s Full SEV1 with long log lines High 10–15 s ~40 s Post-incident synthesis (resolved) High 10–14 s ~38 s asyncio.gather() running four independent agents cuts total latency by 3–4× compared to sequential execution. For a SEV1 at 3 AM, that’s the difference between a 10-second AI-powered head start and a 40-second wait. Five Key Design Decisions Parallel over sequential Each agent is independent and processes the full incident payload in isolation. ConcurrentBuilder with asyncio.gather() is the right primitive—no inter-agent dependencies, no shared state. JSON-only agent instructions Every agent returns only valid JSON with a defined schema. The orchestrator merges fragments with merged.update(agent_output) . No parsing, no extraction, no post-processing. No hardcoded model names AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=model-router is the only model reference. Model Router selects the best model at runtime based on prompt complexity. When new models ship, the agent gets better for free. DefaultAzureCredential everywhere No API keys. No token management code. Managed identity in production, az login in development. Same code, both environments. Instructions as configuration Each agent’s system prompt is a plain Python string. Behaviour changes are text edits, not code logic. A non-developer can refine prompts and redeploy. Guardrails: Built into the Prompts The agent instructions include explicit guardrails that don’t require external filtering: No hallucination: When data is insufficient, the agent sets confidence: 0 and populates missing_information rather than inventing facts. Secret redaction: Each agent is instructed to redact credential-like patterns as [REDACTED] in its output. Mark unknowns: Undeterminable fields use the literal string "UNKNOWN" rather than plausible-sounding guesses. Diagnostic suggestions: When signal is sparse, immediate_actions includes diagnostic steps that gather missing information before prescribing a fix. Model Router: Automatic Model Selection One of the most powerful aspects of this architecture is Model Router. Instead of choosing between gpt-4o , gpt-4o-mini , or o3-mini per agent, you deploy a single model-router endpoint. Model Router analyses each request’s complexity and routes it to the most cost-effective model that can handle it. Model Router insights: models selected per request with associated costs. Model Router telemetry from Microsoft Foundry: request distribution and cost analysis. This means you get optimal cost-performance without writing any model-selection logic. A simple Summary Agent prompt may route to gpt-4o-mini , while a complex Triage Agent prompt with 800 lines of logs routes to gpt-4o all automatically. Deployment: One Command The repo includes both azure.yaml and agent.yaml , so deployment is a single command: Deploy to Foundry # Deploy everything: infra + container + Model Router + Hosted Agent azd up This provisions the Foundry project resources, builds the Docker image, pushes to Azure Container Registry, deploys a Model Router instance, and creates the Hosted Agent. For more control, you can use the SDK deploy script: Manual Docker + SDK deploy # Build and push (must be linux/amd64) docker build --platform linux/amd64 -t oncall-copilot:v1 . docker tag oncall-copilot:v1 $ACR_IMAGE docker push $ACR_IMAGE # Create the hosted agent python scripts/deploy_sdk.py Getting Started Quickstart # Clone git clone https://github.com/microsoft-foundry/oncall-copilot cd oncall-copilot # Install python -m venv .venv source .venv/bin/activate # .venv\Scripts\activate on Windows pip install -r requirements.txt # Set environment variables export AZURE_OPENAI_ENDPOINT="https://<account>.openai.azure.com/" export AZURE_OPENAI_CHAT_DEPLOYMENT_NAME="model-router" export AZURE_AI_PROJECT_ENDPOINT="https://<account>.services.ai.azure.com/api/projects/<project>" # Validate schemas locally (no Azure needed) MOCK_MODE=true python scripts/validate.py # Deploy to Foundry azd up # Invoke the deployed agent python scripts/invoke.py --demo 1 # Start the browser UI python ui/server.py # → http://localhost:7860 Extending: Add Your Own Agent Adding a fifth agent is straightforward. Follow this pattern: Create app/agents/<name>.py with a *_INSTRUCTIONS constant following the existing pattern. Add the agent’s output keys to app/schemas.py . Register it in main.py : main.py — Adding a 5th agent from app.agents.my_new_agent import NEW_INSTRUCTIONS new_agent = AzureOpenAIChatClient( ad_token_provider=_token_provider ).create_agent( instructions=NEW_INSTRUCTIONS, name="new-agent", ) workflow = ConcurrentBuilder().participants( [triage, summary, comms, pir, new_agent] ) Ideas for extensions: a ticket auto-creation agent that creates Jira or Azure DevOps items from the PIR output, a webhook adapter agent that normalises PagerDuty or Datadog payloads, or a human-in-the-loop agent that surfaces missing_information as an interactive form. Key Takeaways for AI Engineers The multi-agent pattern isn’t just for chatbots. Any task that can be decomposed into independent subtasks with distinct output schemas is a candidate. Incident response, document processing, code review, data pipeline validation—the pattern transfers. Microsoft Agent Framework gives you ConcurrentBuilder for parallel execution and AzureOpenAIChatClient for Azure-native auth—you write the prompts, the framework handles the plumbing. Foundry Hosted Agents let you deploy containerised agents with managed infrastructure, automatic scaling, and built-in telemetry. No Kubernetes, no custom API gateway. Model Router eliminates the model selection problem. One deployment name handles all scenarios with optimal cost-performance tradeoffs. Prompt-as-config means your agents are iterable by anyone who can edit text. The feedback loop from “this output could be better” to “deployed improvement” is minutes, not sprints. Resources Microsoft Agent Framework SDK powering the multi-agent orchestration Model Router Automatic model selection based on prompt complexity Foundry Hosted Agents Deploy containerised agents on managed infrastructure ConcurrentBuilder Samples Official agents-in-workflow sample this project follows DefaultAzureCredential Zero-config auth chain used throughout Hosted Agents Concepts Architecture overview of Foundry Hosted Agents The On-Call Copilot sample is open source under the MIT licence. Contributions, scenario files, and agent instruction improvements are welcome via pull request.
Lee_Stott
Mar 12, 2026 Place Microsoft Developer Community Blog
17KViews
4likes
0Comments
Using Azure API Management with Azure Front Door for Global, Multi‑Region Architectures
Modern API‑driven applications demand global reach, high availability, and predictable latency. Azure provides two complementary services that help achieve this: Azure API Management (APIM) as the API gateway and Azure Front Door (AFD) as the global entry point and load balancer. Going over the available documentation available, my team and I found this article on how to front a single-region APIM with an Azure Front Door , but we wanted to extend this to a multi-region APIM as well. That led us to design the solution detailed in this article which explains how to configure multi‑regional, active‑active APIM behind Azure Front Door using Custom origins and regional gateway endpoints. (I have also covered topics like why organizations commonly pair APIM with Front Door, when to use internal vs. external APIM modes, etc. but main topic first! Scroll down to the bottom for more info). Configuring Multi‑Regional APIM with Azure Front Door WHAT TO KNOW: If using APIM Premium with multi‑region gateways, each region exposes its own regional gateway endpoint, formatted as: https://<service-name>-<region>-01.regional.azure-api.net Examples: https://mydemo-apim-westeurope-01.regional.azure-api.net https://mydemo-apim-eastus-01.regional.azure-api.net where 'mydemo' is the name of the APIM instance. You will use these regional endpoints and configure them as a separate origin in Azure Front Door—using the Custom origin type. Solution Architecture Azure Front Door Configuration Steps 1. Create an Origin Group Inside your Front Door profile, define a group (Settings -> Origin Groups - > Add -> Add an origin) that will contain all APIM regional gateways. See images below: 2. Add Each APIM Region as a Custom Origin Use the Custom origin type: Origin type: Custom Host name: Use the APIM regional endpoint Example: mydemo-apim-westeurope-01.regional.azure-api.net Origin host header: Same as the host name. Enable certificate subject name validation (Recommended when private link or TLS integrity is required.) Priority: Lower value = higher priority (for failover). Weight: Controls how traffic is distributed across equally prioritized origins. Status: Enable origin. And repeat the same steps for additional APIM regions giving them priority and weightage as you feel appropriate. How to Know Which Region is being Invoked To test this setup, create 2 Virtual Machines (VMs) in Azure - one for each region. For this guide, we chose to create the VMs in West Europe and East US. Open up a Command Prompt from the VM and do a curl on the sample Echo API that comes with every new APIM deployment: Example: curl -v "afd-blah.b01.azurefd.net/echo/resource?param1=sample" Your results should show the region being hit as shown below: How AFD Routes Traffic Across Multiple APIM Regions AFD evaluates origins in this order: Available instances — the Health Probe removes unhealthy origins Priority — selects highest‑priority available origins Latency — optionally selects lowest‑latency pool Weight — round‑robin distribution across selected origins Example When origins are configured as below: West Europe (priority 1, weight 1000) East US (priority 1, weight 500) Central US (priority 2, weight 1000) AFD will: Use West Europe + East US in a 1000:500 ratio. Only use Central US if both West Europe & East US become unavailable. For more information on this nice algorithm, see here: Traffic routing methods to origin - Azure Front Door | Microsoft Learn More Info (as promised) Why Use Azure API Management? Azure API Management is a fully managed service providing: 1. Centralized API Gateway Enforces policies such as authentication, rate limiting, transformations, and caching. Acts as a single façade for backend services, enabling modernization without breaking existing clients. 2. Security & Governance Integrates with Azure AD, OAuth2, and mTLS (mutual TLS). Provides threat protection and schema validation. 3. Developer Ecosystem Developer portal, API documentation, testing console, versioning, and releases. 4. Multi‑Region Gateways (Premium Tier) Allows deployment of additional regional gateways for active‑active, low‑latency global experiences. APIM Deployment Modes: Internal vs. External External Mode The APIM gateway is reachable publicly over the internet. Common when: Exposing APIs to partners, mobile apps, or public clients. You can easily front this with an Azure Front Door for reasons listed in the next section. Internal Mode APIM gateway is deployed inside a VNet, accessible only privately. Used when: APIs must stay private to an enterprise network. Only internal consumers/VPN/VNet peered systems need access. To make your APIM publicly accessible, you need to front it with both an Application Gateway and an Azure Front Door because: Azure Front Door (AFD) cannot directly reach an internal‑mode APIM because AFD requires a publicly routable origin. Application Gateway is a Layer‑7 reverse proxy that can expose a public frontend while still reaching internal private backends (like APIM gateway). [Ref] But Why Put Azure Front Door in Front of API Management? Azure Front Door provides capabilities that APIM alone does not offer: 1. Global Load Balancing As discussed above. 2. Edge Security Web Application Firewall, TLS termination at the edge, DDoS absorption. Reduces load on API gateways. 3. Faster Global Performance Anycast network and global POPs reduce round‑trip latency before requests hit APIM. A POP (Point of Presence) is an Azure Front Door edge location—a physical site in Microsoft’s global network where incoming user traffic first lands. Azure Front Door uses numerous global and local POPs strategically placed close to end‑users (both enterprise and consumer) to improve performance. Anycast is a networking protocol Azure Front Door uses to improve global connectivity. Ref: Traffic acceleration - Azure Front Door | Microsoft Learn 4. Unified Global Endpoint A single public endpoint (e.g., https://api.contoso.com) that intelligently distributes traffic across multiple APIM regions. With all of the above features, it is best to pair API Management with a Front Door, especially when dealing with multi-region architectures. Credits: Junee Singh, Senior Solution Engineer at Microsoft Isiah Hudson, Senior Solution Engineer at Microsoft
juneesingh
Mar 02, 2026 Place Microsoft Developer Community Blog
1.1KViews
2likes
0Comments
On-Premises Manufacturing Intelligence
Manufacturing facilities face a fundamental dilemma in the AI era: how to harness artificial intelligence for predictive maintenance, equipment diagnostics, and operational insights while keeping sensitive production data entirely on-premises. Industrial environments generate proprietary information, CNC machining parameters, quality control thresholds, equipment performance signatures, maintenance histories, that represents competitive advantage accumulated over decades of process optimization. Sending this data to cloud APIs risks intellectual property exposure, regulatory non-compliance, and operational dependencies that manufacturing operations cannot accept. Traditional cloud-based AI introduces unacceptable vulnerabilities. Network latency of 100-500ms makes real-time decision support impossible for time-sensitive manufacturing processes. Internet dependency creates single points of failure in environments where connectivity is unreliable or deliberately restricted for security. API pricing models become prohibitively expensive when analyzing thousands of sensor readings and maintenance logs continuously. Most critically, data residency requirements for aerospace, defense, pharmaceutical, and automotive industries make cloud AI architectures non-compliant by design ITAR, FDA 21 CFR Part 11, and customer-specific mandates require data never leaves facility boundaries. This article demonstrates a sample solution for manufacturing asset intelligence that runs entirely on-premises using Microsoft Foundry Local, Node.js, and JavaScript. The FoundryLocal-IndJSsample repository provides production-ready implementation with Express backend, HTML/JavaScript frontend, and comprehensive Foundry Local SDK integration. Facilities can deploy sophisticated AI-powered monitoring without external dependencies, cloud costs, data exposure risks, or network requirements. Every inference happens locally on facility hardware with predictable performance and zero data egress. Why On-Premises AI Matters for Industrial Operations The case for local AI inference in manufacturing extends beyond simple preference, it addresses fundamental operational, security, and compliance requirements that cloud solutions cannot satisfy. Understanding these constraints shapes architectural decisions that prioritize reliability, data sovereignty, and cost predictability. Data Sovereignty and Intellectual Property Protection Manufacturing processes represent years of proprietary research, optimization, and competitive advantage. Equipment configurations, cycle times, quality thresholds, and maintenance schedules contain intelligence that competitors would value highly. Sending this data to third-party cloud services, even with contractual protections, introduces risks that manufacturing operations cannot accept. On-premises AI ensures that production data never leaves the facility network perimeter. Telemetry from CNC machines, hydraulic systems, conveyor networks, and control systems remains within air-gapped environments where physical access controls and network isolation provide demonstrable data protection. This architectural guarantee of data locality satisfies both internal security policies and external audit requirements without relying on contractual assurances or encryption alone. Operational Resilience and Network Independence Factory floors frequently operate in environments with limited, unreliable, or intentionally restricted internet connectivity. Remote facilities, secure manufacturing zones, and legacy industrial networks cannot depend on continuous cloud access for critical monitoring functions. When network failures occur, whether from ISP outages, DDoS attacks, or infrastructure damage, AI capabilities must continue operating to prevent production losses. Local inference provides true operational independence. Equipment health monitoring, anomaly detection, and maintenance prioritization continue functioning during network disruptions. This resilience is essential for 24/7 manufacturing operations where downtime costs can exceed tens of thousands of dollars per hour. By eliminating external dependencies, on-premises AI becomes as reliable as the local power supply and computing infrastructure. Latency Requirements for Real-Time Decision Making Manufacturing processes involve precise timing where milliseconds determine quality outcomes. Automated inspection systems must classify defects before products leave the production line. Safety interlocks must respond to hazardous conditions before injuries occur. Predictive maintenance alerts must trigger before catastrophic equipment failures cascade through production lines. Cloud-based AI introduces latency that incompatible with these requirements. Network round-trips to cloud endpoints typically require 100-500 milliseconds, in some case latency is unacceptable for real-time applications. Local inference with Foundry Local delivers sub-50ms response times by eliminating network hops, enabling true real-time AI integration with SCADA systems, PLCs, and manufacturing execution systems. Cost Predictability at Industrial Scale Manufacturing facilities generate enormous volumes of time-series data from thousands of sensors, producing millions of data points daily. Cloud AI services charge per API call or per token processed, creating unpredictable costs that scale linearly with data volume. High-throughput industrial applications can quickly accumulate tens of thousands of dollars in monthly API fees. On-premises AI transforms this variable operational expense into fixed capital infrastructure costs. After initial hardware investment, inference costs remain constant regardless of query volume. For facilities analyzing equipment telemetry, maintenance logs, and operator notes continuously, this economic model provides cost certainty and eliminates budget surprises. Regulatory Compliance and Audit Requirements Regulated industries face strict data handling requirements. Aerospace manufacturers must comply with ITAR controls on technical data. Pharmaceutical facilities must satisfy FDA 21 CFR Part 11 requirements for electronic records. Automotive suppliers must meet customer-specific data residency mandates. Cloud AI services complicate compliance by introducing third-party data processors, cross-border data transfers, and shared infrastructure concerns. Local AI simplifies regulatory compliance by eliminating external data flows. Audit trails remain within the facility. Data handling procedures avoid third-party agreements. Compliance demonstrations become straightforward when AI infrastructure resides entirely within auditable physical and network boundaries. Architecture: Manufacturing Intelligence Without Cloud Dependencies The manufacturing asset intelligence system demonstrates a practical architecture for deploying AI capabilities entirely on-premises. The design prioritizes operational reliability, straightforward integration patterns, and maintainable code structure that facilities can adapt to their specific requirements. System Components and Technology Stack The implementation consists of three primary layers that separate concerns and enable independent scaling: Foundry Local Layer: Provides the local AI inference runtime. Foundry Local manages model loading, execution, and resource allocation. It supports multiple model families (Phi-3.5, Phi-4, Qwen2.5) with automatic hardware acceleration detection for NVIDIA GPUs (CUDA), Intel GPUs (OpenVINO), ARM Qualcomm (QNN) and optimized CPU inference. The service exposes a REST API on localhost that the backend layer consumes for completions. Backend Service Layer: An Express Node.js application that serves as the integration point between the AI runtime and the manufacturing data systems. This layer implements business logic for equipment monitoring, maintenance log classification, and conversational interfaces. It formats prompts with equipment context, calls Foundry Local for inference, and structures responses for the frontend. The backend persists chat history and provides RESTful endpoints for all AI operations. Frontend Interface Layer: A standalone HTML/JavaScript application that provides operator interfaces for equipment monitoring, maintenance management, and AI assistant interactions. The UI fetches data from the backend service and renders dashboards, equipment status views, and chat interfaces. No framework dependencies or build steps are required, the frontend operates as static files that any web server or file system can serve. Data Flow for Equipment Analysis Understanding how data moves through the system clarifies integration points and extension opportunities. When an operator requests AI analysis of equipment status, the following sequence occurs: The frontend collects equipment context including asset ID, current telemetry values, alert status, and recent maintenance history. It constructs an HTTP request to the backend's equipment summary endpoint, passing this context as query parameters or request body. The backend retrieves additional context from the equipment database, including specifications, normal operating ranges, and historical performance patterns. The backend constructs a detailed prompt that provides the AI model with comprehensive context: equipment specifications, current telemetry with alarming conditions highlighted, recent maintenance notes, and specific questions about operational status. This prompt engineering is critical, the model's accuracy depends entirely on the context provided. Generic prompts produce generic responses; detailed, structured prompts yield actionable insights. The backend calls Foundry Local's completion API with the formatted prompt, specifying temperature, max tokens, and other generation parameters. Foundry Local loads the configured model (if not already in memory) and generates a response analyzing the equipment's condition. The inference occurs locally with no network traffic leaving the facility. Response time typically ranges from 500ms to 3 seconds depending on prompt complexity and model size. Foundry Local returns the generated text to the backend, which parses the response for structured information if required (equipment health classifications, priority levels, recommended actions). The backend formats this analysis as JSON and returns it to the frontend. The frontend renders the AI-generated summary in the equipment health dashboard, highlighting critical findings and recommended operator actions. Prompt Engineering for Maintenance Log Classification The maintenance log classification feature demonstrates effective prompt engineering for extracting structured decisions from language models. Manufacturing facilities accumulate thousands of maintenance notes, operator observations, technician reports, and automated system logs. Automatically classifying these entries by severity enables priority-based work scheduling without manual review of every log entry. The classification prompt provides the model with clear instructions, classification categories with definitions, and the maintenance note text to analyze: const classificationPrompt = `You are a manufacturing maintenance expert analyzing equipment log entries. Classify the following maintenance note into one of these categories: CRITICAL: Immediate safety hazard, equipment failure, or production stoppage HIGH: Degraded performance, abnormal readings requiring same-shift attention MEDIUM: Scheduled maintenance items or routine inspections LOW: Informational notes, normal operations logs Provide your response in JSON format: { "classification": "CRITICAL|HIGH|MEDIUM|LOW", "reasoning": "Brief explanation of classification decision", "recommended_action": "Specific next steps for maintenance team" } Maintenance Note: ${maintenanceNote} Classification:`; const response = await foundryClient.chat.completions.create({ model: currentModelAlias, messages: [{ role: 'user', content: classificationPrompt }], temperature: 0.1, // Low temperature for consistent classification max_tokens: 300 }); Key aspects of this prompt design: Role definition: Establishing the model as a "manufacturing maintenance expert" activates relevant knowledge and reasoning patterns in the model's training data. Clear categories: Explicit classification options with definitions prevent ambiguous outputs and enable consistent decision-making across thousands of logs. Structured output format: Requesting JSON responses with specific fields enables automated parsing and integration with maintenance management systems without fragile text parsing. Temperature control: Setting temperature to 0.1 reduces randomness in classifications, ensuring consistent severity assessments for similar maintenance conditions. Context isolation: Separating the maintenance note text from the instructions with clear delimiters prevents prompt injection attacks where malicious log entries might attempt to manipulate classification logic. This classification runs locally for every maintenance log entry without API costs or network delays. Facilities processing hundreds of maintenance notes daily benefit from immediate, consistent classification that routes critical issues to technicians automatically while filtering routine informational logs. Model Selection and Performance Trade-offs Foundry Local supports multiple model families with different memory requirements, inference speeds, and accuracy characteristics. Choosing appropriate models for manufacturing environments requires balancing these trade-offs against hardware constraints and operational requirements: Qwen2.5-0.5b (500MB memory): The smallest available model provides extremely fast inference (100-200ms responses) on limited hardware. Suitable for simple classification tasks, keyword extraction, and high-throughput scenarios where response speed matters more than nuanced understanding. Works well on older servers or edge devices with constrained resources. Phi-3.5-mini (2.1GB memory): The recommended default model balances accuracy with reasonable memory requirements. Provides strong reasoning capabilities for equipment analysis, maintenance prioritization, and conversational assistance. Response times of 1-3 seconds on modern CPUs are acceptable for interactive dashboards. This model handles complex prompts with detailed equipment context effectively. Phi-4-mini (3.6GB memory): Increased model capacity improves understanding of technical terminology and complex equipment relationships. Best choice when analyzing detailed maintenance histories, interpreting sensor correlation patterns, or providing nuanced operational recommendations. Requires more memory but delivers noticeably improved analysis quality for complex scenarios. Qwen2.5-7b (4.7GB memory): The largest supported model provides maximum accuracy and sophisticated reasoning. Ideal for facilities with modern server hardware where best-possible analysis quality justifies longer inference times (3-5 seconds). Consider this model for critical applications where operator decisions depend heavily on AI recommendations. Facilities can download all models during initial setup and switch between them based on specific use cases. Use faster models for real-time dashboard updates and automated classification. Deploy larger models for detailed equipment analysis and maintenance planning where operators can wait several seconds for comprehensive insights. Implementation: Equipment Monitoring and AI Analysis The practical implementation reveals how straightforward on-premises AI integration can be with modern JavaScript tooling and proper architectural separation. The backend service manages all AI interactions, shielding the frontend from inference complexity and providing clean REST interfaces. Backend Service Architecture with Express The Node.js backend initializes the Foundry Local SDK client and exposes endpoints for equipment operations: const express = require('express'); const { FoundryLocalClient } = require('foundry-local-sdk'); const cors = require('cors'); const app = express(); const PORT = process.env.PORT || 3000; // Initialize Foundry Local client const foundryClient = new FoundryLocalClient({ baseURL: 'http://localhost:8008', // Default Foundry Local endpoint timeout: 30000 }); // Middleware configuration app.use(cors()); // Enable cross-origin requests from frontend app.use(express.json()); // Parse JSON request bodies // Health check endpoint for monitoring app.get('/api/health', (req, res) => { res.json({ ok: true, service: 'manufacturing-ai-backend' }); }); // Start server app.listen(PORT, () => { console.log(`Manufacturing AI backend running on port ${PORT}`); console.log(`Foundry Local endpoint: http://localhost:8008`); }); This foundational structure establishes the Express application with CORS support for browser-based frontends and JSON request handling. The Foundry Local client connects to the local inference service running on port 8008, no external network configuration required. Equipment Summary Generation with Context-Rich Prompts The equipment summary endpoint demonstrates effective context injection for accurate AI analysis: app.get('/api/assets/:id/summary', async (req, res) => { try { const assetId = req.params.id; const asset = equipmentDatabase.find(a => a.id === assetId); if (!asset) { return res.status(404).json({ error: 'Asset not found' }); } // Construct detailed equipment context const contextPrompt = buildEquipmentContext(asset); // Generate AI analysis const completion = await foundryClient.chat.completions.create({ model: 'phi-3.5-mini', messages: [{ role: 'user', content: contextPrompt }], temperature: 0.3, max_tokens: 500 }); const analysis = completion.choices[0].message.content; res.json({ assetId: asset.id, assetName: asset.name, analysis: analysis, generatedAt: new Date().toISOString() }); } catch (error) { console.error('Equipment summary error:', error); res.status(500).json({ error: 'AI analysis failed', details: error.message }); } }); The equipment context builder assembles comprehensive information for accurate analysis: function buildEquipmentContext(asset) { const alerts = asset.alerts.filter(a => a.severity !== 'INFO'); const telemetry = asset.currentTelemetry; return `Analyze the following manufacturing equipment status: Equipment: ${asset.name} (${asset.id}) Type: ${asset.type} Location: ${asset.location} Current Telemetry: - Temperature: ${telemetry.temperature}°C (Normal range: ${asset.specs.tempRange}) - Vibration: ${telemetry.vibration} mm/s (Threshold: ${asset.specs.vibrationThreshold}) - Pressure: ${telemetry.pressure} PSI (Normal: ${asset.specs.pressureRange}) - Runtime: ${telemetry.runHours} hours (Next maintenance due: ${asset.nextMaintenance}) Active Alerts: ${alerts.map(a => `- ${a.severity}: ${a.message}`).join('\n')} Recent Maintenance History: ${asset.recentMaintenance.slice(0, 3).map(m => `- ${m.date}: ${m.description}`).join('\n')} Provide a concise operational summary focusing on: 1. Current equipment health status 2. Any concerning trends or anomalies 3. Recommended operator actions if applicable 4. Maintenance priority level Summary:`; } This context-rich approach produces accurate, actionable analysis because the model receives equipment specifications, current telemetry with context, alert history, maintenance patterns, and structured output guidance. The model can identify abnormal conditions accurately rather than guessing what values seem unusual. Conversational AI Assistant with Manufacturing Context The chat endpoint enables natural language queries about equipment status and operational questions: app.post('/api/chat', async (req, res) => { try { const { message, conversationId } = req.body; // Retrieve conversation history for context const history = conversationStore.get(conversationId) || []; // Build plant-wide context for the query const plantContext = buildPlantOperationsContext(); // Construct system message with domain knowledge const systemMessage = { role: 'system', content: `You are an AI assistant for a manufacturing facility's operations team. You have access to real-time equipment data and maintenance records. Current Plant Status: ${plantContext} Provide specific, actionable responses based on actual equipment data. If you don't have information to answer a query, clearly state that. Never speculate about equipment conditions beyond available data.` }; // Include conversation history for multi-turn context const messages = [ systemMessage, ...history, { role: 'user', content: message } ]; const completion = await foundryClient.chat.completions.create({ model: 'phi-3.5-mini', messages: messages, temperature: 0.4, max_tokens: 600 }); const assistantResponse = completion.choices[0].message.content; // Update conversation history history.push( { role: 'user', content: message }, { role: 'assistant', content: assistantResponse } ); conversationStore.set(conversationId, history); res.json({ response: assistantResponse, conversationId: conversationId, timestamp: new Date().toISOString() }); } catch (error) { console.error('Chat error:', error); res.status(500).json({ error: 'Chat request failed', details: error.message }); } }); The conversational interface enables operators to ask natural language questions and receive grounded responses based on actual equipment data, citing specific asset IDs, current metric values, and alert statuses rather than speculating. Deployment and Production Operations Deploying on-premises AI in industrial settings requires consideration of hardware placement, network architecture, integration patterns, and operational procedures that differ from typical web application deployments. Hardware and Infrastructure Requirements The system runs on standard server hardware without specialized AI accelerators, though GPU availability improves performance significantly. Minimum requirements include 8GB RAM for the Phi-3.5-mini model, 4-core CPU, and 50GB storage for model files and application data. Production deployments benefit from 16GB+ RAM to support larger models and concurrent analysis requests. For facilities with NVIDIA GPUs, Foundry Local automatically utilizes CUDA acceleration, reducing inference times by 3-5x compared to CPU-only execution. Deploy the backend service on dedicated server hardware within the factory network. Avoid running AI workloads on the same systems that host critical SCADA or MES applications due to resource contention concerns. Network Architecture and SCADA Integration The AI backend should reside on the manufacturing operations network with firewall rules permitting connections from operator workstations and monitoring systems. Do not expose the backend service directly to the internet, all access should occur through the facility's internal network with authentication via existing directory services. Integrate with SCADA systems through standard industrial protocols. Configure OPC-UA clients to subscribe to equipment telemetry topics and forward readings to the AI backend via REST API calls. Modbus TCP gateways can bridge legacy PLCs to modern APIs by polling register values and POSTing updates to the backend's telemetry ingestion endpoints. Security and Compliance Considerations Many manufacturing facilities operate air-gapped networks where physical separation prevents internet connectivity entirely. Deploy Foundry Local and the AI application in these environments by transferring model files and application packages via removable media during controlled maintenance windows. Implement role-based access control (RBAC) using Active Directory integration. Configure the backend to validate user credentials against LDAP before serving AI analysis requests. Maintain detailed audit logs of all AI invocations including user identity, timestamp, equipment queried, and model version used. Store these logs in immutable append-only databases for compliance audits. Key Takeaways Building production-ready AI systems for industrial environments requires architectural decisions that prioritize operational reliability, data sovereignty, and integration simplicity: Data locality by architectural design: On-premises AI ensures proprietary production data never leaves facility networks through fundamental architectural guarantees rather than configuration options Model selection impacts deployment feasibility: Smaller models (0.5B-2B parameters) enable deployment on commodity hardware without specialized accelerators while maintaining acceptable accuracy Fallback logic preserves operational continuity: AI capabilities enhance but don't replace core monitoring functions, ensuring equipment dashboards display raw telemetry even when AI analysis is unavailable Context-rich prompts determine accuracy: Effective prompts include equipment specifications, normal operating ranges, alert thresholds, and maintenance history to enable grounded recommendations Structured outputs enable automation: JSON response formats allow automated systems to parse classifications and route work orders without fragile text parsing Integration patterns bridge legacy systems: OPC-UA and Modbus TCP gateways connect decades-old PLCs and SCADA systems to modern AI without replacing functional control infrastructure Resources and Further Exploration The complete implementation with extensive comments and documentation is available in the GitHub repository. Additional resources help facilities customize and extend the system for their specific requirements. FoundryLocal-IndJSsample GitHub Repository – Full source code with JavaScript backend, HTML frontend, and sample data files Quick Start Guide and Documentation – Installation instructions, API documentation, and troubleshooting guidance Microsoft Foundry Local Documentation – Official SDK reference, model catalog, and deployment guidance Sample Manufacturing Data – Example equipment telemetry, maintenance logs, and alert structures Backend Implementation Reference – Express server code with Foundry Local SDK integration patterns OPC Foundation – Industrial communication standards for SCADA and PLC integration Edge AI for Beginners - Online FREE course and resources for learning more about using AI on Edge Devices Why On-Premises AI Cloud AI services offer convenience, but they fundamentally conflict with manufacturing operational requirements. Understanding these conflicts explains why local AI isn't just preferable, it's mandatory for production environments. Data privacy and intellectual property protection stand paramount. A CNC machining program represents years of optimization, feed rates, tool paths, thermal compensation algorithms. Quality control measurements reveal product specifications competitors would pay millions to access. Sending this data to external APIs, even with encryption, creates unacceptable exposure risk. Every API call generates logs on third-party servers, potentially subject to subpoenas, data breaches, or regulatory compliance failures. Latency requirements eliminate cloud viability for real-time decisions. When a thermal sensor detects bearing temperature exceeding safe thresholds, the control system needs AI analysis in under 50 milliseconds to prevent catastrophic failure. Cloud APIs introduce 100-500ms baseline latency from network round-trips alone, before queue times and processing. For safety systems, quality inspection, and process control, this latency is operationally unacceptable. Network dependency creates operational fragility. Factory floors frequently have limited connectivity, legacy equipment, RF interference, isolated production cells. Critical AI capabilities cannot fail because internet service drops. Moreover, many defense, aerospace, and pharmaceutical facilities operate air-gapped networks for security compliance. Cloud AI is simply non-operational in these environments. Regulatory requirements mandate data residency. ITAR (International Traffic in Arms Regulations) prohibits certain manufacturing data from leaving approved facilities. FDA 21 CFR Part 11 requires strict data handling controls for pharmaceutical manufacturing. GDPR demands data residency in approved jurisdictions. On-premises AI simplifies compliance by eliminating cross-border data transfers. Cost predictability at scale favors local deployment. A high-volume facility generating 10,000 equipment events per day, each requiring AI analysis, would incur significant cloud API costs. Local models have fixed infrastructure costs that scale economically with usage, making AI economically viable for continuous monitoring. Application Architecture: Web UI + Local AI Backend The FoundryLocal-IndJSsample implements a clean separation between data presentation and AI inference. This architecture ensures the UI remains responsive while AI operations run independently, enabling real-time dashboard updates without blocking user interactions. The web frontend serves a single-page application with vanilla HTML, CSS, and JavaScript, no frameworks, no build tools. This simplicity is intentional: factory IT teams need to audit code, customize interfaces, and deploy on legacy systems. The UI presents four main interfaces: Plant Asset Overview (real-time health cards for all equipment), Asset Health (AI-generated summaries and trend analysis), Maintenance Logs (classification and priority routing), and AI Assistant (natural language interface for operations queries). The Node.js backend runs Express as the HTTP server, handling static file serving, API routing, and WebSocket connections for real-time updates. It loads sample manufacturing data from JSON files, equipment telemetry, maintenance logs, historical events, simulating the data streams that would come from SCADA systems, PLCs, and MES platforms in production. Foundry Local provides the AI inference layer. The backend uses foundry-local-sdk to communicate with the locally running service. All model loading, prompt processing, and response generation happens on-device. The application detects Foundry Local automatically and falls back to rule-based analysis if unavailable, ensuring core functionality persists even when AI is offline. Here's the architectural flow for asset health analysis: User Request (Web UI) ↓ Express API Route (/api/assets/:id/summary) ↓ Load Equipment Data (from JSON/database) ↓ Build Analysis Prompt (Equipment ID, telemetry, alerts) ↓ Foundry Local SDK Call (local AI inference) ↓ Parse AI Response (structured insights) ↓ Return JSON Result (with metadata: model, latency, confidence) ↓ Display in UI (formatted health summary) This architecture demonstrates several industrial system design principles: Offline-first operation: Core functionality works without internet connectivity, with AI as an enhancement rather than dependency Graceful degradation: If AI fails, fall back to rule-based logic rather than crashing operations Minimal external dependencies: Simple stack reduces attack surface and simplifies air-gapped deployment Data locality: All processing happens on-premises, no external API calls Real-time updates: WebSocket connections enable push-based event streaming for dashboard updates Setting Up Foundry Local for Industrial Applications Industrial deployments require careful model selection that balances accuracy, speed, and hardware constraints. Factory edge devices often run on limited hardware—industrial PCs with modest GPUs or CPU-only configurations. Model choice significantly impacts deployment feasibility. Install Foundry Local on the industrial edge device: # Windows (most common for industrial PCs) winget install Microsoft.FoundryLocal # Verify installation foundry --version For manufacturing asset intelligence, model selection trades off speed versus quality: # Fast option: Qwen 0.5B (500MB, <100ms inference) foundry model load qwen2.5-0.5b # Balanced option: Phi-3.5 Mini (2.1GB, ~200ms inference) foundry model load phi-3.5-mini # High quality option: Phi-4 Mini (3.6GB, ~500ms inference) foundry model load phi-4 # Check which model is currently loaded foundry model list For real-time monitoring dashboards where hundreds of assets update continuously, qwen2.5-0.5b provides sufficient quality at speeds that don't bottleneck refresh cycles. For detailed root cause analysis or maintenance report generation where quality matters most, phi-4 justifies the slightly longer inference time. Industrial systems benefit from proactive model caching during downtime: # During maintenance windows, pre-download models foundry model download phi-3.5-mini foundry model download qwen2.5-0.5b # Models cache locally, eliminating runtime downloads The backend automatically detects Foundry Local and selects the loaded model: // backend/services/foundry-service.js import { FoundryLocalClient } from 'foundry-local-sdk'; class FoundryService { constructor() { this.client = null; this.modelAlias = null; this.initializeClient(); } async initializeClient() { try { // Detect Foundry Local endpoint const endpoint = process.env.FOUNDRY_LOCAL_ENDPOINT || 'http://127.0.0.1:5272'; this.client = new FoundryLocalClient({ endpoint }); // Query which model is currently loaded const models = await this.client.models.list(); this.modelAlias = models.data[0]?.id || 'phi-3.5-mini'; console.log(`✅ Foundry Local connected: ${this.modelAlias}`); } catch (error) { console.warn('⚠️ Foundry Local not available, using rule-based fallback'); this.client = null; } } async generateCompletion(prompt, options = {}) { if (!this.client) { // Fallback to rule-based analysis return this.ruleBasedAnalysis(prompt); } try { const startTime = Date.now(); const completion = await this.client.chat.completions.create({ model: this.modelAlias, messages: [ { role: 'system', content: 'You are an industrial asset intelligence assistant analyzing manufacturing equipment.' }, { role: 'user', content: prompt } ], temperature: 0.3, // Low temperature for factual analysis max_tokens: 400, ...options }); const latency = Date.now() - startTime; return { content: completion.choices[0].message.content, model: this.modelAlias, latency_ms: latency, tokens: completion.usage?.total_tokens }; } catch (error) { console.error('Foundry inference error:', error); return this.ruleBasedAnalysis(prompt); } } ruleBasedAnalysis(prompt) { // Fallback logic for when AI is unavailable // Pattern matching and heuristics return { content: '(Rule-based analysis) Equipment status: Monitoring...', model: 'rule-based-fallback', latency_ms: 5, tokens: 0 }; } } export default new FoundryService(); This service layer demonstrates critical production patterns: Automatic endpoint detection: Tries environment variable first, falls back to default Model auto-discovery: Queries Foundry Local for currently loaded model rather than hardcoding Robust error handling: Every API call wrapped in try-catch with fallback logic Performance tracking: Latency measurement enables monitoring and capacity planning Conservative temperature: 0.3 temperature reduces hallucination for factual equipment analysis Implementing AI-Powered Asset Health Analysis Equipment health monitoring forms the core use case, synthesizing telemetry from multiple sources into actionable insights. Traditional monitoring systems show raw metrics (temperature, vibration, pressure) but require expert interpretation. AI transforms this into natural language summaries that any operator can understand and act upon. Here's the API endpoint that generates asset health summaries: // backend/routes/assets.js import express from 'express'; import foundryService from '../services/foundry-service.js'; import { getAssetData } from '../data/asset-loader.js'; const router = express.Router(); router.get('/api/assets/:id/summary', async (req, res) => { try { const assetId = req.params.id; // Load equipment data const asset = await getAssetData(assetId); if (!asset) { return res.status(404).json({ error: 'Asset not found' }); } // Build analysis prompt with context const prompt = buildHealthAnalysisPrompt(asset); // Generate AI summary const analysis = await foundryService.generateCompletion(prompt); // Structure response res.json({ asset_id: assetId, asset_name: asset.name, summary: analysis.content, model_used: analysis.model, latency_ms: analysis.latency_ms, timestamp: new Date().toISOString(), telemetry_snapshot: { temperature: asset.telemetry.temperature, vibration: asset.telemetry.vibration, runtime_hours: asset.telemetry.runtime_hours }, active_alerts: asset.alerts.filter(a => a.active).length }); } catch (error) { console.error('Asset summary error:', error); res.status(500).json({ error: 'Analysis failed' }); } }); function buildHealthAnalysisPrompt(asset) { return ` Analyze the health of this manufacturing equipment and provide a concise summary: Equipment: ${asset.name} (${asset.id}) Type: ${asset.type} Location: ${asset.location} Current Telemetry: - Temperature: ${asset.telemetry.temperature}°C (Normal: ${asset.specs.normal_temp_range}) - Vibration: ${asset.telemetry.vibration} mm/s (Threshold: ${asset.specs.vibration_threshold}) - Operating Pressure: ${asset.telemetry.pressure} PSI - Runtime: ${asset.telemetry.runtime_hours} hours - Last Maintenance: ${asset.maintenance.last_service_date} Active Alerts: ${asset.alerts.map(a => `- ${a.severity}: ${a.message}`).join('\n')} Recent Events: ${asset.recent_events.slice(0, 3).map(e => `- ${e.timestamp}: ${e.description}`).join('\n')} Provide a 3-4 sentence summary covering: 1. Overall equipment health status 2. Any concerning trends or anomalies 3. Recommended actions or monitoring focus Be factual and specific. Do not speculate beyond the provided data. `.trim(); } export default router; This prompt construction demonstrates several best practices for industrial AI: Structured data presentation: Organize telemetry, specs, and alerts in clear sections with labels Context enrichment: Include normal operating ranges so AI can assess abnormality Explicit constraints: Instruction to avoid speculation reduces hallucination risk Output formatting guidance: Request specific structure (3-4 sentences, covering key points) Temporal context: Include recent events so AI understands trend direction Example AI-generated asset summary: { "asset_id": "CNC-L2-M03", "asset_name": "CNC Mill #3", "summary": "Equipment is operating outside normal parameters with elevated temperature at 92°C, significantly above the 75-80°C normal range. Thermal Alert indicates possible coolant flow issue. Vibration levels remain acceptable at 2.8 mm/s. Recommend immediate inspection of coolant system and thermal throttling may impact throughput until resolved.", "model_used": "phi-3.5-mini", "latency_ms": 243, "timestamp": "2026-01-30T14:23:18Z", "telemetry_snapshot": { "temperature": 92, "vibration": 2.8, "runtime_hours": 12847 }, "active_alerts": 2 } This summary transforms raw telemetry into actionable intelligence—operations staff immediately understand the problem, its severity, and the appropriate response, without requiring deep equipment expertise. Maintenance Log Classification with AI Maintenance departments generate hundreds of logs daily, technician notes, operator observations, inspection reports. Manually categorizing and prioritizing these logs consumes significant time. AI classification automatically routes logs to appropriate teams, identifies urgent issues, and extracts key information. The classification endpoint processes maintenance notes: // backend/routes/maintenance.js router.post('/api/logs/classify', async (req, res) => { try { const { log_text, equipment_id } = req.body; if (!log_text || log_text.length < 10) { return res.status(400).json({ error: 'Log text required (min 10 chars)' }); } const classificationPrompt = ` Classify this maintenance log entry into appropriate categories and priority: Equipment: ${equipment_id || 'Unknown'} Log Text: "${log_text}" Classify into EXACTLY ONE primary category: - MECHANICAL: Physical components, bearings, belts, motors - ELECTRICAL: Power systems, sensors, controllers, wiring - HYDRAULIC: Pumps, fluid systems, pressure issues - THERMAL: Cooling, heating, temperature control - SOFTWARE: PLC programming, HMI issues, control logic - ROUTINE: Scheduled maintenance, inspections, calibration Assign priority level: - CRITICAL: Immediate action required, safety or production impact - HIGH: Resolve within 24 hours, performance degradation - MEDIUM: Schedule within 1 week, minor issues - LOW: Routine maintenance, cosmetic issues Extract key details: - Symptoms described - Suspected root cause (if mentioned) - Recommended actions Return ONLY a JSON object with this exact structure: { "category": "MECHANICAL", "priority": "HIGH", "symptoms": ["grinding noise", "vibration above 5mm/s"], "suspected_cause": "bearing wear", "recommended_actions": ["inspect bearings", "order replacement parts"] } `.trim(); const analysis = await foundryService.generateCompletion(classificationPrompt); // Parse AI response as JSON let classification; try { // Extract JSON from response (AI might add explanation text) const jsonMatch = analysis.content.match(/\{[\s\S]*\}/); classification = JSON.parse(jsonMatch[0]); } catch (parseError) { // Fallback parsing if JSON extraction fails classification = parseClassificationText(analysis.content); } // Validate classification const validCategories = ['MECHANICAL', 'ELECTRICAL', 'HYDRAULIC', 'THERMAL', 'SOFTWARE', 'ROUTINE']; const validPriorities = ['CRITICAL', 'HIGH', 'MEDIUM', 'LOW']; if (!validCategories.includes(classification.category)) { classification.category = 'ROUTINE'; } if (!validPriorities.includes(classification.priority)) { classification.priority = 'MEDIUM'; } res.json({ original_log: log_text, classification, model_used: analysis.model, latency_ms: analysis.latency_ms, timestamp: new Date().toISOString() }); } catch (error) { console.error('Classification error:', error); res.status(500).json({ error: 'Classification failed' }); } }); function parseClassificationText(text) { // Fallback parser for when AI doesn't return valid JSON // Extract category, priority, and details using regex patterns const categoryMatch = text.match(/category[":]\s*(MECHANICAL|ELECTRICAL|HYDRAULIC|THERMAL|SOFTWARE|ROUTINE)/i); const priorityMatch = text.match(/priority[":]\s*(CRITICAL|HIGH|MEDIUM|LOW)/i); return { category: categoryMatch ? categoryMatch[1].toUpperCase() : 'ROUTINE', priority: priorityMatch ? priorityMatch[1].toUpperCase() : 'MEDIUM', symptoms: [], suspected_cause: 'Unknown', recommended_actions: [] }; } This implementation demonstrates several critical patterns for structured AI outputs: Explicit output format requirements: Prompt specifies exact JSON structure to encourage parseable responses Defensive parsing: Try JSON extraction first, fall back to text parsing if that fails Validation with sensible defaults: Validate categories and priorities against allowed values, default to safe values on mismatch Constrained classification vocabulary: Limit categories to predefined set rather than open-ended categories Priority inference rules: Guide AI to assess urgency based on safety, production impact, and timeline Example classification output: POST /api/logs/classify { "log_text": "Hydraulic pump PUMP-L1-H01 making grinding noise during startup. Vibration readings spiked to 5.2 mm/s this morning. Possible bearing wear. Recommend inspection.", "equipment_id": "PUMP-L1-H01" } Response: { "original_log": "Hydraulic pump PUMP-L1-H01 making grinding noise...", "classification": { "category": "MECHANICAL", "priority": "HIGH", "symptoms": ["grinding noise during startup", "vibration spike to 5.2 mm/s"], "suspected_cause": "bearing wear", "recommended_actions": ["inspect bearings", "schedule replacement if confirmed worn"] }, "model_used": "phi-3.5-mini", "latency_ms": 187, "timestamp": "2026-01-30T14:35:22Z" } This classification automatically routes the log to the mechanical maintenance team, marks it high priority for same-day attention, and extracts actionable details, all without human intervention. Building the Natural Language Operations Assistant The AI Assistant interface enables operations staff to query equipment status, ask diagnostic questions, and get contextual guidance using natural language. This interface bridges the gap between complex SCADA systems and operators who need quick answers without navigating multiple screens. The chat endpoint implements contextual conversation: // backend/routes/chat.js router.post('/api/chat', async (req, res) => { try { const { message, conversation_id } = req.body; if (!message || message.length < 3) { return res.status(400).json({ error: 'Message required (min 3 chars)' }); } // Load conversation history if exists const history = conversation_id ? await loadConversationHistory(conversation_id) : []; // Build context from current plant state const plantContext = await buildPlantContext(); // Construct system prompt with operational context const systemPrompt = ` You are an operations assistant for a manufacturing facility. Answer questions about equipment status, maintenance, and operational issues. Current Plant Status: ${plantContext} Guidelines: - Provide specific, actionable answers based on current data - Reference specific equipment IDs when relevant - Suggest appropriate next steps for issues - If information is unavailable, say so clearly - Use concise language suitable for busy operators Do not speculate about issues without data to support it. `.trim(); // Build message chain with history const messages = [ { role: 'system', content: systemPrompt }, ...history.map(h => ({ role: h.role, content: h.content })), { role: 'user', content: message } ]; // Generate response const response = await foundryService.generateCompletion( message, { messages } // Pass full conversation context ); // Save conversation const newConversationId = conversation_id || generateConversationId(); await saveConversationTurn(newConversationId, message, response.content); res.json({ response: response.content, conversation_id: newConversationId, model_used: response.model, latency_ms: response.latency_ms, timestamp: new Date().toISOString() }); } catch (error) { console.error('Chat error:', error); res.status(500).json({ error: 'Chat failed' }); } }); async function buildPlantContext() { const assets = await getAllAssets(); // Summarize plant-wide status const criticalAlerts = assets.flatMap(a => a.alerts.filter(alert => alert.severity === 'CRITICAL') ); const equipmentStatus = assets.map(a => `${a.id}: ${a.status} (Temp: ${a.telemetry.temperature}°C, Alerts: ${a.alerts.length})` ); return ` Total Equipment: ${assets.length} Critical Alerts: ${criticalAlerts.length} Equipment Status Summary: ${equipmentStatus.join('\n')} `.trim(); } Example conversational interactions: User: "What's wrong with Line 2?" AI: "Line 2 has elevated issues. CNC-L2-M03 is running 12°C above normal temperature at 92°C with a thermal alert active. This has caused 28% below-target throughput as the equipment throttles to prevent damage. The coolant system requires immediate inspection—flow rate dropped to 3.2 L/min from the normal 5.5 L/min, indicating a potential blockage or pump failure. Other Line 2 equipment (CONV-L2-B01, PLC-L2-MAIN) is operating normally." User: "How's the hydraulic pump performing?" AI: "PUMP-L1-H01 has a high-priority maintenance alert. Grinding noise was reported during startup with vibration at 5.2 mm/s, exceeding the 4.0 mm/s threshold. This pattern suggests bearing wear. Current operational status shows the pump is running but degraded. Recommend inspection today to prevent catastrophic failure. Bearing replacement typically requires 2-4 hours downtime." User: "What should I prioritize this shift?" AI: "Priority order for this shift: 1) CNC-L2-M03 coolant inspection (CRITICAL - impacting production, safety risk), 2) PUMP-L1-H01 bearing check (HIGH - prevent failure progression), 3) Routine inspections on CONV-L1-B02 (scheduled maintenance). Focus immediate attention on Line 2 thermal issues. Line 1 is stable." The assistant provides grounded, specific responses citing actual equipment IDs, current metrics, and actionable priorities. This data-grounded approach maintains operational credibility, operators quickly learn to trust responses that reference specific equipment conditions they can independently verify on SCADA systems. Real-World Industrial Scenario: Thermal Event Response To understand the system's practical value, consider how it supports operators during an actual equipment incident. This scenario simulates realistic manufacturing conditions where multiple issues develop simultaneously and operators need rapid situational awareness. Scenario Setup: Assembly Line Degradation During Shift B operations on a Tuesday evening at a precision manufacturing facility, multiple equipment anomalies develop on production lines. The plant overview dashboard begins displaying elevated alert counts. Telemetry monitoring detects concerning patterns across Assembly Line 2 equipment. CNC mill CNC-L2-M03 reports sustained high temperature at 92°C compared to normal operating range of 65-75°C. Coolant flow sensors indicate 40% reduction from baseline levels—dropping from 5.5 L/min to 3.2 L/min. Equipment throughput has decreased 28% as thermal management systems automatically throttle spindle speeds to prevent damage. The mill has accumulated 12,847 runtime hours with next scheduled maintenance in 153 hours. Simultaneously, hydraulic pump PUMP-L1-H01 on Line 1 logged a maintenance note during startup: "Grinding noise observed during pump startup, vibration spike detected." Automated vibration sensors registered 5.2 mm/s compared to the 4.0 mm/s alarm threshold. The pump's condition monitoring system flagged bearing wear as probable cause based on frequency analysis of vibration signatures. The shift supervisor, monitoring the plant overview dashboard, sees both alerts flagged as high priority. Traditional response would require accessing SCADA screens, checking telemetry history, reviewing maintenance logs across multiple systems, correlating patterns, and consulting equipment experts, consuming 15-20 minutes of investigation time while equipment conditions potentially worsen. AI-Assisted Response Workflow Instead, the supervisor uses the AI Assistant interface to rapidly assess the situation. The workflow demonstrates how local AI transforms incident response: Step 1 - Initial Assessment Query: Supervisor types "What's the current plant status?" The AI assistant analyzes the complete equipment database and returns: "Plant status as of 22:14: 14 assets monitored, 2 high-priority alerts active. Critical issues: CNC-L2-M03 thermal event (92°C, coolant flow reduced 40%, throughput down 28%). High-priority: PUMP-L1-H01 mechanical degradation (bearing wear suspected, 5.2 mm/s vibration). All other equipment operating within normal parameters. Recommend immediate attention to Line 2 thermal issue—production impact and potential equipment damage risk." Step 2 - Detailed Equipment Analysis: Supervisor navigates to Asset Health tab and selects CNC-L2-M03 for detailed analysis. Within 1.8 seconds, the AI summary generates: "CNC-L2-M03 Health Summary: DEGRADED - IMMEDIATE ACTION REQUIRED Current Status: Elevated thermal condition with active coolant flow issue. Operating temperature 92°C exceeds normal range (65-75°C) by 22%. Coolant flow rate 3.2 L/min indicates 42% reduction from baseline 5.5 L/min. Impact Assessment: Throughput reduced 28% due to automatic thermal throttling. Spindle speed limited to 65% capacity to prevent thermal damage. Current production rate: 18 parts/hour vs. target 25 parts/hour. Root Cause Analysis: Coolant flow reduction suggests blockage in cooling circuit or coolant pump degradation. Temperature rise pattern correlates with gradual flow reduction observed over past 6 hours. Recommended Actions: 1. Immediate: Verify coolant reservoir level and pump operation 2. Inspect: Check cooling circuit for blockages or debris 3. Monitor: Temperature trend every 15 minutes until resolved 4. Consider: Production pause if temperature exceeds 95°C Maintenance Priority: HIGH - Production impact and equipment damage risk if unresolved. Estimated repair time: 1-2 hours for coolant system service." Step 3 - Maintenance Log Classification: When the pump maintenance note is entered, the AI classification system analyzes it and returns structured classification within 240ms: { "classification": { "category": "MECHANICAL", "priority": "HIGH", "symptoms": ["grinding noise during startup", "vibration spike to 5.2 mm/s", "frequency pattern indicates bearing wear"], "suspected_cause": "bearing wear - inner race degradation probable", "recommended_actions": [ "schedule bearing inspection within 8 hours", "prepare replacement bearings (6208-2RS)", "plan maintenance window: 2-4 hours", "order backup pump if repair extended" ], "estimated_downtime": "2-4 hours for bearing replacement" } } Step 4 - Decision Support and Prioritization: Supervisor asks "What should I prioritize tonight?" The AI synthesizes all context: "Priority recommendations for Shift B: IMMEDIATE (next 30 minutes): - CNC-L2-M03: Dispatch maintenance to coolant system NOW. Production impacted, equipment damage risk if temperature continues rising. Impact: $2,400/hour revenue loss at current throughput reduction. HIGH (within 4 hours): - PUMP-L1-H01: Schedule bearing inspection for line changeover at 02:00. Pump operational but degraded—plan bearing replacement parts and backup pump. Risk: Catastrophic failure if bearing fully degrades. STANDARD (this week): - CONV-L1-B02: Scheduled inspection per maintenance calendar - Routine: Daily checklist completion for all lines Resource allocation: Focus maintenance team lead on CNC thermal issue. Assign technician to prepare pump bearing replacement kit during current production run." Outcome: Rapid Response Enabled by Local AI The AI-assisted workflow reduced incident assessment time from 15-20 minutes to under 3 minutes. The supervisor immediately dispatched maintenance to investigate CNC-L2-M03's coolant system, identified as root cause. Technicians found debris blocking a cooling circuit junction, 5-minute clearance procedure restored coolant flow to 5.3 L/min. Equipment temperature dropped to 73°C within 10 minutes, and full production throughput resumed. For the hydraulic pump, bearing inspection was scheduled during planned line changeover at 02:00, preventing emergency production stoppage. Bearings were replaced preemptively, avoiding the catastrophic pump failure that would have caused 6-8 hours of unplanned downtime. Total downtime avoided: 8+ hours. Revenue protection: ~$48,000 based on facility's production value. All decisions made with AI running entirely on local edge device, no cloud dependency, no data exposure, no network latency impact. The complete incident response workflow operated on facility-controlled infrastructure with full data sovereignty. Key Takeaways for Manufacturing AI Deployment Building production-ready AI systems for industrial environments requires architectural decisions that prioritize operational reliability, data sovereignty, and integration pragmatism over cutting-edge model sophistication. Several critical lessons emerge from implementing on-premises manufacturing intelligence: Data locality through architectural guarantee: On-premises AI ensures proprietary production data never leaves facility networks not through configuration but through fundamental architecture. There are no cloud API calls to misconfigure, no data upload features to accidentally enable, no external endpoints to compromise. This physical data boundary satisfies security audits and competitive protection requirements with demonstrable certainty rather than contractual assurance. Model selection determines deployment feasibility: Smaller models (0.5B-2B parameters) enable deployment on commodity server hardware without specialized AI accelerators. These models provide sufficient accuracy for industrial classification, summarization, and conversational assistance while maintaining sub-3-second response times essential for operator acceptance. Larger models improve nuance but require GPU infrastructure and longer inference times that may not justify marginal accuracy gains for operational decision-making. Graceful degradation preserves operations: AI capabilities enhance but never replace core monitoring functions. Equipment dashboards must display raw telemetry, alert states, and historical trends even when AI analysis is unavailable. This architectural separation ensures operations continue during AI service maintenance, model updates, or system failures. AI becomes value-add intelligence rather than critical dependency. Context-rich prompts determine accuracy: Generic prompts produce generic responses unsuitable for operational decisions. Effective industrial prompts include equipment specifications, normal operating ranges, alert thresholds, maintenance history, and temporal context. This structured context enables models to provide grounded, specific recommendations citing actual equipment conditions rather than hallucinated speculation. Prompt engineering matters more than model size for operational accuracy. Structured outputs enable automation: JSON response formats with predefined fields allow automated systems to parse classifications, severity levels, and recommended actions without fragile natural language parsing. Maintenance management systems can automatically route work orders, trigger alerts, and update dashboards based on AI classification results. This structured integration scales AI beyond human-read summaries into automated workflow systems. Integration patterns bridge legacy and modern: OPC-UA clients and Modbus TCP gateways connect decades-old PLCs and SCADA systems to modern AI backends without replacing functional control infrastructure. This evolutionary approach enables AI adoption without massive capital equipment replacement. Manufacturing facilities can augment existing investments rather than ripping and replacing proven systems. Responsible AI through grounding and constraints: Industrial AI must acknowledge limits and avoid speculation beyond available data. System prompts should explicitly instruct models: "If you don't have information to answer, clearly state that" and "Do not speculate about equipment conditions beyond provided data." This reduces hallucination risk and maintains operator trust. Operators must verify AI recommendations against domain expertise, position AI as decision support augmenting human judgment, not replacing it. Getting Started: Installation and Deployment Implementing the manufacturing intelligence system requires Foundry Local installation, Node.js backend deployment, and frontend hosting, achievable within a few hours for facilities with existing IT infrastructure and server hardware. Prerequisites and System Requirements Hardware requirements depend on selected AI models. Minimum configuration supports Phi-3.5-mini model (2.1GB): 8GB RAM, 4-core CPU (Intel Core i5/AMD Ryzen 5 or better) 50GB available storage for model files and application data Windows 11/Server 2025 distribution. Recommended production configuration: 16GB+ RAM (supports larger models and concurrent requests), 8-core CPU or NVIDIA GPU (RTX 3060/4060 or better for 3-5x inference acceleration), 100GB SSD storage, gigabit network interface for intra-facility communication. Software prerequisites: Node.js 18 or newer (download from nodejs.org or install via system package manager), Git for repository cloning, modern web browser (Chrome, Edge, Firefox) for frontend access, Windows: PowerShell 5.1+. Foundry Local Installation and Model Setup Install Foundry Local using system-appropriate package manager: # Windows installation via winget winget install Microsoft.FoundryLocal # Verify installation foundry --version # macOS installation via Homebrew brew install microsoft/foundrylocal/foundrylocal Download AI models based on hardware capabilities and accuracy requirements: # Fast option: Qwen 0.5B (500MB, 100-200ms inference) foundry model download qwen2.5-0.5b # Balanced option: Phi-3.5 Mini (2.1GB, 1-3 second inference) foundry model download phi-3.5-mini # High quality option: Phi-4 Mini (3.6GB, 2-5 second inference) foundry model download phi-4-mini # Check downloaded models foundry model list Load a model into the Foundry Local service: # Load default recommended model foundry model run phi-3.5-mini # Verify service is running and model is loaded foundry service status The Foundry Local service will start automatically and expose a REST API on localhost:8008 (default port). The backend application connects to this endpoint for all AI inference operations. Backend Service Deployment Clone the repository and install dependencies: # Clone from GitHub git clone https://github.com/leestott/FoundryLocal-IndJSsample.git cd FoundryLocal-IndJSsample # Navigate to backend directory cd backend # Install Node.js dependencies npm install # Start the backend service npm start The backend server will initialize and display startup messages: Manufacturing AI Backend Starting... ✓ Foundry Local client initialized: http://localhost:8008 ✓ Model detected: phi-3.5-mini ✓ Sample data loaded: 6 assets, 12 maintenance logs ✓ Server running on port 3000 ✓ Frontend accessible at: http://localhost:3000 Health check: http://localhost:3000/api/health Verify backend health: # Test backend API curl http://localhost:3000/api/health # Expected response: {"ok":true,"service":"manufacturing-ai-backend"} # Test Foundry Local integration curl http://localhost:3000/api/models/status # Expected response: {"serviceRunning":true,"model":"phi-3.5-mini"} Frontend Access and Validation Open the web interface by navigating to web/index.html in a browser or starting from the backend URL: # Windows: Open frontend directly start http://localhost:3000 # macOS/Linux: Open frontend open http://localhost:3000 # or xdg-open http://localhost:3000 The web interface displays a navigation bar with four main sections: Overview: Plant-wide dashboard showing all equipment with health status cards, alert counts, and "Load Scenario" button to populate sample data Asset Health: Equipment selector dropdown, telemetry display, active alerts list, and "Generate AI Summary" button for detailed analysis Maintenance: Text area for maintenance log entry, "Classify Log" button, and classification result display showing category, priority, and recommendations AI Assistant: Chat interface with message input, conversation history, and natural language query capabilities Running the Sample Scenario Test the complete system with included sample data: Load scenario data: Click "Load Scenario Inputs" button in Overview tab. This populates equipment database with CNC-L2-M03 thermal event, PUMP-L1-H01 vibration alert, and baseline telemetry for all assets. Generate asset summary: Navigate to Asset Health tab, select "CNC-L2-M03" from dropdown, click "Generate AI Analysis". Within 2-3 seconds, detailed health summary appears explaining thermal condition, coolant flow issue, impacts, and recommended actions. Classify maintenance note: Go to Maintenance tab, enter text: "Grinding noise on startup, vibration 5.2 mm/s, suspect bearing wear". Click "Classify Log". AI categorizes as MECHANICAL/HIGH priority with specific repair recommendations. Ask operational questions: Open AI Assistant tab, type "What's wrong with Line 2?" or "Which equipment needs attention?" AI responds with specific equipment IDs, current conditions, and prioritized action list. Production Deployment Considerations For actual manufacturing facility deployment, several additional configurations apply: Hardware placement: Deploy backend service on dedicated server within manufacturing network zone. Avoid co-locating AI workloads with critical SCADA/MES systems due to resource contention. Use physical server or VM with direct hardware access for GPU acceleration. Network configuration: Backend should reside behind facility firewall with access restricted to internal networks. Do not expose AI service directly to internetm use VPN for remote access if required. Implement authentication via Active Directory/LDAP integration. Configure firewall rules permitting connections from operator workstations and monitoring systems only. Data integration: Replace sample JSON data with connections to actual data sources. Implement OPC-UA client for SCADA integration, connect to MES database for production schedules, integrate with CMMS for maintenance history. Code includes placeholder functions for external data source integration, customize for facility-specific systems. Model selection: Choose appropriate model based on hardware and accuracy requirements. Start with phi-3.5-mini for production deployment. Upgrade to phi-4-mini if analysis quality needs improvement and hardware supports it. Use qwen2.5-0.5b for high-throughput scenarios where speed matters more than nuanced understanding. Test all models against validation scenarios before production promotion. Monitoring and maintenance: Implement health checks monitoring Foundry Local service status, backend API responsiveness, model inference latency, and error rates. Set up alerting when inference latency exceeds thresholds or service unavailable. Establish procedures for model updates during planned maintenance windows. Keep audit logs of all AI invocations for compliance and troubleshooting. Resources and Further Learning The complete implementation with detailed comments, sample data, and documentation provides a foundation for building custom manufacturing intelligence systems. Additional resources support extension and adaptation to specific facility requirements. FoundryLocal-IndJSsample GitHub Repository – Complete source code with JavaScript backend, HTML/CSS/JS frontend, sample manufacturing data, and comprehensive README Installation and Configuration Guide – Detailed setup instructions, API documentation, troubleshooting procedures, and deployment guidance Microsoft Foundry Local Documentation – Official SDK reference, model catalog, hardware requirements, and performance tuning guidance Sample Manufacturing Data Format – JSON structure examples for equipment telemetry, maintenance logs, alert definitions, and operational events Backend Implementation Reference – Express server architecture, Foundry Local SDK integration patterns, API endpoint implementations, and error handling OPC Foundation – Industrial communication standards (OPC-UA, OPC DA) for SCADA system integration and PLC connectivity ISA Standards – International Society of Automation standards for industrial systems, SCADA architecture, and manufacturing execution systems EdgeAI for Beginner - Learn more about Edge AI using these course materials The manufacturing intelligence implementation demonstrates that sophisticated AI capabilities can run entirely on-premises without compromising operational requirements. Facilities gain predictive maintenance insights, natural language operational support, and automated equipment analysis while maintaining complete data sovereignty, zero network dependency, and deterministic performance characteristics essential for production environments.
Lee_Stott
Feb 24, 2026 Place Microsoft Developer Community Blog
437Views
1like
0Comments