ai foundry
90 TopicsMaking Sense of Azure AI Foundry IQ
As enterprise teams build AI agents, the hardest design decisions often have nothing to do with models. Instead, they revolve around a more fundamental question: How should an agent access organizational knowledge in a way that is accurate, secure, and sustainable over time? Azure AI Foundry IQ is designed to address a specific version of that problem. It is not a general‑purpose data access layer, and it is not a replacement for every retrieval pattern. Understanding where it fits and where it does not is key to using it effectively. This post explores those boundaries and grounds them in concrete, enterprise‑relevant scenarios, before showing how Foundry IQ can be implemented directly via Azure AI Search APIs and SDKs. What Azure AI Foundry IQ Is (and Is Not): Azure AI Foundry IQ is a managed knowledge layer built on Azure AI Search. It allows you to define a knowledge base that spans multiple content sources such as SharePoint, Azure Blob Storage, OneLake, existing Azure AI Search indexes, and selected external sources and expose them through a single, permission‑aware endpoint. When an agent queries a knowledge base, Foundry IQ: Plans how the query should be executed Selects relevant knowledge sources Runs retrieval (optionally in multiple steps) Enforces user permissions Returns grounded results with citations A single knowledge base can be reused across multiple agents or applications, avoiding duplicated indexing and inconsistent retrieval logic. What Foundry IQ is not: It does not execute SQL queries, perform aggregations, or provide real‑time numeric accuracy. Foundry IQ retrieves unstructured text, not transactional or analytical data. Where Foundry IQ Is a Good Fit 1. Multi‑Source, Distributed Knowledge Foundry IQ is most valuable when relevant knowledge is spread across multiple systems. It removes the need for each agent to manage source‑specific routing and retrieval logic. This benefit increases as the number of sources grows; with a single source, the overhead is rarely justified. 2. Complex or Multi‑Part Questions Foundry IQ’s agentic retrieval model is designed for questions that require: Decomposition into sub‑questions Retrieval from multiple documents Synthesis across sources Its multi‑step retrieval approach is especially effective when a single document cannot answer the question on its own. 3. Reduced Custom Retrieval Engineering Foundry IQ automates indexing, chunking, vectorization, and orchestration across sources. This makes it a strong choice for teams that want to focus on agent behavior rather than building and maintaining custom RAG pipelines. 4. Enterprise Security and Governance Foundry IQ integrates with Microsoft Entra ID and supports document‑level permissions and Purview sensitivity labels where the underlying source allows it. This makes it suitable for internal or regulated scenarios where permission trimming is a hard requirement. 5. Shared Knowledge Across Multiple Agents A single knowledge base can serve multiple agents or applications, reducing operational overhead and ensuring consistent retrieval behavior across experiences. 6. High Emphasis on Answer Quality and Trust For scenarios where correctness, grounding, and citations matter more than latency or cost, Foundry IQ’s multi‑step retrieval consistently outperforms basic RAG approaches. Example Scenarios Where Foundry IQ Works Well Scenario A: Internal Policy and Operations Assistant An enterprise builds an internal assistant for store managers. Relevant information lives in: • HR policies in SharePoint • Safety procedures in Blob Storage • Operations manuals in OneLake Questions often span multiple documents. A single Foundry IQ knowledge base unifies these sources and enforces permissions automatically. Scenario B: Compliance or Regulatory Knowledge Assistant A compliance team needs answers strictly grounded in approved documents, with citations and access control. Foundry IQ ensures only authorized content is retrieved, reducing the risk of accidental data exposure. Scenario C: Shared Knowledge Layer for Multiple Internal Agents Multiple internal agents like chat assistants, workflow helpers, embedded copilots rely on the same procedural content. A shared knowledge base avoids duplicate indexing and centralizes governance. Where Foundry IQ Is Not a Good Fit 1. Simple or Single‑Source Q&A For a single, well‑defined source, Foundry IQ’s orchestration adds complexity without proportional benefit. 2. Structured or Analytical Data Queries Foundry IQ does not execute live queries or calculations. It retrieves text, not metrics. 3. Ultra‑Low Latency or High‑Throughput Requirements Agentic retrieval introduces LLM‑in‑the‑loop latency and token costs. For sub‑second responses at scale, simpler retrieval pipelines are more appropriate. 4. Highly Customized Retrieval Logic Foundry IQ abstracts the retrieval pipeline. If you require fine‑grained control over scoring or transformations, a fully custom search pipeline may be preferable. Example Scenarios Where Foundry IQ Is the Wrong Tool Scenario D: Sales and Inventory Analytics Agent Questions like “What were Q4 sales by region?” require live data queries. Indexing reports leads to stale answers. A direct SQL or analytics tool is the correct solution. Scenario E: High‑Volume, Low‑Latency Assistant Voice‑based assistants requiring sub‑second responses cannot tolerate the latency of agentic retrieval. A Common Architecture Pattern Most successful implementations combine: Foundry IQ for unstructured documents and policies Structured data tools for analytics and live queries An application or agent layer that routes questions based on intent This avoids forcing a single tool to solve every problem. Querying Foundry IQ Knowledge Bases Directly via Azure AI Search SDK You can query Azure AI Foundry IQ knowledge bases directly using the azure-search-documents Python SDK without using Foundry Agent Service. Your App → Azure AI Search SDK → Foundry IQ Knowledge Base → Grounded Results Ideal when you want full orchestration control while still benefiting from managed, agentic retrieval. How this works Note:It is a reference implementation Install pip install --pre azure-search-documents azure-identity Setup (High Level) Provision Azure AI Search (Basic or higher) Enable Azure AD and API key authentication Enable a system‑assigned managed identity Ingest Content via Knowledge Sources Blob Storage, SharePoint, or OneLake Index, indexer, data source, and skillset are created automatically Knowledge sources and KBs are created via REST API (2025‑11‑01‑preview) Create a Knowledge Base minimal reasoning → semantic retrieval only (no LLM) low / medium reasoning → requires Azure OpenAI model Search service MI needs Cognitive Services User Querying the Knowledge Base (Python) Initialize the Client from azure.identity import DefaultAzureCredential from azure.search.documents.knowledgebases import KnowledgeBaseRetrievalClient client = KnowledgeBaseRetrievalClient( endpoint="https://<search-service>.search.windows.net", knowledge_base_name="<kb-name>", credential=DefaultAzureCredential(), ) Minimal Reasoning (Fast, No LLM) from azure.search.documents.knowledgebases.models import ( KnowledgeBaseRetrievalRequest, KnowledgeRetrievalSemanticIntent, KnowledgeRetrievalMinimalReasoningEffort, KnowledgeRetrievalOutputMode, ) request = KnowledgeBaseRetrievalRequest( intents=[KnowledgeRetrievalSemanticIntent(search="your question here")], retrieval_reasoning_effort=KnowledgeRetrievalMinimalReasoningEffort(), output_mode=KnowledgeRetrievalOutputMode.EXTRACTIVE_DATA, ) response = client.retrieve(retrieval_request=request) Conversational Reasoning (LLM‑Backed) from azure.search.documents.knowledgebases.models import ( KnowledgeBaseRetrievalRequest, KnowledgeBaseMessage, KnowledgeBaseMessageTextContent, KnowledgeRetrievalLowReasoningEffort, KnowledgeRetrievalOutputMode, ) request = KnowledgeBaseRetrievalRequest( messages=[ KnowledgeBaseMessage( role="user", content=[KnowledgeBaseMessageTextContent(text="<first user question>")] ), KnowledgeBaseMessage( role="assistant", content=[KnowledgeBaseMessageTextContent(text="<assistant response>")] ), KnowledgeBaseMessage( role="user", content=[KnowledgeBaseMessageTextContent(text="<follow-up question>")] ), ], retrieval_reasoning_effort=KnowledgeRetrievalLowReasoningEffort(), output_mode=KnowledgeRetrievalOutputMode.EXTRACTIVE_DATA, ) response = client.retrieve(retrieval_request=request) Keep in mind: intents → minimal reasoning only messages → low / medium reasoning only They are not interchangeable. Processing the Response # Extracted content for msg in (response.response or []): for item in (msg.content or []): print(item.text) # Citations (handles blob, SharePoint, OneLake, and search index references) for ref in (response.references or []): ref_id = getattr(ref, "id", None) url = getattr(ref, "blob_url", None) or getattr(ref, "url", None) print(f"[{ref_id}] {url}") # Retrieval diagnostics for record in (response.activity or []): elapsed = getattr(record, "elapsed_ms", None) or "" print(f"{record.type}: {elapsed}ms") Output Modes Mode When to Use extractiveData Feed grounded chunks into your own LLM answerSynthesis Return a ready‑made answer with citations (LLM required) Security & Permissions RBAC: Search Index Data Reader with DefaultAzureCredential Permission trimming Must be enabled at ingestion (ingestionPermissionOptions) Enforced at query time by passing the user’s bearer token response = client.retrieve( retrieval_request=request, x_ms_query_source_authorization="Bearer <user-token>" ) Foundry IQ won't solve every retrieval problem. But when your agents need grounded, permission-aware answers from content scattered across SharePoint, Blob Storage, and OneLake, it handles the hard parts — so you can focus on what your agent actually does.AI Toolkit Extension Pack for Visual Studio Code: Ignite 2025 Update
Unlock the Latest Agentic App Capabilities The Ignite 2025 update delivers a major leap forward for the AI Toolkit extension pack in VS Code, introducing a unified, end-to-end environment for building, visualizing, and deploying agentic applications to Microsoft Foundry, and the addition of Anthropic’s frontier Claude models in the Model Catalog! This release enables developers to build and debug locally in VS Code, then deploy to the cloud with a single click. Seamlessly switch between VS Code and the Foundry portal for visualization, orchestration, and evaluation, creating a smooth roundtrip workflow that accelerates innovation and delivers a truly unified AI development experience. Download the http://aka.ms/aitoolkit today and start building next-generation agentic apps in VS Code! What Can You Do with the AI Toolkit Extension Pack? Access Anthropic models in the Model Catalog Following the Microsoft, NVIDIA and Anthropic strategic partnerships announcement today, we are excited to share that Anthropic’s frontier Claude models including Claude Sonnet 4.5, Claude Opus 4.1, and Claude Haiku 4.5, are now integrated into the AI Toolkit, providing even more choices and flexibility when building intelligent applications and AI agents. Build AI Agents Using GitHub Copilot Scaffold agent applications using best-practice patterns, tool-calling examples, tracing hooks, and test scaffolds, all powered by Copilot and aligned with the Microsoft Agent Framework. Generate agent code in Python or .NET, giving you flexibility to target your preferred runtime. Build and Customize YAML Workflows Design YAML-based workflows in the Foundry portal, then continue editing and testing directly in VS Code. To customize your YAML-based workflows, instantly convert it to Agent Framework code using GitHub Copilot. Upgrade from declarative design to code-first customization without starting from scratch. Visualize Multi-Agent Workflows Envision your code-based agent workflows with an interactive graph visualizer that reveals each component and how they connect Watch in real-time how each node lights up as you run your agent. Use the visualizer to understand and debug complex agent graphs, making iteration fast and intuitive. Experiment, Debug, and Evaluate Locally Use the Hosted Agents Playground to quickly interact with your agents on your development machine. Leverage local tracing support to debug reasoning steps, tool calls, and latency hotspots—so you can quickly diagnose and fix issues. Define metrics, tasks, and datasets for agent evaluation, then implement metrics using the Foundry Evaluation SDK and orchestrate evaluations runs with the help of Copilot. Seamless Integration Across Environments Jump from Foundry Portal to VS Code Web for a development environment in your preferred code editor setting. Open YAML workflows, playgrounds, and agent templates directly in VS Code for editing and deployment. How to Get Started Install the AI Toolkit extension pack from the VS Code marketplace. Check out documentation. Get started with building workflows with Microsoft Foundry in VS Code 1. Work with Hosted (Pro-code) Agent workflows in VS Code 2. Work with Declarative (Low-code) Agent workflows in VS Code Feedback & Support Try out the extensions and let us know what you think! File issues or feedback on our GitHub repo for Foundry extension and AI Toolkit extension. Your input helps us make continuous improvements.3.1KViews4likes0CommentsUnderstanding Small Language Modes
Small Language Models (SLMs) bring AI from the cloud to your device. Unlike Large Language Models that require massive compute and energy, SLMs run locally, offering speed, privacy, and efficiency. They’re ideal for edge applications like mobile, robotics, and IoT.Agents League: Meet the Winners
Agents League brought together developers from around the world to build AI agents using Microsoft's developer tools. With 100+ submissions across three tracks, choosing winners was genuinely difficult. Today, we're proud to announce the category champions. 🎨 Creative Apps Winner: CodeSonify View project CodeSonify turns source code into music. As a genuinely thoughtful system, its functions become ascending melodies, loops create rhythmic patterns, conditionals trigger chord changes, and bugs produce dissonant sounds. It supports 7 programming languages and 5 musical styles, with each language mapped to its own key signature and code complexity directly driving the tempo. What makes CodeSonify stand out is the depth of execution. CodeSonify team delivered three integrated experiences: a web app with real-time visualization and one-click MIDI export, an MCP server exposing 5 tools inside GitHub Copilot in VS Code Agent Mode, and a diff sonification engine that lets you hear a code review. A clean refactor sounds harmonious. A messy one sounds chaotic. The team even built the MIDI generator from scratch in pure TypeScript with zero external dependencies. Built entirely with GitHub Copilot assistance, this is one of those projects that makes you think about code differently. 🧠 Reasoning Agents Winner: CertPrep Multi-Agent System View project CertPrep Multi-Agent System team built a production-grade 8-agent system for personalized Microsoft certification exam preparation, supporting 9 exam families including AI-102, AZ-204, AZ-305, and more. Each agent has a distinct responsibility: profiling the learner, generating a week-by-week study schedule, curating learning paths, tracking readiness, running mock assessments, and issuing a GO / CONDITIONAL GO / NOT YET booking recommendation. The engineering behind the scene here is impressive. A 3-tier LLM fallback chain ensures the system runs reliably even without Azure credentials, with the full pipeline completing in under 1 second in mock mode. A 17-rule guardrail pipeline validates every agent boundary. Study time allocation uses the Largest Remainder algorithm to guarantee no domain is silently zeroed out. 342 automated tests back it all up. This is what thoughtful multi-agent architecture looks like in practice. 💼 Enterprise Agents Winner: Whatever AI Assistant (WAIA) View project WAIA is a production-ready multi-agent system for Microsoft 365 Copilot Chat and Microsoft Teams. A workflow agent routes queries to specialized HR, IT, or Fallback agents, transparently to the user, handling both RAG-pattern Q&A and action automation — including IT ticket submission via a SharePoint list. Technically, it's a showcase of what serious enterprise agent development looks like: a custom MCP server secured with OAuth Identity Passthrough, streaming responses via the OpenAI Responses API, Adaptive Cards for human-in-the-loop approval flows, a debug mode accessible directly from Teams or Copilot, and full OpenTelemetry integration visible in the Foundry portal. Franck also shipped end-to-end automated Bicep deployment so the solution can land in any Azure environment. It's polished, thoroughly documented, and built to be replicated. Thank you To every developer who submitted and shipped projects during Agents League: thank you 💜 Your creativity and innovation brought Agents League to life! 👉 Browse all submissions on GitHubBuilding a Smart Building HVAC Digital Twin with AI Copilot Using Foundry Local
Introduction Building operations teams face a constant challenge: optimizing HVAC systems for energy efficiency while maintaining occupant comfort and air quality. Traditional building management systems display raw sensor data, temperatures, pressures, CO₂ levels—but translating this into actionable insights requires deep HVAC expertise. What if operators could simply ask "Why is the third floor so warm?" and get an intelligent answer grounded in real building state? This article demonstrates building a sample smart building digital twin with an AI-powered operations copilot, implemented using DigitalTwin, React, Three.js, and Microsoft Foundry Local. You'll learn how to architect physics-based simulators that model thermal dynamics, implement 3D visualizations of building systems, integrate natural language AI control, and design fault injection systems for testing and training. Whether you're building IoT platforms for commercial real estate, designing energy management systems, or implementing predictive maintenance for building automation, this sample provides proven patterns for intelligent facility operations. Why Digital Twins Matter for Building Operations Physical buildings generate enormous operational data but lack intelligent interpretation layers. A 50,000 square foot office building might have 500+ sensors streaming metrics every minute, zone temperatures, humidity levels, equipment runtimes, energy consumption. Traditional BMS (Building Management Systems) visualize this data as charts and gauges, but operators must manually correlate patterns, diagnose issues, and predict failures. Digital twins solve this through physics-based simulation coupled with AI interpretation. Instead of just displaying current temperature readings, a digital twin models thermal dynamics, heat transfer rates, HVAC response characteristics, occupancy impacts. When conditions deviate from expectations, the twin compares observed versus predicted states, identifying root causes. Layer AI on top, and operators get natural language explanations: "The conference room is 3 degrees too warm because the VAV damper is stuck at 40% open, reducing airflow by 60%." This application focuses on HVAC, the largest building energy consumer, typically 40-50% of total usage. Optimizing HVAC by just 10% through better controls can save thousands of dollars monthly while improving occupant satisfaction. The digital twin enables "what-if" scenarios before making changes: "What happens to energy consumption and comfort if we raise the cooling setpoint by 2 degrees during peak demand response events?" Architecture: Three-Tier Digital Twin System The application implements a clean three-tier architecture separating visualization, simulation, and state management: The frontend uses React with Three.js for 3D visualization. Users see an interactive 3D model of the three-floor building with color-coded zones indicating temperature and CO₂ levels. Click any equipment, AHUs, VAVs, chillers, to see detailed telemetry. The control panel enables adjusting setpoints, running simulation steps, and activating demand response scenarios. Real-time charts display KPIs: energy consumption, comfort compliance, air quality levels. The backend Node.js/Express server orchestrates simulation and state management. It maintains the digital twin state as JSON, the single source of truth for all equipment, zones, and telemetry. REST API endpoints handle control requests, simulation steps, and AI copilot queries. WebSocket connections push real-time updates to the frontend for live monitoring. The HVAC simulator implements physics-based models: 1R1C thermal models for zones, affinity laws for fan power, chiller COP calculations, CO₂ mass balance equations. Foundry Local provides AI copilot capabilities. The backend uses foundry-local-sdk to query locally running models. Natural language queries ("How's the lobby temperature?") get answered with building state context. The copilot can explain anomalies, suggest optimizations, and even execute commands when explicitly requested. Implementing Physics-Based HVAC Simulation Accurate simulation requires modeling actual HVAC physics. The simulator implements several established building energy models: // backend/src/simulator/thermal-model.js class ZoneThermalModel { // 1R1C (one resistance, one capacitance) thermal model static calculateTemperatureChange(zone, delta_t_seconds) { const C_thermal = zone.volume * 1.2 * 1000; // Heat capacity (J/K) const R_thermal = zone.r_value * zone.envelope_area; // Thermal resistance // Internal heat gains (occupancy, equipment, lighting) const Q_internal = zone.occupancy * 100 + // 100W per person zone.equipment_load + zone.lighting_load; // Cooling/heating from HVAC const airflow_kg_s = zone.vav.airflow_cfm * 0.0004719; // CFM to kg/s const c_p_air = 1006; // Specific heat of air (J/kg·K) const Q_hvac = airflow_kg_s * c_p_air * (zone.vav.supply_temp - zone.temperature); // Envelope losses const Q_envelope = (zone.outdoor_temp - zone.temperature) / R_thermal; // Net energy balance const Q_net = Q_internal + Q_hvac + Q_envelope; // Temperature change: Q = C * dT/dt const dT = (Q_net / C_thermal) * delta_t_seconds; return zone.temperature + dT; } } This model captures essential thermal dynamics while remaining computationally fast enough for real-time simulation. It accounts for internal heat generation from occupants and equipment, HVAC cooling/heating contributions, and heat loss through the building envelope. The CO₂ model uses mass balance equations: class AirQualityModel { static calculateCO2Change(zone, delta_t_seconds) { // CO₂ generation from occupants const G_co2 = zone.occupancy * 0.0052; // L/s per person at rest // Outdoor air ventilation rate const V_oa = zone.vav.outdoor_air_cfm * 0.000471947; // CFM to m³/s // CO₂ concentration difference (indoor - outdoor) const delta_CO2 = zone.co2_ppm - 400; // Outdoor ~400ppm // Mass balance: dC/dt = (G - V*ΔC) / Volume const dCO2_dt = (G_co2 - V_oa * delta_CO2) / zone.volume; return zone.co2_ppm + (dCO2_dt * delta_t_seconds); } } These models execute every simulation step, updating the entire building state: async function simulateStep(twin, timestep_minutes) { const delta_t = timestep_minutes * 60; // Convert to seconds // Update each zone for (const zone of twin.zones) { zone.temperature = ZoneThermalModel.calculateTemperatureChange(zone, delta_t); zone.co2_ppm = AirQualityModel.calculateCO2Change(zone, delta_t); } // Update equipment based on zone demands for (const vav of twin.vavs) { updateVAVOperation(vav, twin.zones); } for (const ahu of twin.ahus) { updateAHUOperation(ahu, twin.vavs); } updateChillerOperation(twin.chiller, twin.ahus); updateBoilerOperation(twin.boiler, twin.ahus); // Calculate system KPIs twin.kpis = calculateSystemKPIs(twin); // Detect alerts twin.alerts = detectAnomalies(twin); // Persist updated state await saveTwinState(twin); return twin; } 3D Visualization with React and Three.js The frontend renders an interactive 3D building view that updates in real-time as conditions change. Using React Three Fiber simplifies Three.js integration with React's component model: // frontend/src/components/BuildingView3D.jsx import { Canvas } from '@react-three/fiber'; import { OrbitControls } from '@react-three/drei'; export function BuildingView3D({ twinState }) { return ( {/* Render building floors */} {twinState.zones.map(zone => ( selectZone(zone.id)} /> ))} {/* Render equipment */} {twinState.ahus.map(ahu => ( ))} ); } function ZoneMesh({ zone, onClick }) { const color = getTemperatureColor(zone.temperature, zone.setpoint); return ( ); } function getTemperatureColor(current, setpoint) { const deviation = current - setpoint; if (Math.abs(deviation) < 1) return '#00ff00'; // Green: comfortable if (Math.abs(deviation) < 3) return '#ffff00'; // Yellow: acceptable return '#ff0000'; // Red: uncomfortable } This visualization immediately shows building state at a glance, operators see "hot spots" in red, comfortable zones in green, and can click any area for detailed metrics. Integrating AI Copilot for Natural Language Control The AI copilot transforms building data into conversational insights. Instead of navigating multiple screens, operators simply ask questions: // backend/src/routes/copilot.js import { FoundryLocalClient } from 'foundry-local-sdk'; const foundry = new FoundryLocalClient({ endpoint: process.env.FOUNDRY_LOCAL_ENDPOINT }); router.post('/api/copilot/chat', async (req, res) => { const { message } = req.body; // Load current building state const twin = await loadTwinState(); // Build context for AI const context = buildBuildingContext(twin); const completion = await foundry.chat.completions.create({ model: 'phi-4', messages: [ { role: 'system', content: `You are an HVAC operations assistant for a 3-floor office building. Current Building State: ${context} Answer questions about equipment status, comfort conditions, and energy usage. Provide specific, actionable information based on the current data. Do not speculate beyond provided information.` }, { role: 'user', content: message } ], temperature: 0.3, max_tokens: 300 }); res.json({ response: completion.choices[0].message.content, model: 'phi-4', timestamp: new Date().toISOString() }); }); function buildBuildingContext(twin) { const alerts = twin.alerts.filter(a => a.active); const zoneStatus = twin.zones.map(z => `${z.name}: ${z.temperature.toFixed(1)}°F (setpoint: ${z.setpoint}°F), CO₂: ${z.co2_ppm}ppm, ${z.occupancy} people` ); return ` Zones: ${zoneStatus.join('\n')} Active Alerts: ${alerts.length} ${alerts.map(a => `- ${a.severity}: ${a.message}`).join('\n')} Energy: ${twin.kpis.total_power_kw.toFixed(1)} kW Chiller: ${twin.chiller.percent_load}% load, ${twin.chiller.cop.toFixed(2)} COP `.trim(); } Example interactions: User: "Why is the third floor conference room warm?" AI: "The Executive Suite on Floor 3 is currently 75.2°F, 3 degrees above the 72°F setpoint. This is due to high occupancy (8 people) and VAV-F3-EXEC operating at reduced airflow (285 CFM vs 450 CFM capacity). AHU-2 is supplying adequate cool air at 55°F. Recommend checking VAV damper position—it may be partially stuck." User: "What's our current energy usage?" AI: "Total building power consumption is 47.3 kW. Breakdown: Chiller (24.1 kW, 51%), AHU fans (12.8 kW, 27%), Pumps (7.2 kW, 15%), Other (3.2 kW, 7%). Chiller is operating at 62% load with a COP of 4.8, which is efficient for current conditions." User: "Activate demand response level 2" AI: "I cannot directly control building systems for safety reasons. To activate DR Level 2, use the Control Panel and click 'Demand Response' → 'Level 2'. This will raise cooling setpoints by 3°F and reduce auxiliary loads, targeting 15% energy reduction." The AI provides grounded, specific answers citing actual equipment IDs and metrics. It refuses to directly execute control commands, instead guiding operators to explicit control interfaces, a critical safety pattern for building systems. Fault Injection for Testing and Training Real building operations experience equipment failures, stuck dampers, sensor drift, communication losses. The digital twin includes comprehensive fault injection capabilities to train operators and test control logic: // backend/src/simulator/fault-injector.js const FAULT_CATALOG = { chillerFailure: { description: 'Chiller compressor failure', apply: (twin) => { twin.chiller.status = 'FAULT'; twin.chiller.cooling_output = 0; twin.alerts.push({ id: 'chiller-fault', severity: 'CRITICAL', message: 'Chiller compressor failure - no cooling available', equipment: 'CHILLER-01' }); } }, stuckVAVDamper: { description: 'VAV damper stuck at current position', apply: (twin, vavId) => { const vav = twin.vavs.find(v => v.id === vavId); vav.damper_stuck = true; vav.damper_position_fixed = vav.damper_position; twin.alerts.push({ id: `vav-stuck-${vavId}`, severity: 'HIGH', message: `VAV ${vavId} damper stuck at ${vav.damper_position}%`, equipment: vavId }); } }, sensorDrift: { description: 'Temperature sensor reading 5°F high', apply: (twin, zoneId) => { const zone = twin.zones.find(z => z.id === zoneId); zone.sensor_drift = 5.0; zone.temperature_measured = zone.temperature_actual + 5.0; } }, communicationLoss: { description: 'Equipment communication timeout', apply: (twin, equipmentId) => { const equipment = findEquipmentById(twin, equipmentId); equipment.comm_status = 'OFFLINE'; equipment.stale_data = true; twin.alerts.push({ id: `comm-loss-${equipmentId}`, severity: 'MEDIUM', message: `Lost communication with ${equipmentId}`, equipment: equipmentId }); } } }; router.post('/api/twin/fault', async (req, res) => { const { faultType, targetEquipment } = req.body; const twin = await loadTwinState(); const fault = FAULT_CATALOG[faultType]; if (!fault) { return res.status(400).json({ error: 'Unknown fault type' }); } fault.apply(twin, targetEquipment); await saveTwinState(twin); res.json({ message: `Applied fault: ${fault.description}`, affectedEquipment: targetEquipment, timestamp: new Date().toISOString() }); }); Operators can inject faults to practice diagnosis and response. Training scenarios might include: "The chiller just failed during a heat wave, how do you maintain comfort?" or "Multiple VAV dampers are stuck, which zones need immediate attention?" Key Takeaways and Production Deployment Building a physics-based digital twin with AI capabilities requires balancing simulation accuracy with computational performance, providing intuitive visualization while maintaining technical depth, and enabling AI assistance without compromising safety. Key architectural lessons: Physics models enable prediction: Comparing predicted vs observed behavior identifies anomalies that simple thresholds miss 3D visualization improves spatial understanding: Operators immediately see which floors or zones need attention AI copilots accelerate diagnosis: Natural language queries get answers in seconds vs. minutes of manual data examination Fault injection validates readiness: Testing failure scenarios prepares operators for real incidents JSON state enables integration: Simple file-based state makes connecting to real BMS systems straightforward For production deployment, connect the twin to actual building systems via BACnet, Modbus, or MQTT integrations. Replace simulated telemetry with real sensor streams. Calibrate model parameters against historical building performance. Implement continuous learning where the twin's predictions improve as it observes actual building behavior. The complete implementation with simulation engine, 3D visualization, AI copilot, and fault injection system is available at github.com/leestott/DigitalTwin. Clone the repository and run the startup scripts to explore the digital twin, no building hardware required. Resources and Further Reading Smart Building HVAC Digital Twin Repository - Complete source code and simulation engine Setup and Quick Start Guide - Installation instructions and usage examples Microsoft Foundry Local Documentation - AI integration reference HVAC Simulation Documentation - Physics model details and calibration Three.js Documentation - 3D visualization framework ASHRAE Standards - Building energy modeling standardsOn‑Device AI with Windows AI Foundry and Foundry Local
From “waiting” to “instant”- without sending data away AI is everywhere, but speed, privacy, and reliability are critical. Users expect instant answers without compromise. On-device AI makes that possible: fast, private and available, even when the network isn’t - empowering apps to deliver seamless experiences. Imagine an intelligent assistant that works in seconds, without sending a text to the cloud. This approach brings speed and data control to the places that need it most; while still letting you tap into cloud power when it makes sense. Windows AI Foundry: A Local Home for Models Windows AI Foundry is a developer toolkit that makes it simple to run AI models directly on Windows devices. It uses ONNX Runtime under the hood and can leverage CPU, GPU (via DirectML), or NPU acceleration, without requiring you to manage those details. The principle is straightforward: Keep the model and the data on the same device. Inference becomes faster, and data stays local by default unless you explicitly choose to use the cloud. Foundry Local Foundry Local is the engine that powers this experience. Think of it as local AI runtime - fast, private, and easy to integrate into an app. Why Adopt On‑Device AI? Faster, more responsive apps: Local inference often reduces perceived latency and improves user experience. Privacy‑first by design: Keep sensitive data on the device; avoid cloud round trips unless the user opts in. Offline capability: An app can provide AI features even without a network connection. Cost control: Reduce cloud compute and data costs for common, high‑volume tasks. This approach is especially useful in regulated industries, field‑work tools, and any app where users expect quick, on‑device responses. Hybrid Pattern for Real Apps On-device AI doesn’t replace the cloud, it complements it. Here’s how: Standalone On‑Device: Quick, private actions like document summarization, local search, and offline assistants. Cloud‑Enhanced (Optional): Large-context models, up-to-date knowledge, or heavy multimodal workloads. Design an app to keep data local by default and surface cloud options transparently with user consent and clear disclosures. Windows AI Foundry supports hybrid workflows: Use Foundry Local for real-time inference. Sync with Azure AI services for model updates, telemetry, and advanced analytics. Implement fallback strategies for resource-intensive scenarios. Application Workflow Code Example using Foundry Local: 1. Only On-Device: Tries Foundry Local first, falls back to ONNX if foundry_runtime.check_foundry_available(): # Use on-device Foundry Local models try: answer = foundry_runtime.run_inference(question, context) return answer, source="Foundry Local (On-Device)" except Exception as e: logger.warning(f"Foundry failed: {e}, trying ONNX...") if onnx_model.is_loaded(): # Fallback to local BERT ONNX model try: answer = bert_model.get_answer(question, context) return answer, source="BERT ONNX (On-Device)" except Exception as e: logger.warning(f"ONNX failed: {e}") return "Error: No local AI available" 2. Hybrid approach: On-device first, cloud as last resort def get_answer(question, context): """ Priority order: 1. Foundry Local (best: advanced + private) 2. ONNX Runtime (good: fast + private) 3. Cloud API (fallback: requires internet, less private) # in case of Hybrid approach, based on real-time scenario """ if foundry_runtime.check_foundry_available(): # Use on-device Foundry Local models try: answer = foundry_runtime.run_inference(question, context) return answer, source="Foundry Local (On-Device)" except Exception as e: logger.warning(f"Foundry failed: {e}, trying ONNX...") if onnx_model.is_loaded(): # Fallback to local BERT ONNX model try: answer = bert_model.get_answer(question, context) return answer, source="BERT ONNX (On-Device)" except Exception as e: logger.warning(f"ONNX failed: {e}, trying cloud...") # Last resort: Cloud API (requires internet) if network_available(): try: import requests response = requests.post( '{BASE_URL_AI_CHAT_COMPLETION}', headers={'Authorization': f'Bearer {API_KEY}'}, json={ 'model': '{MODEL_NAME}', 'messages': [{ 'role': 'user', 'content': f'Context: {context}\n\nQuestion: {question}' }] }, timeout=10 ) answer = response.json()['choices'][0]['message']['content'] return answer, source="Cloud API (Online)" except Exception as e: return "Error: No AI runtime available", source="Failed" else: return "Error: No internet and no local AI available", source="Offline" Demo Project Output: Foundry Local answering context-based questions offline : The Foundry Local engine ran the Phi-4-mini model offline and retrieved context-based data. : The Foundry Local engine ran the Phi-4-mini model offline and mentioned that there is no answer. Practical Use Cases Privacy-First Reading Assistant: Summarize documents locally without sending text to the cloud. Healthcare Apps: Analyze medical data on-device for compliance. Financial Tools: Risk scoring without exposing sensitive financial data. IoT & Edge Devices: Real-time anomaly detection without network dependency. Conclusion On-device AI isn’t just a trend - it’s a shift toward smarter, faster, and more secure applications. With Windows AI Foundry and Foundry Local, developers can deliver experiences that respect user specific data, reduce latency, and work even when connectivity fails. By combining local inference with optional cloud enhancements, you get the best of both worlds: instant performance and scalable intelligence. Whether you’re creating document summarizers, offline assistants, or compliance-ready solutions, this approach ensures your apps stay responsive, reliable, and user-centric. References Get started with Foundry Local - Foundry Local | Microsoft Learn What is Windows AI Foundry? | Microsoft Learn https://devblogs.microsoft.com/foundry/unlock-instant-on-device-ai-with-foundry-local/I want to show my agent a picture—Can I?
Welcome to Agent Support—a developer advice column for those head-scratching moments when you’re building an AI agent! Each post answers a question inspired by real conversations in the AI developer community, offering practical advice and tips. To kick things off, we’re tackling a common challenge for anyone experimenting with multimodal agents: working with image input. Let’s dive in! Dear Agent Support, I’m building an AI agent, and I’d like to include screenshots or product photos as part of the input. But I’m not sure if that’s even possible, or if I need to use a different kind of model altogether. Can I actually upload an image and have the agent process it? Great question, and one that trips up a lot of people early on! The short answer is: yes, some models can process images—but not all of them. Let’s break that down a bit. 🧠 Understanding Image Input When we talk about image input or image attachments, we’re talking about the ability to send a non-text file (like a .png, .jpg, or screenshot) into your prompt and have the model analyze or interpret it. That could mean describing what’s in the image, extracting text from it, answering questions about a chart, or giving feedback on a design layout. 🚫 Not All Models Support Image Input That said, this isn’t something every model can do. Most base language models are trained on text data only, they’re not designed to interpret non-text inputs like images. In most tools and interfaces, the option to upload an image only appears if the selected model supports it, since platforms typically hide or disable features that aren't compatible with a model's capabilities. So, if your current chat interface doesn’t mention anything about vision or image input, it’s likely because the model itself isn’t equipped to handle it. That’s where multimodal models come in. These are models that have been trained (or extended) to understand both text and images, and sometimes other data types too. Think of them as being fluent in more than one language, except in this case, one of those “languages” is visual. 🔎 How to Find Image-Supporting Models If you’re trying to figure out which models support images, the AI Toolkit is a great place to start! The extension includes a built-in Model Catalog where you can filter models by Feature—like Image Attachment—so you can skip the guesswork. Here’s how to do it: Open the Model Catalog from the AI Toolkit panel in Visual Studio Code. Click the Feature filter near the search bar. Select Image Attachment. Browse the filtered results to see which models can accept visual input. Once you've got your filtered list, you can check out the model details or try one in the Playground to test how it handles image-based prompts. 🧪 Test Before You Build Before you plug a model into your agent and start wiring things together, it’s a good idea to test how the model handles image input on its own. This gives you a quick feel for the model’s behavior and helps you catch any limitations before you're deep into building. You can do this in the Playground, where you can upload an image and pair it with a simple prompt like: “Describe the contents of this image.” OR “Summarize what’s happening in this screenshot.” If the model supports image input, you’ll be able to attach a file and get a response based on its visual analysis. If you don’t see the option to upload an image, double-check that the model you’ve selected has image capabilities—this is usually a model issue, not a UI bug. 🔁 Recap Here’s a quick rundown of what we covered: Not all models support image input—you’ll need a multimodal model specifically built to handle visual data. Most platforms won’t let you upload an image unless the model supports it, so if you don’t see that option, it’s probably a model limitation. You can use the AI Toolkit’s Model Catalog to filter models by capability—just check the box for Image Attachment. Test the model in the Playground before integrating it into your agent to make sure it behaves the way you expect. 📺 Want to Go Deeper? Check out my latest video on how to choose the right model for your agent—it’s part of the Build an Agent Series, where I walk through the building blocks of turning an idea into a working AI agent. And if you’re looking to sharpen your model instincts, don’t miss Model Mondays—a weekly series that helps developers like you build your Model IQ, one spotlight at a time. Whether you’re just starting out or already building AI-powered apps, it’s a great way to stay current and confident in your model choices. 👉 Explore the series and catch the next episode: aka.ms/model-mondays/rsvp If you're just getting started with building agents, check out our Agents for Beginners curriculum. And for all your general AI and AI agent questions, join us in the Azure AI Foundry Discord! You can find me hanging out there answering your questions about the AI Toolkit. I'm looking forward to chatting with you there! Whatever you're building, the right model is out there—and with the right tools, you'll know exactly how to find it.If You're Building AI on Azure, ECS 2026 is Where You Need to Be
Let me be direct: there's a lot of noise in the conference calendar. Generic cloud events. Vendor showcases dressed up as technical content. Sessions that look great on paper but leave you with nothing you can actually ship on Monday. ECS 2026 isn't that. As someone who will be on stage at Cologne this May, I can tell you the European Collaboration Summit combined with the European AI & Cloud Summit and European Biz Apps Summit is one of the few events I've seen where engineers leave with real, production-applicable knowledge. Three days. Three summits. 3,000+ attendees. One of the largest Microsoft-focused events in Europe, and it keeps getting better. If you're building AI systems on Azure, designing cloud-native architectures, or trying to figure out how to take your AI experiments to production — this is where the conversation is happening. What ECS 2026 Actually Is ECS 2026 runs May 5–7 at Confex in Cologne, Germany. It brings together three co-located summits under one roof: European Collaboration Summit — Microsoft 365, Teams, Copilot, and governance European AI & Cloud Summit — Azure architecture, AI agents, cloud security, responsible AI European BizApps Summit — Power Platform, Microsoft Fabric, Dynamics For Azure engineers and AI developers, the European AI & Cloud Summit is your primary destination. But don't ignore the overlap, some of the most interesting AI conversations happen at the intersection of collaboration tooling and cloud infrastructure. The scale matters here: 3,000+ attendees, 100+ sessions, multiple deep-dive tracks, and a speaker lineup that includes Microsoft executives, Regional Directors, and MVPs who have built, broken, and rebuilt production systems. The Azure + AI Track - What's Actually On the Agenda The AI & Cloud Summit agenda is built around real technical depth. Not "intro to AI" content, actual architecture decisions, patterns that work, and lessons from things that didn't. Here's what you can expect: AI Agents and Agentic Systems This is where the energy is right now, and ECS is leaning in. Expect sessions covering how to design agent workflows, chain reasoning steps, handle memory and state, and integrate with Azure AI services. Marco Casalaina, VP of Products for Azure AI at Microsoft, is speaking if you want to understand the direction of the Azure AI platform from the people building it, this is a direct line. Azure Architecture at Scale Cloud-native patterns, microservices, containers, and the architectural decisions that determine whether your system holds up under real load. These sessions go beyond theory you'll hear from engineers who've shipped these designs at enterprise scale. Observability, DevOps, and Production AI Getting AI to production is harder than the demos suggest. Sessions here cover monitoring AI systems, integrating LLMs into CI/CD pipelines, and building the operational practices that keep AI in production reliable and governable. Cloud Security and Compliance Security isn't optional when you're putting AI in front of users or connecting it to enterprise data. Tracks cover identity, access patterns, responsible AI governance, and how to design systems that satisfy compliance requirements without becoming unmaintainable. Pre-Conference Deep Dives One underrated part of ECS: the pre-conference workshops. These are extended, hands-on sessions typically 3–6 hours that let you go deep on a single topic with an expert. Think of them as intensive short courses where you can actually work through the material, not just watch slides. If you're newer to a particular area of Azure AI, or you want to build fluency in a specific pattern before the main conference sessions, these are worth the early travel. The Speaker Quality Is Different Here The ECS speaker roster includes Microsoft executives, Microsoft MVPs, and Regional Directors, people who have real accountability for the products and patterns they're presenting. You'll hear from over 20 Microsoft speakers: Marco Casalaina — VP of Products, Azure AI at Microsoft Adam Harmetz — VP of Product at Microsoft, Enterprise Agent And dozens of MVPs and Regional Directors who are in the field every day, solving the same problems you are. These aren't keynote-only speakers — they're in the session rooms, at the hallway track, available for real conversations. The Hallway Track Is Not a Cliché I know "networking" sounds like a corporate afterthought. At ECS it genuinely isn't. When you put 3,000 practitioners, engineers, architects, DevOps leads, security specialists in one venue for three days, the conversations between sessions are often more valuable than the sessions themselves. You get candid answers to "how are you actually handling X in production?" that you won't find in documentation. The European Microsoft community is tight-knit and collaborative. ECS is where that community concentrates. Why This Matters Right Now We're in a period where AI development is moving fast but the engineering discipline around it is still maturing. Most teams are figuring out: How to move from AI prototype to production system How to instrument and observe AI behaviour reliably How to design agent systems that don't become unmaintainable How to satisfy security and compliance requirements in AI-integrated architectures ECS 2026 is one of the few places where you can get direct answers to these questions from people who've solved them — not theoretically, but in production, on Azure, in the last 12 months. If you go, you'll come back with practical patterns you can apply immediately. That's the bar I hold events to. ECS consistently clears it. Register and Explore the Agenda Register for ECS 2026: ecs.events Explore the AI & Cloud Summit agenda: cloudsummit.eu/en/agenda Dates: May 5–7, 2026 | Location: Confex, Cologne, Germany Early registration is worth it the pre-conference workshops fill up. And if you're coming, find me, I'll be the one talking too much about AI agents and Azure deployments. See you in Cologne.Vectorless Reasoning-Based RAG: A New Approach to Retrieval-Augmented Generation
Introduction Retrieval-Augmented Generation (RAG) has become a widely adopted architecture for building AI applications that combine Large Language Models (LLMs) with external knowledge sources. Traditional RAG pipelines rely heavily on vector embeddings and similarity search to retrieve relevant documents. While this works well for many scenarios, it introduces challenges such as: Requires chunking documents into small segments Important context can be split across chunks Embedding generation and vector databases add infrastructure complexity A new paradigm called Vectorless Reasoning-Based RAG is emerging to address these challenges. One framework enabling this approach is PageIndex, an open-source document indexing system that organizes documents into a hierarchical tree structure and allows Large Language Models (LLMs) to perform reasoning-based retrieval over that structure. Vectorless Reasoning-Based RAG Instead of vectors, this approach uses structured document navigation. User Query ->Document Tree Structure ->LLM Reasoning ->Relevant Nodes Retrieved ->LLM Generates Answer This mimics how humans read documents: Look at the table of contents Identify relevant sections Read the relevant content Answer the question Core features No Vector Database: It relies on document structure and LLM reasoning for retrieval. It does not depend on vector similarity search. No Chunking: Documents are not split into artificial chunks. Instead, they are organized using their natural structure, such as pages and sections. Human-like Retrieval: The system mimics how human experts read documents. It navigates through sections and extracts information from relevant parts. Better Explainability and Traceability: Retrieval is based on reasoning. The results can be traced back to specific pages and sections. This makes the process easier to interpret. It avoids opaque and approximate vector search, often called “vibe retrieval.” When to Use Vectorless RAG Vectorless RAG works best when: Data is structured or semi-structured Documents have clear metadata Knowledge sources are well organized Queries require reasoning rather than semantic similarity Examples: enterprise knowledge bases internal documentation systems compliance and policy search healthcare documentation financial reporting Implementing Vectorless RAG with Azure AI Foundry Step 1 : Install Pageindex using pip command, from pageindex import PageIndexClient import pageindex.utils as utils # Get your PageIndex API key from https://dash.pageindex.ai/api-keys PAGEINDEX_API_KEY = "YOUR_PAGEINDEX_API_KEY" pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY) Step 2 : Set up your LLM Example using Azure OpenAI: from openai import AsyncAzureOpenAI client = AsyncAzureOpenAI( api_key=AZURE_OPENAI_API_KEY, azure_endpoint=AZURE_OPENAI_ENDPOINT, api_version=AZURE_OPENAI_API_VERSION ) async def call_llm(prompt, temperature=0): response = await client.chat.completions.create( model=AZURE_DEPLOYMENT_NAME, messages=[{"role": "user", "content": prompt}], temperature=temperature ) return response.choices[0].message.content.strip() Step 3: Page Tree Generation import os, requests pdf_url = "https://arxiv.org/pdf/2501.12948.pdf" //give the pdf url for tree generation, here given one for example pdf_path = os.path.join("../data", pdf_url.split('/')[-1]) os.makedirs(os.path.dirname(pdf_path), exist_ok=True) response = requests.get(pdf_url) with open(pdf_path, "wb") as f: f.write(response.content) print(f"Downloaded {pdf_url}") doc_id = pi_client.submit_document(pdf_path)["doc_id"] print('Document Submitted:', doc_id) Step 4 : Print the generated pageindex tree structure if pi_client.is_retrieval_ready(doc_id): tree = pi_client.get_tree(doc_id, node_summary=True)['result'] print('Simplified Tree Structure of the Document:') utils.print_tree(tree) else: print("Processing document, please try again later...") Step 5 : Use LLM for tree search and identify nodes that might contain relevant context import json query = "What are the conclusions in this document?" tree_without_text = utils.remove_fields(tree.copy(), fields=['text']) search_prompt = f""" You are given a question and a tree structure of a document. Each node contains a node id, node title, and a corresponding summary. Your task is to find all nodes that are likely to contain the answer to the question. Question: {query} Document tree structure: {json.dumps(tree_without_text, indent=2)} Please reply in the following JSON format: {{ "thinking": "<Your thinking process on which nodes are relevant to the question>", "node_list": ["node_id_1", "node_id_2", ..., "node_id_n"] }} Directly return the final JSON structure. Do not output anything else. """ tree_search_result = await call_llm(search_prompt) Step 6 : Print retrieved nodes and reasoning process node_map = utils.create_node_mapping(tree) tree_search_result_json = json.loads(tree_search_result) print('Reasoning Process:') utils.print_wrapped(tree_search_result_json['thinking']) print('\nRetrieved Nodes:') for node_id in tree_search_result_json["node_list"]: node = node_map[node_id] print(f"Node ID: {node['node_id']}\t Page: {node['page_index']}\t Title: {node['title']}") Step 7: Answer generation node_list = json.loads(tree_search_result)["node_list"] relevant_content = "\n\n".join(node_map[node_id]["text"] for node_id in node_list) print('Retrieved Context:\n') utils.print_wrapped(relevant_content[:1000] + '...') answer_prompt = f""" Answer the question based on the context: Question: {query} Context: {relevant_content} Provide a clear, concise answer based only on the context provided. """ print('Generated Answer:\n') answer = await call_llm(answer_prompt) utils.print_wrapped(answer) When to Use Each Approach Both vector-based RAG and vectorless RAG have their strengths. Choosing the right approach depends on the nature of the documents and the type of retrieval required. When to Use Vector Database–Based RAG Vector-based retrieval works best when dealing with large collections of unrelated or loosely structured documents. In such cases, semantic similarity is often sufficient to identify relevant information quickly. Use vector RAG when: Searching across many independent documents Semantic similarity is sufficient to locate relevant content Real-time retrieval is required over very large datasets Common use cases include: Customer support knowledge bases Conversational chatbots Product and content search systems When to Use Vectorless RAG Vectorless approaches such as PageIndex are better suited for long, structured documents where understanding the logical organization of the content is important. Use vectorless RAG when: Documents contain clear hierarchical structure Logical reasoning across sections is required High retrieval accuracy is critical Typical examples include: Financial filings and regulatory reports Legal documents and contracts Technical manuals and documentation Academic and research papers In these scenarios, navigating the document structure allows the system to identify the exact section that logically contains the answer, rather than relying only on semantic similarity. Conclusion Vector databases significantly advanced RAG architectures by enabling scalable semantic search across large datasets. However, they are not the optimal solution for every type of document. Vectorless approaches such as PageIndex introduce a different philosophy: instead of retrieving text that is merely semantically similar, they retrieve text that is logically relevant by reasoning over the structure of the document. As RAG architectures continue to evolve, the future will likely combine the strengths of both approaches. Hybrid systems that integrate vector search for broad retrieval and reasoning-based navigation for precision may offer the best balance of scalability and accuracy for enterprise AI applications.9.3KViews2likes0CommentsBuild a Fully Offline AI App with Foundry Local and CAG
A hands-on guide to building an on-device AI support agent using Context-Augmented Generation, JavaScript, and Foundry Local. You have probably heard the AI pitch: "just call our API." But what happens when your application needs to work without an internet connection? Perhaps your users are field engineers standing next to a pipeline in the middle of nowhere, or your organisation has strict data privacy requirements, or you simply want to build something that works without a cloud bill. This post walks you through how to build a fully offline, on-device AI application using Foundry Local and a pattern called Context-Augmented Generation (CAG). By the end, you will have a clear understanding of what CAG is, how it compares to RAG, and the practical steps to build your own solution. The finished application: a browser-based AI support agent that runs entirely on your machine. What Is Context-Augmented Generation? Context-Augmented Generation (CAG) is a pattern for making AI models useful with your own domain-specific content. Instead of hoping the model "knows" the answer from its training data, you pre-load your entire knowledge base into the model's context window at startup. Every query the model handles has access to all of your documents, all of the time. The flow is straightforward: Load your documents into memory when the application starts. Inject the most relevant documents into the prompt alongside the user's question. Generate a response grounded in your content. There is no retrieval pipeline, no vector database, and no embedding model. Your documents are read from disc, held in memory, and selected per query using simple keyword scoring. The model generates answers grounded in your content rather than relying on what it learnt during training. CAG vs RAG: Understanding the Trade-offs If you have explored AI application patterns before, you have likely encountered Retrieval-Augmented Generation (RAG). Both CAG and RAG solve the same core problem: grounding an AI model's answers in your own content. They take different approaches, and each has genuine strengths and limitations. CAG (Context-Augmented Generation) How it works: All documents are loaded at startup. The most relevant ones are selected per query using keyword scoring and injected into the prompt. Strengths: Drastically simpler architecture with no vector database, no embeddings, and no retrieval pipeline Works fully offline with no external services Minimal dependencies (just two npm packages in this sample) Near-instant document selection with no embedding latency Easy to set up, debug, and reason about Limitations: Constrained by the model's context window size Best suited to small, curated document sets (tens of documents, not thousands) Keyword scoring is less precise than semantic similarity for ambiguous queries Adding documents requires an application restart RAG (Retrieval-Augmented Generation) How it works: Documents are chunked, embedded into vectors, and stored in a database. At query time, the most semantically similar chunks are retrieved and injected into the prompt. Strengths: Scales to thousands or millions of documents Semantic search finds relevant content even when the user's wording differs from the source material Documents can be added or updated dynamically without restarting Fine-grained retrieval (chunk-level) can be more token-efficient for large collections Limitations: More complex architecture: requires an embedding model, a vector database, and a chunking strategy Retrieval quality depends heavily on chunking, embedding model choice, and tuning Additional latency from the embedding and search steps More dependencies and infrastructure to manage Want to compare these patterns hands-on? There is a RAG-based implementation of the same gas field scenario using vector search and embeddings. Clone both repositories, run them side by side, and see how the architectures differ in practice. When Should You Choose Which? Consideration Choose CAG Choose RAG Document count Tens of documents Hundreds or thousands Offline requirement Essential Optional (can run locally too) Setup complexity Minimal Moderate to high Document updates Infrequent (restart to reload) Frequent or dynamic Query precision Good for keyword-matchable content Better for semantically diverse queries Infrastructure None beyond the runtime Vector database, embedding model For the sample application in this post (20 gas engineering procedure documents on a local machine), CAG is the clear winner. If your use case grows to hundreds of documents or requires real-time ingestion, RAG becomes the better choice. Both patterns can run offline using Foundry Local. Foundry Local: Your On-Device AI Runtime Foundry Local is a lightweight runtime from Microsoft that downloads, manages, and serves language models entirely on your device. No cloud account, no API keys, no outbound network calls (after the initial model download). In this sample, your application is responsible for deciding which model to use, and it does that through the foundry-local-sdk . The app creates a FoundryLocalManager , asks the SDK for the local model catalogue, and then runs a small selection policy from src/modelSelector.js . That policy looks at the machine's available RAM, filters out models that are too large, ranks the remaining chat models by preference, and then returns the best fit for that device. Why does it work this way? Because shipping one fixed model would either exclude lower-spec machines or underuse more capable ones. A 14B model may be perfectly reasonable on a 32 GB workstation, but the same choice would be slow or unusable on an 8 GB laptop. By selecting at runtime, the same codebase can run across a wider range of developer machines without manual tuning. What makes it particularly useful for developers: No GPU required — runs on CPU or NPU, making it accessible on standard laptops and desktops Native SDK bindings — in-process inference via the foundry-local-sdk npm package, with no HTTP round-trips to a local server Automatic model management — downloads, caches, and loads models automatically Dynamic model selection — the SDK can evaluate your device's available RAM and pick the best model from the catalogue Real-time progress callbacks — ideal for building loading UIs that show download and initialisation progress The integration code is refreshingly minimal. Here is the core pattern: import { FoundryLocalManager } from "foundry-local-sdk"; // Create a manager and get the model catalogue const manager = FoundryLocalManager.create({ appName: "my-app" }); // Auto-select the best model for this device based on available RAM const models = await manager.catalog.getModels(); const model = selectBestModel(models); // Download if not cached, then load into memory if (!model.isCached) { await model.download((progress) => { console.log(`Download: ${progress.toFixed(0)}%`); }); } await model.load(); // Create a chat client for direct in-process inference const chatClient = model.createChatClient(); const response = await chatClient.completeChat([ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "How do I detect a gas leak?" } ]); That is it. No server configuration, no authentication tokens, no cloud provisioning. The model runs in the same process as your application. The download step matters for a simple reason: offline inference only works once the model files exist locally. The SDK checks whether the chosen model is already cached on the machine. If it is not, the application asks Foundry Local to download it once, store it locally, and then load it into memory. After that first run, the cached model can be reused, which is why subsequent launches are much faster and can operate without any network dependency. Put another way, there are two cooperating pieces here. Your application chooses which model is appropriate for the device and the scenario. Foundry Local and its SDK handle the mechanics of making that model available locally, caching it, loading it, and exposing a chat client for inference. That separation keeps the application logic clear whilst letting the runtime handle the heavy lifting. The Technology Stack The sample application is deliberately simple. No frameworks, no build steps, no Docker: Layer Technology Purpose AI Model Foundry Local + auto-selected model Runs locally via native SDK bindings; best model chosen for your device Back end Node.js + Express Lightweight HTTP server, everyone knows it Context Markdown files pre-loaded at startup No vector database, no embeddings, no retrieval step Front end Single HTML file with inline CSS No build step, mobile-responsive, field-ready The total dependency footprint is two npm packages: express and foundry-local-sdk . Architecture Overview The four-layer architecture, all running on a single machine. The system has four layers, all running in a single process on your device: Client layer: a single HTML file served by Express, with quick-action buttons and a responsive chat interface Server layer: Express.js starts immediately and serves the UI plus an SSE status endpoint; API routes handle chat (streaming and non-streaming), context listing, and health checks CAG engine: loads all domain documents at startup, selects the most relevant ones per query using keyword scoring, and injects them into the prompt AI layer: Foundry Local runs the auto-selected model on CPU/NPU via native SDK bindings (in-process inference, no HTTP round-trips) Building the Solution Step by Step Prerequisites You need two things installed on your machine: Node.js 20 or later: download from nodejs.org Foundry Local: Microsoft's on-device AI runtime: winget install Microsoft.FoundryLocal Foundry Local will automatically select and download the best model for your device the first time you run the application. You can override this by setting the FOUNDRY_MODEL environment variable to a specific model alias. Getting the Code Running # Clone the repository git clone https://github.com/leestott/local-cag.git cd local-cag # Install dependencies npm install # Start the server npm start Open http://127.0.0.1:3000 in your browser. You will see a loading overlay with a progress bar whilst the model downloads (first run only) and loads into memory. Once the model is ready, the overlay fades away and you can start chatting. Desktop view Mobile view How the CAG Pipeline Works Let us trace what happens when a user asks: "How do I detect a gas leak?" The query flow from browser to model and back. 1 Server starts and loads documents When you run npm start , the Express server starts on port 3000. All .md files in the docs/ folder are read, parsed (with optional YAML front-matter for title, category, and ID), and grouped by category. A document index is built listing all available topics. 2 Model is selected and loaded The model selector evaluates your system's available RAM and picks the best model from the Foundry Local catalogue. If the model is not already cached, it downloads it (with progress streamed to the browser via SSE). The model is then loaded into memory for in-process inference. 3 User sends a question The question arrives at the Express server. The chat engine selects the top 3 most relevant documents using keyword scoring. 4 Prompt is constructed The engine builds a messages array containing: the system prompt (with safety-first instructions), the document index (so the model knows all available topics), the 3 selected documents (approximately 6,000 characters), the conversation history, and the user's question. 5 Model generates a grounded response The prompt is sent to the locally loaded model via the Foundry Local SDK's native bindings. The response streams back token by token through Server-Sent Events to the browser. A response with safety warnings and step-by-step guidance The sources panel shows which documents were used Key Code Walkthrough Loading Documents (the Context Module) The context module reads all markdown files from the docs/ folder at startup. Each document can have optional YAML front-matter for metadata: // src/context.js export function loadDocuments() { const files = fs.readdirSync(config.docsDir) .filter(f => f.endsWith(".md")) .sort(); const docs = []; for (const file of files) { const raw = fs.readFileSync(path.join(config.docsDir, file), "utf-8"); const { meta, body } = parseFrontMatter(raw); docs.push({ id: meta.id || path.basename(file, ".md"), title: meta.title || file, category: meta.category || "General", content: body.trim(), }); } return docs; } There is no chunking, no vector computation, and no database. The documents are held in memory as plain text. Dynamic Model Selection Rather than hard-coding a model, the application evaluates your system at runtime: // src/modelSelector.js const totalRamMb = os.totalmem() / (1024 * 1024); const budgetMb = totalRamMb * 0.6; // Use up to 60% of system RAM // Filter to models that fit, rank by quality, boost cached models const candidates = allModels.filter(m => m.task === "chat-completion" && m.fileSizeMb <= budgetMb ); // Returns the best model: e.g. phi-4 on a 32 GB machine, // or phi-3.5-mini on a laptop with 8 GB RAM This means the same application runs on a powerful workstation (selecting a 14B parameter model) or a constrained laptop (selecting a 3.8B model), with no code changes required. This is worth calling out because it is one of the most practical parts of the sample. Developers do not have to decide up front which single model every user should run. The application makes that decision at startup based on the hardware budget you set, then asks Foundry Local to fetch the model if it is missing. The result is a smoother first-run experience and fewer support headaches when the same app is used on mixed hardware. The System Prompt For safety-critical domains, the system prompt is engineered to prioritise safety, prevent hallucination, and enforce structured responses: // src/prompts.js export const SYSTEM_PROMPT = `You are a local, offline support agent for gas field inspection and maintenance engineers. Behaviour Rules: - Always prioritise safety. If a procedure involves risk, explicitly call it out. - Do not hallucinate procedures, measurements, or tolerances. - If the answer is not in the provided context, say: "This information is not available in the local knowledge base." Response Format: - Summary (1-2 lines) - Safety Warnings (if applicable) - Step-by-step Guidance - Reference (document name + section)`; This pattern is transferable to any safety-critical domain: medical devices, electrical work, aviation maintenance, or chemical handling. Adapting This for Your Own Domain The sample project is designed to be forked and adapted. Here is how to make it yours in three steps: 1. Replace the documents Delete the gas engineering documents in docs/ and add your own markdown files. The context module handles any markdown content with optional YAML front-matter: --- title: Troubleshooting Widget Errors category: Support id: KB-001 --- # Troubleshooting Widget Errors ...your content here... 2. Edit the system prompt Open src/prompts.js and rewrite the system prompt for your domain. Keep the structure (summary, safety, steps, reference) and update the language to match your users' expectations. 3. Override the model (optional) By default the application auto-selects the best model. To force a specific model: # See available models foundry model list # Force a smaller, faster model FOUNDRY_MODEL=phi-3.5-mini npm start # Or a larger, higher-quality model FOUNDRY_MODEL=phi-4 npm start Smaller models give faster responses on constrained devices. Larger models give better quality. The auto-selector picks the largest model that fits within 60% of your system RAM. Building a Field-Ready UI The front end is a single HTML file with inline CSS. No React, no build tooling, no bundler. This keeps the project accessible to beginners and easy to deploy. Design decisions that matter for field use: Dark, high-contrast theme with 18px base font size for readability in bright sunlight Large touch targets (minimum 48px) for operation with gloves or PPE Quick-action buttons for common questions, so engineers do not need to type on a phone Responsive layout that works from 320px to 1920px+ screen widths Streaming responses via SSE, so the user sees tokens arriving in real time The mobile chat experience, optimised for field use. Visual Startup Progress with SSE A standout feature of this application is the loading experience. When the user opens the browser, they see a progress overlay showing exactly what the application is doing: Loading domain documents Initialising the Foundry Local SDK Selecting the best model for the device Downloading the model (with a percentage progress bar, first run only) Loading the model into memory This works because the Express server starts before the model finishes loading. The browser connects immediately and receives real-time status updates via Server-Sent Events. Chat endpoints return 503 whilst the model is loading, so the UI cannot send queries prematurely. // Server-side: broadcast status to all connected browsers function broadcastStatus(state) { initState = state; const payload = `data: ${JSON.stringify(state)}\n\n`; for (const client of statusClients) { client.write(payload); } } // During initialisation: broadcastStatus({ stage: "downloading", message: "Downloading phi-4...", progress: 42 }); This pattern is worth adopting in any application where model loading takes more than a few seconds. Users should never stare at a blank screen wondering whether something is broken. Testing The project includes unit tests using the built-in Node.js test runner, with no extra test framework needed: # Run all tests npm test Tests cover configuration, server endpoints, and document loading. Use them as a starting point when you adapt the project for your own domain. Ideas for Extending the Project Once you have the basics running, there are plenty of directions to explore: Conversation memory: persist chat history across sessions using local storage or a lightweight database Hybrid CAG + RAG: add a vector retrieval step for larger document collections that exceed the context window Multi-modal support: add image-based queries (photographing a fault code, for example) PWA packaging: make it installable as a standalone offline application on mobile devices Custom model fine-tuning: fine-tune a model on your domain data for even better answers Ready to Build Your Own? Clone the CAG sample, swap in your own documents, and have an offline AI agent running in minutes. Or compare it with the RAG approach to see which pattern suits your use case best. Get the CAG Sample Get the RAG Sample Summary Building a local AI application does not require a PhD in machine learning or a cloud budget. With Foundry Local, Node.js, and a set of domain documents, you can create a fully offline, mobile-responsive AI agent that answers questions grounded in your own content. The key takeaways: CAG is ideal for small, curated document sets where simplicity and offline capability matter most. No vector database, no embeddings, no retrieval pipeline. RAG scales further when you have hundreds or thousands of documents, or need semantic search for ambiguous queries. See the local-rag sample to compare. Foundry Local makes on-device AI accessible: native SDK bindings, in-process inference, automatic model selection, and no GPU required. The architecture is transferable. Replace the gas engineering documents with your own content, update the system prompt, and you have a domain-specific AI agent for any field. Start simple, iterate outwards. Begin with CAG and a handful of documents. If your needs outgrow the context window, graduate to RAG. Both patterns can run entirely offline. Clone the repository, swap in your own documents, and start building. The best way to learn is to get your hands on the code. This project is open source under the MIT licence. It is a scenario sample for learning and experimentation, not production medical or safety advice. local-cag on GitHub · local-rag on GitHub · Foundry Local