learning
746 TopicsBuild a Fully Offline AI App with Foundry Local and CAG
A hands-on guide to building an on-device AI support agent using Context-Augmented Generation, JavaScript, and Foundry Local. You have probably heard the AI pitch: "just call our API." But what happens when your application needs to work without an internet connection? Perhaps your users are field engineers standing next to a pipeline in the middle of nowhere, or your organisation has strict data privacy requirements, or you simply want to build something that works without a cloud bill. This post walks you through how to build a fully offline, on-device AI application using Foundry Local and a pattern called Context-Augmented Generation (CAG). By the end, you will have a clear understanding of what CAG is, how it compares to RAG, and the practical steps to build your own solution. The finished application: a browser-based AI support agent that runs entirely on your machine. What Is Context-Augmented Generation? Context-Augmented Generation (CAG) is a pattern for making AI models useful with your own domain-specific content. Instead of hoping the model "knows" the answer from its training data, you pre-load your entire knowledge base into the model's context window at startup. Every query the model handles has access to all of your documents, all of the time. The flow is straightforward: Load your documents into memory when the application starts. Inject the most relevant documents into the prompt alongside the user's question. Generate a response grounded in your content. There is no retrieval pipeline, no vector database, and no embedding model. Your documents are read from disc, held in memory, and selected per query using simple keyword scoring. The model generates answers grounded in your content rather than relying on what it learnt during training. CAG vs RAG: Understanding the Trade-offs If you have explored AI application patterns before, you have likely encountered Retrieval-Augmented Generation (RAG). Both CAG and RAG solve the same core problem: grounding an AI model's answers in your own content. They take different approaches, and each has genuine strengths and limitations. CAG (Context-Augmented Generation) How it works: All documents are loaded at startup. The most relevant ones are selected per query using keyword scoring and injected into the prompt. Strengths: Drastically simpler architecture with no vector database, no embeddings, and no retrieval pipeline Works fully offline with no external services Minimal dependencies (just two npm packages in this sample) Near-instant document selection with no embedding latency Easy to set up, debug, and reason about Limitations: Constrained by the model's context window size Best suited to small, curated document sets (tens of documents, not thousands) Keyword scoring is less precise than semantic similarity for ambiguous queries Adding documents requires an application restart RAG (Retrieval-Augmented Generation) How it works: Documents are chunked, embedded into vectors, and stored in a database. At query time, the most semantically similar chunks are retrieved and injected into the prompt. Strengths: Scales to thousands or millions of documents Semantic search finds relevant content even when the user's wording differs from the source material Documents can be added or updated dynamically without restarting Fine-grained retrieval (chunk-level) can be more token-efficient for large collections Limitations: More complex architecture: requires an embedding model, a vector database, and a chunking strategy Retrieval quality depends heavily on chunking, embedding model choice, and tuning Additional latency from the embedding and search steps More dependencies and infrastructure to manage Want to compare these patterns hands-on? There is a RAG-based implementation of the same gas field scenario using vector search and embeddings. Clone both repositories, run them side by side, and see how the architectures differ in practice. When Should You Choose Which? Consideration Choose CAG Choose RAG Document count Tens of documents Hundreds or thousands Offline requirement Essential Optional (can run locally too) Setup complexity Minimal Moderate to high Document updates Infrequent (restart to reload) Frequent or dynamic Query precision Good for keyword-matchable content Better for semantically diverse queries Infrastructure None beyond the runtime Vector database, embedding model For the sample application in this post (20 gas engineering procedure documents on a local machine), CAG is the clear winner. If your use case grows to hundreds of documents or requires real-time ingestion, RAG becomes the better choice. Both patterns can run offline using Foundry Local. Foundry Local: Your On-Device AI Runtime Foundry Local is a lightweight runtime from Microsoft that downloads, manages, and serves language models entirely on your device. No cloud account, no API keys, no outbound network calls (after the initial model download). In this sample, your application is responsible for deciding which model to use, and it does that through the foundry-local-sdk . The app creates a FoundryLocalManager , asks the SDK for the local model catalogue, and then runs a small selection policy from src/modelSelector.js . That policy looks at the machine's available RAM, filters out models that are too large, ranks the remaining chat models by preference, and then returns the best fit for that device. Why does it work this way? Because shipping one fixed model would either exclude lower-spec machines or underuse more capable ones. A 14B model may be perfectly reasonable on a 32 GB workstation, but the same choice would be slow or unusable on an 8 GB laptop. By selecting at runtime, the same codebase can run across a wider range of developer machines without manual tuning. What makes it particularly useful for developers: No GPU required — runs on CPU or NPU, making it accessible on standard laptops and desktops Native SDK bindings — in-process inference via the foundry-local-sdk npm package, with no HTTP round-trips to a local server Automatic model management — downloads, caches, and loads models automatically Dynamic model selection — the SDK can evaluate your device's available RAM and pick the best model from the catalogue Real-time progress callbacks — ideal for building loading UIs that show download and initialisation progress The integration code is refreshingly minimal. Here is the core pattern: import { FoundryLocalManager } from "foundry-local-sdk"; // Create a manager and get the model catalogue const manager = FoundryLocalManager.create({ appName: "my-app" }); // Auto-select the best model for this device based on available RAM const models = await manager.catalog.getModels(); const model = selectBestModel(models); // Download if not cached, then load into memory if (!model.isCached) { await model.download((progress) => { console.log(`Download: ${progress.toFixed(0)}%`); }); } await model.load(); // Create a chat client for direct in-process inference const chatClient = model.createChatClient(); const response = await chatClient.completeChat([ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "How do I detect a gas leak?" } ]); That is it. No server configuration, no authentication tokens, no cloud provisioning. The model runs in the same process as your application. The download step matters for a simple reason: offline inference only works once the model files exist locally. The SDK checks whether the chosen model is already cached on the machine. If it is not, the application asks Foundry Local to download it once, store it locally, and then load it into memory. After that first run, the cached model can be reused, which is why subsequent launches are much faster and can operate without any network dependency. Put another way, there are two cooperating pieces here. Your application chooses which model is appropriate for the device and the scenario. Foundry Local and its SDK handle the mechanics of making that model available locally, caching it, loading it, and exposing a chat client for inference. That separation keeps the application logic clear whilst letting the runtime handle the heavy lifting. The Technology Stack The sample application is deliberately simple. No frameworks, no build steps, no Docker: Layer Technology Purpose AI Model Foundry Local + auto-selected model Runs locally via native SDK bindings; best model chosen for your device Back end Node.js + Express Lightweight HTTP server, everyone knows it Context Markdown files pre-loaded at startup No vector database, no embeddings, no retrieval step Front end Single HTML file with inline CSS No build step, mobile-responsive, field-ready The total dependency footprint is two npm packages: express and foundry-local-sdk . Architecture Overview The four-layer architecture, all running on a single machine. The system has four layers, all running in a single process on your device: Client layer: a single HTML file served by Express, with quick-action buttons and a responsive chat interface Server layer: Express.js starts immediately and serves the UI plus an SSE status endpoint; API routes handle chat (streaming and non-streaming), context listing, and health checks CAG engine: loads all domain documents at startup, selects the most relevant ones per query using keyword scoring, and injects them into the prompt AI layer: Foundry Local runs the auto-selected model on CPU/NPU via native SDK bindings (in-process inference, no HTTP round-trips) Building the Solution Step by Step Prerequisites You need two things installed on your machine: Node.js 20 or later: download from nodejs.org Foundry Local: Microsoft's on-device AI runtime: winget install Microsoft.FoundryLocal Foundry Local will automatically select and download the best model for your device the first time you run the application. You can override this by setting the FOUNDRY_MODEL environment variable to a specific model alias. Getting the Code Running # Clone the repository git clone https://github.com/leestott/local-cag.git cd local-cag # Install dependencies npm install # Start the server npm start Open http://127.0.0.1:3000 in your browser. You will see a loading overlay with a progress bar whilst the model downloads (first run only) and loads into memory. Once the model is ready, the overlay fades away and you can start chatting. Desktop view Mobile view How the CAG Pipeline Works Let us trace what happens when a user asks: "How do I detect a gas leak?" The query flow from browser to model and back. 1 Server starts and loads documents When you run npm start , the Express server starts on port 3000. All .md files in the docs/ folder are read, parsed (with optional YAML front-matter for title, category, and ID), and grouped by category. A document index is built listing all available topics. 2 Model is selected and loaded The model selector evaluates your system's available RAM and picks the best model from the Foundry Local catalogue. If the model is not already cached, it downloads it (with progress streamed to the browser via SSE). The model is then loaded into memory for in-process inference. 3 User sends a question The question arrives at the Express server. The chat engine selects the top 3 most relevant documents using keyword scoring. 4 Prompt is constructed The engine builds a messages array containing: the system prompt (with safety-first instructions), the document index (so the model knows all available topics), the 3 selected documents (approximately 6,000 characters), the conversation history, and the user's question. 5 Model generates a grounded response The prompt is sent to the locally loaded model via the Foundry Local SDK's native bindings. The response streams back token by token through Server-Sent Events to the browser. A response with safety warnings and step-by-step guidance The sources panel shows which documents were used Key Code Walkthrough Loading Documents (the Context Module) The context module reads all markdown files from the docs/ folder at startup. Each document can have optional YAML front-matter for metadata: // src/context.js export function loadDocuments() { const files = fs.readdirSync(config.docsDir) .filter(f => f.endsWith(".md")) .sort(); const docs = []; for (const file of files) { const raw = fs.readFileSync(path.join(config.docsDir, file), "utf-8"); const { meta, body } = parseFrontMatter(raw); docs.push({ id: meta.id || path.basename(file, ".md"), title: meta.title || file, category: meta.category || "General", content: body.trim(), }); } return docs; } There is no chunking, no vector computation, and no database. The documents are held in memory as plain text. Dynamic Model Selection Rather than hard-coding a model, the application evaluates your system at runtime: // src/modelSelector.js const totalRamMb = os.totalmem() / (1024 * 1024); const budgetMb = totalRamMb * 0.6; // Use up to 60% of system RAM // Filter to models that fit, rank by quality, boost cached models const candidates = allModels.filter(m => m.task === "chat-completion" && m.fileSizeMb <= budgetMb ); // Returns the best model: e.g. phi-4 on a 32 GB machine, // or phi-3.5-mini on a laptop with 8 GB RAM This means the same application runs on a powerful workstation (selecting a 14B parameter model) or a constrained laptop (selecting a 3.8B model), with no code changes required. This is worth calling out because it is one of the most practical parts of the sample. Developers do not have to decide up front which single model every user should run. The application makes that decision at startup based on the hardware budget you set, then asks Foundry Local to fetch the model if it is missing. The result is a smoother first-run experience and fewer support headaches when the same app is used on mixed hardware. The System Prompt For safety-critical domains, the system prompt is engineered to prioritise safety, prevent hallucination, and enforce structured responses: // src/prompts.js export const SYSTEM_PROMPT = `You are a local, offline support agent for gas field inspection and maintenance engineers. Behaviour Rules: - Always prioritise safety. If a procedure involves risk, explicitly call it out. - Do not hallucinate procedures, measurements, or tolerances. - If the answer is not in the provided context, say: "This information is not available in the local knowledge base." Response Format: - Summary (1-2 lines) - Safety Warnings (if applicable) - Step-by-step Guidance - Reference (document name + section)`; This pattern is transferable to any safety-critical domain: medical devices, electrical work, aviation maintenance, or chemical handling. Adapting This for Your Own Domain The sample project is designed to be forked and adapted. Here is how to make it yours in three steps: 1. Replace the documents Delete the gas engineering documents in docs/ and add your own markdown files. The context module handles any markdown content with optional YAML front-matter: --- title: Troubleshooting Widget Errors category: Support id: KB-001 --- # Troubleshooting Widget Errors ...your content here... 2. Edit the system prompt Open src/prompts.js and rewrite the system prompt for your domain. Keep the structure (summary, safety, steps, reference) and update the language to match your users' expectations. 3. Override the model (optional) By default the application auto-selects the best model. To force a specific model: # See available models foundry model list # Force a smaller, faster model FOUNDRY_MODEL=phi-3.5-mini npm start # Or a larger, higher-quality model FOUNDRY_MODEL=phi-4 npm start Smaller models give faster responses on constrained devices. Larger models give better quality. The auto-selector picks the largest model that fits within 60% of your system RAM. Building a Field-Ready UI The front end is a single HTML file with inline CSS. No React, no build tooling, no bundler. This keeps the project accessible to beginners and easy to deploy. Design decisions that matter for field use: Dark, high-contrast theme with 18px base font size for readability in bright sunlight Large touch targets (minimum 48px) for operation with gloves or PPE Quick-action buttons for common questions, so engineers do not need to type on a phone Responsive layout that works from 320px to 1920px+ screen widths Streaming responses via SSE, so the user sees tokens arriving in real time The mobile chat experience, optimised for field use. Visual Startup Progress with SSE A standout feature of this application is the loading experience. When the user opens the browser, they see a progress overlay showing exactly what the application is doing: Loading domain documents Initialising the Foundry Local SDK Selecting the best model for the device Downloading the model (with a percentage progress bar, first run only) Loading the model into memory This works because the Express server starts before the model finishes loading. The browser connects immediately and receives real-time status updates via Server-Sent Events. Chat endpoints return 503 whilst the model is loading, so the UI cannot send queries prematurely. // Server-side: broadcast status to all connected browsers function broadcastStatus(state) { initState = state; const payload = `data: ${JSON.stringify(state)}\n\n`; for (const client of statusClients) { client.write(payload); } } // During initialisation: broadcastStatus({ stage: "downloading", message: "Downloading phi-4...", progress: 42 }); This pattern is worth adopting in any application where model loading takes more than a few seconds. Users should never stare at a blank screen wondering whether something is broken. Testing The project includes unit tests using the built-in Node.js test runner, with no extra test framework needed: # Run all tests npm test Tests cover configuration, server endpoints, and document loading. Use them as a starting point when you adapt the project for your own domain. Ideas for Extending the Project Once you have the basics running, there are plenty of directions to explore: Conversation memory: persist chat history across sessions using local storage or a lightweight database Hybrid CAG + RAG: add a vector retrieval step for larger document collections that exceed the context window Multi-modal support: add image-based queries (photographing a fault code, for example) PWA packaging: make it installable as a standalone offline application on mobile devices Custom model fine-tuning: fine-tune a model on your domain data for even better answers Ready to Build Your Own? Clone the CAG sample, swap in your own documents, and have an offline AI agent running in minutes. Or compare it with the RAG approach to see which pattern suits your use case best. Get the CAG Sample Get the RAG Sample Summary Building a local AI application does not require a PhD in machine learning or a cloud budget. With Foundry Local, Node.js, and a set of domain documents, you can create a fully offline, mobile-responsive AI agent that answers questions grounded in your own content. The key takeaways: CAG is ideal for small, curated document sets where simplicity and offline capability matter most. No vector database, no embeddings, no retrieval pipeline. RAG scales further when you have hundreds or thousands of documents, or need semantic search for ambiguous queries. See the local-rag sample to compare. Foundry Local makes on-device AI accessible: native SDK bindings, in-process inference, automatic model selection, and no GPU required. The architecture is transferable. Replace the gas engineering documents with your own content, update the system prompt, and you have a domain-specific AI agent for any field. Start simple, iterate outwards. Begin with CAG and a handful of documents. If your needs outgrow the context window, graduate to RAG. Both patterns can run entirely offline. Clone the repository, swap in your own documents, and start building. The best way to learn is to get your hands on the code. This project is open source under the MIT licence. It is a scenario sample for learning and experimentation, not production medical or safety advice. local-cag on GitHub · local-rag on GitHub · Foundry LocalAgents League: Meet the Winners
Agents League brought together developers from around the world to build AI agents using Microsoft's developer tools. With 100+ submissions across three tracks, choosing winners was genuinely difficult. Today, we're proud to announce the category champions. 🎨 Creative Apps Winner: CodeSonify View project CodeSonify turns source code into music. As a genuinely thoughtful system, its functions become ascending melodies, loops create rhythmic patterns, conditionals trigger chord changes, and bugs produce dissonant sounds. It supports 7 programming languages and 5 musical styles, with each language mapped to its own key signature and code complexity directly driving the tempo. What makes CodeSonify stand out is the depth of execution. CodeSonify team delivered three integrated experiences: a web app with real-time visualization and one-click MIDI export, an MCP server exposing 5 tools inside GitHub Copilot in VS Code Agent Mode, and a diff sonification engine that lets you hear a code review. A clean refactor sounds harmonious. A messy one sounds chaotic. The team even built the MIDI generator from scratch in pure TypeScript with zero external dependencies. Built entirely with GitHub Copilot assistance, this is one of those projects that makes you think about code differently. 🧠 Reasoning Agents Winner: CertPrep Multi-Agent System View project CertPrep Multi-Agent System team built a production-grade 8-agent system for personalized Microsoft certification exam preparation, supporting 9 exam families including AI-102, AZ-204, AZ-305, and more. Each agent has a distinct responsibility: profiling the learner, generating a week-by-week study schedule, curating learning paths, tracking readiness, running mock assessments, and issuing a GO / CONDITIONAL GO / NOT YET booking recommendation. The engineering behind the scene here is impressive. A 3-tier LLM fallback chain ensures the system runs reliably even without Azure credentials, with the full pipeline completing in under 1 second in mock mode. A 17-rule guardrail pipeline validates every agent boundary. Study time allocation uses the Largest Remainder algorithm to guarantee no domain is silently zeroed out. 342 automated tests back it all up. This is what thoughtful multi-agent architecture looks like in practice. 💼 Enterprise Agents Winner: Whatever AI Assistant (WAIA) View project WAIA is a production-ready multi-agent system for Microsoft 365 Copilot Chat and Microsoft Teams. A workflow agent routes queries to specialized HR, IT, or Fallback agents, transparently to the user, handling both RAG-pattern Q&A and action automation — including IT ticket submission via a SharePoint list. Technically, it's a showcase of what serious enterprise agent development looks like: a custom MCP server secured with OAuth Identity Passthrough, streaming responses via the OpenAI Responses API, Adaptive Cards for human-in-the-loop approval flows, a debug mode accessible directly from Teams or Copilot, and full OpenTelemetry integration visible in the Foundry portal. Franck also shipped end-to-end automated Bicep deployment so the solution can land in any Azure environment. It's polished, thoroughly documented, and built to be replicated. Thank you To every developer who submitted and shipped projects during Agents League: thank you 💜 Your creativity and innovation brought Agents League to life! 👉 Browse all submissions on GitHubBuilding Your First Local RAG Application with Foundry Local
A developer's guide to building an offline, mobile-responsive AI support agent using Retrieval-Augmented Generation, the Foundry Local SDK, and JavaScript. Imagine you are a gas field engineer standing beside a pipeline in a remote location. There is no Wi-Fi, no mobile signal, and you need a safety procedure right now. What do you do? This is the exact problem that inspired this project: a fully offline RAG-powered support agent that runs entirely on your machine. No cloud. No API keys. No outbound network calls. Just a local language model, a local vector store, and your own documents, all accessible from a browser on any device. In this post, you will learn how it works, how to build your own, and the key architectural decisions behind it. If you have ever wanted to build an AI application that runs locally and answers questions grounded in your own data, this is the place to start. The finished application: a browser-based AI support agent that runs entirely on your machine. What Is Retrieval-Augmented Generation? Retrieval-Augmented Generation (RAG) is a pattern that makes AI models genuinely useful for domain-specific tasks. Rather than hoping the model "knows" the answer from its training data, you: Retrieve relevant chunks from your own documents using a vector store Augment the model's prompt with those chunks as context Generate a response grounded in your actual data The result is fewer hallucinations, traceable answers with source attribution, and an AI that works with your content rather than relying on general knowledge. If you are building internal tools, customer support bots, field manuals, or knowledge bases, RAG is the pattern you want. RAG vs CAG: Understanding the Trade-offs If you have explored AI application patterns before, you have likely encountered Context-Augmented Generation (CAG). Both RAG and CAG solve the same core problem: grounding an AI model's answers in your own content. They take different approaches, and each has genuine strengths and limitations. RAG (Retrieval-Augmented Generation) How it works: Documents are split into chunks, vectorised, and stored in a database. At query time, the most relevant chunks are retrieved and injected into the prompt. Strengths: Scales to thousands or millions of documents Fine-grained retrieval at chunk level with source attribution Documents can be added or updated dynamically without restarting Token-efficient: only relevant chunks are sent to the model Supports runtime document upload via the web UI Limitations: More complex architecture: requires a vector store and chunking strategy Retrieval quality depends on chunking parameters and scoring method May miss relevant content if the retrieval step does not surface it CAG (Context-Augmented Generation) How it works: All documents are loaded at startup. The most relevant ones are selected per query using keyword scoring and injected into the prompt. Strengths: Drastically simpler architecture with no vector database or embeddings All information is always available to the model Minimal dependencies and easy to set up Near-instant document selection Limitations: Constrained by the model's context window size Best suited to small, curated document sets (tens of documents) Adding documents requires an application restart Want to compare these patterns hands-on? There is a CAG-based implementation of the same gas field scenario using whole-document context injection. Clone both repositories, run them side by side, and see how the architectures differ in practice. When Should You Choose Which? Consideration Choose RAG Choose CAG Document count Hundreds or thousands Tens of documents Document updates Frequent or dynamic (runtime upload) Infrequent (restart to reload) Source attribution Per-chunk with relevance scores Per-document Setup complexity Moderate (ingestion step required) Minimal Query precision Better for large or diverse collections Good for keyword-matchable content Infrastructure SQLite vector store (single file) None beyond the runtime For the sample application in this post (20 gas engineering procedure documents with runtime upload), RAG is the clear winner. If your document set is small and static, CAG may be simpler. Both patterns run fully offline using Foundry Local. Foundry Local: Your On-Device AI Runtime Foundry Local is a lightweight runtime from Microsoft that downloads, manages, and serves language models entirely on your device. No cloud account, no API keys, no outbound network calls (after the initial model download). What makes it particularly useful for developers: No GPU required: runs on CPU or NPU, making it accessible on standard laptops and desktops Native SDK bindings: in-process inference via the foundry-local-sdk npm package, with no HTTP round-trips to a local server Automatic model management: downloads, caches, and loads models automatically Hardware-optimised variant selection: the SDK picks the best variant for your hardware (GPU, NPU, or CPU) Real-time progress callbacks: ideal for building loading UIs that show download and initialisation progress The integration code is refreshingly minimal: import { FoundryLocalManager } from "foundry-local-sdk"; // Create a manager and discover models via the catalogue const manager = FoundryLocalManager.create({ appName: "gas-field-local-rag" }); const model = await manager.catalog.getModel("phi-3.5-mini"); // Download if not cached, then load into memory if (!model.isCached) { await model.download((progress) => { console.log(`Download: ${Math.round(progress * 100)}%`); }); } await model.load(); // Create a chat client for direct in-process inference const chatClient = model.createChatClient(); const response = await chatClient.completeChat([ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "How do I detect a gas leak?" } ]); That is it. No server configuration, no authentication tokens, no cloud provisioning. The model runs in the same process as your application. The Technology Stack The sample application is deliberately simple. No frameworks, no build steps, no Docker: Layer Technology Purpose AI Model Foundry Local + Phi-3.5 Mini Runs locally via native SDK bindings, no GPU required Back end Node.js + Express Lightweight HTTP server, everyone knows it Vector Store SQLite (via better-sqlite3 ) Zero infrastructure, single file on disc Retrieval TF-IDF + cosine similarity No embedding model required, fully offline Front end Single HTML file with inline CSS No build step, mobile-responsive, field-ready The total dependency footprint is three npm packages: express , foundry-local-sdk , and better-sqlite3 . Architecture Overview The five-layer architecture, all running on a single machine. The system has five layers, all running on a single machine: Client layer: a single HTML file served by Express, with quick-action buttons and a responsive chat interface Server layer: Express.js starts immediately and serves the UI plus SSE status and chat endpoints RAG pipeline: the chat engine orchestrates retrieval and generation; the chunker handles TF-IDF vectorisation; the prompts module provides safety-first system instructions Data layer: SQLite stores document chunks and their TF-IDF vectors; documents live as .md files in the docs/ folder AI layer: Foundry Local runs Phi-3.5 Mini on CPU or NPU via native SDK bindings Building the Solution Step by Step Prerequisites You need two things installed on your machine: Node.js 20 or later: download from nodejs.org Foundry Local: Microsoft's on-device AI runtime: winget install Microsoft.FoundryLocal The SDK will automatically download the Phi-3.5 Mini model (approximately 2 GB) the first time you run the application. Getting the Code Running # Clone the repository git clone https://github.com/leestott/local-rag.git cd local-rag # Install dependencies npm install # Ingest the 20 gas engineering documents into the vector store npm run ingest # Start the server npm start Open http://127.0.0.1:3000 in your browser. You will see the status indicator whilst the model loads. Once the model is ready, the status changes to "Offline Ready" and you can start chatting. Desktop view Mobile view How the RAG Pipeline Works Let us trace what happens when a user asks: "How do I detect a gas leak?" The query flow from browser to model and back. 1 Documents are ingested and indexed When you run npm run ingest , every .md file in the docs/ folder is read, parsed (with optional YAML front-matter for title, category, and ID), split into overlapping chunks of approximately 200 tokens, and stored in SQLite with TF-IDF vectors. 2 Model is loaded via the SDK The Foundry Local SDK discovers the model in the local catalogue and loads it into memory. If the model is not already cached, it downloads it first (with progress streamed to the browser via SSE). 3 User sends a question The question arrives at the Express server. The chat engine converts it into a TF-IDF vector, uses an inverted index to find candidate chunks, and scores them using cosine similarity. The top 3 chunks are returned in under 1 ms. 4 Prompt is constructed The engine builds a messages array containing: the system prompt (with safety-first instructions), the retrieved chunks as context, the conversation history, and the user's question. 5 Model generates a grounded response The prompt is sent to the locally loaded model via the Foundry Local SDK's native chat client. The response streams back token by token through Server-Sent Events to the browser. Source references with relevance scores are included. A response with safety warnings and step-by-step guidance The sources panel shows which chunks were used and their relevance Key Code Walkthrough The Vector Store (TF-IDF + SQLite) The vector store uses SQLite to persist document chunks alongside their TF-IDF vectors. At query time, an inverted index finds candidate chunks that share terms with the query, then cosine similarity ranks them: // src/vectorStore.js search(query, topK = 5) { const queryTf = termFrequency(query); this._ensureCache(); // Build in-memory cache on first access // Use inverted index to find candidates sharing at least one term const candidateIndices = new Set(); for (const term of queryTf.keys()) { const indices = this._invertedIndex.get(term); if (indices) { for (const idx of indices) candidateIndices.add(idx); } } // Score only candidates, not all rows const scored = []; for (const idx of candidateIndices) { const row = this._rowCache[idx]; const score = cosineSimilarity(queryTf, row.tf); if (score > 0) scored.push({ ...row, score }); } scored.sort((a, b) => b.score - a.score); return scored.slice(0, topK); } The inverted index, in-memory row cache, and prepared SQL statements bring retrieval time to sub-millisecond for typical query loads. Why TF-IDF Instead of Embeddings? Most RAG tutorials use embedding models for retrieval. This project uses TF-IDF because: Fully offline: no embedding model to download or run Zero latency: vectorisation is instantaneous (it is just maths on word frequencies) Good enough: for 20 domain-specific documents, TF-IDF retrieves the right chunks reliably Transparent: you can inspect the vocabulary and weights, unlike neural embeddings For larger collections or when semantic similarity matters more than keyword overlap, you would swap in an embedding model. For this use case, TF-IDF keeps the stack simple and dependency-free. The System Prompt For safety-critical domains, the system prompt is engineered to prioritise safety, prevent hallucination, and enforce structured responses: // src/prompts.js export const SYSTEM_PROMPT = `You are a local, offline support agent for gas field inspection and maintenance engineers. Behaviour Rules: - Always prioritise safety. If a procedure involves risk, explicitly call it out. - Do not hallucinate procedures, measurements, or tolerances. - If the answer is not in the provided context, say: "This information is not available in the local knowledge base." Response Format: - Summary (1-2 lines) - Safety Warnings (if applicable) - Step-by-step Guidance - Reference (document name + section)`; This pattern is transferable to any safety-critical domain: medical devices, electrical work, aviation maintenance, or chemical handling. Runtime Document Upload Unlike the CAG approach, RAG supports adding documents without restarting the server. Click the upload button to add new .md or .txt files. They are chunked, vectorised, and indexed immediately. The upload modal with the complete list of indexed documents. Adapting This for Your Own Domain The sample project is designed to be forked and adapted. Here is how to make it yours in four steps: 1. Replace the documents Delete the gas engineering documents in docs/ and add your own markdown files. The ingestion pipeline handles any markdown content with optional YAML front-matter: --- title: Troubleshooting Widget Errors category: Support id: KB-001 --- # Troubleshooting Widget Errors ...your content here... 2. Edit the system prompt Open src/prompts.js and rewrite the system prompt for your domain. Keep the structure (summary, safety, steps, reference) and update the language to match your users' expectations. 3. Tune the retrieval In src/config.js : chunkSize: 200 : smaller chunks give more precise retrieval, less context per chunk chunkOverlap: 25 : prevents information falling between chunks topK: 3 : how many chunks to retrieve per query (more gives more context but slower generation) 4. Swap the model Change config.model in src/config.js to any model available in the Foundry Local catalogue. Smaller models give faster responses on constrained devices; larger models give better quality. Building a Field-Ready UI The front end is a single HTML file with inline CSS. No React, no build tooling, no bundler. This keeps the project accessible to beginners and easy to deploy. Design decisions that matter for field use: Dark, high-contrast theme with 18px base font size for readability in bright sunlight Large touch targets (minimum 44px) for operation with gloves or PPE Quick-action buttons that wrap on mobile so all options are visible without scrolling Responsive layout that works from 320px to 1920px+ screen widths Streaming responses via SSE, so the user sees tokens arriving in real time The mobile chat experience, optimised for field use. Testing The project includes unit tests using the built-in Node.js test runner, with no extra test framework needed: # Run all tests npm test Tests cover the chunker, vector store, configuration, and server endpoints. Use them as a starting point when you adapt the project for your own domain. Ideas for Extending the Project Once you have the basics running, there are plenty of directions to explore: Embedding-based retrieval: use a local embedding model for better semantic matching on diverse queries Conversation memory: persist chat history across sessions using local storage or a lightweight database Multi-modal support: add image-based queries (photographing a fault code, for example) PWA packaging: make it installable as a standalone offline application on mobile devices Hybrid retrieval: combine TF-IDF keyword search with semantic embeddings for best results Try the CAG approach: compare with the local-cag sample to see which pattern suits your use case Ready to Build Your Own? Clone the RAG sample, swap in your own documents, and have an offline AI agent running in minutes. Or compare it with the CAG approach to see which pattern suits your use case best. Get the RAG Sample Get the CAG Sample Summary Building a local RAG application does not require a PhD in machine learning or a cloud budget. With Foundry Local, Node.js, and SQLite, you can create a fully offline, mobile-responsive AI agent that answers questions grounded in your own documents. The key takeaways: RAG is ideal for scalable, dynamic document sets where you need fine-grained retrieval with source attribution. Documents can be added at runtime without restarting. CAG is simpler when you have a small, stable set of documents that fit in the context window. See the local-cag sample to compare. Foundry Local makes on-device AI accessible: native SDK bindings, in-process inference, automatic model selection, and no GPU required. TF-IDF + SQLite is a viable vector store for small-to-medium collections, with sub-millisecond retrieval thanks to inverted indexing and in-memory caching. Start simple, iterate outwards. Begin with RAG and a handful of documents. If your needs are simpler, try CAG. Both patterns run entirely offline. Clone the repository, swap in your own documents, and start building. The best way to learn is to get your hands on the code. This project is open source under the MIT licence. It is a scenario sample for learning and experimentation, not production medical or safety advice. local-rag on GitHub · local-cag on GitHub · Foundry LocalMicrosoft Foundry Labs: A Practical Fast Lane from Research to Real Developer Work
Why developers need a fast lane from research → prototypes AI engineering has a speed problem, but it is not a shortage of announcements. The hard part is turning research into a useful prototype before the next wave of models, tools, or agent patterns shows up. That gap matters. AI engineers want to compare quality, latency, and cost before they wire a model into a product. Full-stack teams want to test whether an agent workflow is real or just demo. Platform and operations teams want to know when an experiment can graduate into something observable and supportable. Microsoft makes that case directly in introducing Microsoft Foundry Labs: breakthroughs are arriving faster, and time from research to product has compressed from years to months. If you build real systems, the question is not "What is the coolest demo?" It is "Which experiments are worth my next hour, and how do I evaluate them without creating demo-ware?" That is where Microsoft Foundry Labs becomes interesting. What is Microsoft Foundry Labs? Microsoft Foundry Labs is a place to explore early-stage experiments and prototypes from Microsoft, with an explicit focus on research-driven innovation. The homepage describes it as a way to get a glimpse of potential future directions for AI through experimental technologies from Microsoft Research and more. The announcement adds the operating idea: Labs is a single access point for developers to experiment with new models from Microsoft, explore frameworks, and share feedback. That framing matters. Labs is not just a gallery of flashy ideas. It is a developer-facing exploration surface for projects that are still close to research: models, agent systems, UX ideas, and tool experiments. Here's some things you can do on Labs: Play with tomorrow’s AI, today: 30+ experimental projects—from models to agents—are openly available to fork and build upon, alongside direct access to breakthrough research from Microsoft. Go from prototype to production, fast: Seamless integration with Microsoft Foundry gives you access to 11,000+ models with built-in compute, safety, observability, and governance—so you can move from local experimentation to full-scale production without complex containerization or switching platforms. Build with the people shaping the future of AI: Join a thriving community of 25,000+ developers across Discord and GitHub with direct access to Microsoft researchers and engineers to share feedback and help shape the most promising technologies. What Labs is not: it is not a promise that every project has a production deployment path today, a long-term support commitment, or a hardened enterprise operating model. Spotlight: a few Labs experiments worth a developer's attention Phi-4-Reasoning-Vision-15B: A compact open-weight multimodal reasoning model that is interesting if you care about the quality-versus-efficiency tradeoff in smaller reasoning systems. BitNet: A native 1-bit large language model that is compelling for engineers who care about memory, compute, and energy efficiency. Fara-7B: An ultra-compact agentic small language model designed for computer use, which makes it relevant for builders exploring UI automation and on-device agents. OmniParser V2: A screen parsing module that turns interfaces into actionable elements, directly relevant to computer-use and UI-interaction agents. If you want to inspect actual code, the Labs project pages also expose official repository links for some of these experiments, including OmniParser, Magentic-UI, and BitNet. Labs vs. Foundry: how to think about the boundary The simplest mental model is this: Labs is the exploration edge; Foundry is the platform layer. The Microsoft Foundry documentation describes the broader platform as "the AI app and agent factory" to build, optimize, and govern AI apps and agents at scale. That is a different promise from Labs. Foundry is where you move from curiosity to implementation: model access, agent services, SDKs, observability, evaluation, monitoring, and governance. Labs helps you explore what might matter next. Foundry helps you build, optimize, and govern what matters now. Labs is where you test a research-shaped idea. Foundry is where you decide whether that idea can survive integration, evaluation, tracing, cost controls, and production scrutiny. That also means Labs is not a replacement for the broader Foundry workflow. If an experiment catches your attention, the next question is not "Can I ship this tomorrow?" It is "What is the integration path, and how will I measure whether it deserves promotion?" What's real today vs. what's experimental Real today: Labs is live as an official exploration hub, and Foundry is the broader platform for building, evaluating, monitoring, and governing AI apps and agents. Experimental by design: Labs projects are presented as experiments and prototypes, so they still need validation for your use case. A developer's lens: Models, Agents, Observability What makes Labs useful is not that it shows new things. It is that it gives developers a way to inspect those things through the same three concerns that matter in every serious AI system: model choice, agent design, and observability. Diagram description: imagine a loop with three boxes in a row: Models, Agents, and Observability. A forward arrow runs across the row, and a feedback arrow loops from Observability back to Models. The point is that evaluation data should change both model choices and agent design, instead of arriving too late. Models: what to look for in Labs experiments If you are model-curious, Labs should trigger an evaluation mindset, not a fandom mindset. When you see something like Phi-4-Reasoning-Vision-15B or BitNet on the Labs homepage, ask three things: what capability is being demonstrated, what constraints are obvious, and what the integration path would look like. This is where the Microsoft Foundry Playgrounds mindset is useful even if you started in Labs. The documentation emphasizes model comparison, prompt iteration, parameter tuning, tools, safety guardrails, and code export. It also pushes the right pre-production questions: price-to-performance, latency, tool integration, and code readiness. That is how I would use Labs for models: not to choose winners, but to generate hypotheses worth testing. If a Labs experiment looks promising, move quickly into a small evaluation matrix around capability, latency, cost, and integration friction. Agents: what Labs unlocks for agent builders Labs is especially interesting for agent builders because many of the projects point toward orchestration and tool-use patterns that matter in practice. The official announcement highlights projects across models and agentic frameworks, including Magentic-One and OmniParser v2. On the homepage, projects such as Fara-7B, OmniParser V2, TypeAgent, and Magentic-UI point in a similar direction: agents get more useful when they can reason over tools, interfaces, plans, and human feedback loops. For working developers, that means Labs can act as a scouting surface for agent patterns rather than just agent demos. Look for UI or computer-use style agents when your system needs to act through an interface rather than an API. Look for planning or tool-selection patterns when orchestration matters more than raw model quality. My suggestion: when a Labs project looks relevant to agent work, do not ask "Can I copy this architecture?" Ask "Which agent pattern is being explored here, and under what constraints would it be useful in my system?" Observability: how to experiment responsibly and measure what matters Observability is where prototypes usually go to die, because teams postpone it until after they have something flashy. That is backwards. If you care about real systems, tracing, evaluation, monitoring, and governance should start during prototyping. The Microsoft Foundry documentation already puts that operating model in plain view through guidance for tracing applications, evaluating agentic workflows, and monitoring generative AI apps. The Microsoft Foundry Playgrounds page is also explicit that the agents playground supports tracing and evaluation through AgentOps. At the governance layer, the AI gateway in Azure API Management documentation reinforces why this matters beyond demos. It covers monitoring and logging AI interactions, tracking token metrics, logging prompts and completions, managing quotas, applying safety policies, and governing models, agents, and tools. You do not need every one of those controls on day one, but you do need the habit: if a prototype cannot tell you what it did, why it failed, and what it cost, it is not ready to influence a roadmap. "Pick one and try it": a 20-minute hands-on path Keep this lightweight and tool-agnostic. The point is not to memorize a product UI. The point is to run a disciplined experiment. Browse Labs and pick an experiment aligned to your work. Start at Microsoft Foundry Labs and choose one project that is adjacent to a real problem you have: model efficiency, multimodal reasoning, UI agents, debugging workflows, or human-in-the-loop design. Read the project page and jump to the repo or paper if available. Use the Labs entry to understand the claim being made. Then read the supporting material, not just the summary sentence. Define one small test task and explicit success criteria. Keep it concrete: latency budget, accuracy target, cost ceiling, acceptable safety behavior, or failure rate under a narrow scenario. Capture telemetry from the start. At minimum, keep prompts or inputs, outputs, intermediate decisions, and failures. If the experiment involves tools or agents, include tool choices and obvious reasons for failure or recovery. Make a hard call. Decide whether to keep exploring or wait for a stronger production-grade path. "Interesting" is not the same as "ready for integration." Minimal experiment logger (my suggestion): if you want a lightweight way to avoid demo-ware, even a local JSONL log is enough to capture prompts, outputs, decisions, failures, and latency while you compare ideas from Labs. import json import time from pathlib import Path LOG_PATH = Path("experiment-log.jsonl") def record_event(name, payload): # Append one event per line so runs are easy to diff and analyze later. with LOG_PATH.open("a", encoding="utf-8") as handle: handle.write(json.dumps({"event": name, **payload}) + "\n") def run_experiment(user_input): started = time.time() try: # Replace this stub with your real model or agent call. output = user_input.upper() decision = "keep exploring" if len(output) < 80 else "wait" record_event( "experiment_result", { "input": user_input, "output": output, "decision": decision, "latency_ms": round((time.time() - started) * 1000, 2), "failure": None, }, ) except Exception as error: record_event( "experiment_result", { "input": user_input, "output": None, "decision": "failed", "latency_ms": round((time.time() - started) * 1000, 2), "failure": str(error), }, ) raise if __name__ == "__main__": run_experiment("Summarize the constraints of this Labs project.") That script is intentionally boring. That is the point. It gives you a repeatable, runnable starting point for comparing experiments without pretending you already have a full observability stack. Practical tips: how I evaluate Labs experiments before betting a roadmap on them Separate the idea from the implementation path. A strong research direction can still have a weak near-term integration story. Test one workload, not ten. Pick a narrow task that resembles your production reality and see whether the experiment moves the needle. Track cost and latency as first-class metrics. A novel capability that breaks your budget or response-time envelope is still a failed fit. Treat agent demos skeptically unless you can inspect behavior. Tool calls, traces, failure cases, and recovery paths matter more than polished output. Common pitfalls are predictable here. Do not confuse a research win with a deployment path. Labs is for exploration, so you still need to validate integration, safety, and operations. Do not evaluate with vague prompts. Use a narrow task and explicit success criteria, or you will end up comparing vibes instead of outcomes. Do not skip telemetry because the prototype is small. If you cannot inspect failures early, the prototype will teach you very little. Do not ignore known limitations. For example, the Fara-7B project page explicitly notes challenges on more complex tasks, instruction-following mistakes, and hallucinations, which is exactly the kind of constraint you should carry into evaluation. What to explore next Azure AI Foundry Labs matters because it gives developers a practical way to explore research-shaped ideas before they harden into mainstream patterns. The smart move is to use Labs as an input into better platform decisions: explore in Labs, validate with the discipline encouraged by Foundry playgrounds, and then bring the learnings back into the broader Foundry workflow. Takeaway 1: Labs is an exploration surface for early-stage, research-driven experiments and prototypes, not a blanket promise of production readiness. Takeaway 2: The right workflow is Labs for discovery, then Microsoft Foundry for implementation, optimization, evaluation, monitoring, and governance. Takeaway 3: Tracing, evaluations, and telemetry should start during prototyping, because that is how you avoid confusing a compelling demo with a viable system. If you are curious, start with Microsoft Foundry Labs, read the official context in Introducing Microsoft Foundry Labs, and then map what you learn into the platform guidance in Microsoft Foundry documentation. Try this next Open Microsoft Foundry Labs and choose one experiment that matches a real workload you care about. Use the mindset from Microsoft Foundry Playgrounds to define a small validation task around quality, latency, cost, and safety. Write down the minimum telemetry you need before continuing: inputs, outputs, decisions, failures, and token or cost signals. Read the relevant operating guidance in AI gateway in Azure API Management if your experiment may eventually need monitoring, quotas, safety policies, or governance. Promote only the experiments that can explain their value clearly in a Foundry-shaped build, evaluation, and observability workflow.What’s New in Microsoft Sentinel and XDR: AI Automation, Data Lake Innovation, and Unified SecOps
The most consequential “new” Microsoft Sentinel / Defender XDR narrative for a deeply technical Microsoft Tech Community article is the operational and engineering shift to unified security operations in the Microsoft Defender portal, including an explicit Azure portal retirement/sunset timeline and concrete migration implications (data tiering, correlation engine changes, schema differences, and automation behavior changes). Official sources now align on March 31, 2027 as the sunset date for managing Microsoft Sentinel in the Azure portal, with customers being redirected to the Defender portal after that date. The “headline” feature announcements to anchor your article around (because they create new engineering patterns, not just UI changes) are: AI playbook generator (preview): Natural-language-driven authoring of Python playbooks in an embedded VS Code environment (Cline), using Integration Profiles for dynamic API calls and an Enhanced Alert Trigger for broader automation triggering across Microsoft Sentinel, Defender, and XDR alert sources. CCF Push (public preview): A push-based connector model built on the Azure Monitor Logs Ingestion API, where deploying via Content Hub can automate provisioning of the typical plumbing (DCR/DCE/app registration/RBAC), enabling near-real-time ingestion plus ingestion-time transformations and (per announcement) direct delivery into certain system tables. Data lake tier ingestion for Advanced Hunting tables (GA): Direct ingestion of specific Microsoft XDR Advanced Hunting tables into the Microsoft Sentinel data lake without requiring analytics-tier ingestion—explicitly positioned for long-retention, cost-effective storage and retrospective investigations at scale. Microsoft 365 Copilot data connector (public preview): Ingests Copilot-related audit/activity events via the Purview Unified Audit Log feed into a dedicated table (CopilotActivity) with explicit admin-role requirements and cost notes. Multi-tenant content distribution expansion: Adds support for distributing analytics rules, automation rules, workbooks, and built-in alert tuning rules across tenants via distribution profiles, with stated limitations (for example, automation rules that trigger a playbook cannot currently be distributed). Alert schema differences for “standalone vs XDR connector”: A must-cite engineering artifact documenting breaking/behavioral differences (CompromisedEntity semantics, field mapping changes, alert filtering differences) when moving to the consolidated Defender XDR connector path. What’s new and when Feature and release matrix The table below consolidates officially documented Sentinel and Defender XDR features that are relevant to a “new announcements” technical article. If a source does not explicitly state GA/preview or a specific date, it is marked “unspecified.” Feature Concise description Status (official) Announcement / release date Azure portal Sentinel retirement / redirection Sentinel management experience shifts to Defender portal; sunset date extended; post-sunset redirection expected Date explicitly stated Mar 31, 2027 sunset (date stated) extension published Jan 29, 2026 Sentinel in Defender portal (core GA) Sentinel is GA in Defender portal, including for customers without Defender XDR/E5; unified SecOps surface GA Doc updated Sep 30, 2025; retirement note reiterated 2026 AI playbook generator Natural language → Python playbook, documentation, and a visual flow diagram; VS Code + Cline experience Preview Feb 23, 2026 Integration Profiles (playbook generator) Centralized configuration objects (base URL, auth method, credentials) used by generated playbooks to call external APIs dynamically Preview feature component Feb 23, 2026 Enhanced Alert Trigger (generated playbooks) Tenant-level trigger designed to target alerts across Sentinel + Defender + XDR sources and apply granular conditions Preview feature component Feb 23, 2026 CCF Push Push-based ingestion model that reduces setup friction (DCR/DCE/app reg/RBAC), built on Logs Ingestion API; supports transformations and high-throughput ingestion Public preview Feb 12–13, 2026 Legacy custom data collection API retirement Retirement of legacy custom data collection API noted as part of connector modernization Retirement date stated Sep 2026 (retirement) Data lake tier ingestion for Microsoft XDR Advanced Hunting tables Ingest selected Advanced Hunting tables from MDE/MDO/MDA directly into Sentinel data lake; supports long retention and lake-first analytics GA Feb 10, 2026 Microsoft 365 Copilot data connector Ingests Copilot activities/audit logs; data lands in CopilotActivity; requires specific tenant roles to enable; costs apply Public preview Feb 3, 2026 Multi-tenant content distribution: expanded content types Adds support for analytics rules, automation rules, workbooks, and built-in alert tuning rules; includes limitations and prerequisites Stated as “supported”; feature described as part of public preview experience in monthly update Jan 29, 2026 GKE dedicated connector Dedicated connector built on CCF; ingests GKE cluster activity/workload/security events into GKEAudit; supports DCR transformations and lake-only ingestion GA Mar 4, 2026 UEBA behaviors layer “Who did what to whom” behavior abstraction from raw logs; newer sources state GA; other page sections still label Preview GA and Preview labels appear in official sources (inconsistent) Feb 2026 (GA statement) UEBA widget in Defender portal home Home-page widget to surface anomalous user behavior and accelerate workflows Preview Jan 2026 Alert schema differences: standalone vs XDR connector Documents field mapping differences, CompromisedEntity behavior changes, and alert filtering/scoping differences Doc (behavioral/change reference) Feb 4, 2026 (last updated) Defender incident investigation: Blast radius analysis Graph visualization built on Sentinel data lake + graph for propagation path analysis Preview (per Defender XDR release notes) Sep 2025 (release notes section) Advanced hunting: Hunting graph Graph rendering of predefined threat scenarios in advanced hunting Preview (per Defender XDR release notes) Sep 2025 (release notes section) Sentinel repositories API version retirement “Call to action” to update API versions: older versions retired June 1, 2026; enforcement June 15, 2026 for actions Dates explicitly stated March 2026 (noticed); Jun 1 / Jun 15, 2026 (deadline/enforcement) Technical architecture and integrations Unified reference architecture Microsoft’s official integration documentation describes two “centers of gravity” depending on how you operate: In Defender portal mode, Sentinel data is ingested alongside organizational data into the Defender portal, enabling SOC teams to analyze and respond from a unified surface. In Azure portal mode, Defender XDR incidents/alerts flow via Sentinel connectors and analysts work across both experiences. Integration model: Defender suite and third-party security tools The Defender XDR integration doc is explicit about: Supported Defender components whose alerts appear through the integration (Defender for Endpoint, Identity, Office 365, Cloud Apps), plus other services such as Purview DLP and Entra ID Protection. Behavior when onboarding Sentinel to the Defender portal with Defender XDR licensing: the Defender XDR connector is automatically set up and component alert-provider connectors are disconnected. Expected latency: Defender XDR incidents typically appear in Sentinel UI/API within ~5 minutes, with additional lag before securityIncident ingestion is complete. Cost model: Defender XDR alerts and incidents that populate SecurityAlert / SecurityIncident are synchronized at no charge, while other data types (for example, Advanced Hunting tables) are charged. For third-party tools, Microsoft’s monthly “What’s new” explicitly calls out new GA out-of-the-box connectors/solutions (examples include Mimecast audit logs, Vectra AI XDR, and Proofpoint POD email security) as part of an expanding connector ecosystem intended to unify visibility across cloud, SaaS, and on-premises environments. Telemetry, schemas, analytics, automation, and APIs Data flows and ingestion engineering CCF Push and the “push connector” ingestion path Microsoft’s CCF Push announcement frames the “old” model as predominantly polling-based (Sentinel periodically fetching from partner/customer APIs) and introduces push-based connectors where partners/customers send data directly to a Sentinel workspace, emphasizing that “Deploy” can auto-provision the typical prerequisites: DCE, DCR, Entra app registration + secrets, and RBAC assignments. Microsoft also states that CCF Push is built on the Logs Ingestion API, with benefits including throughput, ingestion-time transformation, and system-table targeting. A precise engineering description of the underlying Logs Ingestion API components (useful for your article even if your readers never build a connector) is documented in Azure Monitor: Sender app authenticates via an app registration that has access to a DCR. Sender sends JSON matching the DCR’s expected structure to a DCR endpoint or a DCE (DCE required for Private Link scenarios). The DCR can apply a transformation to map/filter/enrich before writing to the target table. DCR transformation (KQL) Microsoft documents “transformations in Azure Monitor” and provides concrete sample KQL snippets for common needs such as cost reduction and enrichment. // Keep only Critical events source | where severity == "Critical" // Drop a noisy/unneeded column source | project-away RawData // Enrich with a simple internal/external IP classification (example) source | extend IpLocation = iff(split(ClientIp,".")[0] in ("10","192"), "Internal", "External") These are direct examples from Microsoft’s sample transformations guidance; they are especially relevant because ingestion-time filtering is one of the primary levers for both performance and cost management in Sentinel pipelines. A Sentinel-specific nuance: Microsoft states that Sentinel-enabled Log Analytics workspaces are not subject to Azure Monitor’s filtering ingestion charge, regardless of how much data a transformation filters (while other Azure Monitor transformation cost rules still exist in general). Telemetry schemas and key tables you should call out A “new announcements” article aimed at detection engineers should explicitly name the tables that are impacted by new features: Copilot connector → CopilotActivity table, with a published list of record types (for example, CopilotInteraction and related plugin/workspace/prompt-book operations) and explicit role requirements to enable (Global Administrator or Security Administrator). Defender XDR incident/alert sync → SecurityAlert and SecurityIncident populated at no charge; other Defender data types (Advanced Hunting event tables such as DeviceInfo/EmailEvents) are charged. Sentinel onboarding to Defender advanced hunting: Sentinel alerts tied to incidents are ingested into AlertInfo and accessible in Advanced hunting; SecurityAlert is queryable even if not shown in the schema list in Defender (notable for KQL portability). UEBA “core” tables (engineering relevance: query joins and tuning): IdentityInfo, BehaviorAnalytics, UserPeerAnalytics, Anomalies. UEBA behaviors layer tables (new behavior abstraction): SentinelBehaviorInfo and SentinelBehaviorEntities, created only if behaviors layer is enabled. Microsoft XDR Advanced Hunting lake tier ingestion GA: explicit supported tables from MDE/MDO/MDA (for example DeviceProcessEvents, DeviceNetworkEvents, EmailEvents, UrlClickEvents, CloudAppEvents) and an explicit note that MDI support will follow. Detection and analytics: UEBA and graph UEBA operating model and scoring Microsoft’s UEBA documentation gives you citeable technical detail: UEBA uses machine learning to build behavioral profiles and detect anomalies versus baselines, incorporating peer group analysis and “blast radius evaluation” concepts. Risk scoring is described with two different scoring models: BehaviorAnalytics.InvestigationPriority (0–10) vs Anomalies.AnomalyScore (0–1), with different processing characteristics (near-real-time/event-level vs batch/behavior-level). UEBA Essentials is positioned as a maintained pack of prebuilt queries (including multi-cloud anomaly detection), and Microsoft’s February 2026 update adds detail about expanded anomaly detection across Azure/AWS/GCP/Okta and the anomalies-table-powered queries. Sentinel data lake and graph as the new “analytics substrate” Microsoft’s data lake overview frames a two-tier model: Analytics tier: high-performance, real-time analytics supporting alerting/incident management. Data lake tier: centralized long-term storage for querying and Python-based analytics, designed for retention up to 12 years, with “single-copy” mirroring (data in analytics tier mirrored to lake tier). Microsoft’s graph documentation states that if you already have Sentinel data lake, the required graph is auto-provisioned when you sign into the Defender portal, enabling experiences like hunting graph and blast radius. Microsoft also notes that while the experiences are included in existing licensing, enabling data sources can incur ingestion/processing/storage costs. Automation: AI playbook generator details that matter technically The playbook generator doc contains unusually concrete engineering constraints and required setup. Key technical points to carry into your article: Prerequisites: Security Copilot must be enabled with SCUs available (Microsoft states SCUs aren’t billed for playbook generation but are required), and the Sentinel workspace must be onboarded to Defender. Roles: Sentinel Contributor is required for authoring Automation Rules, and a Detection tuning role in Entra is required to use the generator; permissions may take up to two hours to take effect. Integration Profiles: explicitly defined as Base URL + auth method + required credentials; cannot change API URL/auth method after creation; supports multiple auth methods including OAuth2 client credentials, API key, AWS auth, Bearer/JWT, etc. Enhanced Alert Trigger: designed for broader coverage across Sentinel, Defender, and XDR alerts and tenant-level automation consistency. Limitations: Python only, alerts as the sole input type, no external libraries, max 100 playbooks/tenant, 10-minute runtime, line limits, and separation of enhanced trigger rules from standard alert trigger rules (no automatic migration). APIs and code/CLI (official) Create/update a DCR with Azure CLI (official) Microsoft documents an az monitor data-collection rule create workflow to create/update a DCR from a JSON file, which is directly relevant if your readers build their own “push ingestion” paths outside of CCF Push or need transformations not supported via a guided connector UI. az monitor data-collection rule create \ --location 'eastus' \ --resource-group 'my-resource-group' \ --name 'my-dcr' \ --rule-file 'C:\MyNewDCR.json' \ --description 'This is my new DCR' Send logs via Azure Monitor Ingestion client (Python) (official) Microsoft’s Azure SDK documentation provides a straightforward LogsIngestionClient pattern (and the repo samples document the required environment variables such as DCE, rule immutable ID, and stream name). import os from azure.identity import DefaultAzureCredential from azure.monitor.ingestion import LogsIngestionClient endpoint = os.environ["DATA_COLLECTION_ENDPOINT"] rule_id = os.environ["LOGS_DCR_RULE_ID"] # DCR immutable ID stream_name = os.environ["LOGS_DCR_STREAM_NAME"] # stream name in DCR credential = DefaultAzureCredential() client = LogsIngestionClient(endpoint=endpoint, credential=credential) body = [ {"Time": "2026-03-18T00:00:00Z", "Computer": "host1", "AdditionalContext": "example"} ] # Actual upload method name/details depend on SDK version and sample specifics. # Refer to official ingestion samples and README for the exact call. The repo sample and README explicitly define the environment variables and the use of LogsIngestionClient + DefaultAzureCredential. Sentinel repositories API version retirement (engineering risk) Microsoft’s Sentinel release notes contain an explicit “call to action” that older REST API versions used for Sentinel Repositories will be retired (June 1, 2026) and that Source Control actions using older versions will stop being supported (starting June 15, 2026), recommending migration to specific versions. This is critical for “content-as-code” SOC engineering pipelines. Migration and implementation guidance Prerequisites and planning gates A technically rigorous migration section should treat this as a set of gating checks. Microsoft’s transition guidance highlights several that can materially block or change behavior: Portal transition has no extra cost: Microsoft explicitly states transitioning to the Defender portal has no extra cost (billing remains Sentinel consumption). Data storage and privacy policies change: after onboarding, Defender XDR policies apply even when working with Sentinel data (data retention/sharing differences). Customer-managed keys constraint for data lake: CMK is not supported for data stored in Sentinel data lake; even broader, Sentinel data lake onboarding doc warns that CMK-enabled workspaces aren’t accessible via data lake experiences and that data ingested into the lake is encrypted with Microsoft-managed keys. Region and data residency implications: data lake is provisioned in the primary workspace’s region and onboarding may require consent to ingest Microsoft 365 data into that region if it differs. Data appearance lag when switching tiers: enabling ingestion for the first time or switching between tiers can take 90–120 minutes for data to appear in tables. Step-by-step configuration tasks for the most “new” capabilities Enable lake-tier ingestion for Advanced Hunting tables (GA) Microsoft’s GA announcement provides direct UI steps in the Defender portal: Defender portal → Microsoft Sentinel → Configuration → Tables Select an Advanced Hunting table (from the supported list) Data Retention Settings → choose “Data lake tier” + set retention + save Microsoft states that this allows Defender data to remain accessible in the Advanced Hunting table for 30 days while a copy is sent to Sentinel data lake for long-term retention (up to 12 years) and graph/MCP-related scenarios. Deploy the Microsoft 365 Copilot data connector (public preview) Microsoft’s connector post provides the operational steps and requirements: Install via Content Hub in the Defender portal (search “Copilot”, install solution, open connector page). Enablement requires tenant-level Global Administrator or Security Administrator roles. Data lands in CopilotActivity. Ingestion costs apply based on Sentinel workspace settings or Sentinel data lake tier pricing. Configure multi-tenant content distribution (expanded content types) Microsoft documents: Navigate to “Content Distribution” in Defender multi-tenant management portal. Create/select a distribution profile; choose content types; select content; choose up to 100 workspaces per tenant; save and monitor sync results. Limitations: automation rules that trigger a playbook cannot currently be distributed; alert tuning rules limited to built-in rules (for now). Prerequisites: access to more than one tenant via delegated access; subscription to Microsoft 365 E5 or Office E5. Prepare for Defender XDR connector–driven changes Microsoft explicitly warns that incident creation rules are turned off for Defender XDR–integrated products to avoid duplicates and suggests compensating controls using Defender portal alert tuning or automation rules. It also warns that incident titles will be governed by Defender XDR correlation and recommends avoiding “incident name” conditions in automation rules (tags recommended). Common pitfalls and “what breaks” A strong engineering article should include a “what breaks” section, grounded in Microsoft’s own lists: Schema and field semantics drift: The “standalone vs XDR connector” schema differences doc calls out CompromisedEntity behavior differences, field mapping changes, and alert filtering differences (for example, Defender for Cloud informational alerts not ingested; Entra ID below High not ingested by default). Automation delays and unsupported actions post-onboarding: Transition guidance states automation rules might run up to 10 minutes after alert/incident changes due to forwarding, and that some playbook actions (like adding/removing alerts from incidents) are not supported after onboarding—breaking certain playbook patterns. Incident synchronization boundaries: incidents created in Sentinel via API/Logic App playbook/manual Azure portal aren’t synchronized to Defender portal (per transition doc). Advanced hunting differences after data lake enablement: auxiliary log tables are no longer available in Defender Advanced hunting once data lake is enabled; they must be accessed via data lake exploration KQL experiences. CI/CD failures from API retirement: repository connection create/manage tooling that calls older API versions must migrate by June 1, 2026 to avoid action failures. Performance and cost considerations Microsoft’s cost model is now best explained using tiering and retention: Sentinel data lake tier is designed for cost-effective long retention up to 12 years, with analytics-tier data mirrored to the lake tier as a single copy. For Defender XDR threat hunting data, Microsoft states it is available in analytics tier for 30 days by default; retaining beyond that and moving beyond free windows drives ingestion and/or storage costs depending on whether you extend analytics retention or store longer in lake tier. Ingesting data directly to data lake tier incurs ingestion, storage, and processing costs; retaining in lake beyond analytics retention incurs storage costs. Ingestion-time transformations are a first-class cost lever, and Microsoft explicitly frames filtering as a way to reduce ingestion costs in Log Analytics. Sample deployment checklist Phase Task Acceptance criteria (engineering) Governance Confirm target portal strategy and dates Internal cutover plan aligns with March 31, 2027 retirement; CI/CD deadlines tracked Identity/RBAC Validate roles for onboarding + automation Required Entra roles + Sentinel roles assigned; propagation delays accounted for Data lake readiness Decide whether to onboard to Sentinel data lake CMK policy alignment confirmed; billing subscription owner identified; region implications reviewed Defender XDR integration Choose integration mode and test incident sync Incidents visible within expected latency; bi-directional sync fields behave as expected Schema regression Validate queries/rules against XDR connector schema KQL regression tests pass; CompromisedEntity and filtering changes handled Connector modernization Inventory connectors; plan CCF / CCF Push transitions Function-based connectors migration plan; legacy custom data collection API retirement addressed Automation Pilot AI playbook generator + enhanced triggers Integration Profiles created; generated playbooks reviewed; enhanced trigger scopes correct Multi-tenant operations Configure content distribution if needed Distribution profiles sync reliably; limitations documented; rollback/override plan exists Outage-proofing Update Sentinel repos tooling for API retirement All source-control actions use recommended API versions before June 1, 2026 Use cases and customer impact Detection and response scenarios that map to the new announcements Copilot governance and misuse detection The Copilot connector’s published record types enable detections for scenarios such as unauthorized plugin/workspace/prompt-book operations and anomalous Copilot interactions. Data is explicitly positioned for analytic rules, workbooks, automation, and threat hunting within Sentinel and Sentinel data lake. Long-retention hunting on high-volume Defender telemetry (lake-first approach) Lake-tier ingestion for Advanced Hunting tables (GA) is explicitly framed around scale, cost containment, and retrospective investigations beyond “near-real-time” windows, while keeping 30-day availability in the Advanced Hunting tables themselves. Faster automation authoring and customization (SOAR engineering productivity) Microsoft positions the playbook generator as eliminating rigid templates and enabling dynamic API calls across Microsoft and third-party tools via Integration Profiles, with preview-customer feedback claiming faster automation development (vendor-stated). Multi-tenant SOC standardization (MSSP / large enterprise) Multi-tenant content distribution is explicitly designed to replicate detections, automation, and dashboards across tenants, reducing drift and accelerating onboarding, while keeping execution local to target tenants. Measurable benefit dimensions (how to discuss rigorously) Most Microsoft sources in this announcement set are descriptive (not benchmark studies). A rigorous article should therefore describe what you can measure, and label any numeric claims as vendor-stated. Recommended measurable dimensions grounded in the features as documented: Time-to-detect / time-to-ingest: CCF Push is positioned as real-time, event-driven delivery vs polling-based ingestion. Time-to-triage / time-to-investigate: UEBA layers (Anomalies + Behaviors) are designed to summarize and prioritize activity, with explicit scoring models and tables for query enrichment. Incident queue pressure: Defender XDR grouping/enrichment is explicitly described as reducing SOC queue size and time to resolve. Cost-per-retained-GB and query cost: tiering rules and retention windows define cost tradeoffs; ingestion-time transformations reduce cost by dropping unneeded rows/columns. Vendor-stated metrics: Microsoft’s March 2026 “What’s new” roundup references an external buyer’s guide and reports “44% reduction in total cost of ownership” and “93% faster deployment times” as outcomes for organizations using Sentinel (treat as vendor marketing unless corroborated by an independent study in your environment). Comparison of old vs new Microsoft capabilities and competitor XDR positioning Old vs new (Microsoft) Capability “Older” operating model (common patterns implied by docs) “New” model emphasized in announcements/release notes Primary SOC console Split experience (Azure portal Sentinel + Defender portal XDR) Defender portal as the primary unified SecOps surface; Azure portal sunset Incident correlation engine Sentinel correlation features (e.g., Fusion in Azure portal) Defender XDR correlation engine replaces Fusion for incident creation after onboarding; incident provider always “Microsoft XDR” in Defender portal mode Automation authoring Logic Apps playbooks + automation rules Adds AI playbook generator (Python) + Enhanced Alert Trigger, with explicit constraints/limits Custom ingestion Data Collector API legacy patterns + manual DCR/DCE plumbing CCF Push built on Logs Ingestion API; emphasizes automated provisioning and transformation support Long retention Primarily analytics-tier retention strategies Data lake tier supports up to 12 years; lake-tier ingestion for AH tables GA; explicit tier/cost model Graph-driven investigations Basic incident graphs Blast radius analysis + hunting graph experiences built on Sentinel data lake + graph Competitor XDR offerings (high-level, vendor pages) The table below is intentionally “high-level” and marks details as unspecified unless explicitly stated on the cited vendor pages. Vendor Positioning claims (from official vendor pages) Notes / unspecified items CrowdStrike Falcon Insight XDR is positioned as “AI-native XDR” for “endpoint and beyond,” emphasizing detection/response and threat intelligence. Data lake architecture, ingestion transformation model, and multi-tenant content distribution specifics are unspecified in cited sources. Palo Alto Networks Cortex XDR is positioned as integrated endpoint security with AI-driven operations and broader visibility; vendor site highlights outcome metrics in customer stories and “AI-driven endpoint security.” Lake/graph primitives, connector framework model, and schema parity details are unspecified in cited sources. SentinelOne Singularity XDR is positioned as AI-powered response with automated workflows across the environment; emphasizes machine-speed incident response. Specific SIEM-style retention tiering and documented ingestion-time transformations are unspecified in cited sources.567Views10likes0CommentsIntune device compliance status not evaluated
Has anyone encountered devices taking absolutely forever to evaluate overall compliance after user enrollment ESP? (pre-provisioned devices). They just sit there in "not evaluated" and get blocked by CA policy. Most come good eventually, but some literally are taking employees offline for the whole day. These are all Win11 AAD-joined. Microsoft has only offered me the standard "may take up to 8 hours, goodbye" response but I am pulling my hair out trying to figure out if this is just an Intune thing, or is there a trick I am missing? Some of them take so long that I give up and swap out the device so they can start working. The individual policies are evaluating just fine, but the overall status is way behind. I'd even prefer them to be non-compliant because at least then the grace period would kick in. I have had very limited success with rebooting and kicking off all the syncs / check access buttons, but I have a feeling those buttons have just been a placebo. It happens very sporadically too on about half of devices the user doesn't even notice it's that quick. Thanks for any advice8KViews0likes5CommentsIntegrate Defender for Cloud Apps w/ Azure Firewall or VPN Gateway
Hello, Recently I have been tasked with securing our openAI implementation. I would like to marry the Defender for Cloud Apps with the sanctioning feature and the Blocking unsanctioned traffic like the Defender for Endpoint capability. To do this, I was only able to come up with: creating a windows 2019/2022 server, with RRAS, and two interfaces in Azure, one Public, and one private. Then I add Defender for Endpoint, Optimized to act as a traffic moderator, integrated the solution with Defender for cloud apps, with BLOCK integration enabled. I can then sanction each of the desired applications, closing my environment and only allowing sanctioned traffic to sanctioned locations. This solution seemed : difficult to create, not the best performer, and the solution didn't really take into account the ability of the router to differentiate what solution was originating the traffic, which would allow for selective profiles depending on the originating source. Are there any plans on having similar solutions available in the future from: VPN gateway (integration with Defender for Cloud Apps), or Azure Firewall -> with advanced profile. The Compliance interface with the sanctioning traffic feature seems very straight forward .186Views0likes1CommentGetting Started with Behave: Writing Cucumber Tests in VS Code
What is Behave? Behave is a BDD test framework for Python that allows you to write tests in plain English using Given–When–Then syntax, backed by Python step definitions. Key benefits: Human‑readable test scenarios using Gherkin Strong alignment between business requirements and test automation Easy integration with CI/CD pipelines Lightweight and IDE‑friendly Prerequisites Before getting started, ensure you have the following installed: Python 3.10+ Visual Studio Code Basic understanding of Python Familiarity with BDD concepts (Given / When / Then) Steps Download the sample demo zip from github download Step 1: Create a Virtual Environment and activate it. python -m venv venv .venv\Scripts\activate Install Dependencies pip install behave requests Step 2: Install VS Code Extensions To get a first‑class experience in VS Code, install the following extensions: Python (Microsoft) Gherkin (for .feature syntax highlighting) Behave VSC (optional but recommended) The Behave VSC extension enables: Running tests directly from VS Code Step definition navigation Gherkin auto‑completion Test explorer integration Folder Structure Why This Structure? features/ – contains all Gherkin feature files steps/ – contains Python step implementations environment.py – optional hooks for setup/teardown config/configuration.py - for lifecycle hooks behave.ini – configuration file for Behave Step 3: Write Your First Feature File Feature: Login functionality Login Scenario: Successful login Given the application is running When the user enters valid credentials Then the user should see the dashboard Step 4: Writing Step Definitions from behave import given, when, then @given('the user is on the login page') def step_user_on_login_page(context): print("User navigates to login page") @when('the user enters valid credentials') def step_user_enters_credentials(context): print("User enters username and password") @then('the user should be redirected to the dashboard') def step_user_redirected(context): print("User is redirected to dashboard") Step 5: Adding Test Configuration (configuration.py) Create config/configuration.py to centralize environment-specific settings. This helps avoid hardcoding values across test files. class TestConfig: BASE_URL = "https://example.com" TIMEOUT = 30 BROWSER = "chrome" Step 6: Using Fixtures with environment.py The environment.py file is Behave’s hook mechanism. It runs before and after tests, similar to fixtures in pytest. Create features/environment.py: from config.configuration import TestConfig def before_all(context): print("Setting up test environment") context.config_data = TestConfig() def before_scenario(context, scenario): print(f"Starting scenario: {scenario.name}") def after_scenario(context, scenario): print(f"Finished scenario: {scenario.name}") def after_all(context): print("Tearing down test environment") Common Use Cases Initialize browsers or API clients Load environment variables Clean up test data Open/close DB connections Step 7: Optional Behave Configuration File Create behave.ini for execution settings. This helps during debugging by showing logs directly in the console. [behave] stdout_capture = false stderr_capture = false log_capture = false Step 8: Running Tests From the project root, run: behave To run a specific feature: behave features/login.feature Run by tag behave -t Login Best Practices ✔ Keep feature files business-readable ✔ Avoid logic in feature files ✔ Reuse steps wherever possible ✔ Centralize configs and fixtures ✔ Use tags for selective executionPhi-4-Reasoning-Vision-15B: Use Cases In-Depth
Phi-4-Reasoning-vision-15B is Microsoft's latest vision reasoning model released on Microsoft Foundry. It combines high-resolution visual perception with selective, task-aware reasoning, making it the first model in the Phi-4 family to simultaneously achieve both "seeing clearly" and "thinking deeply" as a small language model (SLM). Traditional vision models only perform passive perception — recognizing "what's in" an image. Phi-4-Reasoning-Vision-15B goes further by performing structured, multi-step reasoning: understanding visual structure in images, connecting it with textual context, and reaching actionable conclusions. This enables developers to build intelligent applications ranging from chart analysis to GUI automation. Core Design Features 2.1 Selective Reasoning The model's most critical design feature is its hybrid reasoning behavior. It can switch between "reasoning mode" and "non-reasoning mode" based on the prompt: When deep reasoning is needed (e.g., math problems, logical analysis) → Multi-step reasoning chain is activated When fast perception is sufficient (e.g., OCR, element localization) → Direct output with reduced latency 2.2 Three Thinking Modes (from Notebook Examples) Developers can precisely control reasoning behavior via the thinking_mode parameter: Mode Trigger Description Best For hybrid (Mixed) Default Model autonomously decides whether deep reasoning is needed General use, balancing speed and accuracy think (Deep Thinking) Appends <think> token Forces full reasoning chain Complex math / science / logic problems nothink (Fast Response) Appends <nothink> token Skips reasoning chain, outputs directly Low-latency perception tasks, simple Q&A The corresponding code implementation: def run_inference(processor, model, prompt, image, thinking_mode="hybrid"): ## FORM MESSAGE AND LOAD IMAGE messages = [ { "role": "user", "content": prompt, } ] ## PROCESS INPUTS prompt = processor.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, return_dict=False, ) if thinking_mode == "think": prompt = str(prompt) + "<think>" elif thinking_mode == "nothink": prompt = str(prompt) + "<|dummy_84|>" print(f"Prompt: {prompt}") inputs = processor(text=prompt, images=[image], return_tensors="pt").to(model.device) ## GENERATE RESPONSE output_ids = model.generate( **inputs, max_new_tokens=1024, temperature=None, top_p=None, do_sample=False, use_cache=False, ) ## DECODE RESPONSE sequence_length = inputs["input_ids"].shape[1] sequence_length -= 1 if thinking_mode == "think" else 0 # remove the extra token for nothink mode new_output_ids = output_ids[:, sequence_length:] model_output = processor.batch_decode( new_output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return model_output This design allows developers to dynamically balance latency and accuracy at runtime — essential for real-time interactive applications. Key Use Cases Use Case 1: GUI Agents (Computer Use Agents) This is one of the model's most important application areas.The model receives a screenshot and a natural language instruction, then outputs the normalized bounding box coordinates for the target UI element. The Notebook also provides a plot_boxes() visualization function that compares model predictions (red box) against ground truth annotations (green box). Real-World Example — E-Commerce Shopping Agent: As described in the official documentation, in retail scenarios the model serves as the perception layer for computer-use agents: Screen comprehension: Identifies products, prices, filters, promotions, buttons, and cart states Grounded output: Produces actionable coordinates for upstream agent models (e.g., Fara-7B) to execute clicks, scrolls, and other interactions Real-time decision support: Compact model size and low-latency inference, suitable for navigating dense product listings and comparing options Use Case 2: Mathematical and Scientific Visual Reasoning Typical applications: Interpreting geometric figures and function graphs for problem-solving Analyzing scientific experiment diagrams and data charts Education: Students photograph and upload problems; the model shows the complete reasoning process and solution steps Use Case 3: Document, Chart, and Table Understanding Typical applications: IT Operations: Interpreting monitoring dashboards, performance charts, and incident reports to assist diagnosis and decision-making Financial Analysis: Extracting metrics from report screenshots and interpreting trends Enterprise Report Automation: Processing scanned documents and tables to generate structured summaries Samples 1. Using Phi-4-Reasoning-Vision-15B to detect jaywalking Go to - Sample Code 2. Using Phi-4-Reasoning-Vision-15B to math Go to - Sample Code 3. Using Phi-4-Reasoning-Vision-15B for GUI Agent Go to - Sample Code Model Comparison at a Glance Below is a comparison of Phi-4-Reasoning-Vision-15B against comparable models on key tasks: No Thinking Mode Thinking Mode Phi-4-Reasoning-Vision-15B shows clear advantages in math reasoning and GUI grounding tasks while remaining competitive in general multimodal understanding. Summary Phi-4-Reasoning-Vision-15B represents a significant milestone for small vision reasoning models: Sees clearly: High-resolution visual perception supporting documents, charts, UI screenshots, and more Thinks deeply: Selective multi-step reasoning chains that rival larger models on complex tasks Runs fast: 15B parameters + NoThink mode, suitable for real-time interactive applications Adapts flexibly: Three thinking modes switchable on the fly, letting developers dynamically balance accuracy and latency at runtime Whether building e-commerce shopping agents, IT operations assistants, or educational tutoring tools, this model provides a complete capability chain from "seeing" to "understanding" to "acting." Resources 1. Read official Blog - Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model 2. Learn more about Phi-4-reasoning-vision in Huggingface - https://huggingface.co/microsoft/Phi-4-reasoning-vision-15B 3. Learn more about Microsoft Phi Family - Microsoft Phi CookBook606Views0likes0Comments