javascript

122 Topics

Implementing the Backend-for-Frontend (BFF) / Curated API Pattern Using Azure API Management
Modern digital applications rarely serve a single type of client. Web portals, mobile apps, partner integrations, and internal tools often consume the same backend services—yet each has different performance, payload, and UX requirements. Exposing backend APIs directly to all clients frequently leads to over-fetching, chatty networks, and tight coupling between UI and backend domain models. This is where a Curated API or Backend for Frontend API design pattern becomes useful. What Is the Backend-for-Frontend (BFF) Pattern? The Backend-for-Frontend (BFF)—also known as the Curated API pattern—solves this problem by introducing a client-specific API layer that shapes, aggregates, and optimizes data specifically for the consuming experience. There is very good architectural guidance on this at Azure Architecture Center [Check out the 1st Link on Citation section] The BFF pattern introduces a dedicated backend layer for each frontend experience. Instead of exposing generic backend services directly, the BFF: Aggregates data from multiple backend services Filters and reshapes responses Optimizes payloads for a specific client Shields clients from backend complexity and change Each frontend (web, mobile, partner) can evolve independently, without forcing backend services to accommodate UI-specific concerns. Why Azure API Management Is a Natural Fit for BFF Azure API Management is commonly used as an API gateway, but its policy engine enables much more than routing and security. Using APIM policies, you can: Call multiple backend services (sequentially or in parallel) Transform request and response payloads to provide a unform experience Apply caching, rate limiting, authentication, and resiliency policies All of this can be achieved without modifying backend code, making APIM an excellent place to implement the BFF pattern. When Should You Use a Curated API in APIM? Using APIM as a BFF makes sense when: Frontend clients require optimized, experience-specific payloads Backend services must remain generic and reusable You want to reduce round trips from mobile or low-bandwidth clients You want to implement uniform polices for cross cutting concerns, authentication/authorization, caching, rate-limiting and logging, etc. You want to avoid building and operating a separate aggregation service You need strong governance, security, and observability at the API layer How the BFF Pattern Works in Azure API Management There is a Git Hub Repository [Check out the 2nd Link on Citation section] that provides a wealth of information and samples on how to create complex APIM policies. I recently contributed to this repository with a sample policy for Curated APIs [Check out the 3rd Link on Citation section] At a high level, the policy follows this flow: APIM receives a single client request APIM issues parallel calls to multiple backend services as shown below <wait for="all"> <send-request mode="copy" response-variable-name="operation1" timeout="{{bff-timeout}}" ignore-error="false"> <set-url>@("{{bff-baseurl}}/operation1?param1=" + context.Request.Url.Query.GetValueOrDefault("param1", "value1"))</set-url> </send-request> <send-request mode="copy" response-variable-name="operation2" timeout="{{bff-timeout}}" ignore-error="false"> <set-url>{{bff-baseurl}}/operation2</set-url> </send-request> <send-request mode="copy" response-variable-name="operation3" timeout="{{bff-timeout}}" ignore-error="false"> <set-url>{{bff-baseurl}}/operation3</set-url> </send-request> <send-request mode="copy" response-variable-name="operation4" timeout="{{bff-timeout}}" ignore-error="false"> <set-url>{{bff-baseurl}}/operation4</set-url> </send-request> </wait> Few things to consider The Wait policy allows us to make multiple requests using nested send-request policies. The for="all" attribute value implies that the policy execution will await all the nested send requests before moving to the next one. {{bff-baseurl}}: This example assumes a single base URL for all end points. It does not have to be. The calls can be made to any endpoint response-variable-name attribute sets a unique variable name to hold response object from each of the parallel calls. This will be used later in the policy to transform and produce the curated result. timeout attribute: This example assumes uniform timeouts for each endpoint, but it might vary as well. ignore-error: set this to true only when you are not concerned about the response from the backend (like a fire and forget request) otherwise keep it false so that the response variable captures the response with error code. Once responses from all the requests have been received (or timed out) the policy execution moves to the next policy Then the responses from all requests are collected and transformed into a single response data  <set-variable name="finalResponseData" value="@{ JObject finalResponse = new JObject(); int finalStatus = 200; // This assumes the final success status (If all backend calls succeed) is 200 - OK, can be customized. string finalStatusReason = "OK"; void ParseBody(JObject element, string propertyName, IResponse response){ string body = ""; if(response!=null){ body = response.Body.As<string>(); try{ var jsonBody = JToken.Parse(body); element.Add(propertyName, jsonBody); } catch(Exception ex){ element.Add(propertyName, body); } } else{ element.Add(propertyName, body); //Add empty body if the response was not captured } } JObject PrepareResponse(string responseVariableName){ JObject responseElement = new JObject(); responseElement.Add("operation", responseVariableName); IResponse response = context.Variables.GetValueOrDefault<IResponse>(responseVariableName); if(response == null){ finalStatus = 207; // if any of the responses are null; the final status will be 207 finalStatusReason = "Multi Status"; ParseBody(responseElement, "error", response); return responseElement; } int status = response.StatusCode; responseElement.Add("status", status); if(status == 200){ // This assumes all the backend APIs return 200, if they return other success responses (e.g. 201) add them here ParseBody(responseElement, "body", response); } else{ // if any of the response codes are non success, the final status will be 207 finalStatus = 207; finalStatusReason = "Multi Status"; ParseBody(responseElement, "error", response); } return responseElement; } // Gather responses into JSON Array // Pass on the each of the response variable names here. JArray finalResponseBody = new JArray(); finalResponseBody.Add(PrepareResponse("operation1")); finalResponseBody.Add(PrepareResponse("operation2")); finalResponseBody.Add(PrepareResponse("operation3")); finalResponseBody.Add(PrepareResponse("operation4")); // Populate finalResponse with aggregated body and status information finalResponse.Add("body", finalResponseBody); finalResponse.Add("status", finalStatus); finalResponse.Add("reason", finalStatusReason); return finalResponse; }" /> What this code does is prepare the response into a single JSON Object. using the help of the PrepareResponse function. The JSON not only collects the response body from each response variable, but it also captures the response codes and determines the final response code based on the individual response codes. For the purpose of his example, I have assumed all operations are GET operations and if all operations return 200 then the overall response is 200-OK, otherwise it is 206 -Partial Content. This can be customized to the actual scenario as needed. Once the final response variable is ready, then construct and return a single response based on the above calculation  <return-response> <set-status code="@((int)((JObject)context.Variables["finalResponseData"]).SelectToken("status"))" reason="@(((JObject)context.Variables["finalResponseData"]).SelectToken("reason").ToString())" /> <set-body>@(((JObject)context.Variables["finalResponseData"]).SelectToken("body").ToString(Newtonsoft.Json.Formatting.None))</set-body> </return-response> This effectively turns APIM into an experience-specific backend tailored to frontend needs. When not to use APIM for BFF Implementation? While this approach works well when you want to curate a few responses together and apply a unified set of policies, there are some cases where you might want to rethink this approach When the need for transformation is complex. Maintaining a lot of code in APIM is not fun. If the response transformation requires a lot of code that needs to be unit tested and code that might change over time, it might be better to sand up a curation service. Azure Functions and Azure Container Apps are well suited for this. When each backend endpoint requires very complex request transformation, then that also increases the amount of code, then that would also indicate a need for an independent curation service. If you are not already using APIM then this does not warrant adding one to your architecture just to implement BFF. Conclusion Using APIM is one of the many approaches you can use to create a BFF layer on top of your existing endpoint. Let me know your thoughts con the comments on what you think of this approach. Citations Azure Architecture Center – Backend-for-Frontends Pattern Azure API Management Policy Snippets (GitHub) Curated APIs Policy Example (GitHub) Send-request Policy Reference
SajalMukherjee
Mar 11, 2026 Place Microsoft Developer Community Blog
129Views
0likes
0Comments
Building a Privacy-First Hybrid AI Briefing Tool with Foundry Local and Azure OpenAI
Introduction Management consultants face a critical challenge: they need instant AI-powered insights from sensitive client documents, but traditional cloud-only AI solutions create unacceptable data privacy risks. Every document uploaded to a cloud API potentially exposes confidential client information, violates data residency requirements, and creates compliance headaches. The solution lies in a hybrid architecture that combines the speed and privacy of on-device AI with the sophistication of cloud models—but only when explicitly requested. This article walks through building a production-ready briefing assistant that runs AI inference locally first, then optionally refines outputs using Azure OpenAI for executive-quality presentations. We'll explore a sample implementation using FL-Client-Briefing-Assistant, built with Next.js 14, TypeScript, and Microsoft Foundry Local. You'll learn how to architect privacy-first AI applications, implement sub-second local inference, and design transparent hybrid workflows that give users complete control over their data. Why Hybrid AI Architecture Matters for Enterprise Applications Before diving into implementation details, let's understand why a hybrid approach is essential for enterprise AI applications, particularly in consulting and professional services. Cloud-only AI services like OpenAI's GPT-4 offer remarkable capabilities, but they introduce several critical challenges. First, every API call sends your data to external servers, creating audit trails and potential exposure points. For consultants handling merger documents, financial reports, or strategic plans, this is often a non-starter. Second, cloud APIs introduce latency, typically 2-5 seconds per request due to network round-trips and queue times. Third, costs scale linearly with usage, making high-volume document analysis expensive at scale. Local-only AI solves privacy and latency concerns but sacrifices quality. Small language models (SLMs) running on laptops produce quick summaries, but they lack the nuanced reasoning and polish needed for C-suite presentations. You get fast, private results that may require significant manual refinement. The hybrid approach gives you the best of both worlds: instant, private local processing as the default, with optional cloud refinement only when quality matters most. This architecture respects data privacy by default while maintaining the flexibility to produce executive-grade outputs when needed. Architecture Overview: Three-Layer Design for Privacy and Performance The FL-Client-Briefing-Assistant implements a clean three-layer architecture that separates concerns and ensures privacy at every level. At the frontend, a Next.js 14 application provides the user interface with strong TypeScript typing throughout. Users interact with four quick-action templates: document summarization, talking points generation, risk analysis, and executive summaries. The UI clearly indicates which model (local or cloud) processed each request, ensuring transparency. The middle tier consists of Next.js API routes that act as orchestration endpoints. These routes validate requests using Zod schemas, route to appropriate inference services, and enforce privacy settings. Critically, the API layer never persists user content unless explicitly opted in via privacy settings. The inference layer contains two distinct services. The local service uses Foundry Local SDK to communicate with a locally running Phi-4 model (or similar SLM). This provides sub-second inference, typical 500ms-1s response times, completely offline. The cloud service connects to Azure OpenAI using the official JavaScript SDK, accessed via Managed Identity or API keys, with proper timeout and retry logic. Setting Up Foundry Local for On-Device Inference Foundry Local is Microsoft's runtime for running AI models entirely on your device—no internet required, no data leaving your machine. Here's how to get it running for this application. First, install Foundry Local on Windows using Windows Package Manager: winget install Microsoft.FoundryLocal After installation, verify the service is ready: foundry service start foundry service status The status command will show you the service endpoint, typically running on a dynamic port like http://127.0.0.1:5272 . This port changes between restarts, so your application must query it programmatically. Next, load an appropriate model. For briefing tasks, Phi-4 Mini provides an excellent balance of quality and speed: foundry model load phi-4 The model downloads (approximately 3.6GB) and loads into memory. This takes 2-5 minutes on first run but persists between sessions. Once loaded, inference is nearly instant, most requests complete in under 1 second. In your application, configure the connection in .env.local : the port for foundry local is dynamic so please ensure you add the correct port. FOUNDRY_LOCAL_ENDPOINT=http://127.0.0.1:**** The application uses the Foundry Local SDK to query the running service: import { FoundryLocalClient } from 'foundry-local-sdk'; const client = new FoundryLocalClient({ endpoint: process.env.FOUNDRY_LOCAL_ENDPOINT }); const response = await client.chat.completions.create({ model: 'phi-4', messages: [ { role: 'system', content: 'You are a professional consultant assistant.' }, { role: 'user', content: 'Summarize this document: ...' } ], max_tokens: 500, temperature: 0.3 }); This code demonstrates several best practices: Explicit model specification: Always name the model to ensure consistency across environments System message framing: Set the appropriate professional context for consulting use cases Conservative temperature: Use 0.3 for factual summarization tasks to reduce hallucination Token limits: Cap outputs to prevent excessive generation times and costs Implementing Privacy-First API Routes The Next.js API routes form the security boundary of the application. Every request must be validated, sanitized, and routed according to privacy settings before reaching inference services. Here's the core local inference route ( app/api/briefing/local/route.ts ): import { NextRequest, NextResponse } from 'next/server'; import { z } from 'zod'; import { FoundryLocalClient } from 'foundry-local-sdk'; const RequestSchema = z.object({ prompt: z.string().min(10).max(5000), template: z.enum(['summary', 'talking-points', 'risk-analysis', 'executive']), context: z.string().optional() }); export async function POST(request: NextRequest) { try { // Validate and parse request body const body = await request.json(); const validated = RequestSchema.parse(body); // Initialize Foundry Local client const client = new FoundryLocalClient({ endpoint: process.env.FOUNDRY_LOCAL_ENDPOINT! }); // Build system prompt based on template const systemPrompts = { 'summary': 'You are a consultant creating concise document summaries.', 'talking-points': 'You are preparing structured talking points for meetings.', 'risk-analysis': 'You are analyzing risks and opportunities systematically.', 'executive': 'You are crafting executive-level briefing notes.' }; // Execute local inference const startTime = Date.now(); const completion = await client.chat.completions.create({ model: 'phi-4', messages: [ { role: 'system', content: systemPrompts[validated.template] }, { role: 'user', content: validated.prompt } ], temperature: 0.3, max_tokens: 500 }); const latency = Date.now() - startTime; // Return structured response with metadata return NextResponse.json({ content: completion.choices[0].message.content, model: 'phi-4 (local)', latency_ms: latency, tokens: completion.usage?.total_tokens, timestamp: new Date().toISOString() }); } catch (error) { if (error instanceof z.ZodError) { return NextResponse.json( { error: 'Invalid request format', details: error.errors }, { status: 400 } ); } console.error('Local inference error:', error); return NextResponse.json( { error: 'Inference failed', message: error.message }, { status: 500 } ); } } This implementation demonstrates several critical security and quality patterns: Request validation with Zod: Every field is type-checked and bounded before processing, preventing injection attacks and malformed inputs Template-based system prompts: Different use cases get optimized prompts, improving output quality and consistency Comprehensive error handling: Validation errors, inference failures, and network issues are caught and reported with appropriate HTTP status codes Performance tracking: Latency measurement enables monitoring and helps users understand response times Metadata enrichment: Responses include model attribution, token usage, and timestamps for auditing The cloud refinement route follows a similar pattern but adds privacy checks: export async function POST(request: NextRequest) { try { const body = await request.json(); const validated = RequestSchema.parse(body); // Check privacy settings from cookie/header const confidentialMode = request.cookies.get('confidential-mode')?.value === 'true'; if (confidentialMode) { return NextResponse.json( { error: 'Cloud refinement disabled in confidential mode' }, { status: 403 } ); } // Proceed with Azure OpenAI call only if privacy allows const client = new OpenAI({ apiKey: process.env.AZURE_OPENAI_KEY, baseURL: process.env.AZURE_OPENAI_ENDPOINT, defaultHeaders: { 'api-key': process.env.AZURE_OPENAI_KEY } }); const completion = await client.chat.completions.create({ model: process.env.AZURE_OPENAI_DEPLOYMENT!, messages: [/* ... */], temperature: 0.5, // Slightly higher for creative refinement max_tokens: 800 }); return NextResponse.json({ content: completion.choices[0].message.content, model: `${process.env.AZURE_OPENAI_DEPLOYMENT} (cloud)`, privacy_notice: 'Content processed by Azure OpenAI', // ... metadata }); } catch (error) { // Error handling } } The confidential mode check is crucial—it ensures that even if a user accidentally clicks the refinement button, no data leaves the device when privacy mode is enabled. This fail-safe design prevents data leakage through UI mistakes or automated workflows. Building the Frontend: Transparent Privacy Controls The user interface must make privacy decisions explicit and visible. Users need to understand which AI service processed their content and make informed choices about cloud refinement. The main briefing interface ( app/page.tsx ) implements this transparency through clear visual indicators: 'use client'; import { useState, useEffect } from 'react'; import { PrivacySettings } from '@/components/PrivacySettings'; export default function BriefingAssistant() { const [confidentialMode, setConfidentialMode] = useState(true); // Privacy by default const [content, setContent] = useState(''); const [result, setResult] = useState(null); const [loading, setLoading] = useState(false); // Load privacy preference from localStorage useEffect(() => { const saved = localStorage.getItem('confidential-mode'); if (saved !== null) { setConfidentialMode(saved === 'true'); } }, []); async function generateBriefing(template: string, useCloud: boolean = false) { if (useCloud && confidentialMode) { alert('Cloud refinement is disabled in confidential mode. Adjust settings to enable.'); return; } setLoading(true); const endpoint = useCloud ? '/api/briefing/cloud' : '/api/briefing/local'; try { const response = await fetch(endpoint, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ prompt: content, template }) }); const data = await response.json(); setResult({ ...data, processedBy: useCloud ? 'cloud' : 'local' }); } catch (error) { console.error('Briefing generation failed:', error); } finally { setLoading(false); } } return ( <div className="briefing-assistant"> <header> <h1>Client Briefing Assistant</h1> <div className="status-bar"> <span className={confidentialMode ? 'confidential' : 'standard'}> {confidentialMode ? '🔒 Confidential Mode' : '🌐 Standard Mode'} </span> <PrivacySettings confidentialMode={confidentialMode} onChange={setConfidentialMode} /> </div> </header> <div className="quick-actions"> <button onClick={() => generateBriefing('summary')}> 📄 Summarize Document </button> <button onClick={() => generateBriefing('talking-points')}> 💬 Generate Talking Points </button> <button onClick={() => generateBriefing('risk-analysis')}> 🎯 Risk Analysis </button> <button onClick={() => generateBriefing('executive')}> 📊 Executive Summary </button> </div> <textarea value={content} onChange={(e) => setContent(e.target.value)} placeholder="Paste client document or meeting notes here..." /> {result && ( <div className="result-card"> <div className="result-header"> <span className="model-badge">{result.model}</span> <span className="latency">{result.latency_ms}ms</span> </div> <div className="result-content">{result.content}</div> {result.processedBy === 'local' && !confidentialMode && ( <button onClick={() => generateBriefing(result.template, true)} className="refine-btn" > ✨ Refine for Executive Presentation </button> )} </div> )} </div> ); } This interface design embodies several principles of responsible AI UX: Privacy by default: Confidential mode is enabled unless explicitly changed, ensuring accidental cloud usage requires multiple intentional actions Clear attribution: Every result shows which model generated it and how long it took, building user trust through transparency Conditional refinement: The cloud refinement button only appears when privacy allows and local inference has completed, preventing premature cloud requests Persistent settings: Privacy preferences save to localStorage, respecting user choices across sessions Visual status indicators: The header always shows current privacy mode with recognizable icons (🔒 for confidential, 🌐 for standard) Testing Privacy and Performance Requirements A privacy-first application demands rigorous testing to ensure data never leaks unintentionally. The project includes comprehensive test suites using Vitest for unit tests and Playwright for end-to-end scenarios. Here's a critical privacy test ( tests/privacy.test.ts ): import { describe, it, expect, beforeEach } from 'vitest'; import { TestUtils } from './utils/test-helpers'; describe('Privacy Controls', () => { let testUtils: TestUtils; beforeEach(() => { testUtils = new TestUtils(); testUtils.enableConfidentialMode(); }); it('should prevent cloud API calls when confidential mode is enabled', async () => { const response = await testUtils.requestBriefing({ template: 'summary', prompt: 'Confidential merger document...', cloud: true }); expect(response.status).toBe(403); expect(response.error).toContain('disabled in confidential mode'); }); it('should allow local inference in confidential mode', async () => { const response = await testUtils.requestBriefing({ template: 'summary', prompt: 'Confidential merger document...', cloud: false }); expect(response.status).toBe(200); expect(response.model).toContain('local'); expect(response.content).toBeTruthy(); }); it('should not persist sensitive content without opt-in', async () => { await testUtils.requestBriefing({ template: 'executive', prompt: 'Strategic acquisition plan...', cloud: false }); const history = await testUtils.getConversationHistory(); expect(history).toHaveLength(0); // No storage by default }); it('should support opt-in history with explicit consent', async () => { testUtils.enableHistorySaving(); await testUtils.requestBriefing({ template: 'executive', prompt: 'Strategic acquisition plan...', cloud: false }); const history = await testUtils.getConversationHistory(); expect(history).toHaveLength(1); expect(history[0].prompt).toContain('acquisition'); }); }); Performance testing ensures local inference meets the sub-second requirement: describe('Performance SLA', () => { it('should complete local inference in under 1 second', async () => { const samples = []; for (let i = 0; i < 10; i++) { const start = Date.now(); await testUtils.requestBriefing({ template: 'summary', prompt: 'Standard 500-word document...', cloud: false }); samples.push(Date.now() - start); } const p95 = calculatePercentile(samples, 95); expect(p95).toBeLessThan(1000); // 95th percentile under 1s }); it('should handle 5 concurrent requests without degradation', async () => { const requests = Array(5).fill(null).map(() => testUtils.requestBriefing({ template: 'talking-points', prompt: 'Meeting agenda...', cloud: false }) ); const results = await Promise.all(requests); expect(results.every(r => r.status === 200)).toBe(true); expect(results.every(r => r.latency_ms < 2000)).toBe(true); }); }); These tests validate the core promise: local inference is fast, private, and reliable under realistic loads. Deployment Considerations and Production Readiness Moving from development to production requires addressing several operational concerns: model distribution, environment configuration, monitoring, and incident response. For Foundry Local deployment, ensure IT teams pre-install the runtime and required models on consultant laptops. Use MDM (Mobile Device Management) systems or Group Policy to automate model downloads during onboarding. Models can be cached in shared network locations to avoid redundant downloads across teams. Environment configuration should separate local and cloud credentials cleanly: # .env.local (local development) FOUNDRY_LOCAL_ENDPOINT=http://127.0.0.1:5272 AZURE_OPENAI_ENDPOINT=https://your-org.openai.azure.com AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini AZURE_OPENAI_KEY=your-key-here # For production, use Azure Managed Identity instead of API keys USE_MANAGED_IDENTITY=true Managed Identity eliminates API key management—the application authenticates using Azure AD, with permissions controlled via IAM policies. This prevents key leakage and simplifies rotation. Monitoring should track both local and cloud usage patterns. Implement structured logging with clear privacy labels: logger.info('Briefing generated', { model: 'local', template: 'summary', latency_ms: 847, tokens: 312, privacy_mode: 'confidential', user_id: hash(userId), // Never log raw user IDs timestamp: new Date().toISOString() }); This approach enables operational insights (average latency, most-used templates, error rates) without exposing sensitive content or user identities. For incident response, establish clear escalation paths. If Foundry Local fails, the application should gracefully degrade—inform users that local inference is unavailable and offer cloud-only mode (with explicit consent). If cloud services fail, local inference continues uninterrupted, ensuring the application remains useful even during Azure outages. Key Takeaways and Next Steps Building a privacy-first hybrid AI application requires careful architectural decisions that prioritize user data protection while maintaining high-quality outputs. The FL-Client-Briefing-Assistant demonstrates that you can achieve sub-second local inference, transparent privacy controls, and optional cloud refinement in a production-ready package. Key lessons from this implementation: Privacy must be the default, not an opt-in feature—confidential mode should require explicit action to disable Transparency builds trust—always show users which model processed their data and how long it took Fallback strategies ensure reliability—graceful degradation when services fail keeps the application useful Testing validates promises—comprehensive tests for privacy, performance, and functionality are non-negotiable Operational visibility without privacy leaks—structured logging enables monitoring without exposing sensitive content To extend this application, consider adding: Document parsing: Integrate PDF, DOCX, and PPTX extractors to analyze file uploads directly Multi-document synthesis: Combine insights from multiple client documents into unified briefings Custom templates: Allow consultants to define their own briefing formats and save them for reuse Offline mode indicators: Detect network connectivity and disable cloud features automatically Audit logging: For regulated industries, implement immutable audit trails showing when cloud refinement was used The full implementation, including all code, tests, and deployment guides, is available at github.com/leestott/FL-Client-Briefing-Assistant. Clone the repository, follow the setup guide, and experience privacy-first AI in action. Resources and Further Reading FL-Client-Briefing-Assistant Repository - Complete source code and documentation Microsoft Foundry Local Documentation - Official runtime documentation and API reference Azure OpenAI Service - Cloud refinement integration guide Project Specification - Detailed requirements and acceptance criteria Implementation Guide - Architecture decisions and design patterns Testing Guide - How to run and interpret comprehensive test suites
Lee_Stott
Feb 26, 2026 Place Microsoft Developer Community Blog
305Views
2likes
1Comment
The JavaScript AI Build-a-thon Season 2 starts March 2!
The JavaScript AI Build-a-thon is a free, hands-on program designed to close that gap. Over the course of four weeks (March 2 - March 31, 2026), you'll move from running AI 100% on-device (Local AI), to designing multi-service, multi-agentic systems, all in JavaScript/ TypeScript and using tools you are already familiar with. The series will culminate in a hackathon, where you will create, compete and turn what you'll have learnt into working projects you can point to, talk about and extend.
Julia_Muiruri
Feb 25, 2026 Place Microsoft Developer Community Blog
208Views
1like
0Comments
On-Premises Manufacturing Intelligence
Manufacturing facilities face a fundamental dilemma in the AI era: how to harness artificial intelligence for predictive maintenance, equipment diagnostics, and operational insights while keeping sensitive production data entirely on-premises. Industrial environments generate proprietary information, CNC machining parameters, quality control thresholds, equipment performance signatures, maintenance histories, that represents competitive advantage accumulated over decades of process optimization. Sending this data to cloud APIs risks intellectual property exposure, regulatory non-compliance, and operational dependencies that manufacturing operations cannot accept. Traditional cloud-based AI introduces unacceptable vulnerabilities. Network latency of 100-500ms makes real-time decision support impossible for time-sensitive manufacturing processes. Internet dependency creates single points of failure in environments where connectivity is unreliable or deliberately restricted for security. API pricing models become prohibitively expensive when analyzing thousands of sensor readings and maintenance logs continuously. Most critically, data residency requirements for aerospace, defense, pharmaceutical, and automotive industries make cloud AI architectures non-compliant by design ITAR, FDA 21 CFR Part 11, and customer-specific mandates require data never leaves facility boundaries. This article demonstrates a sample solution for manufacturing asset intelligence that runs entirely on-premises using Microsoft Foundry Local, Node.js, and JavaScript. The FoundryLocal-IndJSsample repository provides production-ready implementation with Express backend, HTML/JavaScript frontend, and comprehensive Foundry Local SDK integration. Facilities can deploy sophisticated AI-powered monitoring without external dependencies, cloud costs, data exposure risks, or network requirements. Every inference happens locally on facility hardware with predictable performance and zero data egress. Why On-Premises AI Matters for Industrial Operations The case for local AI inference in manufacturing extends beyond simple preference, it addresses fundamental operational, security, and compliance requirements that cloud solutions cannot satisfy. Understanding these constraints shapes architectural decisions that prioritize reliability, data sovereignty, and cost predictability. Data Sovereignty and Intellectual Property Protection Manufacturing processes represent years of proprietary research, optimization, and competitive advantage. Equipment configurations, cycle times, quality thresholds, and maintenance schedules contain intelligence that competitors would value highly. Sending this data to third-party cloud services, even with contractual protections, introduces risks that manufacturing operations cannot accept. On-premises AI ensures that production data never leaves the facility network perimeter. Telemetry from CNC machines, hydraulic systems, conveyor networks, and control systems remains within air-gapped environments where physical access controls and network isolation provide demonstrable data protection. This architectural guarantee of data locality satisfies both internal security policies and external audit requirements without relying on contractual assurances or encryption alone. Operational Resilience and Network Independence Factory floors frequently operate in environments with limited, unreliable, or intentionally restricted internet connectivity. Remote facilities, secure manufacturing zones, and legacy industrial networks cannot depend on continuous cloud access for critical monitoring functions. When network failures occur, whether from ISP outages, DDoS attacks, or infrastructure damage, AI capabilities must continue operating to prevent production losses. Local inference provides true operational independence. Equipment health monitoring, anomaly detection, and maintenance prioritization continue functioning during network disruptions. This resilience is essential for 24/7 manufacturing operations where downtime costs can exceed tens of thousands of dollars per hour. By eliminating external dependencies, on-premises AI becomes as reliable as the local power supply and computing infrastructure. Latency Requirements for Real-Time Decision Making Manufacturing processes involve precise timing where milliseconds determine quality outcomes. Automated inspection systems must classify defects before products leave the production line. Safety interlocks must respond to hazardous conditions before injuries occur. Predictive maintenance alerts must trigger before catastrophic equipment failures cascade through production lines. Cloud-based AI introduces latency that incompatible with these requirements. Network round-trips to cloud endpoints typically require 100-500 milliseconds, in some case latency is unacceptable for real-time applications. Local inference with Foundry Local delivers sub-50ms response times by eliminating network hops, enabling true real-time AI integration with SCADA systems, PLCs, and manufacturing execution systems. Cost Predictability at Industrial Scale Manufacturing facilities generate enormous volumes of time-series data from thousands of sensors, producing millions of data points daily. Cloud AI services charge per API call or per token processed, creating unpredictable costs that scale linearly with data volume. High-throughput industrial applications can quickly accumulate tens of thousands of dollars in monthly API fees. On-premises AI transforms this variable operational expense into fixed capital infrastructure costs. After initial hardware investment, inference costs remain constant regardless of query volume. For facilities analyzing equipment telemetry, maintenance logs, and operator notes continuously, this economic model provides cost certainty and eliminates budget surprises. Regulatory Compliance and Audit Requirements Regulated industries face strict data handling requirements. Aerospace manufacturers must comply with ITAR controls on technical data. Pharmaceutical facilities must satisfy FDA 21 CFR Part 11 requirements for electronic records. Automotive suppliers must meet customer-specific data residency mandates. Cloud AI services complicate compliance by introducing third-party data processors, cross-border data transfers, and shared infrastructure concerns. Local AI simplifies regulatory compliance by eliminating external data flows. Audit trails remain within the facility. Data handling procedures avoid third-party agreements. Compliance demonstrations become straightforward when AI infrastructure resides entirely within auditable physical and network boundaries. Architecture: Manufacturing Intelligence Without Cloud Dependencies The manufacturing asset intelligence system demonstrates a practical architecture for deploying AI capabilities entirely on-premises. The design prioritizes operational reliability, straightforward integration patterns, and maintainable code structure that facilities can adapt to their specific requirements. System Components and Technology Stack The implementation consists of three primary layers that separate concerns and enable independent scaling: Foundry Local Layer: Provides the local AI inference runtime. Foundry Local manages model loading, execution, and resource allocation. It supports multiple model families (Phi-3.5, Phi-4, Qwen2.5) with automatic hardware acceleration detection for NVIDIA GPUs (CUDA), Intel GPUs (OpenVINO), ARM Qualcomm (QNN) and optimized CPU inference. The service exposes a REST API on localhost that the backend layer consumes for completions. Backend Service Layer: An Express Node.js application that serves as the integration point between the AI runtime and the manufacturing data systems. This layer implements business logic for equipment monitoring, maintenance log classification, and conversational interfaces. It formats prompts with equipment context, calls Foundry Local for inference, and structures responses for the frontend. The backend persists chat history and provides RESTful endpoints for all AI operations. Frontend Interface Layer: A standalone HTML/JavaScript application that provides operator interfaces for equipment monitoring, maintenance management, and AI assistant interactions. The UI fetches data from the backend service and renders dashboards, equipment status views, and chat interfaces. No framework dependencies or build steps are required, the frontend operates as static files that any web server or file system can serve. Data Flow for Equipment Analysis Understanding how data moves through the system clarifies integration points and extension opportunities. When an operator requests AI analysis of equipment status, the following sequence occurs: The frontend collects equipment context including asset ID, current telemetry values, alert status, and recent maintenance history. It constructs an HTTP request to the backend's equipment summary endpoint, passing this context as query parameters or request body. The backend retrieves additional context from the equipment database, including specifications, normal operating ranges, and historical performance patterns. The backend constructs a detailed prompt that provides the AI model with comprehensive context: equipment specifications, current telemetry with alarming conditions highlighted, recent maintenance notes, and specific questions about operational status. This prompt engineering is critical, the model's accuracy depends entirely on the context provided. Generic prompts produce generic responses; detailed, structured prompts yield actionable insights. The backend calls Foundry Local's completion API with the formatted prompt, specifying temperature, max tokens, and other generation parameters. Foundry Local loads the configured model (if not already in memory) and generates a response analyzing the equipment's condition. The inference occurs locally with no network traffic leaving the facility. Response time typically ranges from 500ms to 3 seconds depending on prompt complexity and model size. Foundry Local returns the generated text to the backend, which parses the response for structured information if required (equipment health classifications, priority levels, recommended actions). The backend formats this analysis as JSON and returns it to the frontend. The frontend renders the AI-generated summary in the equipment health dashboard, highlighting critical findings and recommended operator actions. Prompt Engineering for Maintenance Log Classification The maintenance log classification feature demonstrates effective prompt engineering for extracting structured decisions from language models. Manufacturing facilities accumulate thousands of maintenance notes, operator observations, technician reports, and automated system logs. Automatically classifying these entries by severity enables priority-based work scheduling without manual review of every log entry. The classification prompt provides the model with clear instructions, classification categories with definitions, and the maintenance note text to analyze: const classificationPrompt = `You are a manufacturing maintenance expert analyzing equipment log entries. Classify the following maintenance note into one of these categories: CRITICAL: Immediate safety hazard, equipment failure, or production stoppage HIGH: Degraded performance, abnormal readings requiring same-shift attention MEDIUM: Scheduled maintenance items or routine inspections LOW: Informational notes, normal operations logs Provide your response in JSON format: { "classification": "CRITICAL|HIGH|MEDIUM|LOW", "reasoning": "Brief explanation of classification decision", "recommended_action": "Specific next steps for maintenance team" } Maintenance Note: ${maintenanceNote} Classification:`; const response = await foundryClient.chat.completions.create({ model: currentModelAlias, messages: [{ role: 'user', content: classificationPrompt }], temperature: 0.1, // Low temperature for consistent classification max_tokens: 300 }); Key aspects of this prompt design: Role definition: Establishing the model as a "manufacturing maintenance expert" activates relevant knowledge and reasoning patterns in the model's training data. Clear categories: Explicit classification options with definitions prevent ambiguous outputs and enable consistent decision-making across thousands of logs. Structured output format: Requesting JSON responses with specific fields enables automated parsing and integration with maintenance management systems without fragile text parsing. Temperature control: Setting temperature to 0.1 reduces randomness in classifications, ensuring consistent severity assessments for similar maintenance conditions. Context isolation: Separating the maintenance note text from the instructions with clear delimiters prevents prompt injection attacks where malicious log entries might attempt to manipulate classification logic. This classification runs locally for every maintenance log entry without API costs or network delays. Facilities processing hundreds of maintenance notes daily benefit from immediate, consistent classification that routes critical issues to technicians automatically while filtering routine informational logs. Model Selection and Performance Trade-offs Foundry Local supports multiple model families with different memory requirements, inference speeds, and accuracy characteristics. Choosing appropriate models for manufacturing environments requires balancing these trade-offs against hardware constraints and operational requirements: Qwen2.5-0.5b (500MB memory): The smallest available model provides extremely fast inference (100-200ms responses) on limited hardware. Suitable for simple classification tasks, keyword extraction, and high-throughput scenarios where response speed matters more than nuanced understanding. Works well on older servers or edge devices with constrained resources. Phi-3.5-mini (2.1GB memory): The recommended default model balances accuracy with reasonable memory requirements. Provides strong reasoning capabilities for equipment analysis, maintenance prioritization, and conversational assistance. Response times of 1-3 seconds on modern CPUs are acceptable for interactive dashboards. This model handles complex prompts with detailed equipment context effectively. Phi-4-mini (3.6GB memory): Increased model capacity improves understanding of technical terminology and complex equipment relationships. Best choice when analyzing detailed maintenance histories, interpreting sensor correlation patterns, or providing nuanced operational recommendations. Requires more memory but delivers noticeably improved analysis quality for complex scenarios. Qwen2.5-7b (4.7GB memory): The largest supported model provides maximum accuracy and sophisticated reasoning. Ideal for facilities with modern server hardware where best-possible analysis quality justifies longer inference times (3-5 seconds). Consider this model for critical applications where operator decisions depend heavily on AI recommendations. Facilities can download all models during initial setup and switch between them based on specific use cases. Use faster models for real-time dashboard updates and automated classification. Deploy larger models for detailed equipment analysis and maintenance planning where operators can wait several seconds for comprehensive insights. Implementation: Equipment Monitoring and AI Analysis The practical implementation reveals how straightforward on-premises AI integration can be with modern JavaScript tooling and proper architectural separation. The backend service manages all AI interactions, shielding the frontend from inference complexity and providing clean REST interfaces. Backend Service Architecture with Express The Node.js backend initializes the Foundry Local SDK client and exposes endpoints for equipment operations: const express = require('express'); const { FoundryLocalClient } = require('foundry-local-sdk'); const cors = require('cors'); const app = express(); const PORT = process.env.PORT || 3000; // Initialize Foundry Local client const foundryClient = new FoundryLocalClient({ baseURL: 'http://localhost:8008', // Default Foundry Local endpoint timeout: 30000 }); // Middleware configuration app.use(cors()); // Enable cross-origin requests from frontend app.use(express.json()); // Parse JSON request bodies // Health check endpoint for monitoring app.get('/api/health', (req, res) => { res.json({ ok: true, service: 'manufacturing-ai-backend' }); }); // Start server app.listen(PORT, () => { console.log(`Manufacturing AI backend running on port ${PORT}`); console.log(`Foundry Local endpoint: http://localhost:8008`); }); This foundational structure establishes the Express application with CORS support for browser-based frontends and JSON request handling. The Foundry Local client connects to the local inference service running on port 8008, no external network configuration required. Equipment Summary Generation with Context-Rich Prompts The equipment summary endpoint demonstrates effective context injection for accurate AI analysis: app.get('/api/assets/:id/summary', async (req, res) => { try { const assetId = req.params.id; const asset = equipmentDatabase.find(a => a.id === assetId); if (!asset) { return res.status(404).json({ error: 'Asset not found' }); } // Construct detailed equipment context const contextPrompt = buildEquipmentContext(asset); // Generate AI analysis const completion = await foundryClient.chat.completions.create({ model: 'phi-3.5-mini', messages: [{ role: 'user', content: contextPrompt }], temperature: 0.3, max_tokens: 500 }); const analysis = completion.choices[0].message.content; res.json({ assetId: asset.id, assetName: asset.name, analysis: analysis, generatedAt: new Date().toISOString() }); } catch (error) { console.error('Equipment summary error:', error); res.status(500).json({ error: 'AI analysis failed', details: error.message }); } }); The equipment context builder assembles comprehensive information for accurate analysis: function buildEquipmentContext(asset) { const alerts = asset.alerts.filter(a => a.severity !== 'INFO'); const telemetry = asset.currentTelemetry; return `Analyze the following manufacturing equipment status: Equipment: ${asset.name} (${asset.id}) Type: ${asset.type} Location: ${asset.location} Current Telemetry: - Temperature: ${telemetry.temperature}°C (Normal range: ${asset.specs.tempRange}) - Vibration: ${telemetry.vibration} mm/s (Threshold: ${asset.specs.vibrationThreshold}) - Pressure: ${telemetry.pressure} PSI (Normal: ${asset.specs.pressureRange}) - Runtime: ${telemetry.runHours} hours (Next maintenance due: ${asset.nextMaintenance}) Active Alerts: ${alerts.map(a => `- ${a.severity}: ${a.message}`).join('\n')} Recent Maintenance History: ${asset.recentMaintenance.slice(0, 3).map(m => `- ${m.date}: ${m.description}`).join('\n')} Provide a concise operational summary focusing on: 1. Current equipment health status 2. Any concerning trends or anomalies 3. Recommended operator actions if applicable 4. Maintenance priority level Summary:`; } This context-rich approach produces accurate, actionable analysis because the model receives equipment specifications, current telemetry with context, alert history, maintenance patterns, and structured output guidance. The model can identify abnormal conditions accurately rather than guessing what values seem unusual. Conversational AI Assistant with Manufacturing Context The chat endpoint enables natural language queries about equipment status and operational questions: app.post('/api/chat', async (req, res) => { try { const { message, conversationId } = req.body; // Retrieve conversation history for context const history = conversationStore.get(conversationId) || []; // Build plant-wide context for the query const plantContext = buildPlantOperationsContext(); // Construct system message with domain knowledge const systemMessage = { role: 'system', content: `You are an AI assistant for a manufacturing facility's operations team. You have access to real-time equipment data and maintenance records. Current Plant Status: ${plantContext} Provide specific, actionable responses based on actual equipment data. If you don't have information to answer a query, clearly state that. Never speculate about equipment conditions beyond available data.` }; // Include conversation history for multi-turn context const messages = [ systemMessage, ...history, { role: 'user', content: message } ]; const completion = await foundryClient.chat.completions.create({ model: 'phi-3.5-mini', messages: messages, temperature: 0.4, max_tokens: 600 }); const assistantResponse = completion.choices[0].message.content; // Update conversation history history.push( { role: 'user', content: message }, { role: 'assistant', content: assistantResponse } ); conversationStore.set(conversationId, history); res.json({ response: assistantResponse, conversationId: conversationId, timestamp: new Date().toISOString() }); } catch (error) { console.error('Chat error:', error); res.status(500).json({ error: 'Chat request failed', details: error.message }); } }); The conversational interface enables operators to ask natural language questions and receive grounded responses based on actual equipment data, citing specific asset IDs, current metric values, and alert statuses rather than speculating. Deployment and Production Operations Deploying on-premises AI in industrial settings requires consideration of hardware placement, network architecture, integration patterns, and operational procedures that differ from typical web application deployments. Hardware and Infrastructure Requirements The system runs on standard server hardware without specialized AI accelerators, though GPU availability improves performance significantly. Minimum requirements include 8GB RAM for the Phi-3.5-mini model, 4-core CPU, and 50GB storage for model files and application data. Production deployments benefit from 16GB+ RAM to support larger models and concurrent analysis requests. For facilities with NVIDIA GPUs, Foundry Local automatically utilizes CUDA acceleration, reducing inference times by 3-5x compared to CPU-only execution. Deploy the backend service on dedicated server hardware within the factory network. Avoid running AI workloads on the same systems that host critical SCADA or MES applications due to resource contention concerns. Network Architecture and SCADA Integration The AI backend should reside on the manufacturing operations network with firewall rules permitting connections from operator workstations and monitoring systems. Do not expose the backend service directly to the internet, all access should occur through the facility's internal network with authentication via existing directory services. Integrate with SCADA systems through standard industrial protocols. Configure OPC-UA clients to subscribe to equipment telemetry topics and forward readings to the AI backend via REST API calls. Modbus TCP gateways can bridge legacy PLCs to modern APIs by polling register values and POSTing updates to the backend's telemetry ingestion endpoints. Security and Compliance Considerations Many manufacturing facilities operate air-gapped networks where physical separation prevents internet connectivity entirely. Deploy Foundry Local and the AI application in these environments by transferring model files and application packages via removable media during controlled maintenance windows. Implement role-based access control (RBAC) using Active Directory integration. Configure the backend to validate user credentials against LDAP before serving AI analysis requests. Maintain detailed audit logs of all AI invocations including user identity, timestamp, equipment queried, and model version used. Store these logs in immutable append-only databases for compliance audits. Key Takeaways Building production-ready AI systems for industrial environments requires architectural decisions that prioritize operational reliability, data sovereignty, and integration simplicity: Data locality by architectural design: On-premises AI ensures proprietary production data never leaves facility networks through fundamental architectural guarantees rather than configuration options Model selection impacts deployment feasibility: Smaller models (0.5B-2B parameters) enable deployment on commodity hardware without specialized accelerators while maintaining acceptable accuracy Fallback logic preserves operational continuity: AI capabilities enhance but don't replace core monitoring functions, ensuring equipment dashboards display raw telemetry even when AI analysis is unavailable Context-rich prompts determine accuracy: Effective prompts include equipment specifications, normal operating ranges, alert thresholds, and maintenance history to enable grounded recommendations Structured outputs enable automation: JSON response formats allow automated systems to parse classifications and route work orders without fragile text parsing Integration patterns bridge legacy systems: OPC-UA and Modbus TCP gateways connect decades-old PLCs and SCADA systems to modern AI without replacing functional control infrastructure Resources and Further Exploration The complete implementation with extensive comments and documentation is available in the GitHub repository. Additional resources help facilities customize and extend the system for their specific requirements. FoundryLocal-IndJSsample GitHub Repository – Full source code with JavaScript backend, HTML frontend, and sample data files Quick Start Guide and Documentation – Installation instructions, API documentation, and troubleshooting guidance Microsoft Foundry Local Documentation – Official SDK reference, model catalog, and deployment guidance Sample Manufacturing Data – Example equipment telemetry, maintenance logs, and alert structures Backend Implementation Reference – Express server code with Foundry Local SDK integration patterns OPC Foundation – Industrial communication standards for SCADA and PLC integration Edge AI for Beginners - Online FREE course and resources for learning more about using AI on Edge Devices Why On-Premises AI Cloud AI services offer convenience, but they fundamentally conflict with manufacturing operational requirements. Understanding these conflicts explains why local AI isn't just preferable, it's mandatory for production environments. Data privacy and intellectual property protection stand paramount. A CNC machining program represents years of optimization, feed rates, tool paths, thermal compensation algorithms. Quality control measurements reveal product specifications competitors would pay millions to access. Sending this data to external APIs, even with encryption, creates unacceptable exposure risk. Every API call generates logs on third-party servers, potentially subject to subpoenas, data breaches, or regulatory compliance failures. Latency requirements eliminate cloud viability for real-time decisions. When a thermal sensor detects bearing temperature exceeding safe thresholds, the control system needs AI analysis in under 50 milliseconds to prevent catastrophic failure. Cloud APIs introduce 100-500ms baseline latency from network round-trips alone, before queue times and processing. For safety systems, quality inspection, and process control, this latency is operationally unacceptable. Network dependency creates operational fragility. Factory floors frequently have limited connectivity, legacy equipment, RF interference, isolated production cells. Critical AI capabilities cannot fail because internet service drops. Moreover, many defense, aerospace, and pharmaceutical facilities operate air-gapped networks for security compliance. Cloud AI is simply non-operational in these environments. Regulatory requirements mandate data residency. ITAR (International Traffic in Arms Regulations) prohibits certain manufacturing data from leaving approved facilities. FDA 21 CFR Part 11 requires strict data handling controls for pharmaceutical manufacturing. GDPR demands data residency in approved jurisdictions. On-premises AI simplifies compliance by eliminating cross-border data transfers. Cost predictability at scale favors local deployment. A high-volume facility generating 10,000 equipment events per day, each requiring AI analysis, would incur significant cloud API costs. Local models have fixed infrastructure costs that scale economically with usage, making AI economically viable for continuous monitoring. Application Architecture: Web UI + Local AI Backend The FoundryLocal-IndJSsample implements a clean separation between data presentation and AI inference. This architecture ensures the UI remains responsive while AI operations run independently, enabling real-time dashboard updates without blocking user interactions. The web frontend serves a single-page application with vanilla HTML, CSS, and JavaScript, no frameworks, no build tools. This simplicity is intentional: factory IT teams need to audit code, customize interfaces, and deploy on legacy systems. The UI presents four main interfaces: Plant Asset Overview (real-time health cards for all equipment), Asset Health (AI-generated summaries and trend analysis), Maintenance Logs (classification and priority routing), and AI Assistant (natural language interface for operations queries). The Node.js backend runs Express as the HTTP server, handling static file serving, API routing, and WebSocket connections for real-time updates. It loads sample manufacturing data from JSON files, equipment telemetry, maintenance logs, historical events, simulating the data streams that would come from SCADA systems, PLCs, and MES platforms in production. Foundry Local provides the AI inference layer. The backend uses foundry-local-sdk to communicate with the locally running service. All model loading, prompt processing, and response generation happens on-device. The application detects Foundry Local automatically and falls back to rule-based analysis if unavailable, ensuring core functionality persists even when AI is offline. Here's the architectural flow for asset health analysis: User Request (Web UI) ↓ Express API Route (/api/assets/:id/summary) ↓ Load Equipment Data (from JSON/database) ↓ Build Analysis Prompt (Equipment ID, telemetry, alerts) ↓ Foundry Local SDK Call (local AI inference) ↓ Parse AI Response (structured insights) ↓ Return JSON Result (with metadata: model, latency, confidence) ↓ Display in UI (formatted health summary) This architecture demonstrates several industrial system design principles: Offline-first operation: Core functionality works without internet connectivity, with AI as an enhancement rather than dependency Graceful degradation: If AI fails, fall back to rule-based logic rather than crashing operations Minimal external dependencies: Simple stack reduces attack surface and simplifies air-gapped deployment Data locality: All processing happens on-premises, no external API calls Real-time updates: WebSocket connections enable push-based event streaming for dashboard updates Setting Up Foundry Local for Industrial Applications Industrial deployments require careful model selection that balances accuracy, speed, and hardware constraints. Factory edge devices often run on limited hardware—industrial PCs with modest GPUs or CPU-only configurations. Model choice significantly impacts deployment feasibility. Install Foundry Local on the industrial edge device: # Windows (most common for industrial PCs) winget install Microsoft.FoundryLocal # Verify installation foundry --version For manufacturing asset intelligence, model selection trades off speed versus quality: # Fast option: Qwen 0.5B (500MB, <100ms inference) foundry model load qwen2.5-0.5b # Balanced option: Phi-3.5 Mini (2.1GB, ~200ms inference) foundry model load phi-3.5-mini # High quality option: Phi-4 Mini (3.6GB, ~500ms inference) foundry model load phi-4 # Check which model is currently loaded foundry model list For real-time monitoring dashboards where hundreds of assets update continuously, qwen2.5-0.5b provides sufficient quality at speeds that don't bottleneck refresh cycles. For detailed root cause analysis or maintenance report generation where quality matters most, phi-4 justifies the slightly longer inference time. Industrial systems benefit from proactive model caching during downtime: # During maintenance windows, pre-download models foundry model download phi-3.5-mini foundry model download qwen2.5-0.5b # Models cache locally, eliminating runtime downloads The backend automatically detects Foundry Local and selects the loaded model: // backend/services/foundry-service.js import { FoundryLocalClient } from 'foundry-local-sdk'; class FoundryService { constructor() { this.client = null; this.modelAlias = null; this.initializeClient(); } async initializeClient() { try { // Detect Foundry Local endpoint const endpoint = process.env.FOUNDRY_LOCAL_ENDPOINT || 'http://127.0.0.1:5272'; this.client = new FoundryLocalClient({ endpoint }); // Query which model is currently loaded const models = await this.client.models.list(); this.modelAlias = models.data[0]?.id || 'phi-3.5-mini'; console.log(`✅ Foundry Local connected: ${this.modelAlias}`); } catch (error) { console.warn('⚠️ Foundry Local not available, using rule-based fallback'); this.client = null; } } async generateCompletion(prompt, options = {}) { if (!this.client) { // Fallback to rule-based analysis return this.ruleBasedAnalysis(prompt); } try { const startTime = Date.now(); const completion = await this.client.chat.completions.create({ model: this.modelAlias, messages: [ { role: 'system', content: 'You are an industrial asset intelligence assistant analyzing manufacturing equipment.' }, { role: 'user', content: prompt } ], temperature: 0.3, // Low temperature for factual analysis max_tokens: 400, ...options }); const latency = Date.now() - startTime; return { content: completion.choices[0].message.content, model: this.modelAlias, latency_ms: latency, tokens: completion.usage?.total_tokens }; } catch (error) { console.error('Foundry inference error:', error); return this.ruleBasedAnalysis(prompt); } } ruleBasedAnalysis(prompt) { // Fallback logic for when AI is unavailable // Pattern matching and heuristics return { content: '(Rule-based analysis) Equipment status: Monitoring...', model: 'rule-based-fallback', latency_ms: 5, tokens: 0 }; } } export default new FoundryService(); This service layer demonstrates critical production patterns: Automatic endpoint detection: Tries environment variable first, falls back to default Model auto-discovery: Queries Foundry Local for currently loaded model rather than hardcoding Robust error handling: Every API call wrapped in try-catch with fallback logic Performance tracking: Latency measurement enables monitoring and capacity planning Conservative temperature: 0.3 temperature reduces hallucination for factual equipment analysis Implementing AI-Powered Asset Health Analysis Equipment health monitoring forms the core use case, synthesizing telemetry from multiple sources into actionable insights. Traditional monitoring systems show raw metrics (temperature, vibration, pressure) but require expert interpretation. AI transforms this into natural language summaries that any operator can understand and act upon. Here's the API endpoint that generates asset health summaries: // backend/routes/assets.js import express from 'express'; import foundryService from '../services/foundry-service.js'; import { getAssetData } from '../data/asset-loader.js'; const router = express.Router(); router.get('/api/assets/:id/summary', async (req, res) => { try { const assetId = req.params.id; // Load equipment data const asset = await getAssetData(assetId); if (!asset) { return res.status(404).json({ error: 'Asset not found' }); } // Build analysis prompt with context const prompt = buildHealthAnalysisPrompt(asset); // Generate AI summary const analysis = await foundryService.generateCompletion(prompt); // Structure response res.json({ asset_id: assetId, asset_name: asset.name, summary: analysis.content, model_used: analysis.model, latency_ms: analysis.latency_ms, timestamp: new Date().toISOString(), telemetry_snapshot: { temperature: asset.telemetry.temperature, vibration: asset.telemetry.vibration, runtime_hours: asset.telemetry.runtime_hours }, active_alerts: asset.alerts.filter(a => a.active).length }); } catch (error) { console.error('Asset summary error:', error); res.status(500).json({ error: 'Analysis failed' }); } }); function buildHealthAnalysisPrompt(asset) { return ` Analyze the health of this manufacturing equipment and provide a concise summary: Equipment: ${asset.name} (${asset.id}) Type: ${asset.type} Location: ${asset.location} Current Telemetry: - Temperature: ${asset.telemetry.temperature}°C (Normal: ${asset.specs.normal_temp_range}) - Vibration: ${asset.telemetry.vibration} mm/s (Threshold: ${asset.specs.vibration_threshold}) - Operating Pressure: ${asset.telemetry.pressure} PSI - Runtime: ${asset.telemetry.runtime_hours} hours - Last Maintenance: ${asset.maintenance.last_service_date} Active Alerts: ${asset.alerts.map(a => `- ${a.severity}: ${a.message}`).join('\n')} Recent Events: ${asset.recent_events.slice(0, 3).map(e => `- ${e.timestamp}: ${e.description}`).join('\n')} Provide a 3-4 sentence summary covering: 1. Overall equipment health status 2. Any concerning trends or anomalies 3. Recommended actions or monitoring focus Be factual and specific. Do not speculate beyond the provided data. `.trim(); } export default router; This prompt construction demonstrates several best practices for industrial AI: Structured data presentation: Organize telemetry, specs, and alerts in clear sections with labels Context enrichment: Include normal operating ranges so AI can assess abnormality Explicit constraints: Instruction to avoid speculation reduces hallucination risk Output formatting guidance: Request specific structure (3-4 sentences, covering key points) Temporal context: Include recent events so AI understands trend direction Example AI-generated asset summary: { "asset_id": "CNC-L2-M03", "asset_name": "CNC Mill #3", "summary": "Equipment is operating outside normal parameters with elevated temperature at 92°C, significantly above the 75-80°C normal range. Thermal Alert indicates possible coolant flow issue. Vibration levels remain acceptable at 2.8 mm/s. Recommend immediate inspection of coolant system and thermal throttling may impact throughput until resolved.", "model_used": "phi-3.5-mini", "latency_ms": 243, "timestamp": "2026-01-30T14:23:18Z", "telemetry_snapshot": { "temperature": 92, "vibration": 2.8, "runtime_hours": 12847 }, "active_alerts": 2 } This summary transforms raw telemetry into actionable intelligence—operations staff immediately understand the problem, its severity, and the appropriate response, without requiring deep equipment expertise. Maintenance Log Classification with AI Maintenance departments generate hundreds of logs daily, technician notes, operator observations, inspection reports. Manually categorizing and prioritizing these logs consumes significant time. AI classification automatically routes logs to appropriate teams, identifies urgent issues, and extracts key information. The classification endpoint processes maintenance notes: // backend/routes/maintenance.js router.post('/api/logs/classify', async (req, res) => { try { const { log_text, equipment_id } = req.body; if (!log_text || log_text.length < 10) { return res.status(400).json({ error: 'Log text required (min 10 chars)' }); } const classificationPrompt = ` Classify this maintenance log entry into appropriate categories and priority: Equipment: ${equipment_id || 'Unknown'} Log Text: "${log_text}" Classify into EXACTLY ONE primary category: - MECHANICAL: Physical components, bearings, belts, motors - ELECTRICAL: Power systems, sensors, controllers, wiring - HYDRAULIC: Pumps, fluid systems, pressure issues - THERMAL: Cooling, heating, temperature control - SOFTWARE: PLC programming, HMI issues, control logic - ROUTINE: Scheduled maintenance, inspections, calibration Assign priority level: - CRITICAL: Immediate action required, safety or production impact - HIGH: Resolve within 24 hours, performance degradation - MEDIUM: Schedule within 1 week, minor issues - LOW: Routine maintenance, cosmetic issues Extract key details: - Symptoms described - Suspected root cause (if mentioned) - Recommended actions Return ONLY a JSON object with this exact structure: { "category": "MECHANICAL", "priority": "HIGH", "symptoms": ["grinding noise", "vibration above 5mm/s"], "suspected_cause": "bearing wear", "recommended_actions": ["inspect bearings", "order replacement parts"] } `.trim(); const analysis = await foundryService.generateCompletion(classificationPrompt); // Parse AI response as JSON let classification; try { // Extract JSON from response (AI might add explanation text) const jsonMatch = analysis.content.match(/\{[\s\S]*\}/); classification = JSON.parse(jsonMatch[0]); } catch (parseError) { // Fallback parsing if JSON extraction fails classification = parseClassificationText(analysis.content); } // Validate classification const validCategories = ['MECHANICAL', 'ELECTRICAL', 'HYDRAULIC', 'THERMAL', 'SOFTWARE', 'ROUTINE']; const validPriorities = ['CRITICAL', 'HIGH', 'MEDIUM', 'LOW']; if (!validCategories.includes(classification.category)) { classification.category = 'ROUTINE'; } if (!validPriorities.includes(classification.priority)) { classification.priority = 'MEDIUM'; } res.json({ original_log: log_text, classification, model_used: analysis.model, latency_ms: analysis.latency_ms, timestamp: new Date().toISOString() }); } catch (error) { console.error('Classification error:', error); res.status(500).json({ error: 'Classification failed' }); } }); function parseClassificationText(text) { // Fallback parser for when AI doesn't return valid JSON // Extract category, priority, and details using regex patterns const categoryMatch = text.match(/category[":]\s*(MECHANICAL|ELECTRICAL|HYDRAULIC|THERMAL|SOFTWARE|ROUTINE)/i); const priorityMatch = text.match(/priority[":]\s*(CRITICAL|HIGH|MEDIUM|LOW)/i); return { category: categoryMatch ? categoryMatch[1].toUpperCase() : 'ROUTINE', priority: priorityMatch ? priorityMatch[1].toUpperCase() : 'MEDIUM', symptoms: [], suspected_cause: 'Unknown', recommended_actions: [] }; } This implementation demonstrates several critical patterns for structured AI outputs: Explicit output format requirements: Prompt specifies exact JSON structure to encourage parseable responses Defensive parsing: Try JSON extraction first, fall back to text parsing if that fails Validation with sensible defaults: Validate categories and priorities against allowed values, default to safe values on mismatch Constrained classification vocabulary: Limit categories to predefined set rather than open-ended categories Priority inference rules: Guide AI to assess urgency based on safety, production impact, and timeline Example classification output: POST /api/logs/classify { "log_text": "Hydraulic pump PUMP-L1-H01 making grinding noise during startup. Vibration readings spiked to 5.2 mm/s this morning. Possible bearing wear. Recommend inspection.", "equipment_id": "PUMP-L1-H01" } Response: { "original_log": "Hydraulic pump PUMP-L1-H01 making grinding noise...", "classification": { "category": "MECHANICAL", "priority": "HIGH", "symptoms": ["grinding noise during startup", "vibration spike to 5.2 mm/s"], "suspected_cause": "bearing wear", "recommended_actions": ["inspect bearings", "schedule replacement if confirmed worn"] }, "model_used": "phi-3.5-mini", "latency_ms": 187, "timestamp": "2026-01-30T14:35:22Z" } This classification automatically routes the log to the mechanical maintenance team, marks it high priority for same-day attention, and extracts actionable details, all without human intervention. Building the Natural Language Operations Assistant The AI Assistant interface enables operations staff to query equipment status, ask diagnostic questions, and get contextual guidance using natural language. This interface bridges the gap between complex SCADA systems and operators who need quick answers without navigating multiple screens. The chat endpoint implements contextual conversation: // backend/routes/chat.js router.post('/api/chat', async (req, res) => { try { const { message, conversation_id } = req.body; if (!message || message.length < 3) { return res.status(400).json({ error: 'Message required (min 3 chars)' }); } // Load conversation history if exists const history = conversation_id ? await loadConversationHistory(conversation_id) : []; // Build context from current plant state const plantContext = await buildPlantContext(); // Construct system prompt with operational context const systemPrompt = ` You are an operations assistant for a manufacturing facility. Answer questions about equipment status, maintenance, and operational issues. Current Plant Status: ${plantContext} Guidelines: - Provide specific, actionable answers based on current data - Reference specific equipment IDs when relevant - Suggest appropriate next steps for issues - If information is unavailable, say so clearly - Use concise language suitable for busy operators Do not speculate about issues without data to support it. `.trim(); // Build message chain with history const messages = [ { role: 'system', content: systemPrompt }, ...history.map(h => ({ role: h.role, content: h.content })), { role: 'user', content: message } ]; // Generate response const response = await foundryService.generateCompletion( message, { messages } // Pass full conversation context ); // Save conversation const newConversationId = conversation_id || generateConversationId(); await saveConversationTurn(newConversationId, message, response.content); res.json({ response: response.content, conversation_id: newConversationId, model_used: response.model, latency_ms: response.latency_ms, timestamp: new Date().toISOString() }); } catch (error) { console.error('Chat error:', error); res.status(500).json({ error: 'Chat failed' }); } }); async function buildPlantContext() { const assets = await getAllAssets(); // Summarize plant-wide status const criticalAlerts = assets.flatMap(a => a.alerts.filter(alert => alert.severity === 'CRITICAL') ); const equipmentStatus = assets.map(a => `${a.id}: ${a.status} (Temp: ${a.telemetry.temperature}°C, Alerts: ${a.alerts.length})` ); return ` Total Equipment: ${assets.length} Critical Alerts: ${criticalAlerts.length} Equipment Status Summary: ${equipmentStatus.join('\n')} `.trim(); } Example conversational interactions: User: "What's wrong with Line 2?" AI: "Line 2 has elevated issues. CNC-L2-M03 is running 12°C above normal temperature at 92°C with a thermal alert active. This has caused 28% below-target throughput as the equipment throttles to prevent damage. The coolant system requires immediate inspection—flow rate dropped to 3.2 L/min from the normal 5.5 L/min, indicating a potential blockage or pump failure. Other Line 2 equipment (CONV-L2-B01, PLC-L2-MAIN) is operating normally." User: "How's the hydraulic pump performing?" AI: "PUMP-L1-H01 has a high-priority maintenance alert. Grinding noise was reported during startup with vibration at 5.2 mm/s, exceeding the 4.0 mm/s threshold. This pattern suggests bearing wear. Current operational status shows the pump is running but degraded. Recommend inspection today to prevent catastrophic failure. Bearing replacement typically requires 2-4 hours downtime." User: "What should I prioritize this shift?" AI: "Priority order for this shift: 1) CNC-L2-M03 coolant inspection (CRITICAL - impacting production, safety risk), 2) PUMP-L1-H01 bearing check (HIGH - prevent failure progression), 3) Routine inspections on CONV-L1-B02 (scheduled maintenance). Focus immediate attention on Line 2 thermal issues. Line 1 is stable." The assistant provides grounded, specific responses citing actual equipment IDs, current metrics, and actionable priorities. This data-grounded approach maintains operational credibility, operators quickly learn to trust responses that reference specific equipment conditions they can independently verify on SCADA systems. Real-World Industrial Scenario: Thermal Event Response To understand the system's practical value, consider how it supports operators during an actual equipment incident. This scenario simulates realistic manufacturing conditions where multiple issues develop simultaneously and operators need rapid situational awareness. Scenario Setup: Assembly Line Degradation During Shift B operations on a Tuesday evening at a precision manufacturing facility, multiple equipment anomalies develop on production lines. The plant overview dashboard begins displaying elevated alert counts. Telemetry monitoring detects concerning patterns across Assembly Line 2 equipment. CNC mill CNC-L2-M03 reports sustained high temperature at 92°C compared to normal operating range of 65-75°C. Coolant flow sensors indicate 40% reduction from baseline levels—dropping from 5.5 L/min to 3.2 L/min. Equipment throughput has decreased 28% as thermal management systems automatically throttle spindle speeds to prevent damage. The mill has accumulated 12,847 runtime hours with next scheduled maintenance in 153 hours. Simultaneously, hydraulic pump PUMP-L1-H01 on Line 1 logged a maintenance note during startup: "Grinding noise observed during pump startup, vibration spike detected." Automated vibration sensors registered 5.2 mm/s compared to the 4.0 mm/s alarm threshold. The pump's condition monitoring system flagged bearing wear as probable cause based on frequency analysis of vibration signatures. The shift supervisor, monitoring the plant overview dashboard, sees both alerts flagged as high priority. Traditional response would require accessing SCADA screens, checking telemetry history, reviewing maintenance logs across multiple systems, correlating patterns, and consulting equipment experts, consuming 15-20 minutes of investigation time while equipment conditions potentially worsen. AI-Assisted Response Workflow Instead, the supervisor uses the AI Assistant interface to rapidly assess the situation. The workflow demonstrates how local AI transforms incident response: Step 1 - Initial Assessment Query: Supervisor types "What's the current plant status?" The AI assistant analyzes the complete equipment database and returns: "Plant status as of 22:14: 14 assets monitored, 2 high-priority alerts active. Critical issues: CNC-L2-M03 thermal event (92°C, coolant flow reduced 40%, throughput down 28%). High-priority: PUMP-L1-H01 mechanical degradation (bearing wear suspected, 5.2 mm/s vibration). All other equipment operating within normal parameters. Recommend immediate attention to Line 2 thermal issue—production impact and potential equipment damage risk." Step 2 - Detailed Equipment Analysis: Supervisor navigates to Asset Health tab and selects CNC-L2-M03 for detailed analysis. Within 1.8 seconds, the AI summary generates: "CNC-L2-M03 Health Summary: DEGRADED - IMMEDIATE ACTION REQUIRED Current Status: Elevated thermal condition with active coolant flow issue. Operating temperature 92°C exceeds normal range (65-75°C) by 22%. Coolant flow rate 3.2 L/min indicates 42% reduction from baseline 5.5 L/min. Impact Assessment: Throughput reduced 28% due to automatic thermal throttling. Spindle speed limited to 65% capacity to prevent thermal damage. Current production rate: 18 parts/hour vs. target 25 parts/hour. Root Cause Analysis: Coolant flow reduction suggests blockage in cooling circuit or coolant pump degradation. Temperature rise pattern correlates with gradual flow reduction observed over past 6 hours. Recommended Actions: 1. Immediate: Verify coolant reservoir level and pump operation 2. Inspect: Check cooling circuit for blockages or debris 3. Monitor: Temperature trend every 15 minutes until resolved 4. Consider: Production pause if temperature exceeds 95°C Maintenance Priority: HIGH - Production impact and equipment damage risk if unresolved. Estimated repair time: 1-2 hours for coolant system service." Step 3 - Maintenance Log Classification: When the pump maintenance note is entered, the AI classification system analyzes it and returns structured classification within 240ms: { "classification": { "category": "MECHANICAL", "priority": "HIGH", "symptoms": ["grinding noise during startup", "vibration spike to 5.2 mm/s", "frequency pattern indicates bearing wear"], "suspected_cause": "bearing wear - inner race degradation probable", "recommended_actions": [ "schedule bearing inspection within 8 hours", "prepare replacement bearings (6208-2RS)", "plan maintenance window: 2-4 hours", "order backup pump if repair extended" ], "estimated_downtime": "2-4 hours for bearing replacement" } } Step 4 - Decision Support and Prioritization: Supervisor asks "What should I prioritize tonight?" The AI synthesizes all context: "Priority recommendations for Shift B: IMMEDIATE (next 30 minutes): - CNC-L2-M03: Dispatch maintenance to coolant system NOW. Production impacted, equipment damage risk if temperature continues rising. Impact: $2,400/hour revenue loss at current throughput reduction. HIGH (within 4 hours): - PUMP-L1-H01: Schedule bearing inspection for line changeover at 02:00. Pump operational but degraded—plan bearing replacement parts and backup pump. Risk: Catastrophic failure if bearing fully degrades. STANDARD (this week): - CONV-L1-B02: Scheduled inspection per maintenance calendar - Routine: Daily checklist completion for all lines Resource allocation: Focus maintenance team lead on CNC thermal issue. Assign technician to prepare pump bearing replacement kit during current production run." Outcome: Rapid Response Enabled by Local AI The AI-assisted workflow reduced incident assessment time from 15-20 minutes to under 3 minutes. The supervisor immediately dispatched maintenance to investigate CNC-L2-M03's coolant system, identified as root cause. Technicians found debris blocking a cooling circuit junction, 5-minute clearance procedure restored coolant flow to 5.3 L/min. Equipment temperature dropped to 73°C within 10 minutes, and full production throughput resumed. For the hydraulic pump, bearing inspection was scheduled during planned line changeover at 02:00, preventing emergency production stoppage. Bearings were replaced preemptively, avoiding the catastrophic pump failure that would have caused 6-8 hours of unplanned downtime. Total downtime avoided: 8+ hours. Revenue protection: ~$48,000 based on facility's production value. All decisions made with AI running entirely on local edge device, no cloud dependency, no data exposure, no network latency impact. The complete incident response workflow operated on facility-controlled infrastructure with full data sovereignty. Key Takeaways for Manufacturing AI Deployment Building production-ready AI systems for industrial environments requires architectural decisions that prioritize operational reliability, data sovereignty, and integration pragmatism over cutting-edge model sophistication. Several critical lessons emerge from implementing on-premises manufacturing intelligence: Data locality through architectural guarantee: On-premises AI ensures proprietary production data never leaves facility networks not through configuration but through fundamental architecture. There are no cloud API calls to misconfigure, no data upload features to accidentally enable, no external endpoints to compromise. This physical data boundary satisfies security audits and competitive protection requirements with demonstrable certainty rather than contractual assurance. Model selection determines deployment feasibility: Smaller models (0.5B-2B parameters) enable deployment on commodity server hardware without specialized AI accelerators. These models provide sufficient accuracy for industrial classification, summarization, and conversational assistance while maintaining sub-3-second response times essential for operator acceptance. Larger models improve nuance but require GPU infrastructure and longer inference times that may not justify marginal accuracy gains for operational decision-making. Graceful degradation preserves operations: AI capabilities enhance but never replace core monitoring functions. Equipment dashboards must display raw telemetry, alert states, and historical trends even when AI analysis is unavailable. This architectural separation ensures operations continue during AI service maintenance, model updates, or system failures. AI becomes value-add intelligence rather than critical dependency. Context-rich prompts determine accuracy: Generic prompts produce generic responses unsuitable for operational decisions. Effective industrial prompts include equipment specifications, normal operating ranges, alert thresholds, maintenance history, and temporal context. This structured context enables models to provide grounded, specific recommendations citing actual equipment conditions rather than hallucinated speculation. Prompt engineering matters more than model size for operational accuracy. Structured outputs enable automation: JSON response formats with predefined fields allow automated systems to parse classifications, severity levels, and recommended actions without fragile natural language parsing. Maintenance management systems can automatically route work orders, trigger alerts, and update dashboards based on AI classification results. This structured integration scales AI beyond human-read summaries into automated workflow systems. Integration patterns bridge legacy and modern: OPC-UA clients and Modbus TCP gateways connect decades-old PLCs and SCADA systems to modern AI backends without replacing functional control infrastructure. This evolutionary approach enables AI adoption without massive capital equipment replacement. Manufacturing facilities can augment existing investments rather than ripping and replacing proven systems. Responsible AI through grounding and constraints: Industrial AI must acknowledge limits and avoid speculation beyond available data. System prompts should explicitly instruct models: "If you don't have information to answer, clearly state that" and "Do not speculate about equipment conditions beyond provided data." This reduces hallucination risk and maintains operator trust. Operators must verify AI recommendations against domain expertise, position AI as decision support augmenting human judgment, not replacing it. Getting Started: Installation and Deployment Implementing the manufacturing intelligence system requires Foundry Local installation, Node.js backend deployment, and frontend hosting, achievable within a few hours for facilities with existing IT infrastructure and server hardware. Prerequisites and System Requirements Hardware requirements depend on selected AI models. Minimum configuration supports Phi-3.5-mini model (2.1GB): 8GB RAM, 4-core CPU (Intel Core i5/AMD Ryzen 5 or better) 50GB available storage for model files and application data Windows 11/Server 2025 distribution. Recommended production configuration: 16GB+ RAM (supports larger models and concurrent requests), 8-core CPU or NVIDIA GPU (RTX 3060/4060 or better for 3-5x inference acceleration), 100GB SSD storage, gigabit network interface for intra-facility communication. Software prerequisites: Node.js 18 or newer (download from nodejs.org or install via system package manager), Git for repository cloning, modern web browser (Chrome, Edge, Firefox) for frontend access, Windows: PowerShell 5.1+. Foundry Local Installation and Model Setup Install Foundry Local using system-appropriate package manager: # Windows installation via winget winget install Microsoft.FoundryLocal # Verify installation foundry --version # macOS installation via Homebrew brew install microsoft/foundrylocal/foundrylocal Download AI models based on hardware capabilities and accuracy requirements: # Fast option: Qwen 0.5B (500MB, 100-200ms inference) foundry model download qwen2.5-0.5b # Balanced option: Phi-3.5 Mini (2.1GB, 1-3 second inference) foundry model download phi-3.5-mini # High quality option: Phi-4 Mini (3.6GB, 2-5 second inference) foundry model download phi-4-mini # Check downloaded models foundry model list Load a model into the Foundry Local service: # Load default recommended model foundry model run phi-3.5-mini # Verify service is running and model is loaded foundry service status The Foundry Local service will start automatically and expose a REST API on localhost:8008 (default port). The backend application connects to this endpoint for all AI inference operations. Backend Service Deployment Clone the repository and install dependencies: # Clone from GitHub git clone https://github.com/leestott/FoundryLocal-IndJSsample.git cd FoundryLocal-IndJSsample # Navigate to backend directory cd backend # Install Node.js dependencies npm install # Start the backend service npm start The backend server will initialize and display startup messages: Manufacturing AI Backend Starting... ✓ Foundry Local client initialized: http://localhost:8008 ✓ Model detected: phi-3.5-mini ✓ Sample data loaded: 6 assets, 12 maintenance logs ✓ Server running on port 3000 ✓ Frontend accessible at: http://localhost:3000 Health check: http://localhost:3000/api/health Verify backend health: # Test backend API curl http://localhost:3000/api/health # Expected response: {"ok":true,"service":"manufacturing-ai-backend"} # Test Foundry Local integration curl http://localhost:3000/api/models/status # Expected response: {"serviceRunning":true,"model":"phi-3.5-mini"} Frontend Access and Validation Open the web interface by navigating to web/index.html in a browser or starting from the backend URL: # Windows: Open frontend directly start http://localhost:3000 # macOS/Linux: Open frontend open http://localhost:3000 # or xdg-open http://localhost:3000 The web interface displays a navigation bar with four main sections: Overview: Plant-wide dashboard showing all equipment with health status cards, alert counts, and "Load Scenario" button to populate sample data Asset Health: Equipment selector dropdown, telemetry display, active alerts list, and "Generate AI Summary" button for detailed analysis Maintenance: Text area for maintenance log entry, "Classify Log" button, and classification result display showing category, priority, and recommendations AI Assistant: Chat interface with message input, conversation history, and natural language query capabilities Running the Sample Scenario Test the complete system with included sample data: Load scenario data: Click "Load Scenario Inputs" button in Overview tab. This populates equipment database with CNC-L2-M03 thermal event, PUMP-L1-H01 vibration alert, and baseline telemetry for all assets. Generate asset summary: Navigate to Asset Health tab, select "CNC-L2-M03" from dropdown, click "Generate AI Analysis". Within 2-3 seconds, detailed health summary appears explaining thermal condition, coolant flow issue, impacts, and recommended actions. Classify maintenance note: Go to Maintenance tab, enter text: "Grinding noise on startup, vibration 5.2 mm/s, suspect bearing wear". Click "Classify Log". AI categorizes as MECHANICAL/HIGH priority with specific repair recommendations. Ask operational questions: Open AI Assistant tab, type "What's wrong with Line 2?" or "Which equipment needs attention?" AI responds with specific equipment IDs, current conditions, and prioritized action list. Production Deployment Considerations For actual manufacturing facility deployment, several additional configurations apply: Hardware placement: Deploy backend service on dedicated server within manufacturing network zone. Avoid co-locating AI workloads with critical SCADA/MES systems due to resource contention. Use physical server or VM with direct hardware access for GPU acceleration. Network configuration: Backend should reside behind facility firewall with access restricted to internal networks. Do not expose AI service directly to internetm use VPN for remote access if required. Implement authentication via Active Directory/LDAP integration. Configure firewall rules permitting connections from operator workstations and monitoring systems only. Data integration: Replace sample JSON data with connections to actual data sources. Implement OPC-UA client for SCADA integration, connect to MES database for production schedules, integrate with CMMS for maintenance history. Code includes placeholder functions for external data source integration, customize for facility-specific systems. Model selection: Choose appropriate model based on hardware and accuracy requirements. Start with phi-3.5-mini for production deployment. Upgrade to phi-4-mini if analysis quality needs improvement and hardware supports it. Use qwen2.5-0.5b for high-throughput scenarios where speed matters more than nuanced understanding. Test all models against validation scenarios before production promotion. Monitoring and maintenance: Implement health checks monitoring Foundry Local service status, backend API responsiveness, model inference latency, and error rates. Set up alerting when inference latency exceeds thresholds or service unavailable. Establish procedures for model updates during planned maintenance windows. Keep audit logs of all AI invocations for compliance and troubleshooting. Resources and Further Learning The complete implementation with detailed comments, sample data, and documentation provides a foundation for building custom manufacturing intelligence systems. Additional resources support extension and adaptation to specific facility requirements. FoundryLocal-IndJSsample GitHub Repository – Complete source code with JavaScript backend, HTML/CSS/JS frontend, sample manufacturing data, and comprehensive README Installation and Configuration Guide – Detailed setup instructions, API documentation, troubleshooting procedures, and deployment guidance Microsoft Foundry Local Documentation – Official SDK reference, model catalog, hardware requirements, and performance tuning guidance Sample Manufacturing Data Format – JSON structure examples for equipment telemetry, maintenance logs, alert definitions, and operational events Backend Implementation Reference – Express server architecture, Foundry Local SDK integration patterns, API endpoint implementations, and error handling OPC Foundation – Industrial communication standards (OPC-UA, OPC DA) for SCADA system integration and PLC connectivity ISA Standards – International Society of Automation standards for industrial systems, SCADA architecture, and manufacturing execution systems EdgeAI for Beginner - Learn more about Edge AI using these course materials The manufacturing intelligence implementation demonstrates that sophisticated AI capabilities can run entirely on-premises without compromising operational requirements. Facilities gain predictive maintenance insights, natural language operational support, and automated equipment analysis while maintaining complete data sovereignty, zero network dependency, and deterministic performance characteristics essential for production environments.
Lee_Stott
Feb 24, 2026 Place Microsoft Developer Community Blog
233Views
1like
0Comments
Teaching AI Development Through Gamification:
Introduction Learning AI development can feel overwhelming. Developers face abstract concepts like embeddings, prompt engineering, and workflow orchestration topics that traditional tutorials struggle to make tangible. How do you teach someone what an embedding "feels like" or why prompt engineering matters beyond theoretical examples? The answer lies in experiential learning through gamification. Instead of reading about AI concepts, what if developers could play a game that teaches these ideas through progressively challenging levels, immediate feedback, and real AI interactions? This article explores exactly that: building an educational adventure game that transforms AI learning from abstract theory into hands-on exploration. We'll dive into Foundry Local Learning Adventure, a JavaScript-based game that teaches AI fundamentals through five interactive levels. You'll learn how to create engaging educational experiences, integrate local AI models using Foundry Local, design progressive difficulty curves, and build cross-platform applications that run both in browsers and terminals. Whether you're an educator designing technical curriculum or a developer building learning tools, this architecture provides a proven blueprint for gamified technical education. Why Gamification Works for Technical Learning Traditional technical education follows a predictable pattern: read documentation, watch tutorials, attempt exercises, struggle with setup, eventually give up. The problem isn't content quality, it's engagement and friction. Gamification addresses both issues simultaneously. By framing learning as progression through levels, you create intrinsic motivation. Each completed challenge feels like unlocking a new ability in a game, triggering the same dopamine response that keeps players engaged in entertainment experiences. Progress is visible, achievements are celebrated, and setbacks feel like natural parts of the journey rather than personal failures. More importantly, gamification reduces friction. Instead of "install dependencies, configure API keys, read documentation, write code, debug errors," learners simply start the game and begin playing. The game handles setup, provides guardrails, and offers immediate feedback. When a concept clicks, the game celebrates it. When learners struggle, hints appear automatically. For AI development specifically, gamification solves a unique challenge: making probabilistic, non-deterministic systems feel approachable. Traditional programming has clear right and wrong answers, but AI outputs vary. A game can frame this variability as exploration rather than failure, teaching developers to evaluate AI responses critically while maintaining confidence. Architecture Overview: Dual-Platform Design for Maximum Reach The Foundry Local Learning Adventure implements a dual-platform architecture with separate but consistent implementations for web browsers and command-line terminals. This design maximizes accessibility, learners can start playing instantly in a browser, then graduate to CLI mode for the full terminal experience when they're ready to go deeper. The web version prioritizes zero-friction onboarding. It's deployed to GitHub Pages and can also be opened locally via a simple HTTP server, no build step, no package managers. The game starts with simulated AI responses in demo mode, but crucially, it also supports real AI responses when Foundry Local is installed. The web version auto-discovers Foundry Local's dynamic port through a foundry-port.json file (written by the startup scripts) or by scanning common ports. Progress saves to localStorage, badges unlock as you complete challenges, and an AI-powered mentor named Sage guides you through a chat widget in the corner. This version is perfect for classrooms, conference demos, and learners who want to try before committing to a full CLI setup. The CLI version provides the full terminal experience with real AI interactions. Built on Node.js with ES modules, this version features a custom FoundryLocalClient class that connects to Foundry Local's OpenAI-compatible REST API. Instead of relying on an external SDK, the game implements its own API client with automatic port discovery, model selection, and graceful fallback to demo mode. The terminal interface includes a rich command system ( play , hint , ask , explain , progress , badges ) and the Sage mentor provides contextual guidance throughout. Both versions implement the same five levels and learning objectives independently. The CLI uses game/src/game.js , levels.js , and mentor.js as ES modules, while the web version uses game/web/game-web.js and game-data.js . A key innovation is the automatic port discovery system, which eliminates manual configuration: // 3-tier port discovery strategy (game/src/game.js) class FoundryLocalClient { constructor() { this.commonPorts = [61341, 5272, 51319, 5000, 8080]; this.mode = 'demo'; // 'local', 'azure', or 'demo' } async initialize() { // Tier 1: CLI discovery - parse 'foundry service status' output const cliPort = await this.discoverPortViaCLI(); if (cliPort) { this.baseUrl = cliPort; this.mode = 'local'; return; } // Tier 2: Try configured URL from config.json if (await this.tryFoundryUrl(config.foundryLocal.baseUrl)) { this.mode = 'local'; return; } // Tier 3: Scan common ports for (const port of this.commonPorts) { if (await this.tryFoundryUrl(`http://127.0.0.1:${port}`)) { this.mode = 'local'; return; } } // Fallback: demo mode with simulated responses console.log('💡 Running in demo mode (no Foundry Local detected)'); this.mode = 'demo'; } async chat(messages, options = {}) { if (this.mode === 'demo') return this.getDemoResponse(messages); const response = await fetch(`${this.baseUrl}/v1/chat/completions`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model: this.selectedModel, messages, temperature: options.temperature || 0.7, max_tokens: options.max_tokens || 300 }) }); const data = await response.json(); return data.choices[0].message.content; } } This architecture demonstrates several key principles for educational software: Progressive disclosure: Start simple (web demo mode), add complexity optionally (real AI via Foundry Local or Azure) Consistent learning outcomes: Both platforms teach the same five concepts through independently implemented but equivalent experiences Zero barriers to entry: No installation required for the web version eliminates the #1 reason learners abandon technical tutorials Automatic service discovery: The 3-tier port discovery strategy means no manual configuration, just install Foundry Local and play Graceful degradation: Three connection modes (local, Azure, demo) ensure the game always works regardless of setup Level Design: Teaching AI Concepts Through Progressive Challenges The game's five levels form a carefully designed curriculum that builds AI understanding incrementally. Each level introduces one core concept, provides hands-on practice, and validates learning before proceeding. Level 1: Meet the Model teaches the fundamental request-response pattern. Learners send their first message to an AI and see it respond. The challenge is deliberately trivial, just say hello, because the goal is building confidence. The level succeeds when the learner realizes "I can talk to an AI and it understands me." This moment of agency sets the foundation for everything else. The implementation focuses on positive reinforcement. In the CLI version, the Sage mentor celebrates each completion with contextual messages, while the web version displays inline celebration banners with badge animations: // Level 1 execution (game/src/game.js) async executeLevel1() { const level = this.levels.getLevel(1); this.displayLevelHeader(level); // Sage introduces the level const intro = await this.mentor.introduceLevel(1); console.log(`\n🧙 Sage: ${intro}`); const userPrompt = await this.askQuestion('\nYour prompt: '); console.log('\n🤖 AI is thinking...'); const response = await this.client.chat([ { role: 'system', content: 'You are Sage, a friendly AI mentor.' }, { role: 'user', content: userPrompt } ]); console.log(`\n📨 AI Response:\n${response}`); if (response && response.length > 10) { // Sage celebrates const celebration = await this.mentor.celebrateLevelComplete(1); console.log(`\n🧙 Sage: ${celebration}`); console.log('\n🎯 You earned the Prompt Apprentice badge!'); console.log('🏆 +100 points'); this.progress.completeLevel(1, 100, '🎯 Prompt Apprentice'); } } This celebration pattern repeats throughout, explicit acknowledgment of success via the Sage mentor, explanation of what was learned, and a preview of what's next. The mentor system ( game/src/mentor.js ) provides contextual encouragement using AI-generated or pre-written fallback messages, transforming abstract concepts into concrete achievements. Level 2: Prompt Mastery introduces prompt quality through comparison. The game presents a deliberately poor prompt: "tell me stuff about coding." Learners must rewrite it to be specific, contextual, and actionable. The game runs both prompts, displays results side-by-side, and asks learners to evaluate the difference. // Level 2: Prompt Improvement (game/src/game.js) async executeLevel2() { const level = this.levels.getLevel(2); this.displayLevelHeader(level); const intro = await this.mentor.introduceLevel(2); console.log(`\n🧙 Sage: ${intro}`); // Show the bad prompt const badPrompt = "tell me stuff about coding"; console.log(`\n❌ Poor prompt: "${badPrompt}"`); console.log('\n🤖 Getting response to bad prompt...'); const badResponse = await this.client.chat([ { role: 'user', content: badPrompt } ]); console.log(`\n📊 Bad prompt result:\n${badResponse}`); // Get the learner's improved version console.log('\n✍️ Now write a BETTER prompt about the same topic:'); const goodPrompt = await this.askQuestion('Your improved prompt: '); console.log('\n🤖 Getting response to your prompt...'); const goodResponse = await this.client.chat([ { role: 'user', content: goodPrompt } ]); console.log(`\n📊 Your prompt result:\n${goodResponse}`); // Evaluate: improved prompt should be longer and more specific const isImproved = goodPrompt.length > badPrompt.length && goodResponse.length > 0; if (isImproved) { const celebration = await this.mentor.celebrateLevelComplete(2); console.log(`\n🧙 Sage: ${celebration}`); console.log('\n✨ You earned the Prompt Engineer badge!'); console.log('🏆 +150 points'); this.progress.completeLevel(2, 150, '✨ Prompt Engineer'); } else { const hint = await this.mentor.provideHint(2); console.log(`\n💡 Sage: ${hint}`); } } This comparative approach is powerful, learners don't just read about prompt engineering, they experience its impact directly. The before/after comparison makes quality differences undeniable. Level 3: Embeddings Explorer demystifies semantic search through practical demonstration. Learners search a knowledge base about Foundry Local using natural language queries. The game shows how embedding similarity works by returning relevant content even when exact keywords don't match. // Level 3: Embedding Search (game/src/game.js) async executeLevel3() { const level = this.levels.getLevel(3); this.displayLevelHeader(level); // Knowledge base loaded from game/data/knowledge-base.json const knowledgeBase = [ { id: 1, content: "Foundry Local runs AI models entirely on your device" }, { id: 2, content: "Embeddings convert text into numerical vectors" }, { id: 3, content: "Cosine similarity measures how related two texts are" }, // ... more entries about AI and Foundry Local ]; const query = await this.askQuestion('\n🔍 Search query: '); // Get embedding for user's query const queryEmbedding = await this.client.getEmbedding(query); // Get embeddings for all knowledge base entries const results = []; for (const item of knowledgeBase) { const itemEmbedding = await this.client.getEmbedding(item.content); const similarity = this.cosineSimilarity(queryEmbedding, itemEmbedding); results.push({ ...item, similarity }); } // Sort by similarity and show top matches results.sort((a, b) => b.similarity - a.similarity); console.log('\n📑 Top matches:'); results.slice(0, 3).forEach((r, i) => { console.log(` ${i + 1}. (${(r.similarity * 100).toFixed(1)}%) ${r.content}`); }); } // Cosine similarity calculation (also in TaskHandler) cosineSimilarity(a, b) { const dot = a.reduce((sum, val, i) => sum + val * b[i], 0); const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0)); const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0)); return dot / (magA * magB); } // Demo mode generates pseudo-embeddings when Foundry isn't available getPseudoEmbedding(text) { // 128-dimension hash-based vector for offline demonstration const embedding = new Array(128).fill(0); for (let i = 0; i < text.length; i++) { embedding[i % 128] += text.charCodeAt(i) / 1000; } return embedding; } Learners query things like "How do I run AI offline?" and discover content about Foundry Local's offline capabilities—even though the word "offline" appears nowhere in the result. When Foundry Local is running, the game calls the /v1/embeddings endpoint for real vector representations. In demo mode, a pseudo-embedding function generates 128-dimension hash-based vectors that still demonstrate the concept of similarity search. This concrete demonstration of semantic understanding beats any theoretical explanation. Level 4: Workflow Wizard teaches AI pipeline composition. Learners build a three-step workflow: summarize text → extract keywords → generate questions. Each step uses the previous output as input, demonstrating how complex AI tasks decompose into chains of simpler operations. // Level 4: Workflow Builder (game/src/game.js) async executeLevel4() { const level = this.levels.getLevel(4); this.displayLevelHeader(level); const intro = await this.mentor.introduceLevel(4); console.log(`\n🧙 Sage: ${intro}`); console.log('\n📝 Enter text for the 3-step AI pipeline:'); const inputText = await this.askQuestion('Input text: '); // Step 1: Summarize console.log('\n⚙️ Step 1: Summarizing...'); const summary = await this.client.chat([ { role: 'system', content: 'Summarize this in 2 sentences.' }, { role: 'user', content: inputText } ]); console.log(` Result: ${summary}`); // Step 2: Extract keywords (chained from Step 1 output) console.log('\n🔑 Step 2: Extracting keywords...'); const keywords = await this.client.chat([ { role: 'system', content: 'Extract 5 important keywords.' }, { role: 'user', content: summary } ]); console.log(` Keywords: ${keywords}`); // Step 3: Generate questions (chained from Step 2 output) console.log('\n❓ Step 3: Generating study questions...'); const questions = await this.client.chat([ { role: 'system', content: 'Create 3 quiz questions about these topics.' }, { role: 'user', content: keywords } ]); console.log(` Questions:\n${questions}`); console.log('\n✅ Workflow complete!'); const celebration = await this.mentor.celebrateLevelComplete(4); console.log(`\n🧙 Sage: ${celebration}`); console.log('\n⚡ You earned the Workflow Wizard badge!'); console.log('🏆 +250 points'); this.progress.completeLevel(4, 250, '⚡ Workflow Wizard'); } This level bridges the gap between "toy examples" and real applications. Learners see firsthand how combining simple AI operations creates sophisticated functionality. Level 5: Build Your Own Tool challenges learners to create a custom AI-powered tool by selecting from pre-built templates and configuring them. Rather than asking learners to write arbitrary code, the game provides four structured templates that demonstrate how AI tools work in practice: // Level 5: Tool Builder templates (game/web/game-web.js) const TOOL_TEMPLATES = [ { id: 'summarizer', name: '📝 Text Summarizer', description: 'Summarizes long text into key points', systemPrompt: 'You are a text summarization tool. Provide concise summaries.', exampleInput: 'Paste any long article or document...' }, { id: 'translator', name: '🌐 Code Translator', description: 'Translates code between programming languages', systemPrompt: 'You are a code translation tool. Convert code accurately.', exampleInput: 'function hello() { console.log("Hello!"); }' }, { id: 'reviewer', name: '🔍 Code Reviewer', description: 'Reviews code for bugs, style, and improvements', systemPrompt: 'You are a code review tool. Identify issues and suggest fixes.', exampleInput: 'Paste code to review...' }, { id: 'custom', name: '✨ Custom Tool', description: 'Design your own AI tool with a custom system prompt', systemPrompt: '', // Learner provides this exampleInput: '' } ]; // Tool testing sends the configured system prompt + user input to Foundry Local async function testTool(template, userInput) { const response = await callFoundryAPI([ { role: 'system', content: template.systemPrompt }, { role: 'user', content: userInput } ]); console.log(`🔧 Tool output: ${response}`); return response; } This template-based approach is safer and more educational than arbitrary code execution. Learners select a template, customize its system prompt, test it with sample input, and see how the AI responds differently based on the tool's configuration. The "Custom Tool" option lets advanced learners design their own system prompts from scratch. Completing this level marks true understanding—learners aren't just using AI, they're shaping what it can do through prompt design and tool composition. Building the Web Version: Zero-Install Educational Experience The web version demonstrates how to create educational software that requires absolutely zero setup. This is critical for workshops, classroom settings, and casual learners who won't commit to installation until they see value. The architecture is deliberately simple, vanilla JavaScript with ES6 modules, no build tools, no package managers. The HTML includes a multi-screen layout with a welcome screen, level selection grid, game area, and modals for progress, badges, help, and game completion:  <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Foundry Local Learning Adventure</title> <link rel="stylesheet" href="styles.css"> </head> <body>  <div id="welcome-screen" class="screen active"> <h1>🎮 Foundry Local Learning Adventure</h1> <p>Master Microsoft Foundry AI - One Level at a Time!</p> <input type="text" id="player-name" placeholder="Enter your name"> <button id="start-btn">Start Adventure</button> <div id="foundry-status"></div> </div>  <div id="menu-screen" class="screen"> <div class="level-grid">  </div> <div class="stats-bar"> <span id="points-display">0 points</span> <span id="badges-count">0/5 badges</span> </div> </div>  <div id="level-screen" class="screen"> <div id="level-header"></div> <div id="task-area"></div> <div id="response-area"></div> <div id="hint-area"></div> </div>  <div id="mentor-chat" class="mentor-widget"> <div class="mentor-header">🧙 Sage (AI Mentor)</div> <div id="mentor-messages"></div> <input type="text" id="mentor-input" placeholder="Ask Sage anything..."> </div> <script type="module" src="game-data.js"></script> <script type="module" src="game-web.js"></script> </body> </html> A critical feature of the web version is its ability to connect to a real Foundry Local instance. On startup, the game checks for a foundry-port.json file (written by the cross-platform start scripts) and falls back to scanning common ports: // game/web/game-web.js - Foundry Local auto-discovery let foundryConnection = { connected: false, baseUrl: null }; async function checkFoundryConnection() { // Try reading port from discovery file (written by start scripts) const discoveredPort = await readDiscoveredPort(); if (discoveredPort) { try { const resp = await fetch(`${discoveredPort}/v1/models`); if (resp.ok) { foundryConnection = { connected: true, baseUrl: discoveredPort }; updateStatusBadge('🟢 Foundry Local Connected'); return; } } catch (e) { /* continue to port scan */ } } // Scan common Foundry Local ports const ports = [61341, 5272, 51319, 5000, 8080]; for (const port of ports) { try { const resp = await fetch(`http://127.0.0.1:${port}/v1/models`); if (resp.ok) { foundryConnection = { connected: true, baseUrl: `http://127.0.0.1:${port}` }; updateStatusBadge('🟢 Foundry Local Connected'); return; } } catch (e) { continue; } } // Demo mode - use simulated responses from DEMO_RESPONSES updateStatusBadge('🟡 Demo Mode (install Foundry Local for real AI)'); } async function callFoundryAPI(messages) { if (!foundryConnection.connected) { return getDemoResponse(messages); // Simulated responses } const resp = await fetch(`${foundryConnection.baseUrl}/v1/chat/completions`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model: 'auto', messages, temperature: 0.7 }) }); const data = await resp.json(); return data.choices[0].message.content; } The web version also includes level-specific UIs: each level type has its own builder function that constructs the appropriate interface. For example, Level 2 (Prompt Improvement) shows a split-view with the bad prompt result on one side and the learner's improved prompt on the other. Level 3 (Embeddings) presents a search interface with similarity scores. Level 5 (Tool Builder) offers a template selector with four options (Text Summarizer, Code Translator, Code Reviewer, and Custom). This architecture teaches several patterns for web-based educational tools: LocalStorage for persistence: Progress survives page refreshes without requiring accounts or databases ES6 modules for organization: Clean separation between game data ( game-data.js ) and engine ( game-web.js ) Hybrid AI mode: Real AI when Foundry Local is available, simulated responses when it's not—same code path for both Multi-screen navigation: Welcome, menu, level, and completion screens provide clear progression Always-available mentor: The Sage chat widget in the corner lets learners ask questions at any point Implementing the CLI Version with Real AI Integration The CLI version provides the authentic AI development experience. This version requires Node.js and Foundry Local, but rewards setup effort with genuine model interactions. Installation uses a startup script that handles prerequisites: #!/bin/bash # scripts/start-game.sh echo "🎮 Starting Foundry Local Learning Adventure..." # Check Node.js if ! command -v node &> /dev/null; then echo "❌ Node.js not found. Install from https://nodejs.org/" exit 1 fi # Check Foundry Local if ! command -v foundry &> /dev/null; then echo "❌ Foundry Local not found." echo " Install: winget install Microsoft.FoundryLocal" exit 1 fi # Start Foundry service echo "🚀 Starting Foundry Local service..." foundry service start # Wait for service sleep 2 # Load model echo "📦 Loading Phi-4 model..." foundry model load phi-4 # Install dependencies echo "📥 Installing game dependencies..." npm install # Start game echo "✅ Launching game..." npm start The game logic integrates with Foundry Local using the official SDK: // game/src/game.js import { FoundryLocalClient } from 'foundry-local-sdk'; import readline from 'readline/promises'; const client = new FoundryLocalClient({ endpoint: 'http://127.0.0.1:5272' // Default Foundry Local port }); async function getAIResponse(prompt, level) { try { const startTime = Date.now(); const completion = await client.chat.completions.create({ model: 'phi-4', messages: [ { role: 'system', content: `You are Sage, a friendly AI mentor teaching ${LEVELS[level-1].title}.` }, { role: 'user', content: prompt } ], temperature: 0.7, max_tokens: 300 }); const latency = Date.now() - startTime; console.log(`\n⏱️ AI responded in ${latency}ms`); return completion.choices[0].message.content; } catch (error) { console.error('❌ AI error:', error.message); console.log('💡 Falling back to demo mode...'); return getDemoResponse(prompt, level); } } async function playLevel(levelNumber) { const level = LEVELS[levelNumber - 1]; console.clear(); console.log(`\n${'='.repeat(60)}`); console.log(` Level ${levelNumber}: ${level.title}`); console.log(`${'='.repeat(60)}\n`); console.log(`🎯 ${level.objective}\n`); console.log(`📚 ${level.description}\n`); const rl = readline.createInterface({ input: process.stdin, output: process.stdout }); const userPrompt = await rl.question('Your prompt: '); rl.close(); console.log('\n🤖 AI is thinking...'); const response = await getAIResponse(userPrompt, levelNumber); console.log(`\n📨 AI Response:\n${response}\n`); // Evaluate success if (level.successCriteria(response, userPrompt)) { celebrateSuccess(level); updateProgress(levelNumber); if (levelNumber < 5) { const playNext = await askYesNo('Play next level?'); if (playNext) { await playLevel(levelNumber + 1); } } else { showGameComplete(); } } else { console.log(`\n💡 Hint: ${level.hints[0]}\n`); const retry = await askYesNo('Try again?'); if (retry) { await playLevel(levelNumber); } } } The CLI version adds several enhancements that deepen learning: Latency visibility: Display response times so learners understand local vs cloud performance differences Graceful fallback: If Foundry Local fails, switch to demo mode automatically rather than crashing Interactive prompts: Use readline for natural command-line interaction patterns Progress persistence: Save to JSON files so learners can pause and resume Command history: Log all prompts and responses for learners to review their progression Key Takeaways and Educational Design Principles Building effective educational software for technical audiences requires balancing several competing concerns: accessibility vs authenticity, simplicity vs depth, guidance vs exploration. The Foundry Local Learning Adventure succeeds by making deliberate architectural choices that prioritize learner experience. Key principles demonstrated: Zero-friction starts win: The web version eliminates all setup barriers, maximizing the chance learners will actually begin Automatic service discovery: The 3-tier port discovery strategy means no manual configuration, just install Foundry Local and play Progressive challenge curves build confidence: Each level introduces exactly one new concept, building on previous knowledge Immediate feedback accelerates learning: Learners know instantly if they succeeded, with Sage providing contextual explanations Real tools create transferable skills: The CLI version uses professional developer patterns (OpenAI-compatible REST APIs, ES modules, readline) that apply beyond the game Celebration creates emotional investment: Badges, points, and Sage's encouragement transform learning into achievement Dual platforms expand reach: Web attracts casual learners, CLI converts them to serious practitioners—and both support real AI Graceful degradation ensures reliability: Three connection modes (local, Azure, demo) mean the game always works regardless of setup To extend this approach for your own educational projects, consider: Domain-specific challenges: Adapt level structure to your technical domain (e.g., API design, database optimization, security practices) Multiplayer competitions: Add leaderboards and time trials to introduce social motivation Adaptive difficulty: Track learner performance and adjust challenge difficulty dynamically Sandbox modes: After completing the curriculum, provide free-play areas for experimentation Community sharing: Let learners share custom levels or challenges they've created The complete implementation with all levels, both web and CLI versions, comprehensive tests, and deployment guides is available at github.com/leestott/FoundryLocal-LearningAdventure. You can play the web version immediately at leestott.github.io/FoundryLocal-LearningAdventure or clone the repository to experience the full CLI version with real AI. Resources and Further Reading Foundry Local Learning Adventure Repository - Complete source code for both web and CLI versions Play Online Now - Try the web version instantly in your browser (supports real AI with Foundry Local installed) Microsoft Foundry Local Documentation - Official SDK and CLI reference Contributing Guide - How to contribute new levels or improvements
Lee_Stott
Feb 17, 2026 Place Microsoft Developer Community Blog
407Views
2likes
0Comments
Benchmarking Local AI Models
Introduction Selecting the right AI model for your application requires more than reading benchmark leaderboards. Published benchmarks measure academic capabilities, question answering, reasoning, coding, but your application has specific requirements: latency budgets, hardware constraints, quality thresholds. How do you know if Phi-4 provides acceptable quality for your document summarization use case? Will Qwen2.5-0.5B meet your 100ms response time requirement? Does your edge device have sufficient memory for Phi-3.5 Mini? The answer lies in empirical testing: running actual models on your hardware with your workload patterns. This article demonstrates building a comprehensive model benchmarking platform using FLPerformance, Node.js, React, and Microsoft Foundry Local. You'll learn how to implement scientific performance measurement, design meaningful benchmark suites, visualize multi-dimensional comparisons, and make data-driven model selection decisions. Whether you're evaluating models for production deployment, optimizing inference costs, or validating hardware specifications, this platform provides the tools for rigorous performance analysis. Why Model Benchmarking Requires Purpose-Built Tools You cannot assess model performance by running a few manual tests and noting the results. Scientific benchmarking demands controlled conditions, statistically significant sample sizes, multi-dimensional metrics, and reproducible methodology. Understand why purpose-built tooling is essential. Performance is multi-dimensional. A model might excel at throughput (tokens per second) but suffer at latency (time to first token). Another might generate high-quality outputs slowly. Your application might prioritize consistency over average performance, a model with variable response times (high p95/p99 latency) creates poor user experiences even if averages look good. Measuring all dimensions simultaneously enables informed tradeoffs. Hardware matters enormously. Benchmark results from NVIDIA A100 GPUs don't predict performance on consumer laptops. NPU acceleration changes the picture again. Memory constraints affect which models can even load. Test on your actual deployment hardware or comparable specifications to get actionable results. Concurrency reveals bottlenecks. A model handling one request excellently might struggle with ten concurrent requests. Real applications experience variable load, measuring only single-threaded performance misses critical scalability constraints. Controlled concurrency testing reveals these limits. Statistical rigor prevents false conclusions. Running a prompt once and noting the response time tells you nothing about performance distribution. Was this result typical? An outlier? You need dozens or hundreds of trials to establish p50/p95/p99 percentiles, understand variance, and detect stability issues. Comparison requires controlled experiments. Different prompts, different times of day, different system loads, all introduce confounding variables. Scientific comparison runs identical workloads across models sequentially, controlling for external factors. Architecture: Three-Layer Performance Testing Platform FLPerformance implements a clean separation between orchestration, measurement, and presentation: The frontend React application provides model management, benchmark configuration, test execution, and results visualization. Users add models from the Foundry Local catalog, configure benchmark parameters (iterations, concurrency, timeout values), launch test runs, and view real-time progress. The results dashboard displays comparison tables, latency distribution charts, throughput graphs, and "best model for..." recommendations. The backend Node.js/Express server orchestrates tests and captures metrics. It manages the single Foundry Local service instance, loads/unloads models as needed, executes benchmark suites with controlled concurrency, measures comprehensive metrics (TTFT, TPOT, total latency, throughput, error rates), and persists results to JSON storage. WebSocket connections provide real-time progress updates during long benchmark runs. Foundry Local SDK integration uses the official foundry-local-sdk npm package. The SDK manages service lifecycle, starting, stopping, health checkin, and handles model operations, downloading, loading into memory, unloading. It provides OpenAI-compatible inference APIs for consistent request formatting across models. The architecture supports simultaneous testing of multiple models by loading them one at a time, running identical benchmarks, and aggregating results for comparison: User Initiates Benchmark Run ↓ Backend receives {models: [...], suite: "default", iterations: 10} ↓ For each model: 1. Load model into Foundry Local 2. Execute benchmark suite - For each prompt in suite: * Run N iterations * Measure TTFT, TPOT, total time * Track errors and timeouts * Calculate tokens/second 3. Aggregate statistics (mean, p50, p95, p99) 4. Unload model ↓ Store results with metadata ↓ Return comparison data to frontend ↓ Visualize performance metrics Implementing Scientific Measurement Infrastructure Accurate performance measurement requires instrumentation that captures multiple dimensions without introducing measurement overhead: // src/server/benchmark.js import { performance } from 'perf_hooks'; export class BenchmarkExecutor { constructor(foundryClient, options = {}) { this.client = foundryClient; this.options = { iterations: options.iterations || 10, concurrency: options.concurrency || 1, timeout_ms: options.timeout_ms || 30000, warmup_iterations: options.warmup_iterations || 2 }; } async runBenchmarkSuite(modelId, prompts) { const results = []; // Warmup phase (exclude from results) console.log(`Running ${this.options.warmup_iterations} warmup iterations...`); for (let i = 0; i < this.options.warmup_iterations; i++) { await this.executePrompt(modelId, prompts[0].text); } // Actual benchmark runs for (const prompt of prompts) { console.log(`Benchmarking prompt: ${prompt.id}`); const measurements = []; for (let i = 0; i < this.options.iterations; i++) { const measurement = await this.executeMeasuredPrompt( modelId, prompt.text ); measurements.push(measurement); // Small delay between iterations to stabilize await sleep(100); } results.push({ prompt_id: prompt.id, prompt_text: prompt.text, measurements, statistics: this.calculateStatistics(measurements) }); } return { model_id: modelId, timestamp: new Date().toISOString(), config: this.options, results }; } async executeMeasuredPrompt(modelId, promptText) { const measurement = { success: false, error: null, ttft_ms: null, // Time to first token tpot_ms: null, // Time per output token total_ms: null, tokens_generated: 0, tokens_per_second: 0 }; try { const startTime = performance.now(); let firstTokenTime = null; let tokenCount = 0; // Streaming completion to measure TTFT const stream = await this.client.chat.completions.create({ model: modelId, messages: [{ role: 'user', content: promptText }], max_tokens: 200, temperature: 0.7, stream: true }); for await (const chunk of stream) { if (chunk.choices[0]?.delta?.content) { if (firstTokenTime === null) { firstTokenTime = performance.now(); measurement.ttft_ms = firstTokenTime - startTime; } tokenCount++; } } const endTime = performance.now(); measurement.total_ms = endTime - startTime; measurement.tokens_generated = tokenCount; if (tokenCount > 1 && firstTokenTime) { // TPOT = time after first token / (tokens - 1) const timeAfterFirstToken = endTime - firstTokenTime; measurement.tpot_ms = timeAfterFirstToken / (tokenCount - 1); measurement.tokens_per_second = 1000 / measurement.tpot_ms; } measurement.success = true; } catch (error) { measurement.error = error.message; measurement.success = false; } return measurement; } calculateStatistics(measurements) { const successful = measurements.filter(m => m.success); const total = measurements.length; if (successful.length === 0) { return { success_rate: 0, error_rate: 1.0, sample_size: total }; } const ttfts = successful.map(m => m.ttft_ms).sort((a, b) => a - b); const tpots = successful.map(m => m.tpot_ms).filter(v => v !== null).sort((a, b) => a - b); const totals = successful.map(m => m.total_ms).sort((a, b) => a - b); const throughputs = successful.map(m => m.tokens_per_second).filter(v => v > 0); return { success_rate: successful.length / total, error_rate: (total - successful.length) / total, sample_size: total, ttft: { mean: mean(ttfts), median: percentile(ttfts, 50), p95: percentile(ttfts, 95), p99: percentile(ttfts, 99), min: Math.min(...ttfts), max: Math.max(...ttfts) }, tpot: tpots.length > 0 ? { mean: mean(tpots), median: percentile(tpots, 50), p95: percentile(tpots, 95) } : null, total_latency: { mean: mean(totals), median: percentile(totals, 50), p95: percentile(totals, 95), p99: percentile(totals, 99) }, throughput: { mean_tps: mean(throughputs), median_tps: percentile(throughputs, 50) } }; } } function mean(arr) { return arr.reduce((sum, val) => sum + val, 0) / arr.length; } function percentile(sortedArr, p) { const index = Math.ceil((sortedArr.length * p) / 100) - 1; return sortedArr[Math.max(0, index)]; } function sleep(ms) { return new Promise(resolve => setTimeout(resolve, ms)); } This measurement infrastructure captures: Time to First Token (TTFT): Critical for perceived responsiveness—users notice delays before output begins Time Per Output Token (TPOT): Determines generation speed after first token—affects throughput Total latency: End-to-end time—matters for batch processing and high-volume scenarios Tokens per second: Overall throughput metric—useful for capacity planning Statistical distributions: Mean alone masks variability—p95/p99 reveal tail latencies that impact user experience Success/error rates: Stability metrics—some models timeout or crash under load Designing Meaningful Benchmark Suites Benchmark quality depends on prompt selection. Generic prompts don't reflect real application behavior. Design suites that mirror actual use cases: // benchmarks/suites/default.json { "name": "default", "description": "General-purpose benchmark covering diverse scenarios", "prompts": [ { "id": "short-factual", "text": "What is the capital of France?", "category": "factual", "expected_tokens": 5 }, { "id": "medium-explanation", "text": "Explain how photosynthesis works in 3-4 sentences.", "category": "explanation", "expected_tokens": 80 }, { "id": "long-reasoning", "text": "Analyze the economic factors that led to the 2008 financial crisis. Discuss at least 5 major causes with supporting details.", "category": "reasoning", "expected_tokens": 250 }, { "id": "code-generation", "text": "Write a Python function that finds the longest palindrome in a string. Include docstring and example usage.", "category": "coding", "expected_tokens": 150 }, { "id": "creative-writing", "text": "Write a short story (3 paragraphs) about a robot learning to paint.", "category": "creative", "expected_tokens": 200 } ] } This suite covers multiple dimensions: Length variation: Short (5 tokens), medium (80), long (250)—tests models across output ranges Task diversity: Factual recall, explanation, reasoning, code, creative—reveals capability breadth Token predictability: Expected token counts enable throughput calculations For production applications, create custom suites matching your actual workload: { "name": "customer-support", "description": "Simulates actual customer support queries", "prompts": [ { "id": "product-question", "text": "How do I reset my password for the customer portal?" }, { "id": "troubleshooting", "text": "I'm getting error code 503 when trying to upload files. What should I do?" }, { "id": "policy-inquiry", "text": "What is your refund policy for annual subscriptions?" } ] } Visualizing Multi-Dimensional Performance Comparisons Raw numbers don't reveal insights—visualization makes patterns obvious. The frontend implements several comparison views: Comparison Table shows side-by-side metrics: // frontend/src/components/ResultsTable.jsx export function ResultsTable({ results }) { return ( {results.map(result => ( ))} Model TTFT (ms) TPOT (ms) Throughput (tok/s) P95 Latency Error Rate {result.model_id} {result.stats.ttft.median.toFixed(0)} (p95: {result.stats.ttft.p95.toFixed(0)}) {result.stats.tpot?.median.toFixed(1) || 'N/A'} {result.stats.throughput.median_tps.toFixed(1)} {result.stats.total_latency.p95.toFixed(0)} ms 0.05 ? 'error' : 'success'}> {(result.stats.error_rate * 100).toFixed(1)}% ); } Latency Distribution Chart reveals performance consistency: // Using Chart.js for visualization export function LatencyChart({ results }) { const data = { labels: results.map(r => r.model_id), datasets: [ { label: 'Median (p50)', data: results.map(r => r.stats.total_latency.median), backgroundColor: 'rgba(75, 192, 192, 0.5)' }, { label: 'p95', data: results.map(r => r.stats.total_latency.p95), backgroundColor: 'rgba(255, 206, 86, 0.5)' }, { label: 'p99', data: results.map(r => r.stats.total_latency.p99), backgroundColor: 'rgba(255, 99, 132, 0.5)' } ] }; return ( ); } Recommendations Engine synthesizes multi-dimensional comparison: export function generateRecommendations(results) { const recommendations = []; // Find fastest TTFT (best perceived responsiveness) const fastestTTFT = results.reduce((best, r) => r.stats.ttft.median < best.stats.ttft.median ? r : best ); recommendations.push({ category: 'Fastest Response', model: fastestTTFT.model_id, reason: `Lowest median TTFT: ${fastestTTFT.stats.ttft.median.toFixed(0)}ms` }); // Find highest throughput const highestThroughput = results.reduce((best, r) => r.stats.throughput.median_tps > best.stats.throughput.median_tps ? r : best ); recommendations.push({ category: 'Best Throughput', model: highestThroughput.model_id, reason: `Highest tok/s: ${highestThroughput.stats.throughput.median_tps.toFixed(1)}` }); // Find most consistent (lowest p95-p50 spread) const mostConsistent = results.reduce((best, r) => { const spread = r.stats.total_latency.p95 - r.stats.total_latency.median; const bestSpread = best.stats.total_latency.p95 - best.stats.total_latency.median; return spread < bestSpread ? r : best; }); recommendations.push({ category: 'Most Consistent', model: mostConsistent.model_id, reason: 'Lowest latency variance (p95-p50 spread)' }); return recommendations; } Key Takeaways and Benchmarking Best Practices Effective model benchmarking requires scientific methodology, comprehensive metrics, and application-specific testing. FLPerformance demonstrates that rigorous performance measurement is accessible to any development team. Critical principles for model evaluation: Test on target hardware: Results from cloud GPUs don't predict laptop performance Measure multiple dimensions: TTFT, TPOT, throughput, consistency all matter Use statistical rigor: Single runs mislead—capture distributions with adequate sample sizes Design realistic workloads: Generic benchmarks don't predict your application's behavior Include warmup iterations: Model loading and JIT compilation affect early measurements Control concurrency: Real applications handle multiple requests—test at realistic loads Document methodology: Reproducible results require documented procedures and configurations The complete benchmarking platform with model management, measurement infrastructure, visualization dashboards, and comprehensive documentation is available at github.com/leestott/FLPerformance. Clone the repository and run the startup script to begin evaluating models on your hardware. Resources and Further Reading FLPerformance Repository - Complete benchmarking platform Quick Start Guide - Setup and first benchmark run Microsoft Foundry Local Documentation - SDK reference and model catalog Architecture Guide - System design and SDK integration Benchmarking Best Practices - Methodology and troubleshooting
Lee_Stott
Feb 02, 2026 Place Microsoft Developer Community Blog
583Views
1like
0Comments
How to Integrate Playwright MCP for AI-Driven Test Automation
Test automation has come a long way, from scripted flows to self-healing and now AI-driven testing. With the introduction of Model Context Protocol (MCP), Playwright can now interact with AI models and external tools to make smarter testing decisions. This guide walks you through integrating MCP with Playwright in VSCode, starting from the basics, enabling you to build smarter, adaptive tests today. What Is Playwright MCP? Playwright: An open-source framework for web testing and automation. It supports multiple browsers (Chromium, Firefox, and WebKit) and offers robust features like auto-wait, capturing screenshots, along with some great tooling like Codegen, Trace Viewer. MCP (Model Context Protocol): A protocol that enables external tools to communicate with AI models or services in a structured, secure way. By combining Playwright with MCP, you unlock: AI-assisted test generation. Dynamic test data. Smarter debugging and adaptive workflows. Why Integrate MCP with Playwright? AI-powered test generation: Reduce manual scripting. Dynamic context awareness: Tests adapt to real-time data. Improved debugging: AI can suggest fixes for failing tests. Smarter locator selection: AI helps pick stable, reliable selectors to reduce flaky tests. Natural language instructions: Write or trigger tests using plain English prompts. Getting Started in VS Code Prerequisites Node.js Download: nodejs.org Minimum version: v18.0.0 or higher (recommended: latest LTS) Check version: node --version Playwright Install Playwright: npm install @playwright/test Step 1: Create Project Folder mkdir playwrightMCP-demo cd playwrightMCP-demo Step 2: Initialize Project npm init playwright@latest Step 3: Install MCP Server for VS Code Navigate to GitHub - microsoft/playwright-mcp: Playwright MCP server and click install server for VS Code Search for 'MCP: Open user configuration' (type ‘>mcp’ in the search box) You will see a file mcp.json is created in your user -> app data folder, which is having the server details. { "servers": { "playwright": { "command": "npx", "args": [ "@playwright/mcp@latest" ], "type": "stdio" } }, "inputs": [] } Alternatively, install an MCP server directly GitHub MCP server registry using the Extensions view in VS Code. From GitHub MCP server registry Verify installation: Open Copilot Chat → select Agent Mode → click Configure Tools → confirm microsoft/playwright-mcp appears in the list. Step 4: Create a Simple Test Using MCP Once your project and MCP setup are ready in VS Code, you can create a simple test that demonstrates MCP’s capabilities. MCP can help in multiple scenarios, below is the example for Test Generation using AI: Scenario: AI-Assisted Test Generation- Use natural language prompts to generate Playwright tests automatically. Test Scenario - Validate that a user can switch the Playwright documentation language dropdown to Python, search for “Frames,” and navigate to the Frames documentation page. Confirm that the page heading correctly displays “Frames.” Sample Prompt to Use in VS Code (Copilot Agent Mode):Create a Playwright automated test in JavaScript that verifies navigation to the 'Frames' documentation page following below steps and be more specific about locators to avoid strict mode violation error Navigate to : Playwright documentation select “Python” from the dropdown options, labelled “Node.js” Type the keyword “Frames” into the search box. Click the search result for the Frames documentation page Verify that the page header reads “Frames”. Log success or provide a failure message with details. Copilot will generate the test automatically in your tests folder Step 5: Run Test npx playwright test Conclusion Integrating Playwright with MCP in VS Code helps you build smarter, adaptive tests without adding complexity. Start small, follow best practices, and scale as you grow. Note - Installation steps may vary depending on your environment. Refer to MCP Registry · GitHub for the latest instructions.
LeenaShaw
Nov 20, 2025 Place Microsoft Developer Community Blog
7.7KViews
1like
3Comments
Real‑Time AI Streaming with Azure OpenAI and SignalR
TL;DR We’ll build a real-time AI app where Azure OpenAI streams responses and SignalR broadcasts them live to an Angular client. Users see answers appear incrementally just like ChatGPT while Azure SignalR Service handles scale. You’ll learn the architecture, streaming code, Angular integration, and optional enhancements like typing indicators and multi-agent scenarios. Why This Matters Modern users expect instant feedback. Waiting for a full AI response feels slow and breaks engagement. Streaming responses: Reduces perceived latency: Users see content as it’s generated. Improves UX: Mimics ChatGPT’s typing effect. Keeps users engaged: Especially for long-form answers. Scales for enterprise: Azure SignalR Service handles thousands of concurrent connections. What you’ll build A SignalR Hub that calls Azure OpenAI with streaming enabled and forwards partial output to clients as it arrives. An Angular client that connects over WebSockets/SSE to the hub and renders partial content with a typing indicator. An optional Azure SignalR Service layer for scalable connection management (thousands to millions of long‑lived connections). References: SignalR hosting & scale; Azure SignalR Service concepts. Architecture The hub calls Azure OpenAI with streaming enabled (await foreach over updates) and broadcasts partials to clients. Azure SignalR Service (optional) offloads connection scale and removes sticky‑session complexity in multi‑node deployments. References: Streaming code pattern; scale/ARR affinity; Azure SignalR integration. Prerequisites Azure OpenAI resource with a deployed model (e.g., gpt-4o or gpt-4o-mini) .NET 8 API + ASP.NET Core SignalR backend Angular 16+ frontend (using microsoft/signalr) Step‑by‑Step Implementation 1) Backend: ASP.NET Core + SignalR Install packages dotnet add package Microsoft.AspNetCore.SignalR dotnet add package Azure.AI.OpenAI --prerelease dotnet add package Azure.Identity dotnet add package Microsoft.Extensions.AI dotnet add package Microsoft.Extensions.AI.OpenAI --prerelease # Optional (managed scale): Azure SignalR Service dotnet add package Microsoft.Azure.SignalR Using DefaultAzureCredential (Entra ID) avoids storing raw keys in code and is the recommended auth model for Azure services. Program.cs var builder = WebApplication.CreateBuilder(args); builder.Services.AddSignalR(); // To offload connection management to Azure SignalR Service, uncomment: // builder.Services.AddSignalR().AddAzureSignalR(); builder.Services.AddSingleton<AiStreamingService>(); var app = builder.Build(); app.MapHub<ChatHub>("/chat"); app.Run(); AiStreamingService.cs - streams content from Azure OpenAI using Microsoft.Extensions.AI; using Azure.AI.OpenAI; using Azure.Identity; public class AiStreamingService { private readonly IChatClient _chatClient; public AiStreamingService(IConfiguration config) { var endpoint = new Uri(config["AZURE_OPENAI_ENDPOINT"]!); var deployment = config["AZURE_OPENAI_DEPLOYMENT"]!; // e.g., "gpt-4o-mini" var azureClient = new AzureOpenAIClient(endpoint, new DefaultAzureCredential()); _chatClient = azureClient.GetChatClient(deployment).AsIChatClient(); } public async IAsyncEnumerable<string> StreamReplyAsync(string userMessage) { var messages = new List<ChatMessage> { ChatMessage.CreateSystemMessage("You are a helpful assistant."), ChatMessage.CreateUserMessage(userMessage) }; await foreach (var update in _chatClient.CompleteChatStreamingAsync(messages)) { // Only text parts; ignore tool calls/annotations var chunk = string.Join("", update.Content .Where(p => p.Kind == ChatMessageContentPartKind.Text) .Select(p => ((TextContent)p).Text)); if (!string.IsNullOrEmpty(chunk)) yield return chunk; } } } Modern .NET AI extensions (Microsoft.Extensions.AI) expose a unified streaming pattern via CompleteChatStreamingAsync. ChatHub.cs - pushes partials to the caller using Microsoft.AspNetCore.SignalR; public class ChatHub : Hub { private readonly AiStreamingService _ai; public ChatHub(AiStreamingService ai) => _ai = ai; // Client calls: connection.invoke("AskAi", prompt) public async Task AskAi(string prompt) { var messageId = Guid.NewGuid().ToString("N"); await Clients.Caller.SendAsync("typing", messageId, true); await foreach (var partial in _ai.StreamReplyAsync(prompt)) { await Clients.Caller.SendAsync("partial", messageId, partial); } await Clients.Caller.SendAsync("typing", messageId, false); await Clients.Caller.SendAsync("completed", messageId); } } 2) Frontend: Angular client with microsoft/signalr Install the SignalR client npm i microsoft/signalr Create a SignalR service (Angular) // src/app/services/ai-stream.service.ts import { Injectable } from '@angular/core'; import * as signalR from '@microsoft/signalr'; import { BehaviorSubject, Observable } from 'rxjs'; @Injectable({ providedIn: 'root' }) export class AiStreamService { private connection?: signalR.HubConnection; private typing$ = new BehaviorSubject<boolean>(false); private partial$ = new BehaviorSubject<string>(''); private completed$ = new BehaviorSubject<boolean>(false); get typing(): Observable<boolean> { return this.typing$.asObservable(); } get partial(): Observable<string> { return this.partial$.asObservable(); } get completed(): Observable<boolean> { return this.completed$.asObservable(); } async start(): Promise<void> { this.connection = new signalR.HubConnectionBuilder() .withUrl('/chat') // same origin; use absolute URL if CORS .withAutomaticReconnect() .configureLogging(signalR.LogLevel.Information) .build(); this.connection.on('typing', (_id: string, on: boolean) => this.typing$.next(on)); this.connection.on('partial', (_id: string, text: string) => { // Append incremental content this.partial$.next((this.partial$.value || '') + text); }); this.connection.on('completed', (_id: string) => this.completed$.next(true)); await this.connection.start(); } async ask(prompt: string): Promise<void> { // Reset state per request this.partial$.next(''); this.completed$.next(false); await this.connection?.invoke('AskAi', prompt); } } Angular component // src/app/components/ai-chat/ai-chat.component.ts import { Component, OnInit } from '@angular/core'; import { AiStreamService } from '../../services/ai-stream.service'; @Component({ selector: 'app-ai-chat', templateUrl: './ai-chat.component.html', styleUrls: ['./ai-chat.component.css'] }) export class AiChatComponent implements OnInit { prompt = ''; output = ''; typing = false; done = false; constructor(private ai: AiStreamService) {} async ngOnInit() { await this.ai.start(); this.ai.typing.subscribe(on => this.typing = on); this.ai.partial.subscribe(text => this.output = text); this.ai.completed.subscribe(done => this.done = done); } async send() { this.output = ''; this.done = false; await this.ai.ask(this.prompt); } } HTML Template  <div class="chat"> <div class="prompt"> <input [(ngModel)]="prompt" placeholder="Ask me anything…" /> <button (click)="send()">Send</button> </div> <div class="response"> <pre>{{ output }}</pre> <div class="typing" *ngIf="typing">Assistant is typing…</div> <div class="done" *ngIf="done">✓ Completed</div> </div> </div> Streaming modes, content filters, and UX Azure OpenAI streaming interacts with content filtering in two ways: Default streaming: The service buffers output into content chunks and runs content filters before each chunk is emitted; you still stream, but not necessarily token‑by‑token. Asynchronous Filter (optional): The service returns token‑level updates immediately and runs filters asynchronously. You get ultra‑smooth streaming but must handle delayed moderation signals (e.g., redaction or halting the stream). Best practices Append partials in small batches client‑side to avoid DOM thrash; finalize formatting on "completed". Log full messages server‑side only after completion to keep histories consistent (mirrors agent frameworks). Security & compliance Auth: Prefer Microsoft Entra ID (DefaultAzureCredential) to avoid key sprawl; use RBAC and Managed Identities where possible. Secrets: Store Azure SignalR connection strings in Key Vault and rotate periodically; never hardcode. CORS & cross‑domain: When hosting frontend and hub on different origins, configure CORS and use absolute URLs in withUrl(...). Connection management & scaling tips Persistent connection load: SignalR consumes TCP resources; separate heavy real‑time workloads or use Azure SignalR to protect other apps. Sticky sessions (self‑hosted): Required in most multi‑server scenarios unless WebSockets‑only + SkipNegotiation applies; Azure SignalR removes this requirement. Learn more AI‑Powered Group Chat sample (ASP.NET Core): Azure OpenAI .NET client (auth & streaming): SignalR JavaScript Client
pranav_pratik
Nov 12, 2025 Place Microsoft Developer Community Blog
658Views
2likes
0Comments
Serverless MCP Agent with LangChain.js v1 — Burgers, Tools, and Traces 🍔
AI agents that can actually do stuff (not just chat) are the fun part nowadays, but wiring them cleanly into real APIs, keeping things observable, and shipping them to the cloud can get... messy. So we built a fresh end‑to‑end sample to show how to do it right with the brand new LangChain.js v1 and Model Context Protocol (MCP). In case you missed it, MCP is a recent open standard that makes it easy for LLM agents to consume tools and APIs, and LangChain.js, a great framework for building GenAI apps and agents, has first-class support for it. You can quickly get up speed with the MCP for Beginners course and AI Agents for Beginners course. This new sample gives you: A LangChain.js v1 agent that streams its result, along reasoning + tool steps An MCP server exposing real tools (burger menu + ordering) from a business API A web interface with authentication, sessions history, and a debug panel (for developers) A production-ready multi-service architecture Serverless deployment on Azure in one command ( azd up ) Yes, it’s a burger ordering system. Who doesn't like burgers? Grab your favorite beverage ☕, and let’s dive in for a quick tour! TL;DR key takeaways New sample: full-stack Node.js AI agent using LangChain.js v1 + MCP tools Architecture: web app → agent API → MCP server → burger API Runs locally with a single npm start , deploys with azd up Uses streaming (NDJSON) with intermediate tool + LLM steps surfaced to the UI Ready to fork, extend, and plug into your own domain / tools What will you learn here? What this sample is about and its high-level architecture What LangChain.js v1 brings to the table for agents How to deploy and run the sample How MCP tools can expose real-world APIs Reference links for everything we use GitHub repo LangChain.js docs Model Context Protocol Azure Developer CLI MCP Inspector Use case You want an AI assistant that can take a natural language request like “Order two spicy burgers and show me my pending orders” and: Understand intent (query menu, then place order) Call the right MCP tools in sequence, calling in turn the necessary APIs Stream progress (LLM tokens + tool steps) Return a clean final answer Swap “burgers” for “inventory”, “bookings”, “support tickets”, or “IoT devices” and you’ve got a reusable pattern! Sample overview Before we play a bit with the sample, let's have a look at the main services implemented here: Service Role Tech Agent Web App ( agent-webapp ) Chat UI + streaming + session history Azure Static Web Apps, Lit web components Agent API ( agent-api ) LangChain.js v1 agent orchestration + auth + history Azure Functions, Node.js Burger MCP Server ( burger-mcp ) Exposes burger API as tools over MCP (Streamable HTTP + SSE) Azure Functions, Express, MCP SDK Burger API ( burger-api ) Business logic: burgers, toppings, orders lifecycle Azure Functions, Cosmos DB Here's a simplified view of how they interact: There are also other supporting components like databases and storage not shown here for clarity. For this quickstart we'll only interact with the Agent Web App and the Burger MCP Server, as they are the main stars of the show here. LangChain.js v1 agent features The recent release of LangChain.js v1 is a huge milestone for the JavaScript AI community! It marks a significant shift from experimental tools to a production-ready framework. The new version doubles down on what’s needed to build robust AI applications, with a strong focus on agents. This includes first-class support for streaming not just the final output, but also intermediate steps like tool calls and agent reasoning. This makes building transparent and interactive agent experiences (like the one in this sample) much more straightforward. Quickstart Requirements GitHub account Azure account (free signup, or if you're a student, get free credits here) Azure Developer CLI Deploy and run the sample We'll use GitHub Codespaces for a quick zero-install setup here, but if you prefer to run it locally, check the README. Click on the following link or open it in a new tab to launch a Codespace: Create Codespace This will open a VS Code environment in your browser with the repo already cloned and all the tools installed and ready to go. Provision and deploy to Azure Open a terminal and run these commands: # Install dependencies npm install # Login to Azure azd auth login # Provision and deploy all resources azd up Follow the prompts to select your Azure subscription and region. If you're unsure of which one to pick, choose East US 2 . The deployment will take about 15 minutes the first time, to create all the necessary resources (Functions, Static Web Apps, Cosmos DB, AI Models). If you're curious about what happens under the hood, you can take a look at the main.bicep file in the infra folder, which defines the infrastructure as code for this sample. Test the MCP server While the deployment is running, you can run the MCP server and API locally (even in Codespaces) to see how it works. Open another terminal and run: npm start This will start all services locally, including the Burger API and the MCP server, which will be available at http://localhost:3000/mcp . This may take a few seconds, wait until you see this message in the terminal: 🚀 All services ready 🚀 When these services are running without Azure resources provisioned, they will use in-memory data instead of Cosmos DB so you can experiment freely with the API and MCP server, though the agent won't be functional as it requires a LLM resource. MCP tools The MCP server exposes the following tools, which the agent can use to interact with the burger ordering system: Tool Name Description get_burgers Get a list of all burgers in the menu get_burger_by_id Get a specific burger by its ID get_toppings Get a list of all toppings in the menu get_topping_by_id Get a specific topping by its ID get_topping_categories Get a list of all topping categories get_orders Get a list of all orders in the system get_order_by_id Get a specific order by its ID place_order Place a new order with burgers (requires userId , optional nickname ) delete_order_by_id Cancel an order if it has not yet been started (status must be pending , requires userId ) You can test these tools using the MCP Inspector. Open another terminal and run: npx -y @modelcontextprotocol/inspector Then open the URL printed in the terminal in your browser and connect using these settings: Transport: Streamable HTTP URL: http://localhost:3000/mcp Connection Type: Via Proxy (should be default) Click on Connect, then try listing the tools first, and run get_burgers tool to get the menu info. Test the Agent Web App After the deployment is completed, you can run the command npm run env to print the URLs of the deployed services. Open the Agent Web App URL in your browser (it should look like https://<your-web-app>.azurestaticapps.net ). You'll first be greeted by an authentication page, you can sign in either with your GitHub or Microsoft account and then you should be able to access the chat interface. From there, you can start asking any question or use one of the suggested prompts, for example try asking: Recommend me an extra spicy burger . As the agent processes your request, you'll see the response streaming in real-time, along with the intermediate steps and tool calls. Once the response is complete, you can also unfold the debug panel to see the full reasoning chain and the tools that were invoked: Tip: Our agent service also sends detailed tracing data using OpenTelemetry. You can explore these either in Azure Monitor for the deployed service, or locally using an OpenTelemetry collector. We'll cover this in more detail in a future post. Wrap it up Congratulations, you just finished spinning up a full-stack serverless AI agent using LangChain.js v1, MCP tools, and Azure’s serverless platform. Now it's your turn to dive in the code and extend it for your use cases! 😎 And don't forget to azd down once you're done to avoid any unwanted costs. Going further This was just a quick introduction to this sample, and you can expect more in-depth posts and tutorials soon. Since we're in the era of AI agents, we've also made sure that this sample can be explored and extended easily with code agents like GitHub Copilot. We even built a custom chat mode to help you discover and understand the codebase faster! Check out the Copilot setup guide in the repo to get started. You can quickly get up speed with the MCP for Beginners course and AI Agents for Beginners course. If you like this sample, don't forget to star the repo ⭐️! You can also join us in the Azure AI community Discord to chat and ask any questions. Happy coding and burger ordering! 🍔
sinedied
Oct 27, 2025 Place Microsoft Developer Community Blog
2.4KViews
0likes
1Comment
AI Career Navigator — Empowering Job Seekers with Azure OpenAI
AI Career Navigator is more than just a project — it’s a mission to make career growth accessible, intelligent, and human. Powered by Azure OpenAI, it transforms uncertainty into direction and effort into achievement. Author: Aryan Jaiswal — Gold Microsoft Learn Student Ambassador Reviewer: Julia Muiruri (Microsoft)
aryanjstar
Oct 09, 2025 Place Educator Developer Blog
413Views
2likes
0Comments