learning
108 TopicsOn-Premises Manufacturing Intelligence
Manufacturing facilities face a fundamental dilemma in the AI era: how to harness artificial intelligence for predictive maintenance, equipment diagnostics, and operational insights while keeping sensitive production data entirely on-premises. Industrial environments generate proprietary information, CNC machining parameters, quality control thresholds, equipment performance signatures, maintenance histories, that represents competitive advantage accumulated over decades of process optimization. Sending this data to cloud APIs risks intellectual property exposure, regulatory non-compliance, and operational dependencies that manufacturing operations cannot accept. Traditional cloud-based AI introduces unacceptable vulnerabilities. Network latency of 100-500ms makes real-time decision support impossible for time-sensitive manufacturing processes. Internet dependency creates single points of failure in environments where connectivity is unreliable or deliberately restricted for security. API pricing models become prohibitively expensive when analyzing thousands of sensor readings and maintenance logs continuously. Most critically, data residency requirements for aerospace, defense, pharmaceutical, and automotive industries make cloud AI architectures non-compliant by design ITAR, FDA 21 CFR Part 11, and customer-specific mandates require data never leaves facility boundaries. This article demonstrates a sample solution for manufacturing asset intelligence that runs entirely on-premises using Microsoft Foundry Local, Node.js, and JavaScript. The FoundryLocal-IndJSsample repository provides production-ready implementation with Express backend, HTML/JavaScript frontend, and comprehensive Foundry Local SDK integration. Facilities can deploy sophisticated AI-powered monitoring without external dependencies, cloud costs, data exposure risks, or network requirements. Every inference happens locally on facility hardware with predictable performance and zero data egress. Why On-Premises AI Matters for Industrial Operations The case for local AI inference in manufacturing extends beyond simple preference, it addresses fundamental operational, security, and compliance requirements that cloud solutions cannot satisfy. Understanding these constraints shapes architectural decisions that prioritize reliability, data sovereignty, and cost predictability. Data Sovereignty and Intellectual Property Protection Manufacturing processes represent years of proprietary research, optimization, and competitive advantage. Equipment configurations, cycle times, quality thresholds, and maintenance schedules contain intelligence that competitors would value highly. Sending this data to third-party cloud services, even with contractual protections, introduces risks that manufacturing operations cannot accept. On-premises AI ensures that production data never leaves the facility network perimeter. Telemetry from CNC machines, hydraulic systems, conveyor networks, and control systems remains within air-gapped environments where physical access controls and network isolation provide demonstrable data protection. This architectural guarantee of data locality satisfies both internal security policies and external audit requirements without relying on contractual assurances or encryption alone. Operational Resilience and Network Independence Factory floors frequently operate in environments with limited, unreliable, or intentionally restricted internet connectivity. Remote facilities, secure manufacturing zones, and legacy industrial networks cannot depend on continuous cloud access for critical monitoring functions. When network failures occur, whether from ISP outages, DDoS attacks, or infrastructure damage, AI capabilities must continue operating to prevent production losses. Local inference provides true operational independence. Equipment health monitoring, anomaly detection, and maintenance prioritization continue functioning during network disruptions. This resilience is essential for 24/7 manufacturing operations where downtime costs can exceed tens of thousands of dollars per hour. By eliminating external dependencies, on-premises AI becomes as reliable as the local power supply and computing infrastructure. Latency Requirements for Real-Time Decision Making Manufacturing processes involve precise timing where milliseconds determine quality outcomes. Automated inspection systems must classify defects before products leave the production line. Safety interlocks must respond to hazardous conditions before injuries occur. Predictive maintenance alerts must trigger before catastrophic equipment failures cascade through production lines. Cloud-based AI introduces latency that incompatible with these requirements. Network round-trips to cloud endpoints typically require 100-500 milliseconds, in some case latency is unacceptable for real-time applications. Local inference with Foundry Local delivers sub-50ms response times by eliminating network hops, enabling true real-time AI integration with SCADA systems, PLCs, and manufacturing execution systems. Cost Predictability at Industrial Scale Manufacturing facilities generate enormous volumes of time-series data from thousands of sensors, producing millions of data points daily. Cloud AI services charge per API call or per token processed, creating unpredictable costs that scale linearly with data volume. High-throughput industrial applications can quickly accumulate tens of thousands of dollars in monthly API fees. On-premises AI transforms this variable operational expense into fixed capital infrastructure costs. After initial hardware investment, inference costs remain constant regardless of query volume. For facilities analyzing equipment telemetry, maintenance logs, and operator notes continuously, this economic model provides cost certainty and eliminates budget surprises. Regulatory Compliance and Audit Requirements Regulated industries face strict data handling requirements. Aerospace manufacturers must comply with ITAR controls on technical data. Pharmaceutical facilities must satisfy FDA 21 CFR Part 11 requirements for electronic records. Automotive suppliers must meet customer-specific data residency mandates. Cloud AI services complicate compliance by introducing third-party data processors, cross-border data transfers, and shared infrastructure concerns. Local AI simplifies regulatory compliance by eliminating external data flows. Audit trails remain within the facility. Data handling procedures avoid third-party agreements. Compliance demonstrations become straightforward when AI infrastructure resides entirely within auditable physical and network boundaries. Architecture: Manufacturing Intelligence Without Cloud Dependencies The manufacturing asset intelligence system demonstrates a practical architecture for deploying AI capabilities entirely on-premises. The design prioritizes operational reliability, straightforward integration patterns, and maintainable code structure that facilities can adapt to their specific requirements. System Components and Technology Stack The implementation consists of three primary layers that separate concerns and enable independent scaling: Foundry Local Layer: Provides the local AI inference runtime. Foundry Local manages model loading, execution, and resource allocation. It supports multiple model families (Phi-3.5, Phi-4, Qwen2.5) with automatic hardware acceleration detection for NVIDIA GPUs (CUDA), Intel GPUs (OpenVINO), ARM Qualcomm (QNN) and optimized CPU inference. The service exposes a REST API on localhost that the backend layer consumes for completions. Backend Service Layer: An Express Node.js application that serves as the integration point between the AI runtime and the manufacturing data systems. This layer implements business logic for equipment monitoring, maintenance log classification, and conversational interfaces. It formats prompts with equipment context, calls Foundry Local for inference, and structures responses for the frontend. The backend persists chat history and provides RESTful endpoints for all AI operations. Frontend Interface Layer: A standalone HTML/JavaScript application that provides operator interfaces for equipment monitoring, maintenance management, and AI assistant interactions. The UI fetches data from the backend service and renders dashboards, equipment status views, and chat interfaces. No framework dependencies or build steps are required, the frontend operates as static files that any web server or file system can serve. Data Flow for Equipment Analysis Understanding how data moves through the system clarifies integration points and extension opportunities. When an operator requests AI analysis of equipment status, the following sequence occurs: The frontend collects equipment context including asset ID, current telemetry values, alert status, and recent maintenance history. It constructs an HTTP request to the backend's equipment summary endpoint, passing this context as query parameters or request body. The backend retrieves additional context from the equipment database, including specifications, normal operating ranges, and historical performance patterns. The backend constructs a detailed prompt that provides the AI model with comprehensive context: equipment specifications, current telemetry with alarming conditions highlighted, recent maintenance notes, and specific questions about operational status. This prompt engineering is critical, the model's accuracy depends entirely on the context provided. Generic prompts produce generic responses; detailed, structured prompts yield actionable insights. The backend calls Foundry Local's completion API with the formatted prompt, specifying temperature, max tokens, and other generation parameters. Foundry Local loads the configured model (if not already in memory) and generates a response analyzing the equipment's condition. The inference occurs locally with no network traffic leaving the facility. Response time typically ranges from 500ms to 3 seconds depending on prompt complexity and model size. Foundry Local returns the generated text to the backend, which parses the response for structured information if required (equipment health classifications, priority levels, recommended actions). The backend formats this analysis as JSON and returns it to the frontend. The frontend renders the AI-generated summary in the equipment health dashboard, highlighting critical findings and recommended operator actions. Prompt Engineering for Maintenance Log Classification The maintenance log classification feature demonstrates effective prompt engineering for extracting structured decisions from language models. Manufacturing facilities accumulate thousands of maintenance notes, operator observations, technician reports, and automated system logs. Automatically classifying these entries by severity enables priority-based work scheduling without manual review of every log entry. The classification prompt provides the model with clear instructions, classification categories with definitions, and the maintenance note text to analyze: const classificationPrompt = `You are a manufacturing maintenance expert analyzing equipment log entries. Classify the following maintenance note into one of these categories: CRITICAL: Immediate safety hazard, equipment failure, or production stoppage HIGH: Degraded performance, abnormal readings requiring same-shift attention MEDIUM: Scheduled maintenance items or routine inspections LOW: Informational notes, normal operations logs Provide your response in JSON format: { "classification": "CRITICAL|HIGH|MEDIUM|LOW", "reasoning": "Brief explanation of classification decision", "recommended_action": "Specific next steps for maintenance team" } Maintenance Note: ${maintenanceNote} Classification:`; const response = await foundryClient.chat.completions.create({ model: currentModelAlias, messages: [{ role: 'user', content: classificationPrompt }], temperature: 0.1, // Low temperature for consistent classification max_tokens: 300 }); Key aspects of this prompt design: Role definition: Establishing the model as a "manufacturing maintenance expert" activates relevant knowledge and reasoning patterns in the model's training data. Clear categories: Explicit classification options with definitions prevent ambiguous outputs and enable consistent decision-making across thousands of logs. Structured output format: Requesting JSON responses with specific fields enables automated parsing and integration with maintenance management systems without fragile text parsing. Temperature control: Setting temperature to 0.1 reduces randomness in classifications, ensuring consistent severity assessments for similar maintenance conditions. Context isolation: Separating the maintenance note text from the instructions with clear delimiters prevents prompt injection attacks where malicious log entries might attempt to manipulate classification logic. This classification runs locally for every maintenance log entry without API costs or network delays. Facilities processing hundreds of maintenance notes daily benefit from immediate, consistent classification that routes critical issues to technicians automatically while filtering routine informational logs. Model Selection and Performance Trade-offs Foundry Local supports multiple model families with different memory requirements, inference speeds, and accuracy characteristics. Choosing appropriate models for manufacturing environments requires balancing these trade-offs against hardware constraints and operational requirements: Qwen2.5-0.5b (500MB memory): The smallest available model provides extremely fast inference (100-200ms responses) on limited hardware. Suitable for simple classification tasks, keyword extraction, and high-throughput scenarios where response speed matters more than nuanced understanding. Works well on older servers or edge devices with constrained resources. Phi-3.5-mini (2.1GB memory): The recommended default model balances accuracy with reasonable memory requirements. Provides strong reasoning capabilities for equipment analysis, maintenance prioritization, and conversational assistance. Response times of 1-3 seconds on modern CPUs are acceptable for interactive dashboards. This model handles complex prompts with detailed equipment context effectively. Phi-4-mini (3.6GB memory): Increased model capacity improves understanding of technical terminology and complex equipment relationships. Best choice when analyzing detailed maintenance histories, interpreting sensor correlation patterns, or providing nuanced operational recommendations. Requires more memory but delivers noticeably improved analysis quality for complex scenarios. Qwen2.5-7b (4.7GB memory): The largest supported model provides maximum accuracy and sophisticated reasoning. Ideal for facilities with modern server hardware where best-possible analysis quality justifies longer inference times (3-5 seconds). Consider this model for critical applications where operator decisions depend heavily on AI recommendations. Facilities can download all models during initial setup and switch between them based on specific use cases. Use faster models for real-time dashboard updates and automated classification. Deploy larger models for detailed equipment analysis and maintenance planning where operators can wait several seconds for comprehensive insights. Implementation: Equipment Monitoring and AI Analysis The practical implementation reveals how straightforward on-premises AI integration can be with modern JavaScript tooling and proper architectural separation. The backend service manages all AI interactions, shielding the frontend from inference complexity and providing clean REST interfaces. Backend Service Architecture with Express The Node.js backend initializes the Foundry Local SDK client and exposes endpoints for equipment operations: const express = require('express'); const { FoundryLocalClient } = require('foundry-local-sdk'); const cors = require('cors'); const app = express(); const PORT = process.env.PORT || 3000; // Initialize Foundry Local client const foundryClient = new FoundryLocalClient({ baseURL: 'http://localhost:8008', // Default Foundry Local endpoint timeout: 30000 }); // Middleware configuration app.use(cors()); // Enable cross-origin requests from frontend app.use(express.json()); // Parse JSON request bodies // Health check endpoint for monitoring app.get('/api/health', (req, res) => { res.json({ ok: true, service: 'manufacturing-ai-backend' }); }); // Start server app.listen(PORT, () => { console.log(`Manufacturing AI backend running on port ${PORT}`); console.log(`Foundry Local endpoint: http://localhost:8008`); }); This foundational structure establishes the Express application with CORS support for browser-based frontends and JSON request handling. The Foundry Local client connects to the local inference service running on port 8008, no external network configuration required. Equipment Summary Generation with Context-Rich Prompts The equipment summary endpoint demonstrates effective context injection for accurate AI analysis: app.get('/api/assets/:id/summary', async (req, res) => { try { const assetId = req.params.id; const asset = equipmentDatabase.find(a => a.id === assetId); if (!asset) { return res.status(404).json({ error: 'Asset not found' }); } // Construct detailed equipment context const contextPrompt = buildEquipmentContext(asset); // Generate AI analysis const completion = await foundryClient.chat.completions.create({ model: 'phi-3.5-mini', messages: [{ role: 'user', content: contextPrompt }], temperature: 0.3, max_tokens: 500 }); const analysis = completion.choices[0].message.content; res.json({ assetId: asset.id, assetName: asset.name, analysis: analysis, generatedAt: new Date().toISOString() }); } catch (error) { console.error('Equipment summary error:', error); res.status(500).json({ error: 'AI analysis failed', details: error.message }); } }); The equipment context builder assembles comprehensive information for accurate analysis: function buildEquipmentContext(asset) { const alerts = asset.alerts.filter(a => a.severity !== 'INFO'); const telemetry = asset.currentTelemetry; return `Analyze the following manufacturing equipment status: Equipment: ${asset.name} (${asset.id}) Type: ${asset.type} Location: ${asset.location} Current Telemetry: - Temperature: ${telemetry.temperature}°C (Normal range: ${asset.specs.tempRange}) - Vibration: ${telemetry.vibration} mm/s (Threshold: ${asset.specs.vibrationThreshold}) - Pressure: ${telemetry.pressure} PSI (Normal: ${asset.specs.pressureRange}) - Runtime: ${telemetry.runHours} hours (Next maintenance due: ${asset.nextMaintenance}) Active Alerts: ${alerts.map(a => `- ${a.severity}: ${a.message}`).join('\n')} Recent Maintenance History: ${asset.recentMaintenance.slice(0, 3).map(m => `- ${m.date}: ${m.description}`).join('\n')} Provide a concise operational summary focusing on: 1. Current equipment health status 2. Any concerning trends or anomalies 3. Recommended operator actions if applicable 4. Maintenance priority level Summary:`; } This context-rich approach produces accurate, actionable analysis because the model receives equipment specifications, current telemetry with context, alert history, maintenance patterns, and structured output guidance. The model can identify abnormal conditions accurately rather than guessing what values seem unusual. Conversational AI Assistant with Manufacturing Context The chat endpoint enables natural language queries about equipment status and operational questions: app.post('/api/chat', async (req, res) => { try { const { message, conversationId } = req.body; // Retrieve conversation history for context const history = conversationStore.get(conversationId) || []; // Build plant-wide context for the query const plantContext = buildPlantOperationsContext(); // Construct system message with domain knowledge const systemMessage = { role: 'system', content: `You are an AI assistant for a manufacturing facility's operations team. You have access to real-time equipment data and maintenance records. Current Plant Status: ${plantContext} Provide specific, actionable responses based on actual equipment data. If you don't have information to answer a query, clearly state that. Never speculate about equipment conditions beyond available data.` }; // Include conversation history for multi-turn context const messages = [ systemMessage, ...history, { role: 'user', content: message } ]; const completion = await foundryClient.chat.completions.create({ model: 'phi-3.5-mini', messages: messages, temperature: 0.4, max_tokens: 600 }); const assistantResponse = completion.choices[0].message.content; // Update conversation history history.push( { role: 'user', content: message }, { role: 'assistant', content: assistantResponse } ); conversationStore.set(conversationId, history); res.json({ response: assistantResponse, conversationId: conversationId, timestamp: new Date().toISOString() }); } catch (error) { console.error('Chat error:', error); res.status(500).json({ error: 'Chat request failed', details: error.message }); } }); The conversational interface enables operators to ask natural language questions and receive grounded responses based on actual equipment data, citing specific asset IDs, current metric values, and alert statuses rather than speculating. Deployment and Production Operations Deploying on-premises AI in industrial settings requires consideration of hardware placement, network architecture, integration patterns, and operational procedures that differ from typical web application deployments. Hardware and Infrastructure Requirements The system runs on standard server hardware without specialized AI accelerators, though GPU availability improves performance significantly. Minimum requirements include 8GB RAM for the Phi-3.5-mini model, 4-core CPU, and 50GB storage for model files and application data. Production deployments benefit from 16GB+ RAM to support larger models and concurrent analysis requests. For facilities with NVIDIA GPUs, Foundry Local automatically utilizes CUDA acceleration, reducing inference times by 3-5x compared to CPU-only execution. Deploy the backend service on dedicated server hardware within the factory network. Avoid running AI workloads on the same systems that host critical SCADA or MES applications due to resource contention concerns. Network Architecture and SCADA Integration The AI backend should reside on the manufacturing operations network with firewall rules permitting connections from operator workstations and monitoring systems. Do not expose the backend service directly to the internet, all access should occur through the facility's internal network with authentication via existing directory services. Integrate with SCADA systems through standard industrial protocols. Configure OPC-UA clients to subscribe to equipment telemetry topics and forward readings to the AI backend via REST API calls. Modbus TCP gateways can bridge legacy PLCs to modern APIs by polling register values and POSTing updates to the backend's telemetry ingestion endpoints. Security and Compliance Considerations Many manufacturing facilities operate air-gapped networks where physical separation prevents internet connectivity entirely. Deploy Foundry Local and the AI application in these environments by transferring model files and application packages via removable media during controlled maintenance windows. Implement role-based access control (RBAC) using Active Directory integration. Configure the backend to validate user credentials against LDAP before serving AI analysis requests. Maintain detailed audit logs of all AI invocations including user identity, timestamp, equipment queried, and model version used. Store these logs in immutable append-only databases for compliance audits. Key Takeaways Building production-ready AI systems for industrial environments requires architectural decisions that prioritize operational reliability, data sovereignty, and integration simplicity: Data locality by architectural design: On-premises AI ensures proprietary production data never leaves facility networks through fundamental architectural guarantees rather than configuration options Model selection impacts deployment feasibility: Smaller models (0.5B-2B parameters) enable deployment on commodity hardware without specialized accelerators while maintaining acceptable accuracy Fallback logic preserves operational continuity: AI capabilities enhance but don't replace core monitoring functions, ensuring equipment dashboards display raw telemetry even when AI analysis is unavailable Context-rich prompts determine accuracy: Effective prompts include equipment specifications, normal operating ranges, alert thresholds, and maintenance history to enable grounded recommendations Structured outputs enable automation: JSON response formats allow automated systems to parse classifications and route work orders without fragile text parsing Integration patterns bridge legacy systems: OPC-UA and Modbus TCP gateways connect decades-old PLCs and SCADA systems to modern AI without replacing functional control infrastructure Resources and Further Exploration The complete implementation with extensive comments and documentation is available in the GitHub repository. Additional resources help facilities customize and extend the system for their specific requirements. FoundryLocal-IndJSsample GitHub Repository – Full source code with JavaScript backend, HTML frontend, and sample data files Quick Start Guide and Documentation – Installation instructions, API documentation, and troubleshooting guidance Microsoft Foundry Local Documentation – Official SDK reference, model catalog, and deployment guidance Sample Manufacturing Data – Example equipment telemetry, maintenance logs, and alert structures Backend Implementation Reference – Express server code with Foundry Local SDK integration patterns OPC Foundation – Industrial communication standards for SCADA and PLC integration Edge AI for Beginners - Online FREE course and resources for learning more about using AI on Edge Devices Why On-Premises AI Cloud AI services offer convenience, but they fundamentally conflict with manufacturing operational requirements. Understanding these conflicts explains why local AI isn't just preferable, it's mandatory for production environments. Data privacy and intellectual property protection stand paramount. A CNC machining program represents years of optimization, feed rates, tool paths, thermal compensation algorithms. Quality control measurements reveal product specifications competitors would pay millions to access. Sending this data to external APIs, even with encryption, creates unacceptable exposure risk. Every API call generates logs on third-party servers, potentially subject to subpoenas, data breaches, or regulatory compliance failures. Latency requirements eliminate cloud viability for real-time decisions. When a thermal sensor detects bearing temperature exceeding safe thresholds, the control system needs AI analysis in under 50 milliseconds to prevent catastrophic failure. Cloud APIs introduce 100-500ms baseline latency from network round-trips alone, before queue times and processing. For safety systems, quality inspection, and process control, this latency is operationally unacceptable. Network dependency creates operational fragility. Factory floors frequently have limited connectivity, legacy equipment, RF interference, isolated production cells. Critical AI capabilities cannot fail because internet service drops. Moreover, many defense, aerospace, and pharmaceutical facilities operate air-gapped networks for security compliance. Cloud AI is simply non-operational in these environments. Regulatory requirements mandate data residency. ITAR (International Traffic in Arms Regulations) prohibits certain manufacturing data from leaving approved facilities. FDA 21 CFR Part 11 requires strict data handling controls for pharmaceutical manufacturing. GDPR demands data residency in approved jurisdictions. On-premises AI simplifies compliance by eliminating cross-border data transfers. Cost predictability at scale favors local deployment. A high-volume facility generating 10,000 equipment events per day, each requiring AI analysis, would incur significant cloud API costs. Local models have fixed infrastructure costs that scale economically with usage, making AI economically viable for continuous monitoring. Application Architecture: Web UI + Local AI Backend The FoundryLocal-IndJSsample implements a clean separation between data presentation and AI inference. This architecture ensures the UI remains responsive while AI operations run independently, enabling real-time dashboard updates without blocking user interactions. The web frontend serves a single-page application with vanilla HTML, CSS, and JavaScript, no frameworks, no build tools. This simplicity is intentional: factory IT teams need to audit code, customize interfaces, and deploy on legacy systems. The UI presents four main interfaces: Plant Asset Overview (real-time health cards for all equipment), Asset Health (AI-generated summaries and trend analysis), Maintenance Logs (classification and priority routing), and AI Assistant (natural language interface for operations queries). The Node.js backend runs Express as the HTTP server, handling static file serving, API routing, and WebSocket connections for real-time updates. It loads sample manufacturing data from JSON files, equipment telemetry, maintenance logs, historical events, simulating the data streams that would come from SCADA systems, PLCs, and MES platforms in production. Foundry Local provides the AI inference layer. The backend uses foundry-local-sdk to communicate with the locally running service. All model loading, prompt processing, and response generation happens on-device. The application detects Foundry Local automatically and falls back to rule-based analysis if unavailable, ensuring core functionality persists even when AI is offline. Here's the architectural flow for asset health analysis: User Request (Web UI) ↓ Express API Route (/api/assets/:id/summary) ↓ Load Equipment Data (from JSON/database) ↓ Build Analysis Prompt (Equipment ID, telemetry, alerts) ↓ Foundry Local SDK Call (local AI inference) ↓ Parse AI Response (structured insights) ↓ Return JSON Result (with metadata: model, latency, confidence) ↓ Display in UI (formatted health summary) This architecture demonstrates several industrial system design principles: Offline-first operation: Core functionality works without internet connectivity, with AI as an enhancement rather than dependency Graceful degradation: If AI fails, fall back to rule-based logic rather than crashing operations Minimal external dependencies: Simple stack reduces attack surface and simplifies air-gapped deployment Data locality: All processing happens on-premises, no external API calls Real-time updates: WebSocket connections enable push-based event streaming for dashboard updates Setting Up Foundry Local for Industrial Applications Industrial deployments require careful model selection that balances accuracy, speed, and hardware constraints. Factory edge devices often run on limited hardware—industrial PCs with modest GPUs or CPU-only configurations. Model choice significantly impacts deployment feasibility. Install Foundry Local on the industrial edge device: # Windows (most common for industrial PCs) winget install Microsoft.FoundryLocal # Verify installation foundry --version For manufacturing asset intelligence, model selection trades off speed versus quality: # Fast option: Qwen 0.5B (500MB, <100ms inference) foundry model load qwen2.5-0.5b # Balanced option: Phi-3.5 Mini (2.1GB, ~200ms inference) foundry model load phi-3.5-mini # High quality option: Phi-4 Mini (3.6GB, ~500ms inference) foundry model load phi-4 # Check which model is currently loaded foundry model list For real-time monitoring dashboards where hundreds of assets update continuously, qwen2.5-0.5b provides sufficient quality at speeds that don't bottleneck refresh cycles. For detailed root cause analysis or maintenance report generation where quality matters most, phi-4 justifies the slightly longer inference time. Industrial systems benefit from proactive model caching during downtime: # During maintenance windows, pre-download models foundry model download phi-3.5-mini foundry model download qwen2.5-0.5b # Models cache locally, eliminating runtime downloads The backend automatically detects Foundry Local and selects the loaded model: // backend/services/foundry-service.js import { FoundryLocalClient } from 'foundry-local-sdk'; class FoundryService { constructor() { this.client = null; this.modelAlias = null; this.initializeClient(); } async initializeClient() { try { // Detect Foundry Local endpoint const endpoint = process.env.FOUNDRY_LOCAL_ENDPOINT || 'http://127.0.0.1:5272'; this.client = new FoundryLocalClient({ endpoint }); // Query which model is currently loaded const models = await this.client.models.list(); this.modelAlias = models.data[0]?.id || 'phi-3.5-mini'; console.log(`✅ Foundry Local connected: ${this.modelAlias}`); } catch (error) { console.warn('⚠️ Foundry Local not available, using rule-based fallback'); this.client = null; } } async generateCompletion(prompt, options = {}) { if (!this.client) { // Fallback to rule-based analysis return this.ruleBasedAnalysis(prompt); } try { const startTime = Date.now(); const completion = await this.client.chat.completions.create({ model: this.modelAlias, messages: [ { role: 'system', content: 'You are an industrial asset intelligence assistant analyzing manufacturing equipment.' }, { role: 'user', content: prompt } ], temperature: 0.3, // Low temperature for factual analysis max_tokens: 400, ...options }); const latency = Date.now() - startTime; return { content: completion.choices[0].message.content, model: this.modelAlias, latency_ms: latency, tokens: completion.usage?.total_tokens }; } catch (error) { console.error('Foundry inference error:', error); return this.ruleBasedAnalysis(prompt); } } ruleBasedAnalysis(prompt) { // Fallback logic for when AI is unavailable // Pattern matching and heuristics return { content: '(Rule-based analysis) Equipment status: Monitoring...', model: 'rule-based-fallback', latency_ms: 5, tokens: 0 }; } } export default new FoundryService(); This service layer demonstrates critical production patterns: Automatic endpoint detection: Tries environment variable first, falls back to default Model auto-discovery: Queries Foundry Local for currently loaded model rather than hardcoding Robust error handling: Every API call wrapped in try-catch with fallback logic Performance tracking: Latency measurement enables monitoring and capacity planning Conservative temperature: 0.3 temperature reduces hallucination for factual equipment analysis Implementing AI-Powered Asset Health Analysis Equipment health monitoring forms the core use case, synthesizing telemetry from multiple sources into actionable insights. Traditional monitoring systems show raw metrics (temperature, vibration, pressure) but require expert interpretation. AI transforms this into natural language summaries that any operator can understand and act upon. Here's the API endpoint that generates asset health summaries: // backend/routes/assets.js import express from 'express'; import foundryService from '../services/foundry-service.js'; import { getAssetData } from '../data/asset-loader.js'; const router = express.Router(); router.get('/api/assets/:id/summary', async (req, res) => { try { const assetId = req.params.id; // Load equipment data const asset = await getAssetData(assetId); if (!asset) { return res.status(404).json({ error: 'Asset not found' }); } // Build analysis prompt with context const prompt = buildHealthAnalysisPrompt(asset); // Generate AI summary const analysis = await foundryService.generateCompletion(prompt); // Structure response res.json({ asset_id: assetId, asset_name: asset.name, summary: analysis.content, model_used: analysis.model, latency_ms: analysis.latency_ms, timestamp: new Date().toISOString(), telemetry_snapshot: { temperature: asset.telemetry.temperature, vibration: asset.telemetry.vibration, runtime_hours: asset.telemetry.runtime_hours }, active_alerts: asset.alerts.filter(a => a.active).length }); } catch (error) { console.error('Asset summary error:', error); res.status(500).json({ error: 'Analysis failed' }); } }); function buildHealthAnalysisPrompt(asset) { return ` Analyze the health of this manufacturing equipment and provide a concise summary: Equipment: ${asset.name} (${asset.id}) Type: ${asset.type} Location: ${asset.location} Current Telemetry: - Temperature: ${asset.telemetry.temperature}°C (Normal: ${asset.specs.normal_temp_range}) - Vibration: ${asset.telemetry.vibration} mm/s (Threshold: ${asset.specs.vibration_threshold}) - Operating Pressure: ${asset.telemetry.pressure} PSI - Runtime: ${asset.telemetry.runtime_hours} hours - Last Maintenance: ${asset.maintenance.last_service_date} Active Alerts: ${asset.alerts.map(a => `- ${a.severity}: ${a.message}`).join('\n')} Recent Events: ${asset.recent_events.slice(0, 3).map(e => `- ${e.timestamp}: ${e.description}`).join('\n')} Provide a 3-4 sentence summary covering: 1. Overall equipment health status 2. Any concerning trends or anomalies 3. Recommended actions or monitoring focus Be factual and specific. Do not speculate beyond the provided data. `.trim(); } export default router; This prompt construction demonstrates several best practices for industrial AI: Structured data presentation: Organize telemetry, specs, and alerts in clear sections with labels Context enrichment: Include normal operating ranges so AI can assess abnormality Explicit constraints: Instruction to avoid speculation reduces hallucination risk Output formatting guidance: Request specific structure (3-4 sentences, covering key points) Temporal context: Include recent events so AI understands trend direction Example AI-generated asset summary: { "asset_id": "CNC-L2-M03", "asset_name": "CNC Mill #3", "summary": "Equipment is operating outside normal parameters with elevated temperature at 92°C, significantly above the 75-80°C normal range. Thermal Alert indicates possible coolant flow issue. Vibration levels remain acceptable at 2.8 mm/s. Recommend immediate inspection of coolant system and thermal throttling may impact throughput until resolved.", "model_used": "phi-3.5-mini", "latency_ms": 243, "timestamp": "2026-01-30T14:23:18Z", "telemetry_snapshot": { "temperature": 92, "vibration": 2.8, "runtime_hours": 12847 }, "active_alerts": 2 } This summary transforms raw telemetry into actionable intelligence—operations staff immediately understand the problem, its severity, and the appropriate response, without requiring deep equipment expertise. Maintenance Log Classification with AI Maintenance departments generate hundreds of logs daily, technician notes, operator observations, inspection reports. Manually categorizing and prioritizing these logs consumes significant time. AI classification automatically routes logs to appropriate teams, identifies urgent issues, and extracts key information. The classification endpoint processes maintenance notes: // backend/routes/maintenance.js router.post('/api/logs/classify', async (req, res) => { try { const { log_text, equipment_id } = req.body; if (!log_text || log_text.length < 10) { return res.status(400).json({ error: 'Log text required (min 10 chars)' }); } const classificationPrompt = ` Classify this maintenance log entry into appropriate categories and priority: Equipment: ${equipment_id || 'Unknown'} Log Text: "${log_text}" Classify into EXACTLY ONE primary category: - MECHANICAL: Physical components, bearings, belts, motors - ELECTRICAL: Power systems, sensors, controllers, wiring - HYDRAULIC: Pumps, fluid systems, pressure issues - THERMAL: Cooling, heating, temperature control - SOFTWARE: PLC programming, HMI issues, control logic - ROUTINE: Scheduled maintenance, inspections, calibration Assign priority level: - CRITICAL: Immediate action required, safety or production impact - HIGH: Resolve within 24 hours, performance degradation - MEDIUM: Schedule within 1 week, minor issues - LOW: Routine maintenance, cosmetic issues Extract key details: - Symptoms described - Suspected root cause (if mentioned) - Recommended actions Return ONLY a JSON object with this exact structure: { "category": "MECHANICAL", "priority": "HIGH", "symptoms": ["grinding noise", "vibration above 5mm/s"], "suspected_cause": "bearing wear", "recommended_actions": ["inspect bearings", "order replacement parts"] } `.trim(); const analysis = await foundryService.generateCompletion(classificationPrompt); // Parse AI response as JSON let classification; try { // Extract JSON from response (AI might add explanation text) const jsonMatch = analysis.content.match(/\{[\s\S]*\}/); classification = JSON.parse(jsonMatch[0]); } catch (parseError) { // Fallback parsing if JSON extraction fails classification = parseClassificationText(analysis.content); } // Validate classification const validCategories = ['MECHANICAL', 'ELECTRICAL', 'HYDRAULIC', 'THERMAL', 'SOFTWARE', 'ROUTINE']; const validPriorities = ['CRITICAL', 'HIGH', 'MEDIUM', 'LOW']; if (!validCategories.includes(classification.category)) { classification.category = 'ROUTINE'; } if (!validPriorities.includes(classification.priority)) { classification.priority = 'MEDIUM'; } res.json({ original_log: log_text, classification, model_used: analysis.model, latency_ms: analysis.latency_ms, timestamp: new Date().toISOString() }); } catch (error) { console.error('Classification error:', error); res.status(500).json({ error: 'Classification failed' }); } }); function parseClassificationText(text) { // Fallback parser for when AI doesn't return valid JSON // Extract category, priority, and details using regex patterns const categoryMatch = text.match(/category[":]\s*(MECHANICAL|ELECTRICAL|HYDRAULIC|THERMAL|SOFTWARE|ROUTINE)/i); const priorityMatch = text.match(/priority[":]\s*(CRITICAL|HIGH|MEDIUM|LOW)/i); return { category: categoryMatch ? categoryMatch[1].toUpperCase() : 'ROUTINE', priority: priorityMatch ? priorityMatch[1].toUpperCase() : 'MEDIUM', symptoms: [], suspected_cause: 'Unknown', recommended_actions: [] }; } This implementation demonstrates several critical patterns for structured AI outputs: Explicit output format requirements: Prompt specifies exact JSON structure to encourage parseable responses Defensive parsing: Try JSON extraction first, fall back to text parsing if that fails Validation with sensible defaults: Validate categories and priorities against allowed values, default to safe values on mismatch Constrained classification vocabulary: Limit categories to predefined set rather than open-ended categories Priority inference rules: Guide AI to assess urgency based on safety, production impact, and timeline Example classification output: POST /api/logs/classify { "log_text": "Hydraulic pump PUMP-L1-H01 making grinding noise during startup. Vibration readings spiked to 5.2 mm/s this morning. Possible bearing wear. Recommend inspection.", "equipment_id": "PUMP-L1-H01" } Response: { "original_log": "Hydraulic pump PUMP-L1-H01 making grinding noise...", "classification": { "category": "MECHANICAL", "priority": "HIGH", "symptoms": ["grinding noise during startup", "vibration spike to 5.2 mm/s"], "suspected_cause": "bearing wear", "recommended_actions": ["inspect bearings", "schedule replacement if confirmed worn"] }, "model_used": "phi-3.5-mini", "latency_ms": 187, "timestamp": "2026-01-30T14:35:22Z" } This classification automatically routes the log to the mechanical maintenance team, marks it high priority for same-day attention, and extracts actionable details, all without human intervention. Building the Natural Language Operations Assistant The AI Assistant interface enables operations staff to query equipment status, ask diagnostic questions, and get contextual guidance using natural language. This interface bridges the gap between complex SCADA systems and operators who need quick answers without navigating multiple screens. The chat endpoint implements contextual conversation: // backend/routes/chat.js router.post('/api/chat', async (req, res) => { try { const { message, conversation_id } = req.body; if (!message || message.length < 3) { return res.status(400).json({ error: 'Message required (min 3 chars)' }); } // Load conversation history if exists const history = conversation_id ? await loadConversationHistory(conversation_id) : []; // Build context from current plant state const plantContext = await buildPlantContext(); // Construct system prompt with operational context const systemPrompt = ` You are an operations assistant for a manufacturing facility. Answer questions about equipment status, maintenance, and operational issues. Current Plant Status: ${plantContext} Guidelines: - Provide specific, actionable answers based on current data - Reference specific equipment IDs when relevant - Suggest appropriate next steps for issues - If information is unavailable, say so clearly - Use concise language suitable for busy operators Do not speculate about issues without data to support it. `.trim(); // Build message chain with history const messages = [ { role: 'system', content: systemPrompt }, ...history.map(h => ({ role: h.role, content: h.content })), { role: 'user', content: message } ]; // Generate response const response = await foundryService.generateCompletion( message, { messages } // Pass full conversation context ); // Save conversation const newConversationId = conversation_id || generateConversationId(); await saveConversationTurn(newConversationId, message, response.content); res.json({ response: response.content, conversation_id: newConversationId, model_used: response.model, latency_ms: response.latency_ms, timestamp: new Date().toISOString() }); } catch (error) { console.error('Chat error:', error); res.status(500).json({ error: 'Chat failed' }); } }); async function buildPlantContext() { const assets = await getAllAssets(); // Summarize plant-wide status const criticalAlerts = assets.flatMap(a => a.alerts.filter(alert => alert.severity === 'CRITICAL') ); const equipmentStatus = assets.map(a => `${a.id}: ${a.status} (Temp: ${a.telemetry.temperature}°C, Alerts: ${a.alerts.length})` ); return ` Total Equipment: ${assets.length} Critical Alerts: ${criticalAlerts.length} Equipment Status Summary: ${equipmentStatus.join('\n')} `.trim(); } Example conversational interactions: User: "What's wrong with Line 2?" AI: "Line 2 has elevated issues. CNC-L2-M03 is running 12°C above normal temperature at 92°C with a thermal alert active. This has caused 28% below-target throughput as the equipment throttles to prevent damage. The coolant system requires immediate inspection—flow rate dropped to 3.2 L/min from the normal 5.5 L/min, indicating a potential blockage or pump failure. Other Line 2 equipment (CONV-L2-B01, PLC-L2-MAIN) is operating normally." User: "How's the hydraulic pump performing?" AI: "PUMP-L1-H01 has a high-priority maintenance alert. Grinding noise was reported during startup with vibration at 5.2 mm/s, exceeding the 4.0 mm/s threshold. This pattern suggests bearing wear. Current operational status shows the pump is running but degraded. Recommend inspection today to prevent catastrophic failure. Bearing replacement typically requires 2-4 hours downtime." User: "What should I prioritize this shift?" AI: "Priority order for this shift: 1) CNC-L2-M03 coolant inspection (CRITICAL - impacting production, safety risk), 2) PUMP-L1-H01 bearing check (HIGH - prevent failure progression), 3) Routine inspections on CONV-L1-B02 (scheduled maintenance). Focus immediate attention on Line 2 thermal issues. Line 1 is stable." The assistant provides grounded, specific responses citing actual equipment IDs, current metrics, and actionable priorities. This data-grounded approach maintains operational credibility, operators quickly learn to trust responses that reference specific equipment conditions they can independently verify on SCADA systems. Real-World Industrial Scenario: Thermal Event Response To understand the system's practical value, consider how it supports operators during an actual equipment incident. This scenario simulates realistic manufacturing conditions where multiple issues develop simultaneously and operators need rapid situational awareness. Scenario Setup: Assembly Line Degradation During Shift B operations on a Tuesday evening at a precision manufacturing facility, multiple equipment anomalies develop on production lines. The plant overview dashboard begins displaying elevated alert counts. Telemetry monitoring detects concerning patterns across Assembly Line 2 equipment. CNC mill CNC-L2-M03 reports sustained high temperature at 92°C compared to normal operating range of 65-75°C. Coolant flow sensors indicate 40% reduction from baseline levels—dropping from 5.5 L/min to 3.2 L/min. Equipment throughput has decreased 28% as thermal management systems automatically throttle spindle speeds to prevent damage. The mill has accumulated 12,847 runtime hours with next scheduled maintenance in 153 hours. Simultaneously, hydraulic pump PUMP-L1-H01 on Line 1 logged a maintenance note during startup: "Grinding noise observed during pump startup, vibration spike detected." Automated vibration sensors registered 5.2 mm/s compared to the 4.0 mm/s alarm threshold. The pump's condition monitoring system flagged bearing wear as probable cause based on frequency analysis of vibration signatures. The shift supervisor, monitoring the plant overview dashboard, sees both alerts flagged as high priority. Traditional response would require accessing SCADA screens, checking telemetry history, reviewing maintenance logs across multiple systems, correlating patterns, and consulting equipment experts, consuming 15-20 minutes of investigation time while equipment conditions potentially worsen. AI-Assisted Response Workflow Instead, the supervisor uses the AI Assistant interface to rapidly assess the situation. The workflow demonstrates how local AI transforms incident response: Step 1 - Initial Assessment Query: Supervisor types "What's the current plant status?" The AI assistant analyzes the complete equipment database and returns: "Plant status as of 22:14: 14 assets monitored, 2 high-priority alerts active. Critical issues: CNC-L2-M03 thermal event (92°C, coolant flow reduced 40%, throughput down 28%). High-priority: PUMP-L1-H01 mechanical degradation (bearing wear suspected, 5.2 mm/s vibration). All other equipment operating within normal parameters. Recommend immediate attention to Line 2 thermal issue—production impact and potential equipment damage risk." Step 2 - Detailed Equipment Analysis: Supervisor navigates to Asset Health tab and selects CNC-L2-M03 for detailed analysis. Within 1.8 seconds, the AI summary generates: "CNC-L2-M03 Health Summary: DEGRADED - IMMEDIATE ACTION REQUIRED Current Status: Elevated thermal condition with active coolant flow issue. Operating temperature 92°C exceeds normal range (65-75°C) by 22%. Coolant flow rate 3.2 L/min indicates 42% reduction from baseline 5.5 L/min. Impact Assessment: Throughput reduced 28% due to automatic thermal throttling. Spindle speed limited to 65% capacity to prevent thermal damage. Current production rate: 18 parts/hour vs. target 25 parts/hour. Root Cause Analysis: Coolant flow reduction suggests blockage in cooling circuit or coolant pump degradation. Temperature rise pattern correlates with gradual flow reduction observed over past 6 hours. Recommended Actions: 1. Immediate: Verify coolant reservoir level and pump operation 2. Inspect: Check cooling circuit for blockages or debris 3. Monitor: Temperature trend every 15 minutes until resolved 4. Consider: Production pause if temperature exceeds 95°C Maintenance Priority: HIGH - Production impact and equipment damage risk if unresolved. Estimated repair time: 1-2 hours for coolant system service." Step 3 - Maintenance Log Classification: When the pump maintenance note is entered, the AI classification system analyzes it and returns structured classification within 240ms: { "classification": { "category": "MECHANICAL", "priority": "HIGH", "symptoms": ["grinding noise during startup", "vibration spike to 5.2 mm/s", "frequency pattern indicates bearing wear"], "suspected_cause": "bearing wear - inner race degradation probable", "recommended_actions": [ "schedule bearing inspection within 8 hours", "prepare replacement bearings (6208-2RS)", "plan maintenance window: 2-4 hours", "order backup pump if repair extended" ], "estimated_downtime": "2-4 hours for bearing replacement" } } Step 4 - Decision Support and Prioritization: Supervisor asks "What should I prioritize tonight?" The AI synthesizes all context: "Priority recommendations for Shift B: IMMEDIATE (next 30 minutes): - CNC-L2-M03: Dispatch maintenance to coolant system NOW. Production impacted, equipment damage risk if temperature continues rising. Impact: $2,400/hour revenue loss at current throughput reduction. HIGH (within 4 hours): - PUMP-L1-H01: Schedule bearing inspection for line changeover at 02:00. Pump operational but degraded—plan bearing replacement parts and backup pump. Risk: Catastrophic failure if bearing fully degrades. STANDARD (this week): - CONV-L1-B02: Scheduled inspection per maintenance calendar - Routine: Daily checklist completion for all lines Resource allocation: Focus maintenance team lead on CNC thermal issue. Assign technician to prepare pump bearing replacement kit during current production run." Outcome: Rapid Response Enabled by Local AI The AI-assisted workflow reduced incident assessment time from 15-20 minutes to under 3 minutes. The supervisor immediately dispatched maintenance to investigate CNC-L2-M03's coolant system, identified as root cause. Technicians found debris blocking a cooling circuit junction, 5-minute clearance procedure restored coolant flow to 5.3 L/min. Equipment temperature dropped to 73°C within 10 minutes, and full production throughput resumed. For the hydraulic pump, bearing inspection was scheduled during planned line changeover at 02:00, preventing emergency production stoppage. Bearings were replaced preemptively, avoiding the catastrophic pump failure that would have caused 6-8 hours of unplanned downtime. Total downtime avoided: 8+ hours. Revenue protection: ~$48,000 based on facility's production value. All decisions made with AI running entirely on local edge device, no cloud dependency, no data exposure, no network latency impact. The complete incident response workflow operated on facility-controlled infrastructure with full data sovereignty. Key Takeaways for Manufacturing AI Deployment Building production-ready AI systems for industrial environments requires architectural decisions that prioritize operational reliability, data sovereignty, and integration pragmatism over cutting-edge model sophistication. Several critical lessons emerge from implementing on-premises manufacturing intelligence: Data locality through architectural guarantee: On-premises AI ensures proprietary production data never leaves facility networks not through configuration but through fundamental architecture. There are no cloud API calls to misconfigure, no data upload features to accidentally enable, no external endpoints to compromise. This physical data boundary satisfies security audits and competitive protection requirements with demonstrable certainty rather than contractual assurance. Model selection determines deployment feasibility: Smaller models (0.5B-2B parameters) enable deployment on commodity server hardware without specialized AI accelerators. These models provide sufficient accuracy for industrial classification, summarization, and conversational assistance while maintaining sub-3-second response times essential for operator acceptance. Larger models improve nuance but require GPU infrastructure and longer inference times that may not justify marginal accuracy gains for operational decision-making. Graceful degradation preserves operations: AI capabilities enhance but never replace core monitoring functions. Equipment dashboards must display raw telemetry, alert states, and historical trends even when AI analysis is unavailable. This architectural separation ensures operations continue during AI service maintenance, model updates, or system failures. AI becomes value-add intelligence rather than critical dependency. Context-rich prompts determine accuracy: Generic prompts produce generic responses unsuitable for operational decisions. Effective industrial prompts include equipment specifications, normal operating ranges, alert thresholds, maintenance history, and temporal context. This structured context enables models to provide grounded, specific recommendations citing actual equipment conditions rather than hallucinated speculation. Prompt engineering matters more than model size for operational accuracy. Structured outputs enable automation: JSON response formats with predefined fields allow automated systems to parse classifications, severity levels, and recommended actions without fragile natural language parsing. Maintenance management systems can automatically route work orders, trigger alerts, and update dashboards based on AI classification results. This structured integration scales AI beyond human-read summaries into automated workflow systems. Integration patterns bridge legacy and modern: OPC-UA clients and Modbus TCP gateways connect decades-old PLCs and SCADA systems to modern AI backends without replacing functional control infrastructure. This evolutionary approach enables AI adoption without massive capital equipment replacement. Manufacturing facilities can augment existing investments rather than ripping and replacing proven systems. Responsible AI through grounding and constraints: Industrial AI must acknowledge limits and avoid speculation beyond available data. System prompts should explicitly instruct models: "If you don't have information to answer, clearly state that" and "Do not speculate about equipment conditions beyond provided data." This reduces hallucination risk and maintains operator trust. Operators must verify AI recommendations against domain expertise, position AI as decision support augmenting human judgment, not replacing it. Getting Started: Installation and Deployment Implementing the manufacturing intelligence system requires Foundry Local installation, Node.js backend deployment, and frontend hosting, achievable within a few hours for facilities with existing IT infrastructure and server hardware. Prerequisites and System Requirements Hardware requirements depend on selected AI models. Minimum configuration supports Phi-3.5-mini model (2.1GB): 8GB RAM, 4-core CPU (Intel Core i5/AMD Ryzen 5 or better) 50GB available storage for model files and application data Windows 11/Server 2025 distribution. Recommended production configuration: 16GB+ RAM (supports larger models and concurrent requests), 8-core CPU or NVIDIA GPU (RTX 3060/4060 or better for 3-5x inference acceleration), 100GB SSD storage, gigabit network interface for intra-facility communication. Software prerequisites: Node.js 18 or newer (download from nodejs.org or install via system package manager), Git for repository cloning, modern web browser (Chrome, Edge, Firefox) for frontend access, Windows: PowerShell 5.1+. Foundry Local Installation and Model Setup Install Foundry Local using system-appropriate package manager: # Windows installation via winget winget install Microsoft.FoundryLocal # Verify installation foundry --version # macOS installation via Homebrew brew install microsoft/foundrylocal/foundrylocal Download AI models based on hardware capabilities and accuracy requirements: # Fast option: Qwen 0.5B (500MB, 100-200ms inference) foundry model download qwen2.5-0.5b # Balanced option: Phi-3.5 Mini (2.1GB, 1-3 second inference) foundry model download phi-3.5-mini # High quality option: Phi-4 Mini (3.6GB, 2-5 second inference) foundry model download phi-4-mini # Check downloaded models foundry model list Load a model into the Foundry Local service: # Load default recommended model foundry model run phi-3.5-mini # Verify service is running and model is loaded foundry service status The Foundry Local service will start automatically and expose a REST API on localhost:8008 (default port). The backend application connects to this endpoint for all AI inference operations. Backend Service Deployment Clone the repository and install dependencies: # Clone from GitHub git clone https://github.com/leestott/FoundryLocal-IndJSsample.git cd FoundryLocal-IndJSsample # Navigate to backend directory cd backend # Install Node.js dependencies npm install # Start the backend service npm start The backend server will initialize and display startup messages: Manufacturing AI Backend Starting... ✓ Foundry Local client initialized: http://localhost:8008 ✓ Model detected: phi-3.5-mini ✓ Sample data loaded: 6 assets, 12 maintenance logs ✓ Server running on port 3000 ✓ Frontend accessible at: http://localhost:3000 Health check: http://localhost:3000/api/health Verify backend health: # Test backend API curl http://localhost:3000/api/health # Expected response: {"ok":true,"service":"manufacturing-ai-backend"} # Test Foundry Local integration curl http://localhost:3000/api/models/status # Expected response: {"serviceRunning":true,"model":"phi-3.5-mini"} Frontend Access and Validation Open the web interface by navigating to web/index.html in a browser or starting from the backend URL: # Windows: Open frontend directly start http://localhost:3000 # macOS/Linux: Open frontend open http://localhost:3000 # or xdg-open http://localhost:3000 The web interface displays a navigation bar with four main sections: Overview: Plant-wide dashboard showing all equipment with health status cards, alert counts, and "Load Scenario" button to populate sample data Asset Health: Equipment selector dropdown, telemetry display, active alerts list, and "Generate AI Summary" button for detailed analysis Maintenance: Text area for maintenance log entry, "Classify Log" button, and classification result display showing category, priority, and recommendations AI Assistant: Chat interface with message input, conversation history, and natural language query capabilities Running the Sample Scenario Test the complete system with included sample data: Load scenario data: Click "Load Scenario Inputs" button in Overview tab. This populates equipment database with CNC-L2-M03 thermal event, PUMP-L1-H01 vibration alert, and baseline telemetry for all assets. Generate asset summary: Navigate to Asset Health tab, select "CNC-L2-M03" from dropdown, click "Generate AI Analysis". Within 2-3 seconds, detailed health summary appears explaining thermal condition, coolant flow issue, impacts, and recommended actions. Classify maintenance note: Go to Maintenance tab, enter text: "Grinding noise on startup, vibration 5.2 mm/s, suspect bearing wear". Click "Classify Log". AI categorizes as MECHANICAL/HIGH priority with specific repair recommendations. Ask operational questions: Open AI Assistant tab, type "What's wrong with Line 2?" or "Which equipment needs attention?" AI responds with specific equipment IDs, current conditions, and prioritized action list. Production Deployment Considerations For actual manufacturing facility deployment, several additional configurations apply: Hardware placement: Deploy backend service on dedicated server within manufacturing network zone. Avoid co-locating AI workloads with critical SCADA/MES systems due to resource contention. Use physical server or VM with direct hardware access for GPU acceleration. Network configuration: Backend should reside behind facility firewall with access restricted to internal networks. Do not expose AI service directly to internetm use VPN for remote access if required. Implement authentication via Active Directory/LDAP integration. Configure firewall rules permitting connections from operator workstations and monitoring systems only. Data integration: Replace sample JSON data with connections to actual data sources. Implement OPC-UA client for SCADA integration, connect to MES database for production schedules, integrate with CMMS for maintenance history. Code includes placeholder functions for external data source integration, customize for facility-specific systems. Model selection: Choose appropriate model based on hardware and accuracy requirements. Start with phi-3.5-mini for production deployment. Upgrade to phi-4-mini if analysis quality needs improvement and hardware supports it. Use qwen2.5-0.5b for high-throughput scenarios where speed matters more than nuanced understanding. Test all models against validation scenarios before production promotion. Monitoring and maintenance: Implement health checks monitoring Foundry Local service status, backend API responsiveness, model inference latency, and error rates. Set up alerting when inference latency exceeds thresholds or service unavailable. Establish procedures for model updates during planned maintenance windows. Keep audit logs of all AI invocations for compliance and troubleshooting. Resources and Further Learning The complete implementation with detailed comments, sample data, and documentation provides a foundation for building custom manufacturing intelligence systems. Additional resources support extension and adaptation to specific facility requirements. FoundryLocal-IndJSsample GitHub Repository – Complete source code with JavaScript backend, HTML/CSS/JS frontend, sample manufacturing data, and comprehensive README Installation and Configuration Guide – Detailed setup instructions, API documentation, troubleshooting procedures, and deployment guidance Microsoft Foundry Local Documentation – Official SDK reference, model catalog, hardware requirements, and performance tuning guidance Sample Manufacturing Data Format – JSON structure examples for equipment telemetry, maintenance logs, alert definitions, and operational events Backend Implementation Reference – Express server architecture, Foundry Local SDK integration patterns, API endpoint implementations, and error handling OPC Foundation – Industrial communication standards (OPC-UA, OPC DA) for SCADA system integration and PLC connectivity ISA Standards – International Society of Automation standards for industrial systems, SCADA architecture, and manufacturing execution systems EdgeAI for Beginner - Learn more about Edge AI using these course materials The manufacturing intelligence implementation demonstrates that sophisticated AI capabilities can run entirely on-premises without compromising operational requirements. Facilities gain predictive maintenance insights, natural language operational support, and automated equipment analysis while maintaining complete data sovereignty, zero network dependency, and deterministic performance characteristics essential for production environments.Building Interactive Agent UIs with AG-UI and Microsoft Agent Framework
Introduction Picture this: You've built an AI agent that analyzes financial data. A user uploads a quarterly report and asks: "What are the top three expense categories?" Behind the scenes, your agent parses the spreadsheet, aggregates thousands of rows, and generates visualizations. All in 20 seconds. But the user? They see a loading spinner. Nothing else. No "reading file" message, no "analyzing data" indicator, no hint that progress is being made. They start wondering: Is it frozen? Should I refresh? The problem isn't the agent's capabilities - it's the communication gap between the agent running on the backend and the user interface. When agents perform multi-step reasoning, call external APIs, or execute complex tool chains, users deserve to see what's happening. They need streaming updates, intermediate results, and transparent progress indicators. Yet most agent frameworks force developers to choose between simple request/response patterns or building custom solutions to stream updates to their UIs. This is where AG-UI comes in. AG-UI is a fairly new event-based protocol that standardizes how agents communicate with user interfaces. Instead of every framework and development team inventing their own streaming solution, AG-UI provides a shared vocabulary of structured events that work consistently across different agent implementations. When an agent starts processing, calls a tool, generates text, or encounters an error, the UI receives explicit, typed events in real time. The beauty of AG-UI is its framework-agnostic design. While this blog post demonstrates integration with Microsoft Agent Framework (MAF), the same AG-UI protocol works with LangGraph, CrewAI, or any other compliant framework. Write your UI code once, and it works with any AG-UI-compliant backend. (Note: MAF supports both Python and .NET - this blog post focuses on the Python implementation.) TL;DR The Problem: Users don't get real-time updates while AI agents work behind the scenes - no progress indicators, no transparency into tool calls, and no insight into what's happening. The Solution: AG-UI is an open, event-based protocol that standardizes real-time communication between AI agents and user interfaces. Instead of each development team and framework inventing custom streaming solutions, AG-UI provides a shared vocabulary of structured events (like TOOL_CALL_START, TEXT_MESSAGE_CONTENT, RUN_FINISHED) that work across any compliant framework. Key Benefits: Framework-agnostic - Write UI code once, works with LangGraph, Microsoft Agent Framework, CrewAI, and more Real-time observability - See exactly what your agent is doing as it happens Server-Sent Events - Built on standard HTTP for universal compatibility Protocol-managed state - No manual conversation history tracking In This Post: You'll learn why AG-UI exists, how it works, and build a complete working application using Microsoft Agent Framework with Python - from server setup to client implementation. What You'll Learn This blog post walks through: Why AG-UI exists - how agent-UI communication has evolved and what problems current approaches couldn't solve How the protocol works - the key design choices that make AG-UI simple, reliable, and framework-agnostic Protocol architecture - the generic components and how AG-UI integrates with agent frameworks Building an AG-UI application - a complete working example using Microsoft Agent Framework with server, client, and step-by-step setup Understanding events - what happens under the hood when your agent runs and how to observe it Thinking in events - how building with AG-UI differs from traditional APIs, and what benefits this brings Making the right choice - when AG-UI is the right fit for your project and when alternatives might be better Estimated reading time: 15 minutes Who this is for: Developers building AI agents who want to provide real-time feedback to users, and teams evaluating standardized approaches to agent-UI communication To appreciate why AG-UI matters, we need to understand the journey that led to its creation. Let's trace how agent-UI communication has evolved through three distinct phases. The Evolution of Agent-UI Communication AI agents have become more capable over time. As they evolved, the way they communicated with user interfaces had to evolve as well. Here's how this evolution unfolded. Phase 1: Simple Request/Response In the early days of AI agent development, the interaction model was straightforward: send a question, wait for an answer, display the result. This synchronous approach mirrored traditional API calls and worked fine for simple scenarios. # Simple, but limiting response = agent.run("What's the weather in Paris?") display(response) # User waits... and waits... Works for: Quick queries that complete in seconds, simple Q&A interactions where immediate feedback and interactivity aren't critical. Breaks down: When agents need to call multiple tools, perform multi-step reasoning, or process complex queries that take 30+ seconds. Users see nothing but a loading spinner, with no insight into what's happening or whether the agent is making progress. This creates a poor user experience and makes it impossible to show intermediate results or allow user intervention. Recognizing these limitations, development teams began experimenting with more sophisticated approaches. Phase 2: Custom Streaming Solutions As agents became more sophisticated, teams recognized the need for incremental feedback and interactivity. Rather than waiting for the complete response, they implemented custom streaming solutions to show partial results as they became available. # Every team invents their own format for chunk in agent.stream("What's the weather?"): display(chunk) # But what about tool calls? Errors? Progress? This was a step forward for building interactive agent UIs, but each team solved the problem differently. Also, different frameworks had incompatible approaches - some streamed only text tokens, others sent structured JSON, and most provided no visibility into critical events like tool calls or errors. The problem: No standardization across frameworks - client code that works with LangGraph won't work with Crew AI, requiring separate implementations for each agent backend Each implementation handles tool calls differently - some send nothing during tool execution, others send unstructured messages Complex state management - clients must track conversation history, manage reconnections, and handle edge cases manually The industry needed a better solution - a common protocol that could work across all frameworks while maintaining the benefits of streaming. Phase 3: Standardized Protocol (AG-UI) AG-UI emerged as a response to the fragmentation problem. Instead of each framework and development team inventing their own streaming solution, AG-UI provides a shared vocabulary of events that work consistently across different agent implementations. # Standardized events everyone understands async for event in agent.run_stream("What's the weather?"): if event.type == "TEXT_MESSAGE_CONTENT": display_text(event.delta) elif event.type == "TOOL_CALL_START": show_tool_indicator(event.tool_name) elif event.type == "TOOL_CALL_RESULT": show_tool_result(event.result) The key difference is structured observability. Rather than guessing what the agent is doing from unstructured text, clients receive explicit events for every stage of execution: when the agent starts, when it generates text, when it calls a tool, when that tool completes, and when the entire run finishes. What's different: A standardized vocabulary of event types, complete observability into agent execution, and framework-agnostic clients that work with any AG-UI-compliant backend. You write your UI code once, and it works whether the backend uses Microsoft Agent Framework, LangGraph, or any other framework that speaks AG-UI. Now that we've seen why AG-UI emerged and what problems it solves, let's examine the specific design decisions that make the protocol work. These choices weren't arbitrary - each one addresses concrete challenges in building reliable, observable agent-UI communication. The Design Decisions Behind AG-UI Why Server-Sent Events (SSE)? Aspect WebSockets SSE (AG-UI) Complexity Bidirectional Unidirectional (simpler) Firewall/Proxy Sometimes blocked Standard HTTP Reconnection Manual implementation Built-in browser support Use case Real-time games, chat Agent responses (one-way) For agent interactions, you typically only need server→client communication, making SSE a simpler choice. SSE solves the transport problem - how events travel from server to client. But once connected, how does the protocol handle conversation state across multiple interactions? Why Protocol-Managed Threads? # Without protocol threads (client manages): conversation_history = [] conversation_history.append({"role": "user", "content": message}) response = agent.complete(conversation_history) conversation_history.append({"role": "assistant", "content": response}) # Complex, error-prone, doesn't work with multiple clients # With AG-UI (protocol manages): thread = agent.get_new_thread() # Server creates and manages thread agent.run_stream(message, thread=thread) # Server maintains context # Simple, reliable, shareable across clients With transport and state management handled, the final piece is the actual messages flowing through the connection. What information should the protocol communicate, and how should it be structured? Why Standardized Event Types? Instead of parsing unstructured text, clients get typed events: RUN_STARTED - Agent begins (start loading UI) TEXT_MESSAGE_CONTENT - Text chunk (stream to user) TOOL_CALL_START - Tool invoked (show "searching...", "calculating...") TOOL_CALL_RESULT - Tool finished (show result, update UI) RUN_FINISHED - Complete (hide loading) This lets UIs react intelligently without custom parsing logic. Now that we understand the protocol's design choices, let's see how these pieces fit together in a complete system. Architecture Overview Here's how the components interact: The communication between these layers relies on a well-defined set of event types. Here are the core events that flow through the SSE connection: Core Event Types AG-UI provides a standardized set of event types to describe what's happening during an agent's execution: RUN_STARTED - agent begins execution TEXT_MESSAGE_START, TEXT_MESSAGE_CONTENT, TEXT_MESSAGE_END - streaming segments of text TOOL_CALL_START, TOOL_CALL_ARGS, TOOL_CALL_END, TOOL_CALL_RESULT - tool execution events RUN_FINISHED - agent has finished execution RUN_ERROR - error information This model lets the UI update as the agent runs, rather than waiting for the final response. The generic architecture above applies to any AG-UI implementation. Now let's see how this translates to Microsoft Agent Framework. AG-UI with Microsoft Agent Framework While AG-UI is framework-agnostic, this blog post demonstrates integration with Microsoft Agent Framework (MAF) using Python. MAF is available in both Python and .NET, giving you flexibility to build AG-UI applications in your preferred language. Understanding how MAF implements the protocol will help you build your own applications or work with other compliant frameworks. Integration Architecture The Microsoft Agent Framework integration involves several specialized layers that handle protocol translation and execution orchestration: Understanding each layer: FastAPI Endpoint - Handles HTTP requests and establishes SSE connections for streaming AgentFrameworkAgent - Protocol wrapper that translates between AG-UI events and Agent Framework operations Orchestrators - Manage execution flow, coordinate tool calling sequences, and handle state transitions ChatAgent - Your agent implementation with instructions, tools, and business logic ChatClient - Interface to the underlying language model (Azure OpenAI, OpenAI, or other providers) The good news? When you call add_agent_framework_fastapi_endpoint, all the middleware layers are configured automatically. You simply provide your ChatAgent, and the integration handles protocol translation, event streaming, and state management behind the scenes. Now that we understand both the protocol architecture and the Microsoft Agent Framework integration, let's build a working application. Hands-On: Building Your First AG-UI Application This section demonstrates how to build an AG-UI server and client using Microsoft Agent Framework and FastAPI. Prerequisites Before building your first AG-UI application, ensure you have: Python 3.10 or later installed Basic understanding of async/await patterns in Python Azure CLI installed and authenticated (az login) Azure OpenAI service endpoint and deployment configured (setup guide) Cognitive Services OpenAI Contributor role for your Azure OpenAI resource You'll also need to install the AG-UI integration package: pip install agent-framework-ag-ui --pre This automatically installs agent-framework-core, fastapi, and uvicorn as dependencies. With your environment configured, let's create the server that will host your agent and expose it via the AG-UI protocol. Building the Server Let's create a FastAPI server that hosts an AI agent and exposes it via AG-UI: # server.py import os from typing import Annotated from dotenv import load_dotenv from fastapi import FastAPI from pydantic import Field from agent_framework import ChatAgent, ai_function from agent_framework.azure import AzureOpenAIChatClient from agent_framework_ag_ui import add_agent_framework_fastapi_endpoint from azure.identity import DefaultAzureCredential # Load environment variables from .env file load_dotenv() # Validate environment configuration openai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT") model_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME") if not openai_endpoint: raise RuntimeError("Missing required environment variable: AZURE_OPENAI_ENDPOINT") if not model_deployment: raise RuntimeError("Missing required environment variable: AZURE_OPENAI_DEPLOYMENT_NAME") # Define tools the agent can use @ai_function def get_order_status( order_id: Annotated[str, Field(description="The order ID to look up (e.g., ORD-001)")] ) -> dict: """Look up the status of a customer order. Returns order status, tracking number, and estimated delivery date. """ # Simulated order lookup orders = { "ORD-001": {"status": "shipped", "tracking": "1Z999AA1", "eta": "Jan 25, 2026"}, "ORD-002": {"status": "processing", "tracking": None, "eta": "Jan 23, 2026"}, "ORD-003": {"status": "delivered", "tracking": "1Z999AA3", "eta": "Delivered Jan 20"}, } return orders.get(order_id, {"status": "not_found", "message": "Order not found"}) # Initialize Azure OpenAI client chat_client = AzureOpenAIChatClient( credential=DefaultAzureCredential(), endpoint=openai_endpoint, deployment_name=model_deployment, ) # Configure the agent with custom instructions and tools agent = ChatAgent( name="CustomerSupportAgent", instructions="""You are a helpful customer support assistant. You have access to a get_order_status tool that can look up order information. IMPORTANT: When a user mentions an order ID (like ORD-001, ORD-002, etc.), you MUST call the get_order_status tool to retrieve the actual order details. Do NOT make up or guess order information. After calling get_order_status, provide the actual results to the user in a friendly format.""", chat_client=chat_client, tools=[get_order_status], ) # Initialize FastAPI application app = FastAPI( title="AG-UI Customer Support Server", description="Interactive AI agent server using AG-UI protocol with tool calling" ) # Mount the AG-UI endpoint add_agent_framework_fastapi_endpoint(app, agent, path="/chat") def main(): """Entry point for the AG-UI server.""" import uvicorn print("Starting AG-UI server on http://localhost:8000") uvicorn.run(app, host="0.0.0.0", port=8000, log_level="info") # Run the application if __name__ == "__main__": main() What's happening here: We define a tool: get_order_status with the AI_function decorator Use Annotated and Field for parameter descriptions to help the agent understand when and how to use the tool We create an Azure OpenAI chat client with credential authentication The ChatAgent is configured with domain-specific instructions and the tools parameter add_agent_framework_fastapi_endpoint automatically handles SSE streaming and tool execution The server exposes the agent at the /chat endpoint Note: This example uses Azure OpenAI, but AG-UI works with any chat model. You can also integrate with Azure AI Foundry's model catalog or use other LLM providers. Tool calling is supported by most modern LLMs including GPT-4, GPT-4o, and Claude models. To run this server: # Set your Azure OpenAI credentials export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/" export AZURE_OPENAI_DEPLOYMENT_NAME="gpt-4o" # Start the server python server.py With your server running and exposing the AG-UI endpoint, the next step is building a client that can connect and consume the event stream. Streaming Results to Clients With the server running, clients can connect and stream events as the agent processes requests. Here's a Python client that demonstrates the streaming capabilities: # client.py import asyncio import os from dotenv import load_dotenv from agent_framework import ChatAgent, FunctionCallContent, FunctionResultContent from agent_framework_ag_ui import AGUIChatClient # Load environment variables from .env file load_dotenv() async def interactive_chat(): """Interactive chat session with streaming responses.""" # Connect to the AG-UI server base_url = os.getenv("AGUI_SERVER_URL", "http://localhost:8000/chat") print(f"Connecting to: {base_url}\n") # Initialize the AG-UI client client = AGUIChatClient(endpoint=base_url) # Create a local agent representation agent = ChatAgent(chat_client=client) # Start a new conversation thread conversation_thread = agent.get_new_thread() print("Chat started! Type 'exit' or 'quit' to end the session.\n") try: while True: # Collect user input user_message = input("You: ") # Handle empty input if not user_message.strip(): print("Please enter a message.\n") continue # Check for exit commands if user_message.lower() in ["exit", "quit", "bye"]: print("\nGoodbye!") break # Stream the agent's response print("Agent: ", end="", flush=True) # Track tool calls to avoid duplicate prints seen_tools = set() async for update in agent.run_stream(user_message, thread=conversation_thread): # Display text content if update.text: print(update.text, end="", flush=True) # Display tool calls and results for content in update.contents: if isinstance(content, FunctionCallContent): # Only print each tool call once if content.call_id not in seen_tools: seen_tools.add(content.call_id) print(f"\n[Calling tool: {content.name}]", flush=True) elif isinstance(content, FunctionResultContent): # Only print each result once result_id = f"result_{content.call_id}" if result_id not in seen_tools: seen_tools.add(result_id) result_text = content.result if isinstance(content.result, str) else str(content.result) print(f"[Tool result: {result_text}]", flush=True) print("\n") # New line after response completes except KeyboardInterrupt: print("\n\nChat interrupted by user.") except ConnectionError as e: print(f"\nConnection error: {e}") print("Make sure the server is running.") except Exception as e: print(f"\nUnexpected error: {e}") def main(): """Entry point for the AG-UI client.""" asyncio.run(interactive_chat()) if __name__ == "__main__": main() Key features: The client connects to the AG-UI endpoint using AGUIChatClient with the endpoint parameter run_stream() yields updates containing text and content as they arrive Tool calls are detected using FunctionCallContent and displayed with [Calling tool: ...] Tool results are detected using FunctionResultContent and displayed with [Tool result: ...] Deduplication logic (seen_tools set) prevents printing the same tool call multiple times as it streams Thread management maintains conversation context across messages Graceful error handling for connection issues To use the client: # Optional: specify custom server URL export AGUI_SERVER_URL="http://localhost:8000/chat" # Start the interactive chat python client.py Example Session: Connecting to: http://localhost:8000/chat Chat started! Type 'exit' or 'quit' to end the session. You: What's the status of order ORD-001? Agent: [Calling tool: get_order_status] [Tool result: {"status": "shipped", "tracking": "1Z999AA1", "eta": "Jan 25, 2026"}] Your order ORD-001 has been shipped! - Tracking Number: 1Z999AA1 - Estimated Delivery Date: January 25, 2026 You can use the tracking number to monitor the delivery progress. You: Can you check ORD-002? Agent: [Calling tool: get_order_status] [Tool result: {"status": "processing", "tracking": null, "eta": "Jan 23, 2026"}] Your order ORD-002 is currently being processed. - Status: Processing - Estimated Delivery: January 23, 2026 Your order should ship soon, and you'll receive a tracking number once it's on the way. You: exit Goodbye! The client we just built handles events at a high level, abstracting away the details. But what's actually flowing through that SSE connection? Let's peek under the hood. Event Types You'll See As the server streams back responses, clients receive a series of structured events. If you were to observe the raw SSE stream (e.g., using curl), you'd see events like: curl -N http://localhost:8000/chat \ -H "Content-Type: application/json" \ -H "Accept: text/event-stream" \ -d '{"messages": [{"role": "user", "content": "What'\''s the status of order ORD-001?"}]}' Sample event stream (with tool calling): data: {"type":"RUN_STARTED","threadId":"eb4d9850-14ef-446c-af4b-23037acda9e8","runId":"chatcmpl-xyz"} data: {"type":"TEXT_MESSAGE_START","messageId":"e8648880-a9ff-4178-a17d-4a6d3ec3d39c","role":"assistant"} data: {"type":"TOOL_CALL_START","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","toolCallName":"get_order_status","parentMessageId":"e8648880-a9ff-4178-a17d-4a6d3ec3d39c"} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"{\""} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"order"} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"_id"} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"\":\""} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"ORD"} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"-"} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"001"} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"\"}"} data: {"type":"TOOL_CALL_END","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y"} data: {"type":"TOOL_CALL_RESULT","messageId":"f048cb0a-a049-4a51-9403-a05e4820438a","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","content":"{\"status\": \"shipped\", \"tracking\": \"1Z999AA1\", \"eta\": \"Jan 25, 2026\"}","role":"tool"} data: {"type":"TEXT_MESSAGE_START","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","role":"assistant"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":"Your"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":" order"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":" ORD"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":"-"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":"001"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":" has"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":" been"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":" shipped"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":"!"} ... (additional TEXT_MESSAGE_CONTENT events streaming the response) ... data: {"type":"TEXT_MESSAGE_END","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf"} data: {"type":"RUN_FINISHED","threadId":"eb4d9850-14ef-446c-af4b-23037acda9e8","runId":"chatcmpl-xyz"} Understanding the flow: RUN_STARTED - Agent begins processing the request TEXT_MESSAGE_START - First message starts (will contain tool calls) TOOL_CALL_START - Agent invokes the get_order_status tool Multiple TOOL_CALL_ARGS events - Arguments stream incrementally as JSON chunks ({"order_id":"ORD-001"}) TOOL_CALL_END - Tool invocation structure complete TOOL_CALL_RESULT - Tool execution finished with result data TEXT_MESSAGE_START - Second message starts (the final response) Multiple TEXT_MESSAGE_CONTENT events - Response text streams word-by-word TEXT_MESSAGE_END - Response message complete RUN_FINISHED - Entire run completed successfully This granular event model enables rich UI experiences - showing tool execution indicators ("Searching...", "Calculating..."), displaying intermediate results, and providing complete transparency into the agent's reasoning process. Seeing the raw events helps, but truly working with AG-UI requires a shift in how you think about agent interactions. Let's explore this conceptual change. The Mental Model Shift Traditional API Thinking # Imperative: Call and wait response = agent.run("What's 2+2?") print(response) # "The answer is 4" Mental model: Function call with return value AG-UI Thinking # Reactive: Subscribe to events async for event in agent.run_stream("What's 2+2?"): match event.type: case "RUN_STARTED": show_loading() case "TEXT_MESSAGE_CONTENT": display_chunk(event.delta) case "RUN_FINISHED": hide_loading() Mental model: Observable stream of events This shift feels similar to: Moving from synchronous to async code Moving from REST to event-driven architecture Moving from polling to pub/sub This mental shift isn't just philosophical - it unlocks concrete benefits that weren't possible with request/response patterns. What You Gain Observability # You can SEE what the agent is doing TOOL_CALL_START: "get_order_status" TOOL_CALL_ARGS: {"order_id": "ORD-001"} TOOL_CALL_RESULT: {"status": "shipped", "tracking": "1Z999AA1", "eta": "Jan 25, 2026"} TEXT_MESSAGE_START: "Your order ORD-001 has been shipped..." Interruptibility # Future: Cancel long-running operations async for event in agent.run_stream(query): if user_clicked_cancel: await agent.cancel(thread_id, run_id) break Transparency # Users see the reasoning process "Looking up order ORD-001..." "Order found: Status is 'shipped'" "Retrieving tracking information..." "Your order has been shipped with tracking number 1Z999AA1..." To put these benefits in context, here's how AG-UI compares to traditional approaches across key dimensions: AG-UI vs. Traditional Approaches Aspect Traditional REST Custom Streaming AG-UI Connection Model Request/Response Varies Server-Sent Events State Management Manual Manual Protocol-managed Tool Calling Invisible Custom format Standardized events Framework Varies Framework-locked Framework-agnostic Browser Support Universal Varies Universal Implementation Simple Complex Moderate Ecosystem N/A Isolated Growing You've now seen AG-UI's design principles, implementation details, and conceptual foundations. But the most important question remains: should you actually use it? Conclusion: Is AG-UI Right for Your Project? AG-UI represents a shift toward standardized, observable agent interactions. Before adopting it, understand where the protocol stands and whether it fits your needs. Protocol Maturity The protocol is stable enough for production use but still evolving: Ready now: Core specification stable, Microsoft Agent Framework integration available, FastAPI/Python implementation mature, basic streaming and threading work reliably. Choose AG-UI If You Building new agent projects - No legacy API to maintain, want future compatibility with emerging ecosystem Need streaming observability - Multi-step workflows where users benefit from seeing each stage of execution Want framework flexibility - Same client code works with any AG-UI-compliant backend Comfortable with evolving standards - Can adapt to protocol changes as it matures Stick with Alternatives If You Have working solutions - Custom streaming working well, migration cost not justified Need guaranteed stability - Mission-critical systems where breaking changes are unacceptable Build simple agents - Single-step request/response without tool calling or streaming needs Risk-averse environment - Large existing implementations where proven approaches are required Beyond individual project decisions, it's worth considering AG-UI's role in the broader ecosystem. The Bigger Picture While this blog post focused on Microsoft Agent Framework, AG-UI's true power lies in its broader mission: creating a common language for agent-UI communication across the entire ecosystem. As more frameworks adopt it, the real value emerges: write your UI once, work with any compliant agent framework. Think of it like GraphQL for APIs or OpenAPI for REST - a standardization layer that benefits the entire ecosystem. The protocol is young, but the problem it solves is real. Whether you adopt it now or wait for broader adoption, understanding AG-UI helps you make informed architectural decisions for your agent applications. Ready to dive deeper? Here are the official resources to continue your AG-UI journey. Resources AG-UI & Microsoft Agent Framework Getting Started with AG-UI (Microsoft Learn) - Official tutorial AG-UI Integration Overview - Architecture and concepts AG-UI Protocol Specification - Official protocol documentation Backend Tool Rendering - Adding function tools Security Considerations - Production security guidance Microsoft Agent Framework Documentation - Framework overview AG-UI Dojo Examples - Live demonstrations UI Components & Integration CopilotKit for Microsoft Agent Framework - React component library Community & Support Microsoft Q&A - Community support Agent Framework GitHub - Source code and issues Related Technologies Azure AI Foundry Documentation - Azure AI platform FastAPI Documentation - Web framework Server-Sent Events (SSE) Specification - Protocol standard This blog post introduces AG-UI with Microsoft Agent Framework, focusing on fundamental concepts and building your first interactive agent application.Agents League: Two Weeks, Three Tracks, One Challenge
We're inviting all developers to join Agents League, running February 16-27. It's a two-week challenge where you'll build AI agents using production-ready tools, learn from live coding sessions, and get feedback directly from Microsoft product teams. We've put together starter kits for each track to help you get up and running quickly that also includes requirements and guidelines. Whether you want to explore what GitHub Copilot can do beyond autocomplete, build reasoning agents on Microsoft Foundry, or create enterprise integrations for Microsoft 365 Copilot, we have a track for you. Important: Register first to be eligible for prizes and your digital badge. Without registration, you won't qualify for awards or receive a badge when you submit. What Is Agents League? It's a 2-week competition that combines learning with building: 📽️ Live coding battles – Watch Product teams, MVPs and community members tackle challenges in real-time on Microsoft Reactor 💻 Async challenges – Build at your own pace, on your schedule 💬 Discord community – Connect with other participants, join AMAs, and get help when you need it 🏆 Prizes – $500 per track winner, plus GitHub Copilot Pro subscriptions for top picks The Three Tracks 🎨 Creative Apps — Build with GitHub Copilot (Chat, CLI, or SDK) 🧠 Reasoning Agents — Build with Microsoft Foundry 💼 Enterprise Agents — Build with M365 Agents Toolkit (or Copilot Studio) More details on each track below, or jump straight to the starter kits. The Schedule Agents League starts on February 16th and runs through Feburary 27th. Within 2 weeks, we host live battles on Reactor and AMA sessions on Discord. Week 1: Live Battles (Feb 17-19) We're kicking off with live coding battles streamed on Microsoft Reactor. Watch experienced developers compete in real-time, explaining their approach and architectural decisions as they go. Tue Feb 17, 9 AM PT — 🎨 Creative Apps battle Wed Feb 18, 9 AM PT — 🧠 Reasoning Agents battle Thu Feb 19, 9 AM PT — 💼 Enterprise Agents battle All sessions are recorded, so you can watch on your own schedule. Week 2: Build + AMAs (Feb 24-26) This is your time to build and ask questions on Discord. The async format means you work when it suits you, evenings, weekends, whatever fits your schedule. We're also hosting AMAs on Discord where you can ask questions directly to Microsoft experts and product teams: Tue Feb 24, 9 AM PT — 🎨 Creative Apps AMA Wed Feb 25, 9 AM PT — 🧠 Reasoning Agents AMA Thu Feb 26, 9 AM PT — 💼 Enterprise Agents AMA Bring your questions, get help when you're stuck, and share what you're building with the community. Pick Your Track We've created a starter kit for each track with setup guides, project ideas, and example scenarios to help you get started quickly. 🎨 Creative Apps Tool: GitHub Copilot (Chat, CLI, or SDK) Build innovative, imaginative applications that showcase the potential of AI-assisted development. All application types are welcome, web apps, CLI tools, games, mobile apps, desktop applications, and more. The starter kit walks you through GitHub Copilot's different modes and provides prompting tips to get the best results. View the Creative Apps starter kit. 🧠 Reasoning Agents Tool: Microsoft Foundry (UI or SDK) and/or Microsoft Agent Framework Build a multi-agent system that leverages advanced reasoning capabilities to solve complex problems. This track focuses on agents that can plan, reason through multi-step problems, and collaborate. The starter kit includes architecture patterns, reasoning strategies (planner-executor, critic/verifier, self-reflection), and integration guides for tools and MCP servers. View the Reasoning Agents starter kit. 💼 Enterprise Agents Tool: M365 Agents Toolkit or Copilot Studio Create intelligent agents that extend Microsoft 365 Copilot to address real-world enterprise scenarios. Your agent must work on Microsoft 365 Copilot Chat. Bonus points for: MCP server integration, OAuth security, Adaptive Cards UI, connected agents (multi-agent architecture). View the Enterprise Agents starter kit. Prizes & Recognition To be eligible for prizes and your digital badge, you must register before submitting your project. Category Winners ($500 each): 🎨 Creative Apps winner 🧠 Reasoning Agents winner 💼 Enterprise Agents winner GitHub Copilot Pro subscriptions: Community Favorite (voted by participants on Discord) Product Team Picks (selected by Microsoft product teams) Everyone who registers and submits a project wins: A digital badge to showcase their participation. Beyond the prizes, every participant gets feedback from the teams who built these tools, a valuable opportunity to learn and improve your approach to AI agent development. How to Get Started Register first — This is required to be eligible for prizes and to receive your digital badge. Without registration, your submission won't qualify for awards or a badge. Pick a track — Choose one track. Explore the starter kits to help you decide. Watch the battles — See how experienced developers approach these challenges. Great for learning even if you're still deciding whether to compete. Build your project — You have until Feb 27. Work on your own schedule. Submit via GitHub — Open an issue using the project submission template. Join us on Discord — Get help, share your progress, and vote for your favorite projects on Discord. Links Register: https://aka.ms/agentsleague/register Starter Kits: https://github.com/microsoft/agentsleague/starter-kits Discord: https://aka.ms/agentsleague/discord Live Battles: https://aka.ms/agentsleague/battles Submit Project: Project submission templateBenchmarking Local AI Models
Introduction Selecting the right AI model for your application requires more than reading benchmark leaderboards. Published benchmarks measure academic capabilities, question answering, reasoning, coding, but your application has specific requirements: latency budgets, hardware constraints, quality thresholds. How do you know if Phi-4 provides acceptable quality for your document summarization use case? Will Qwen2.5-0.5B meet your 100ms response time requirement? Does your edge device have sufficient memory for Phi-3.5 Mini? The answer lies in empirical testing: running actual models on your hardware with your workload patterns. This article demonstrates building a comprehensive model benchmarking platform using FLPerformance, Node.js, React, and Microsoft Foundry Local. You'll learn how to implement scientific performance measurement, design meaningful benchmark suites, visualize multi-dimensional comparisons, and make data-driven model selection decisions. Whether you're evaluating models for production deployment, optimizing inference costs, or validating hardware specifications, this platform provides the tools for rigorous performance analysis. Why Model Benchmarking Requires Purpose-Built Tools You cannot assess model performance by running a few manual tests and noting the results. Scientific benchmarking demands controlled conditions, statistically significant sample sizes, multi-dimensional metrics, and reproducible methodology. Understand why purpose-built tooling is essential. Performance is multi-dimensional. A model might excel at throughput (tokens per second) but suffer at latency (time to first token). Another might generate high-quality outputs slowly. Your application might prioritize consistency over average performance, a model with variable response times (high p95/p99 latency) creates poor user experiences even if averages look good. Measuring all dimensions simultaneously enables informed tradeoffs. Hardware matters enormously. Benchmark results from NVIDIA A100 GPUs don't predict performance on consumer laptops. NPU acceleration changes the picture again. Memory constraints affect which models can even load. Test on your actual deployment hardware or comparable specifications to get actionable results. Concurrency reveals bottlenecks. A model handling one request excellently might struggle with ten concurrent requests. Real applications experience variable load, measuring only single-threaded performance misses critical scalability constraints. Controlled concurrency testing reveals these limits. Statistical rigor prevents false conclusions. Running a prompt once and noting the response time tells you nothing about performance distribution. Was this result typical? An outlier? You need dozens or hundreds of trials to establish p50/p95/p99 percentiles, understand variance, and detect stability issues. Comparison requires controlled experiments. Different prompts, different times of day, different system loads, all introduce confounding variables. Scientific comparison runs identical workloads across models sequentially, controlling for external factors. Architecture: Three-Layer Performance Testing Platform FLPerformance implements a clean separation between orchestration, measurement, and presentation: The frontend React application provides model management, benchmark configuration, test execution, and results visualization. Users add models from the Foundry Local catalog, configure benchmark parameters (iterations, concurrency, timeout values), launch test runs, and view real-time progress. The results dashboard displays comparison tables, latency distribution charts, throughput graphs, and "best model for..." recommendations. The backend Node.js/Express server orchestrates tests and captures metrics. It manages the single Foundry Local service instance, loads/unloads models as needed, executes benchmark suites with controlled concurrency, measures comprehensive metrics (TTFT, TPOT, total latency, throughput, error rates), and persists results to JSON storage. WebSocket connections provide real-time progress updates during long benchmark runs. Foundry Local SDK integration uses the official foundry-local-sdk npm package. The SDK manages service lifecycle, starting, stopping, health checkin, and handles model operations, downloading, loading into memory, unloading. It provides OpenAI-compatible inference APIs for consistent request formatting across models. The architecture supports simultaneous testing of multiple models by loading them one at a time, running identical benchmarks, and aggregating results for comparison: User Initiates Benchmark Run ↓ Backend receives {models: [...], suite: "default", iterations: 10} ↓ For each model: 1. Load model into Foundry Local 2. Execute benchmark suite - For each prompt in suite: * Run N iterations * Measure TTFT, TPOT, total time * Track errors and timeouts * Calculate tokens/second 3. Aggregate statistics (mean, p50, p95, p99) 4. Unload model ↓ Store results with metadata ↓ Return comparison data to frontend ↓ Visualize performance metrics Implementing Scientific Measurement Infrastructure Accurate performance measurement requires instrumentation that captures multiple dimensions without introducing measurement overhead: // src/server/benchmark.js import { performance } from 'perf_hooks'; export class BenchmarkExecutor { constructor(foundryClient, options = {}) { this.client = foundryClient; this.options = { iterations: options.iterations || 10, concurrency: options.concurrency || 1, timeout_ms: options.timeout_ms || 30000, warmup_iterations: options.warmup_iterations || 2 }; } async runBenchmarkSuite(modelId, prompts) { const results = []; // Warmup phase (exclude from results) console.log(`Running ${this.options.warmup_iterations} warmup iterations...`); for (let i = 0; i < this.options.warmup_iterations; i++) { await this.executePrompt(modelId, prompts[0].text); } // Actual benchmark runs for (const prompt of prompts) { console.log(`Benchmarking prompt: ${prompt.id}`); const measurements = []; for (let i = 0; i < this.options.iterations; i++) { const measurement = await this.executeMeasuredPrompt( modelId, prompt.text ); measurements.push(measurement); // Small delay between iterations to stabilize await sleep(100); } results.push({ prompt_id: prompt.id, prompt_text: prompt.text, measurements, statistics: this.calculateStatistics(measurements) }); } return { model_id: modelId, timestamp: new Date().toISOString(), config: this.options, results }; } async executeMeasuredPrompt(modelId, promptText) { const measurement = { success: false, error: null, ttft_ms: null, // Time to first token tpot_ms: null, // Time per output token total_ms: null, tokens_generated: 0, tokens_per_second: 0 }; try { const startTime = performance.now(); let firstTokenTime = null; let tokenCount = 0; // Streaming completion to measure TTFT const stream = await this.client.chat.completions.create({ model: modelId, messages: [{ role: 'user', content: promptText }], max_tokens: 200, temperature: 0.7, stream: true }); for await (const chunk of stream) { if (chunk.choices[0]?.delta?.content) { if (firstTokenTime === null) { firstTokenTime = performance.now(); measurement.ttft_ms = firstTokenTime - startTime; } tokenCount++; } } const endTime = performance.now(); measurement.total_ms = endTime - startTime; measurement.tokens_generated = tokenCount; if (tokenCount > 1 && firstTokenTime) { // TPOT = time after first token / (tokens - 1) const timeAfterFirstToken = endTime - firstTokenTime; measurement.tpot_ms = timeAfterFirstToken / (tokenCount - 1); measurement.tokens_per_second = 1000 / measurement.tpot_ms; } measurement.success = true; } catch (error) { measurement.error = error.message; measurement.success = false; } return measurement; } calculateStatistics(measurements) { const successful = measurements.filter(m => m.success); const total = measurements.length; if (successful.length === 0) { return { success_rate: 0, error_rate: 1.0, sample_size: total }; } const ttfts = successful.map(m => m.ttft_ms).sort((a, b) => a - b); const tpots = successful.map(m => m.tpot_ms).filter(v => v !== null).sort((a, b) => a - b); const totals = successful.map(m => m.total_ms).sort((a, b) => a - b); const throughputs = successful.map(m => m.tokens_per_second).filter(v => v > 0); return { success_rate: successful.length / total, error_rate: (total - successful.length) / total, sample_size: total, ttft: { mean: mean(ttfts), median: percentile(ttfts, 50), p95: percentile(ttfts, 95), p99: percentile(ttfts, 99), min: Math.min(...ttfts), max: Math.max(...ttfts) }, tpot: tpots.length > 0 ? { mean: mean(tpots), median: percentile(tpots, 50), p95: percentile(tpots, 95) } : null, total_latency: { mean: mean(totals), median: percentile(totals, 50), p95: percentile(totals, 95), p99: percentile(totals, 99) }, throughput: { mean_tps: mean(throughputs), median_tps: percentile(throughputs, 50) } }; } } function mean(arr) { return arr.reduce((sum, val) => sum + val, 0) / arr.length; } function percentile(sortedArr, p) { const index = Math.ceil((sortedArr.length * p) / 100) - 1; return sortedArr[Math.max(0, index)]; } function sleep(ms) { return new Promise(resolve => setTimeout(resolve, ms)); } This measurement infrastructure captures: Time to First Token (TTFT): Critical for perceived responsiveness—users notice delays before output begins Time Per Output Token (TPOT): Determines generation speed after first token—affects throughput Total latency: End-to-end time—matters for batch processing and high-volume scenarios Tokens per second: Overall throughput metric—useful for capacity planning Statistical distributions: Mean alone masks variability—p95/p99 reveal tail latencies that impact user experience Success/error rates: Stability metrics—some models timeout or crash under load Designing Meaningful Benchmark Suites Benchmark quality depends on prompt selection. Generic prompts don't reflect real application behavior. Design suites that mirror actual use cases: // benchmarks/suites/default.json { "name": "default", "description": "General-purpose benchmark covering diverse scenarios", "prompts": [ { "id": "short-factual", "text": "What is the capital of France?", "category": "factual", "expected_tokens": 5 }, { "id": "medium-explanation", "text": "Explain how photosynthesis works in 3-4 sentences.", "category": "explanation", "expected_tokens": 80 }, { "id": "long-reasoning", "text": "Analyze the economic factors that led to the 2008 financial crisis. Discuss at least 5 major causes with supporting details.", "category": "reasoning", "expected_tokens": 250 }, { "id": "code-generation", "text": "Write a Python function that finds the longest palindrome in a string. Include docstring and example usage.", "category": "coding", "expected_tokens": 150 }, { "id": "creative-writing", "text": "Write a short story (3 paragraphs) about a robot learning to paint.", "category": "creative", "expected_tokens": 200 } ] } This suite covers multiple dimensions: Length variation: Short (5 tokens), medium (80), long (250)—tests models across output ranges Task diversity: Factual recall, explanation, reasoning, code, creative—reveals capability breadth Token predictability: Expected token counts enable throughput calculations For production applications, create custom suites matching your actual workload: { "name": "customer-support", "description": "Simulates actual customer support queries", "prompts": [ { "id": "product-question", "text": "How do I reset my password for the customer portal?" }, { "id": "troubleshooting", "text": "I'm getting error code 503 when trying to upload files. What should I do?" }, { "id": "policy-inquiry", "text": "What is your refund policy for annual subscriptions?" } ] } Visualizing Multi-Dimensional Performance Comparisons Raw numbers don't reveal insights—visualization makes patterns obvious. The frontend implements several comparison views: Comparison Table shows side-by-side metrics: // frontend/src/components/ResultsTable.jsx export function ResultsTable({ results }) { return ( {results.map(result => ( ))} Model TTFT (ms) TPOT (ms) Throughput (tok/s) P95 Latency Error Rate {result.model_id} {result.stats.ttft.median.toFixed(0)} (p95: {result.stats.ttft.p95.toFixed(0)}) {result.stats.tpot?.median.toFixed(1) || 'N/A'} {result.stats.throughput.median_tps.toFixed(1)} {result.stats.total_latency.p95.toFixed(0)} ms 0.05 ? 'error' : 'success'}> {(result.stats.error_rate * 100).toFixed(1)}% ); } Latency Distribution Chart reveals performance consistency: // Using Chart.js for visualization export function LatencyChart({ results }) { const data = { labels: results.map(r => r.model_id), datasets: [ { label: 'Median (p50)', data: results.map(r => r.stats.total_latency.median), backgroundColor: 'rgba(75, 192, 192, 0.5)' }, { label: 'p95', data: results.map(r => r.stats.total_latency.p95), backgroundColor: 'rgba(255, 206, 86, 0.5)' }, { label: 'p99', data: results.map(r => r.stats.total_latency.p99), backgroundColor: 'rgba(255, 99, 132, 0.5)' } ] }; return ( ); } Recommendations Engine synthesizes multi-dimensional comparison: export function generateRecommendations(results) { const recommendations = []; // Find fastest TTFT (best perceived responsiveness) const fastestTTFT = results.reduce((best, r) => r.stats.ttft.median < best.stats.ttft.median ? r : best ); recommendations.push({ category: 'Fastest Response', model: fastestTTFT.model_id, reason: `Lowest median TTFT: ${fastestTTFT.stats.ttft.median.toFixed(0)}ms` }); // Find highest throughput const highestThroughput = results.reduce((best, r) => r.stats.throughput.median_tps > best.stats.throughput.median_tps ? r : best ); recommendations.push({ category: 'Best Throughput', model: highestThroughput.model_id, reason: `Highest tok/s: ${highestThroughput.stats.throughput.median_tps.toFixed(1)}` }); // Find most consistent (lowest p95-p50 spread) const mostConsistent = results.reduce((best, r) => { const spread = r.stats.total_latency.p95 - r.stats.total_latency.median; const bestSpread = best.stats.total_latency.p95 - best.stats.total_latency.median; return spread < bestSpread ? r : best; }); recommendations.push({ category: 'Most Consistent', model: mostConsistent.model_id, reason: 'Lowest latency variance (p95-p50 spread)' }); return recommendations; } Key Takeaways and Benchmarking Best Practices Effective model benchmarking requires scientific methodology, comprehensive metrics, and application-specific testing. FLPerformance demonstrates that rigorous performance measurement is accessible to any development team. Critical principles for model evaluation: Test on target hardware: Results from cloud GPUs don't predict laptop performance Measure multiple dimensions: TTFT, TPOT, throughput, consistency all matter Use statistical rigor: Single runs mislead—capture distributions with adequate sample sizes Design realistic workloads: Generic benchmarks don't predict your application's behavior Include warmup iterations: Model loading and JIT compilation affect early measurements Control concurrency: Real applications handle multiple requests—test at realistic loads Document methodology: Reproducible results require documented procedures and configurations The complete benchmarking platform with model management, measurement infrastructure, visualization dashboards, and comprehensive documentation is available at github.com/leestott/FLPerformance. Clone the repository and run the startup script to begin evaluating models on your hardware. Resources and Further Reading FLPerformance Repository - Complete benchmarking platform Quick Start Guide - Setup and first benchmark run Microsoft Foundry Local Documentation - SDK reference and model catalog Architecture Guide - System design and SDK integration Benchmarking Best Practices - Methodology and troubleshootingLeverage AI for faster, more productive coding with GitHub Copilot
GitHub Copilot serves as an invaluable learning tool for developers, especially those who are still learning a particular programming language or framework. By providing context-aware suggestions, it helps learners understand the syntax, structure, and logic behind different code snippets. While fundamental knowledge of programming concepts and syntax services should remain a prerequisite, a developer exploring a new framework would be able to describe the functionality they want to implement, and GitHub Copilot will generate code that demonstrates how to achieve it. This accelerates the learning process and empowers developers to gain proficiency in new technologies more efficiently.AI Upskilling Framework Level 3 Building
The Global AI Community is excited to bring you the latest updates on AI Upskilling Framework Level 3 Building, straight from Microsoft Ignite! This session dives deep into advanced concepts for building agentic workflows and showcases new announcements that will help developers accelerate their Agentic AI journey.Azure Skilling at Microsoft Ignite 2025
The energy at Microsoft Ignite was unmistakable. Developers, architects, and technical decision-makers converged in San Francisco to explore the latest innovations in cloud technology, AI applications, and data platforms. Beyond the keynotes and product announcements was something even more valuable: an integrated skilling ecosystem designed to transform how you build with Azure. This year Azure Skilling at Microsoft Ignite 2025 brought together distinct learning experiences, over 150+ hands-on labs, and multiple pathways to industry-recognized credentials—all designed to help you master skills that matter most in today's AI-driven cloud landscape. Just Launched at Ignite Microsoft Ignite 2025 offered an exceptional array of learning opportunities, each designed to meet developers anywhere on the skilling journey. Whether you joined us in-person or on-demand in the virtual experience, multiple touchpoints are available to deepen your Azure expertise. Ignite 2025 is in the books, but you can still engage with the latest Microsoft skilling opportunities, including: The Azure Skills Challenge provides a gamified learning experience that lets you compete while completing task-based achievements across Azure's most critical technologies. These challenges aren't just about badges and bragging rights—they're carefully designed to help you advance technical skills and prepare for Microsoft role-based certifications. The competitive element adds urgency and motivation, turning learning into an engaging race against the clock and your peers. For those seeking structured guidance, Plans on Learn offer curated sets of content designed to help you achieve specific learning outcomes. These carefully assembled learning journeys include built-in milestones, progress tracking, and optional email reminders to keep you on track. Each plan represents 12-15 hours of focused learning, taking you from concept to capability in areas like AI application development, data platform modernization, or infrastructure optimization. The Microsoft Reactor Azure Skilling Series, running December 3-11, brings skilling to life through engaging video content, mixing regular programming with special Ignite-specific episodes. This series will deliver technical readiness and programming guidance in a livestream presentation that's more digestible than traditional documentation. Whether you're catching episodes live with interactive Q&A or watching on-demand later, you’ll get world-class instruction that makes complex topics approachable. Beyond Ignite: Your Continuous Learning Journey Here's the critical insight that separates Ignite attendees who transform their careers from those who simply collect swag: the real learning begins after the event ends. Microsoft Ignite is your launchpad, not your destination. Every module you start, every lab you complete, and every challenge you tackle connects to a comprehensive learning ecosystem on Microsoft Learn that's available 24/7, 365 days a year. Think of Ignite as your intensive immersion experience—the moment when you gain context, build momentum, and identify the skills that will have the biggest impact on your work. What you do in the weeks and months following determines whether that momentum compounds into career-defining expertise or dissipates into business as usual. For those targeting career advancement through formal credentials, Microsoft Certifications, Applied Skills and AI Skills Navigator, provide globally recognized validation of your expertise. Applied Skills focus on scenario-based competencies, demonstrating that you can build and deploy solutions, not simply answer theoretical questions. Certifications cover role-based scenarios for developers, data engineers, AI engineers, and solution architects. The assessment experiences include performance-based testing in dedicated Azure tenants where you complete real configuration and development tasks. And finally, the NEW AI Skills Navigator is an agentic learning space, bringing together AI-powered skilling experiences and credentials in a single, unified experience with Microsoft, LinkedIn Learning and GitHub – all in one spot Why This Matters: The Competitive Context The cloud skills race is intensifying. While our competitors offer robust training and content, Microsoft's differentiation comes not from having more content—though our 1.4 million module completions last fiscal year and 35,000+ certifications awarded speak to scale—but from integration of services to orchestrate workflows. Only Microsoft offers a truly unified ecosystem where GitHub Copilot accelerates your development, Azure AI services power your applications, and Azure platform services deploy and scale your solutions—all backed by integrated skilling content that teaches you to maximize this connected experience. When you continue your learning journey after Ignite, you're not just accumulating technical knowledge. You're developing fluency in an integrated development environment that no competitor can replicate. You're learning to leverage AI-powered development tools, cloud-native architectures, and enterprise-grade security in ways that compound each other's value. This unified expertise is what transforms individual developers into force-multipliers for their organizations. Start Now, Build Momentum, Never Stop Microsoft Ignite 2025 offered the chance to compress months of learning into days of intensive, hands-on experience, but you can still take part through the on-demand videos, the Global Ignite Skills Challenge, visiting the GitHub repos for the /Ignite25 labs, the Reactor Azure Skilling Series, and the curated Plans on Learn provide multiple entry points regardless of your current skill level or preferred learning style. But remember: the developers who extract the most value from Ignite are those who treat the event as the beginning, not the culmination, of their learning journey. They join hackathons, contribute to GitHub repositories, and engage with the Azure community on Discord and technical forums. The question isn't whether you'll learn something valuable from Microsoft Ignite 2025-that's guaranteed. The question is whether you'll convert that learning into sustained momentum that compounds over months and years into career-defining expertise. The ecosystem is here. The content is ready. Your skilling journey doesn't end when Ignite does—it accelerates.3.7KViews0likes0CommentsAnnouncing Public Preview: AI Toolkit for GitHub Copilot Prompt-First Agent Development
This week at GitHub Universe, we’re announcing the Public Preview of the GitHub Copilot prompt-first agent development in the AI Toolkit for Visual Studio Code. With this release, building powerful AI agents is now simpler and faster - no need to wrestle with complex frameworks or orchestrators. Just start with natural language prompts and let GitHub Copilot guide you from concept to working agent code. Accelerate Agent Development in VS Code The AI Toolkit embeds agent development workflows directly into Visual Studio Code and GitHub Copilot, enabling you to transform ideas into production-ready agents within minutes. This unified experience empowers developers and product teams to: Select the best model for your agent scenario Build and orchestrate agents using Microsoft Agent Framework Trace agent behaviors Evaluate agent response quality Select the best model for your scenario Models are the foundation for building powerful agents. Using the AI Toolkit, you can already explore and experiment with a wide range of local and remote models. Copilot now recommends models tailored to your agent’s needs, helping you make informed choices quickly. Build and orchestrate agents Whether you’re creating a single agent or designing a multi-agent workflow, Copilot leverages the latest Microsoft Agent Framework to generate robust agent code. You can initiate agent creation with simple prompts and visualize workflows for greater clarity and control. Create a single agent using Copilot Create a multi-agent workflow using Copilot and visualize workflow execution Trace agent behaviors As agents become more sophisticated, understanding their actions is crucial. The AI Toolkit enables tracing via Copilot, collecting local traces and displaying detailed agent calls, all within VS Code. Evaluate agent response quality Copilot guides you through structured evaluation, recommending metrics and generating test datasets. Integrate evaluations into your CI/CD pipeline for continuous quality assurance and confident deployments. Get started and share feedback This release marks a significant step toward making AI agent development easier and more accessible in Visual Studio Code. Try out the AI Toolkit for Visual Studio Code, share your thoughts, and file issues and suggest features on our GitHub repo. Thank you for being a part of this journey with us!Study Buddy: Learning Data Science and Machine Learning with an AI Sidekick
If you've ever wished for a friendly companion to guide you through the world of data science and machine learning, you're not alone. As part of the "For Beginners" curriculum, I recently built a Study Buddy Agent, an AI-powered assistant designed to help learners explore data science interactively, intuitively, and joyfully. Why a Study Buddy? Learning something new can be overwhelming, especially when you're navigating complex topics like machine learning, statistics, or Python programming. The Study Buddy Agent is here to change that. It brings the curriculum to life by answering questions, offering explanations, and nudging learners toward deeper understanding, all in a conversational format. Think of it as your AI-powered lab partner: always available, never judgmental, and endlessly curious. Built with chatmodes, Powered by Purpose The agent lives inside a .chatmodes file in the https://github.com/microsoft/Data-Science-For-Beginners/blob/main/.github/chatmodes/study-mode.chatmode.md. This file defines how the agent behaves, what tone it uses, and how it interacts with learners. I designed it to be friendly, encouraging, and beginner-first—just like the curriculum itself. It’s not just about answering questions. The Study Buddy is trained to: Reinforce key concepts from the curriculum Offer hints and nudges when learners get stuck Encourage exploration and experimentation Celebrate progress and milestones What’s Under the Hood? The agent uses GitHub Copilot's chatmode, which allows developers to define custom behaviors for AI agents. By aligning the agent’s responses with the curriculum’s learning objectives, we ensure that learners stay on track while enjoying the flexibility of conversational learning. How You Can Use It YouTube Video here: Study Buddy - Data Science AI Sidekick Clone the repo: Head to the https://github.com/microsoft/Data-Science-For-Beginners and clone it locally or use Codespaces. Open the GitHub Copilot Chat, and select Study Buddy: This will activate the Study Buddy. Start chatting: Ask questions, explore topics, and let the agent guide you. What’s Next? This is just the beginning. I’m exploring ways to: Expand the agent to other beginner curriculums (Web Dev, AI, IoT) Integrate feedback loops so learners can shape the agent’s evolution Final Thoughts In my role, I believe learning should be inclusive, empowering, and fun. The Study Buddy Agent is a small step toward that vision, a way to make data science feel less like a mountain and more like a hike with a good friend. Try it out, share your feedback, and let’s keep building tools that make learning magical. Join us on Discord to share your feedback.Level up your Python Gen AI Skills from our free nine-part YouTube series!
Want to learn how to use generative AI models in your Python applications? We're putting on a series of nine live streams, in both English and Spanish, all about generative AI. We'll cover large language models, embedding models, vision models, introduce techniques like RAG, function calling, and structured outputs, and show you how to build Agents and MCP servers. Plus we'll talk about AI safety and evaluations, to make sure all your models and applications are producing safe outputs. 🔗 Register for the entire series. In addition to the live streams, you can also join a weekly office hours in our AI Discord to ask any questions that don't get answered in the chat. You can also scroll down to learn about each live stream and register for individual sessions. See you in the streams! 👋🏻 Large Language Models 7 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor Join us for the first session in our Python + AI series! In this session, we'll talk about Large Language Models (LLMs), the models that power ChatGPT and GitHub Copilot. We'll use Python to interact with LLMs using popular packages like the OpenAI SDK and Langchain. We'll experiment with prompt engineering and few-shot examples to improve our outputs. We'll also show how to build a full stack app powered by LLMs, and explain the importance of concurrency and streaming for user-facing AI apps. Vector embeddings 8 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In our second session of the Python + AI series, we'll dive into a different kind of model: the vector embedding model. A vector embedding is a way to encode a text or image as an array of floating point numbers. Vector embeddings make it possible to perform similarity search on many kinds of content. In this session, we'll explore different vector embedding models, like the OpenAI text-embedding-3 series, with both visualizations and Python code. We'll compare distance metrics, use quantization to reduce vector size, and try out multimodal embedding models. Retrieval Augmented Generation 9 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In our fourth Python + AI session, we'll explore one of the most popular techniques used with LLMs: Retrieval Augmented Generation. RAG is an approach that sends context to the LLM so that it can provide well-grounded answers for a particular domain. The RAG approach can be used with many kinds of data sources like CSVs, webpages, documents, databases. In this session, we'll walk through RAG flows in Python, starting with a simple flow and culminating in a full-stack RAG application based on Azure AI Search. Vision models 14 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor Our third stream in the Python + AI series is all about vision models! Vision models are LLMs that can accept both text and images, like GPT 4o and 4o-mini. You can use those models for image captioning, data extraction, question-answering, classification, and more! We'll use Python to send images to vision models, build a basic chat-on-images app, and build a multimodal search engine. Structured outputs 15 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In our fifth stream of the Python + AI series, we'll discover how to get LLMs to output structured responses that adhere to a schema. In Python, all we need to do is define a @dataclass or a Pydantic BaseModel, and we get validated output that meets our needs perfectly. We'll focus on the structured outputs mode available in OpenAI models, but you can use similar techniques with other model providers. Our examples will demonstrate the many ways you can use structured responses, like entity extraction, classification, and agentic workflows. Quality and safety 16 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor Now that we're more than halfway through our Python + AI series, we're covering a crucial topic: how to use AI safely, and how to evaluate the quality of AI outputs. There are multiple mitigation layers when working with LLMs: the model itself, a safety system on top, the prompting and context, and the application user experience. Our focus will be on Azure tools that make it easier to put safe AI systems into production. We'll show how to configure the Azure AI Content Safety system when working with Azure AI models, and how to handle those errors in Python code. Then we'll use the Azure AI Evaluation SDK to evaluate the safety and quality of the output from our LLM. Tool calling 21 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor Now that we're more than halfway through our Python + AI series, we're covering a crucial topic: how to use AI safely, and how to evaluate the quality of AI outputs. There are multiple mitigation layers when working with LLMs: the model itself, a safety system on top, the prompting and context, and the application user experience. Our focus will be on Azure tools that make it easier to put safe AI systems into production. We'll show how to configure the Azure AI Content Safety system when working with Azure AI models, and how to handle those errors in Python code. Then we'll use the Azure AI Evaluation SDK to evaluate the safety and quality of the output from our LLM. AI agents 22 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor For the penultimate session of our Python + AI series, we're building AI agents! We'll use many of the most popular Python AI agent frameworks: Langgraph, Semantic Kernel, Autogen, Pydantic AI, and more. Our agents will start simple and then ramp up in complexity, demonstrating different architectures like hand-offs, round-robin, supervisor, graphs, and ReAct. Model Context Protocol 23 October, 2025 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In the final session of our Python + AI series, we're diving into the hottest technology of 2025: MCP, Model Context Protocol. This open protocol makes it easy to extend AI agents and chatbots with custom functionality, to make them more powerful and flexible. We'll show how to use the official Python FastMCP SDK to build an MCP server running locally and consume that server from chatbots like GitHub Copilot. Then we'll build our own MCP client to consume the server. Finally, we'll discover how easy it is to point popular AI agent frameworks like Langgraph, Pydantic AI, and Semantic Kernel at MCP servers. With great power comes great responsibility, so we will briefly discuss the many security risks that come with MCP, both as a user and developer.