get started
105 TopicsBuilding a Smart Building HVAC Digital Twin with AI Copilot Using Foundry Local
Introduction Building operations teams face a constant challenge: optimizing HVAC systems for energy efficiency while maintaining occupant comfort and air quality. Traditional building management systems display raw sensor data, temperatures, pressures, CO₂ levels—but translating this into actionable insights requires deep HVAC expertise. What if operators could simply ask "Why is the third floor so warm?" and get an intelligent answer grounded in real building state? This article demonstrates building a sample smart building digital twin with an AI-powered operations copilot, implemented using DigitalTwin, React, Three.js, and Microsoft Foundry Local. You'll learn how to architect physics-based simulators that model thermal dynamics, implement 3D visualizations of building systems, integrate natural language AI control, and design fault injection systems for testing and training. Whether you're building IoT platforms for commercial real estate, designing energy management systems, or implementing predictive maintenance for building automation, this sample provides proven patterns for intelligent facility operations. Why Digital Twins Matter for Building Operations Physical buildings generate enormous operational data but lack intelligent interpretation layers. A 50,000 square foot office building might have 500+ sensors streaming metrics every minute, zone temperatures, humidity levels, equipment runtimes, energy consumption. Traditional BMS (Building Management Systems) visualize this data as charts and gauges, but operators must manually correlate patterns, diagnose issues, and predict failures. Digital twins solve this through physics-based simulation coupled with AI interpretation. Instead of just displaying current temperature readings, a digital twin models thermal dynamics, heat transfer rates, HVAC response characteristics, occupancy impacts. When conditions deviate from expectations, the twin compares observed versus predicted states, identifying root causes. Layer AI on top, and operators get natural language explanations: "The conference room is 3 degrees too warm because the VAV damper is stuck at 40% open, reducing airflow by 60%." This application focuses on HVAC, the largest building energy consumer, typically 40-50% of total usage. Optimizing HVAC by just 10% through better controls can save thousands of dollars monthly while improving occupant satisfaction. The digital twin enables "what-if" scenarios before making changes: "What happens to energy consumption and comfort if we raise the cooling setpoint by 2 degrees during peak demand response events?" Architecture: Three-Tier Digital Twin System The application implements a clean three-tier architecture separating visualization, simulation, and state management: The frontend uses React with Three.js for 3D visualization. Users see an interactive 3D model of the three-floor building with color-coded zones indicating temperature and CO₂ levels. Click any equipment, AHUs, VAVs, chillers, to see detailed telemetry. The control panel enables adjusting setpoints, running simulation steps, and activating demand response scenarios. Real-time charts display KPIs: energy consumption, comfort compliance, air quality levels. The backend Node.js/Express server orchestrates simulation and state management. It maintains the digital twin state as JSON, the single source of truth for all equipment, zones, and telemetry. REST API endpoints handle control requests, simulation steps, and AI copilot queries. WebSocket connections push real-time updates to the frontend for live monitoring. The HVAC simulator implements physics-based models: 1R1C thermal models for zones, affinity laws for fan power, chiller COP calculations, CO₂ mass balance equations. Foundry Local provides AI copilot capabilities. The backend uses foundry-local-sdk to query locally running models. Natural language queries ("How's the lobby temperature?") get answered with building state context. The copilot can explain anomalies, suggest optimizations, and even execute commands when explicitly requested. Implementing Physics-Based HVAC Simulation Accurate simulation requires modeling actual HVAC physics. The simulator implements several established building energy models: // backend/src/simulator/thermal-model.js class ZoneThermalModel { // 1R1C (one resistance, one capacitance) thermal model static calculateTemperatureChange(zone, delta_t_seconds) { const C_thermal = zone.volume * 1.2 * 1000; // Heat capacity (J/K) const R_thermal = zone.r_value * zone.envelope_area; // Thermal resistance // Internal heat gains (occupancy, equipment, lighting) const Q_internal = zone.occupancy * 100 + // 100W per person zone.equipment_load + zone.lighting_load; // Cooling/heating from HVAC const airflow_kg_s = zone.vav.airflow_cfm * 0.0004719; // CFM to kg/s const c_p_air = 1006; // Specific heat of air (J/kg·K) const Q_hvac = airflow_kg_s * c_p_air * (zone.vav.supply_temp - zone.temperature); // Envelope losses const Q_envelope = (zone.outdoor_temp - zone.temperature) / R_thermal; // Net energy balance const Q_net = Q_internal + Q_hvac + Q_envelope; // Temperature change: Q = C * dT/dt const dT = (Q_net / C_thermal) * delta_t_seconds; return zone.temperature + dT; } } This model captures essential thermal dynamics while remaining computationally fast enough for real-time simulation. It accounts for internal heat generation from occupants and equipment, HVAC cooling/heating contributions, and heat loss through the building envelope. The CO₂ model uses mass balance equations: class AirQualityModel { static calculateCO2Change(zone, delta_t_seconds) { // CO₂ generation from occupants const G_co2 = zone.occupancy * 0.0052; // L/s per person at rest // Outdoor air ventilation rate const V_oa = zone.vav.outdoor_air_cfm * 0.000471947; // CFM to m³/s // CO₂ concentration difference (indoor - outdoor) const delta_CO2 = zone.co2_ppm - 400; // Outdoor ~400ppm // Mass balance: dC/dt = (G - V*ΔC) / Volume const dCO2_dt = (G_co2 - V_oa * delta_CO2) / zone.volume; return zone.co2_ppm + (dCO2_dt * delta_t_seconds); } } These models execute every simulation step, updating the entire building state: async function simulateStep(twin, timestep_minutes) { const delta_t = timestep_minutes * 60; // Convert to seconds // Update each zone for (const zone of twin.zones) { zone.temperature = ZoneThermalModel.calculateTemperatureChange(zone, delta_t); zone.co2_ppm = AirQualityModel.calculateCO2Change(zone, delta_t); } // Update equipment based on zone demands for (const vav of twin.vavs) { updateVAVOperation(vav, twin.zones); } for (const ahu of twin.ahus) { updateAHUOperation(ahu, twin.vavs); } updateChillerOperation(twin.chiller, twin.ahus); updateBoilerOperation(twin.boiler, twin.ahus); // Calculate system KPIs twin.kpis = calculateSystemKPIs(twin); // Detect alerts twin.alerts = detectAnomalies(twin); // Persist updated state await saveTwinState(twin); return twin; } 3D Visualization with React and Three.js The frontend renders an interactive 3D building view that updates in real-time as conditions change. Using React Three Fiber simplifies Three.js integration with React's component model: // frontend/src/components/BuildingView3D.jsx import { Canvas } from '@react-three/fiber'; import { OrbitControls } from '@react-three/drei'; export function BuildingView3D({ twinState }) { return ( {/* Render building floors */} {twinState.zones.map(zone => ( selectZone(zone.id)} /> ))} {/* Render equipment */} {twinState.ahus.map(ahu => ( ))} ); } function ZoneMesh({ zone, onClick }) { const color = getTemperatureColor(zone.temperature, zone.setpoint); return ( ); } function getTemperatureColor(current, setpoint) { const deviation = current - setpoint; if (Math.abs(deviation) < 1) return '#00ff00'; // Green: comfortable if (Math.abs(deviation) < 3) return '#ffff00'; // Yellow: acceptable return '#ff0000'; // Red: uncomfortable } This visualization immediately shows building state at a glance, operators see "hot spots" in red, comfortable zones in green, and can click any area for detailed metrics. Integrating AI Copilot for Natural Language Control The AI copilot transforms building data into conversational insights. Instead of navigating multiple screens, operators simply ask questions: // backend/src/routes/copilot.js import { FoundryLocalClient } from 'foundry-local-sdk'; const foundry = new FoundryLocalClient({ endpoint: process.env.FOUNDRY_LOCAL_ENDPOINT }); router.post('/api/copilot/chat', async (req, res) => { const { message } = req.body; // Load current building state const twin = await loadTwinState(); // Build context for AI const context = buildBuildingContext(twin); const completion = await foundry.chat.completions.create({ model: 'phi-4', messages: [ { role: 'system', content: `You are an HVAC operations assistant for a 3-floor office building. Current Building State: ${context} Answer questions about equipment status, comfort conditions, and energy usage. Provide specific, actionable information based on the current data. Do not speculate beyond provided information.` }, { role: 'user', content: message } ], temperature: 0.3, max_tokens: 300 }); res.json({ response: completion.choices[0].message.content, model: 'phi-4', timestamp: new Date().toISOString() }); }); function buildBuildingContext(twin) { const alerts = twin.alerts.filter(a => a.active); const zoneStatus = twin.zones.map(z => `${z.name}: ${z.temperature.toFixed(1)}°F (setpoint: ${z.setpoint}°F), CO₂: ${z.co2_ppm}ppm, ${z.occupancy} people` ); return ` Zones: ${zoneStatus.join('\n')} Active Alerts: ${alerts.length} ${alerts.map(a => `- ${a.severity}: ${a.message}`).join('\n')} Energy: ${twin.kpis.total_power_kw.toFixed(1)} kW Chiller: ${twin.chiller.percent_load}% load, ${twin.chiller.cop.toFixed(2)} COP `.trim(); } Example interactions: User: "Why is the third floor conference room warm?" AI: "The Executive Suite on Floor 3 is currently 75.2°F, 3 degrees above the 72°F setpoint. This is due to high occupancy (8 people) and VAV-F3-EXEC operating at reduced airflow (285 CFM vs 450 CFM capacity). AHU-2 is supplying adequate cool air at 55°F. Recommend checking VAV damper position—it may be partially stuck." User: "What's our current energy usage?" AI: "Total building power consumption is 47.3 kW. Breakdown: Chiller (24.1 kW, 51%), AHU fans (12.8 kW, 27%), Pumps (7.2 kW, 15%), Other (3.2 kW, 7%). Chiller is operating at 62% load with a COP of 4.8, which is efficient for current conditions." User: "Activate demand response level 2" AI: "I cannot directly control building systems for safety reasons. To activate DR Level 2, use the Control Panel and click 'Demand Response' → 'Level 2'. This will raise cooling setpoints by 3°F and reduce auxiliary loads, targeting 15% energy reduction." The AI provides grounded, specific answers citing actual equipment IDs and metrics. It refuses to directly execute control commands, instead guiding operators to explicit control interfaces, a critical safety pattern for building systems. Fault Injection for Testing and Training Real building operations experience equipment failures, stuck dampers, sensor drift, communication losses. The digital twin includes comprehensive fault injection capabilities to train operators and test control logic: // backend/src/simulator/fault-injector.js const FAULT_CATALOG = { chillerFailure: { description: 'Chiller compressor failure', apply: (twin) => { twin.chiller.status = 'FAULT'; twin.chiller.cooling_output = 0; twin.alerts.push({ id: 'chiller-fault', severity: 'CRITICAL', message: 'Chiller compressor failure - no cooling available', equipment: 'CHILLER-01' }); } }, stuckVAVDamper: { description: 'VAV damper stuck at current position', apply: (twin, vavId) => { const vav = twin.vavs.find(v => v.id === vavId); vav.damper_stuck = true; vav.damper_position_fixed = vav.damper_position; twin.alerts.push({ id: `vav-stuck-${vavId}`, severity: 'HIGH', message: `VAV ${vavId} damper stuck at ${vav.damper_position}%`, equipment: vavId }); } }, sensorDrift: { description: 'Temperature sensor reading 5°F high', apply: (twin, zoneId) => { const zone = twin.zones.find(z => z.id === zoneId); zone.sensor_drift = 5.0; zone.temperature_measured = zone.temperature_actual + 5.0; } }, communicationLoss: { description: 'Equipment communication timeout', apply: (twin, equipmentId) => { const equipment = findEquipmentById(twin, equipmentId); equipment.comm_status = 'OFFLINE'; equipment.stale_data = true; twin.alerts.push({ id: `comm-loss-${equipmentId}`, severity: 'MEDIUM', message: `Lost communication with ${equipmentId}`, equipment: equipmentId }); } } }; router.post('/api/twin/fault', async (req, res) => { const { faultType, targetEquipment } = req.body; const twin = await loadTwinState(); const fault = FAULT_CATALOG[faultType]; if (!fault) { return res.status(400).json({ error: 'Unknown fault type' }); } fault.apply(twin, targetEquipment); await saveTwinState(twin); res.json({ message: `Applied fault: ${fault.description}`, affectedEquipment: targetEquipment, timestamp: new Date().toISOString() }); }); Operators can inject faults to practice diagnosis and response. Training scenarios might include: "The chiller just failed during a heat wave, how do you maintain comfort?" or "Multiple VAV dampers are stuck, which zones need immediate attention?" Key Takeaways and Production Deployment Building a physics-based digital twin with AI capabilities requires balancing simulation accuracy with computational performance, providing intuitive visualization while maintaining technical depth, and enabling AI assistance without compromising safety. Key architectural lessons: Physics models enable prediction: Comparing predicted vs observed behavior identifies anomalies that simple thresholds miss 3D visualization improves spatial understanding: Operators immediately see which floors or zones need attention AI copilots accelerate diagnosis: Natural language queries get answers in seconds vs. minutes of manual data examination Fault injection validates readiness: Testing failure scenarios prepares operators for real incidents JSON state enables integration: Simple file-based state makes connecting to real BMS systems straightforward For production deployment, connect the twin to actual building systems via BACnet, Modbus, or MQTT integrations. Replace simulated telemetry with real sensor streams. Calibrate model parameters against historical building performance. Implement continuous learning where the twin's predictions improve as it observes actual building behavior. The complete implementation with simulation engine, 3D visualization, AI copilot, and fault injection system is available at github.com/leestott/DigitalTwin. Clone the repository and run the startup scripts to explore the digital twin, no building hardware required. Resources and Further Reading Smart Building HVAC Digital Twin Repository - Complete source code and simulation engine Setup and Quick Start Guide - Installation instructions and usage examples Microsoft Foundry Local Documentation - AI integration reference HVAC Simulation Documentation - Physics model details and calibration Three.js Documentation - 3D visualization framework ASHRAE Standards - Building energy modeling standardsBuilding Interactive Agent UIs with AG-UI and Microsoft Agent Framework
Introduction Picture this: You've built an AI agent that analyzes financial data. A user uploads a quarterly report and asks: "What are the top three expense categories?" Behind the scenes, your agent parses the spreadsheet, aggregates thousands of rows, and generates visualizations. All in 20 seconds. But the user? They see a loading spinner. Nothing else. No "reading file" message, no "analyzing data" indicator, no hint that progress is being made. They start wondering: Is it frozen? Should I refresh? The problem isn't the agent's capabilities - it's the communication gap between the agent running on the backend and the user interface. When agents perform multi-step reasoning, call external APIs, or execute complex tool chains, users deserve to see what's happening. They need streaming updates, intermediate results, and transparent progress indicators. Yet most agent frameworks force developers to choose between simple request/response patterns or building custom solutions to stream updates to their UIs. This is where AG-UI comes in. AG-UI is a fairly new event-based protocol that standardizes how agents communicate with user interfaces. Instead of every framework and development team inventing their own streaming solution, AG-UI provides a shared vocabulary of structured events that work consistently across different agent implementations. When an agent starts processing, calls a tool, generates text, or encounters an error, the UI receives explicit, typed events in real time. The beauty of AG-UI is its framework-agnostic design. While this blog post demonstrates integration with Microsoft Agent Framework (MAF), the same AG-UI protocol works with LangGraph, CrewAI, or any other compliant framework. Write your UI code once, and it works with any AG-UI-compliant backend. (Note: MAF supports both Python and .NET - this blog post focuses on the Python implementation.) TL;DR The Problem: Users don't get real-time updates while AI agents work behind the scenes - no progress indicators, no transparency into tool calls, and no insight into what's happening. The Solution: AG-UI is an open, event-based protocol that standardizes real-time communication between AI agents and user interfaces. Instead of each development team and framework inventing custom streaming solutions, AG-UI provides a shared vocabulary of structured events (like TOOL_CALL_START, TEXT_MESSAGE_CONTENT, RUN_FINISHED) that work across any compliant framework. Key Benefits: Framework-agnostic - Write UI code once, works with LangGraph, Microsoft Agent Framework, CrewAI, and more Real-time observability - See exactly what your agent is doing as it happens Server-Sent Events - Built on standard HTTP for universal compatibility Protocol-managed state - No manual conversation history tracking In This Post: You'll learn why AG-UI exists, how it works, and build a complete working application using Microsoft Agent Framework with Python - from server setup to client implementation. What You'll Learn This blog post walks through: Why AG-UI exists - how agent-UI communication has evolved and what problems current approaches couldn't solve How the protocol works - the key design choices that make AG-UI simple, reliable, and framework-agnostic Protocol architecture - the generic components and how AG-UI integrates with agent frameworks Building an AG-UI application - a complete working example using Microsoft Agent Framework with server, client, and step-by-step setup Understanding events - what happens under the hood when your agent runs and how to observe it Thinking in events - how building with AG-UI differs from traditional APIs, and what benefits this brings Making the right choice - when AG-UI is the right fit for your project and when alternatives might be better Estimated reading time: 15 minutes Who this is for: Developers building AI agents who want to provide real-time feedback to users, and teams evaluating standardized approaches to agent-UI communication To appreciate why AG-UI matters, we need to understand the journey that led to its creation. Let's trace how agent-UI communication has evolved through three distinct phases. The Evolution of Agent-UI Communication AI agents have become more capable over time. As they evolved, the way they communicated with user interfaces had to evolve as well. Here's how this evolution unfolded. Phase 1: Simple Request/Response In the early days of AI agent development, the interaction model was straightforward: send a question, wait for an answer, display the result. This synchronous approach mirrored traditional API calls and worked fine for simple scenarios. # Simple, but limiting response = agent.run("What's the weather in Paris?") display(response) # User waits... and waits... Works for: Quick queries that complete in seconds, simple Q&A interactions where immediate feedback and interactivity aren't critical. Breaks down: When agents need to call multiple tools, perform multi-step reasoning, or process complex queries that take 30+ seconds. Users see nothing but a loading spinner, with no insight into what's happening or whether the agent is making progress. This creates a poor user experience and makes it impossible to show intermediate results or allow user intervention. Recognizing these limitations, development teams began experimenting with more sophisticated approaches. Phase 2: Custom Streaming Solutions As agents became more sophisticated, teams recognized the need for incremental feedback and interactivity. Rather than waiting for the complete response, they implemented custom streaming solutions to show partial results as they became available. # Every team invents their own format for chunk in agent.stream("What's the weather?"): display(chunk) # But what about tool calls? Errors? Progress? This was a step forward for building interactive agent UIs, but each team solved the problem differently. Also, different frameworks had incompatible approaches - some streamed only text tokens, others sent structured JSON, and most provided no visibility into critical events like tool calls or errors. The problem: No standardization across frameworks - client code that works with LangGraph won't work with Crew AI, requiring separate implementations for each agent backend Each implementation handles tool calls differently - some send nothing during tool execution, others send unstructured messages Complex state management - clients must track conversation history, manage reconnections, and handle edge cases manually The industry needed a better solution - a common protocol that could work across all frameworks while maintaining the benefits of streaming. Phase 3: Standardized Protocol (AG-UI) AG-UI emerged as a response to the fragmentation problem. Instead of each framework and development team inventing their own streaming solution, AG-UI provides a shared vocabulary of events that work consistently across different agent implementations. # Standardized events everyone understands async for event in agent.run_stream("What's the weather?"): if event.type == "TEXT_MESSAGE_CONTENT": display_text(event.delta) elif event.type == "TOOL_CALL_START": show_tool_indicator(event.tool_name) elif event.type == "TOOL_CALL_RESULT": show_tool_result(event.result) The key difference is structured observability. Rather than guessing what the agent is doing from unstructured text, clients receive explicit events for every stage of execution: when the agent starts, when it generates text, when it calls a tool, when that tool completes, and when the entire run finishes. What's different: A standardized vocabulary of event types, complete observability into agent execution, and framework-agnostic clients that work with any AG-UI-compliant backend. You write your UI code once, and it works whether the backend uses Microsoft Agent Framework, LangGraph, or any other framework that speaks AG-UI. Now that we've seen why AG-UI emerged and what problems it solves, let's examine the specific design decisions that make the protocol work. These choices weren't arbitrary - each one addresses concrete challenges in building reliable, observable agent-UI communication. The Design Decisions Behind AG-UI Why Server-Sent Events (SSE)? Aspect WebSockets SSE (AG-UI) Complexity Bidirectional Unidirectional (simpler) Firewall/Proxy Sometimes blocked Standard HTTP Reconnection Manual implementation Built-in browser support Use case Real-time games, chat Agent responses (one-way) For agent interactions, you typically only need server→client communication, making SSE a simpler choice. SSE solves the transport problem - how events travel from server to client. But once connected, how does the protocol handle conversation state across multiple interactions? Why Protocol-Managed Threads? # Without protocol threads (client manages): conversation_history = [] conversation_history.append({"role": "user", "content": message}) response = agent.complete(conversation_history) conversation_history.append({"role": "assistant", "content": response}) # Complex, error-prone, doesn't work with multiple clients # With AG-UI (protocol manages): thread = agent.get_new_thread() # Server creates and manages thread agent.run_stream(message, thread=thread) # Server maintains context # Simple, reliable, shareable across clients With transport and state management handled, the final piece is the actual messages flowing through the connection. What information should the protocol communicate, and how should it be structured? Why Standardized Event Types? Instead of parsing unstructured text, clients get typed events: RUN_STARTED - Agent begins (start loading UI) TEXT_MESSAGE_CONTENT - Text chunk (stream to user) TOOL_CALL_START - Tool invoked (show "searching...", "calculating...") TOOL_CALL_RESULT - Tool finished (show result, update UI) RUN_FINISHED - Complete (hide loading) This lets UIs react intelligently without custom parsing logic. Now that we understand the protocol's design choices, let's see how these pieces fit together in a complete system. Architecture Overview Here's how the components interact: The communication between these layers relies on a well-defined set of event types. Here are the core events that flow through the SSE connection: Core Event Types AG-UI provides a standardized set of event types to describe what's happening during an agent's execution: RUN_STARTED - agent begins execution TEXT_MESSAGE_START, TEXT_MESSAGE_CONTENT, TEXT_MESSAGE_END - streaming segments of text TOOL_CALL_START, TOOL_CALL_ARGS, TOOL_CALL_END, TOOL_CALL_RESULT - tool execution events RUN_FINISHED - agent has finished execution RUN_ERROR - error information This model lets the UI update as the agent runs, rather than waiting for the final response. The generic architecture above applies to any AG-UI implementation. Now let's see how this translates to Microsoft Agent Framework. AG-UI with Microsoft Agent Framework While AG-UI is framework-agnostic, this blog post demonstrates integration with Microsoft Agent Framework (MAF) using Python. MAF is available in both Python and .NET, giving you flexibility to build AG-UI applications in your preferred language. Understanding how MAF implements the protocol will help you build your own applications or work with other compliant frameworks. Integration Architecture The Microsoft Agent Framework integration involves several specialized layers that handle protocol translation and execution orchestration: Understanding each layer: FastAPI Endpoint - Handles HTTP requests and establishes SSE connections for streaming AgentFrameworkAgent - Protocol wrapper that translates between AG-UI events and Agent Framework operations Orchestrators - Manage execution flow, coordinate tool calling sequences, and handle state transitions ChatAgent - Your agent implementation with instructions, tools, and business logic ChatClient - Interface to the underlying language model (Azure OpenAI, OpenAI, or other providers) The good news? When you call add_agent_framework_fastapi_endpoint, all the middleware layers are configured automatically. You simply provide your ChatAgent, and the integration handles protocol translation, event streaming, and state management behind the scenes. Now that we understand both the protocol architecture and the Microsoft Agent Framework integration, let's build a working application. Hands-On: Building Your First AG-UI Application This section demonstrates how to build an AG-UI server and client using Microsoft Agent Framework and FastAPI. Prerequisites Before building your first AG-UI application, ensure you have: Python 3.10 or later installed Basic understanding of async/await patterns in Python Azure CLI installed and authenticated (az login) Azure OpenAI service endpoint and deployment configured (setup guide) Cognitive Services OpenAI Contributor role for your Azure OpenAI resource You'll also need to install the AG-UI integration package: pip install agent-framework-ag-ui --pre This automatically installs agent-framework-core, fastapi, and uvicorn as dependencies. With your environment configured, let's create the server that will host your agent and expose it via the AG-UI protocol. Building the Server Let's create a FastAPI server that hosts an AI agent and exposes it via AG-UI: # server.py import os from typing import Annotated from dotenv import load_dotenv from fastapi import FastAPI from pydantic import Field from agent_framework import ChatAgent, ai_function from agent_framework.azure import AzureOpenAIChatClient from agent_framework_ag_ui import add_agent_framework_fastapi_endpoint from azure.identity import DefaultAzureCredential # Load environment variables from .env file load_dotenv() # Validate environment configuration openai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT") model_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME") if not openai_endpoint: raise RuntimeError("Missing required environment variable: AZURE_OPENAI_ENDPOINT") if not model_deployment: raise RuntimeError("Missing required environment variable: AZURE_OPENAI_DEPLOYMENT_NAME") # Define tools the agent can use @ai_function def get_order_status( order_id: Annotated[str, Field(description="The order ID to look up (e.g., ORD-001)")] ) -> dict: """Look up the status of a customer order. Returns order status, tracking number, and estimated delivery date. """ # Simulated order lookup orders = { "ORD-001": {"status": "shipped", "tracking": "1Z999AA1", "eta": "Jan 25, 2026"}, "ORD-002": {"status": "processing", "tracking": None, "eta": "Jan 23, 2026"}, "ORD-003": {"status": "delivered", "tracking": "1Z999AA3", "eta": "Delivered Jan 20"}, } return orders.get(order_id, {"status": "not_found", "message": "Order not found"}) # Initialize Azure OpenAI client chat_client = AzureOpenAIChatClient( credential=DefaultAzureCredential(), endpoint=openai_endpoint, deployment_name=model_deployment, ) # Configure the agent with custom instructions and tools agent = ChatAgent( name="CustomerSupportAgent", instructions="""You are a helpful customer support assistant. You have access to a get_order_status tool that can look up order information. IMPORTANT: When a user mentions an order ID (like ORD-001, ORD-002, etc.), you MUST call the get_order_status tool to retrieve the actual order details. Do NOT make up or guess order information. After calling get_order_status, provide the actual results to the user in a friendly format.""", chat_client=chat_client, tools=[get_order_status], ) # Initialize FastAPI application app = FastAPI( title="AG-UI Customer Support Server", description="Interactive AI agent server using AG-UI protocol with tool calling" ) # Mount the AG-UI endpoint add_agent_framework_fastapi_endpoint(app, agent, path="/chat") def main(): """Entry point for the AG-UI server.""" import uvicorn print("Starting AG-UI server on http://localhost:8000") uvicorn.run(app, host="0.0.0.0", port=8000, log_level="info") # Run the application if __name__ == "__main__": main() What's happening here: We define a tool: get_order_status with the AI_function decorator Use Annotated and Field for parameter descriptions to help the agent understand when and how to use the tool We create an Azure OpenAI chat client with credential authentication The ChatAgent is configured with domain-specific instructions and the tools parameter add_agent_framework_fastapi_endpoint automatically handles SSE streaming and tool execution The server exposes the agent at the /chat endpoint Note: This example uses Azure OpenAI, but AG-UI works with any chat model. You can also integrate with Azure AI Foundry's model catalog or use other LLM providers. Tool calling is supported by most modern LLMs including GPT-4, GPT-4o, and Claude models. To run this server: # Set your Azure OpenAI credentials export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/" export AZURE_OPENAI_DEPLOYMENT_NAME="gpt-4o" # Start the server python server.py With your server running and exposing the AG-UI endpoint, the next step is building a client that can connect and consume the event stream. Streaming Results to Clients With the server running, clients can connect and stream events as the agent processes requests. Here's a Python client that demonstrates the streaming capabilities: # client.py import asyncio import os from dotenv import load_dotenv from agent_framework import ChatAgent, FunctionCallContent, FunctionResultContent from agent_framework_ag_ui import AGUIChatClient # Load environment variables from .env file load_dotenv() async def interactive_chat(): """Interactive chat session with streaming responses.""" # Connect to the AG-UI server base_url = os.getenv("AGUI_SERVER_URL", "http://localhost:8000/chat") print(f"Connecting to: {base_url}\n") # Initialize the AG-UI client client = AGUIChatClient(endpoint=base_url) # Create a local agent representation agent = ChatAgent(chat_client=client) # Start a new conversation thread conversation_thread = agent.get_new_thread() print("Chat started! Type 'exit' or 'quit' to end the session.\n") try: while True: # Collect user input user_message = input("You: ") # Handle empty input if not user_message.strip(): print("Please enter a message.\n") continue # Check for exit commands if user_message.lower() in ["exit", "quit", "bye"]: print("\nGoodbye!") break # Stream the agent's response print("Agent: ", end="", flush=True) # Track tool calls to avoid duplicate prints seen_tools = set() async for update in agent.run_stream(user_message, thread=conversation_thread): # Display text content if update.text: print(update.text, end="", flush=True) # Display tool calls and results for content in update.contents: if isinstance(content, FunctionCallContent): # Only print each tool call once if content.call_id not in seen_tools: seen_tools.add(content.call_id) print(f"\n[Calling tool: {content.name}]", flush=True) elif isinstance(content, FunctionResultContent): # Only print each result once result_id = f"result_{content.call_id}" if result_id not in seen_tools: seen_tools.add(result_id) result_text = content.result if isinstance(content.result, str) else str(content.result) print(f"[Tool result: {result_text}]", flush=True) print("\n") # New line after response completes except KeyboardInterrupt: print("\n\nChat interrupted by user.") except ConnectionError as e: print(f"\nConnection error: {e}") print("Make sure the server is running.") except Exception as e: print(f"\nUnexpected error: {e}") def main(): """Entry point for the AG-UI client.""" asyncio.run(interactive_chat()) if __name__ == "__main__": main() Key features: The client connects to the AG-UI endpoint using AGUIChatClient with the endpoint parameter run_stream() yields updates containing text and content as they arrive Tool calls are detected using FunctionCallContent and displayed with [Calling tool: ...] Tool results are detected using FunctionResultContent and displayed with [Tool result: ...] Deduplication logic (seen_tools set) prevents printing the same tool call multiple times as it streams Thread management maintains conversation context across messages Graceful error handling for connection issues To use the client: # Optional: specify custom server URL export AGUI_SERVER_URL="http://localhost:8000/chat" # Start the interactive chat python client.py Example Session: Connecting to: http://localhost:8000/chat Chat started! Type 'exit' or 'quit' to end the session. You: What's the status of order ORD-001? Agent: [Calling tool: get_order_status] [Tool result: {"status": "shipped", "tracking": "1Z999AA1", "eta": "Jan 25, 2026"}] Your order ORD-001 has been shipped! - Tracking Number: 1Z999AA1 - Estimated Delivery Date: January 25, 2026 You can use the tracking number to monitor the delivery progress. You: Can you check ORD-002? Agent: [Calling tool: get_order_status] [Tool result: {"status": "processing", "tracking": null, "eta": "Jan 23, 2026"}] Your order ORD-002 is currently being processed. - Status: Processing - Estimated Delivery: January 23, 2026 Your order should ship soon, and you'll receive a tracking number once it's on the way. You: exit Goodbye! The client we just built handles events at a high level, abstracting away the details. But what's actually flowing through that SSE connection? Let's peek under the hood. Event Types You'll See As the server streams back responses, clients receive a series of structured events. If you were to observe the raw SSE stream (e.g., using curl), you'd see events like: curl -N http://localhost:8000/chat \ -H "Content-Type: application/json" \ -H "Accept: text/event-stream" \ -d '{"messages": [{"role": "user", "content": "What'\''s the status of order ORD-001?"}]}' Sample event stream (with tool calling): data: {"type":"RUN_STARTED","threadId":"eb4d9850-14ef-446c-af4b-23037acda9e8","runId":"chatcmpl-xyz"} data: {"type":"TEXT_MESSAGE_START","messageId":"e8648880-a9ff-4178-a17d-4a6d3ec3d39c","role":"assistant"} data: {"type":"TOOL_CALL_START","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","toolCallName":"get_order_status","parentMessageId":"e8648880-a9ff-4178-a17d-4a6d3ec3d39c"} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"{\""} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"order"} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"_id"} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"\":\""} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"ORD"} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"-"} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"001"} data: {"type":"TOOL_CALL_ARGS","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","delta":"\"}"} data: {"type":"TOOL_CALL_END","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y"} data: {"type":"TOOL_CALL_RESULT","messageId":"f048cb0a-a049-4a51-9403-a05e4820438a","toolCallId":"call_GTWj2N3ZyYiiQIjg3fwmiQ8y","content":"{\"status\": \"shipped\", \"tracking\": \"1Z999AA1\", \"eta\": \"Jan 25, 2026\"}","role":"tool"} data: {"type":"TEXT_MESSAGE_START","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","role":"assistant"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":"Your"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":" order"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":" ORD"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":"-"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":"001"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":" has"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":" been"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":" shipped"} data: {"type":"TEXT_MESSAGE_CONTENT","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf","delta":"!"} ... (additional TEXT_MESSAGE_CONTENT events streaming the response) ... data: {"type":"TEXT_MESSAGE_END","messageId":"8215fc88-8cb6-4ce4-8bdb-a8715dcd26cf"} data: {"type":"RUN_FINISHED","threadId":"eb4d9850-14ef-446c-af4b-23037acda9e8","runId":"chatcmpl-xyz"} Understanding the flow: RUN_STARTED - Agent begins processing the request TEXT_MESSAGE_START - First message starts (will contain tool calls) TOOL_CALL_START - Agent invokes the get_order_status tool Multiple TOOL_CALL_ARGS events - Arguments stream incrementally as JSON chunks ({"order_id":"ORD-001"}) TOOL_CALL_END - Tool invocation structure complete TOOL_CALL_RESULT - Tool execution finished with result data TEXT_MESSAGE_START - Second message starts (the final response) Multiple TEXT_MESSAGE_CONTENT events - Response text streams word-by-word TEXT_MESSAGE_END - Response message complete RUN_FINISHED - Entire run completed successfully This granular event model enables rich UI experiences - showing tool execution indicators ("Searching...", "Calculating..."), displaying intermediate results, and providing complete transparency into the agent's reasoning process. Seeing the raw events helps, but truly working with AG-UI requires a shift in how you think about agent interactions. Let's explore this conceptual change. The Mental Model Shift Traditional API Thinking # Imperative: Call and wait response = agent.run("What's 2+2?") print(response) # "The answer is 4" Mental model: Function call with return value AG-UI Thinking # Reactive: Subscribe to events async for event in agent.run_stream("What's 2+2?"): match event.type: case "RUN_STARTED": show_loading() case "TEXT_MESSAGE_CONTENT": display_chunk(event.delta) case "RUN_FINISHED": hide_loading() Mental model: Observable stream of events This shift feels similar to: Moving from synchronous to async code Moving from REST to event-driven architecture Moving from polling to pub/sub This mental shift isn't just philosophical - it unlocks concrete benefits that weren't possible with request/response patterns. What You Gain Observability # You can SEE what the agent is doing TOOL_CALL_START: "get_order_status" TOOL_CALL_ARGS: {"order_id": "ORD-001"} TOOL_CALL_RESULT: {"status": "shipped", "tracking": "1Z999AA1", "eta": "Jan 25, 2026"} TEXT_MESSAGE_START: "Your order ORD-001 has been shipped..." Interruptibility # Future: Cancel long-running operations async for event in agent.run_stream(query): if user_clicked_cancel: await agent.cancel(thread_id, run_id) break Transparency # Users see the reasoning process "Looking up order ORD-001..." "Order found: Status is 'shipped'" "Retrieving tracking information..." "Your order has been shipped with tracking number 1Z999AA1..." To put these benefits in context, here's how AG-UI compares to traditional approaches across key dimensions: AG-UI vs. Traditional Approaches Aspect Traditional REST Custom Streaming AG-UI Connection Model Request/Response Varies Server-Sent Events State Management Manual Manual Protocol-managed Tool Calling Invisible Custom format Standardized events Framework Varies Framework-locked Framework-agnostic Browser Support Universal Varies Universal Implementation Simple Complex Moderate Ecosystem N/A Isolated Growing You've now seen AG-UI's design principles, implementation details, and conceptual foundations. But the most important question remains: should you actually use it? Conclusion: Is AG-UI Right for Your Project? AG-UI represents a shift toward standardized, observable agent interactions. Before adopting it, understand where the protocol stands and whether it fits your needs. Protocol Maturity The protocol is stable enough for production use but still evolving: Ready now: Core specification stable, Microsoft Agent Framework integration available, FastAPI/Python implementation mature, basic streaming and threading work reliably. Choose AG-UI If You Building new agent projects - No legacy API to maintain, want future compatibility with emerging ecosystem Need streaming observability - Multi-step workflows where users benefit from seeing each stage of execution Want framework flexibility - Same client code works with any AG-UI-compliant backend Comfortable with evolving standards - Can adapt to protocol changes as it matures Stick with Alternatives If You Have working solutions - Custom streaming working well, migration cost not justified Need guaranteed stability - Mission-critical systems where breaking changes are unacceptable Build simple agents - Single-step request/response without tool calling or streaming needs Risk-averse environment - Large existing implementations where proven approaches are required Beyond individual project decisions, it's worth considering AG-UI's role in the broader ecosystem. The Bigger Picture While this blog post focused on Microsoft Agent Framework, AG-UI's true power lies in its broader mission: creating a common language for agent-UI communication across the entire ecosystem. As more frameworks adopt it, the real value emerges: write your UI once, work with any compliant agent framework. Think of it like GraphQL for APIs or OpenAPI for REST - a standardization layer that benefits the entire ecosystem. The protocol is young, but the problem it solves is real. Whether you adopt it now or wait for broader adoption, understanding AG-UI helps you make informed architectural decisions for your agent applications. Ready to dive deeper? Here are the official resources to continue your AG-UI journey. Resources AG-UI & Microsoft Agent Framework Getting Started with AG-UI (Microsoft Learn) - Official tutorial AG-UI Integration Overview - Architecture and concepts AG-UI Protocol Specification - Official protocol documentation Backend Tool Rendering - Adding function tools Security Considerations - Production security guidance Microsoft Agent Framework Documentation - Framework overview AG-UI Dojo Examples - Live demonstrations UI Components & Integration CopilotKit for Microsoft Agent Framework - React component library Community & Support Microsoft Q&A - Community support Agent Framework GitHub - Source code and issues Related Technologies Azure AI Foundry Documentation - Azure AI platform FastAPI Documentation - Web framework Server-Sent Events (SSE) Specification - Protocol standard This blog post introduces AG-UI with Microsoft Agent Framework, focusing on fundamental concepts and building your first interactive agent application.Welcome! Let's get started.
We're gathering the early adopters of Microsoft Agent 365 to connect, share, and answer questions about deploying agents in your organization with observability, security, and governance. Welcome! So... how many agents are in your Registry? -- Nichole76Views0likes0CommentsAgents League: Two Weeks, Three Tracks, One Challenge
We're inviting all developers to join Agents League, running February 16-27. It's a two-week challenge where you'll build AI agents using production-ready tools, learn from live coding sessions, and get feedback directly from Microsoft product teams. We've put together starter kits for each track to help you get up and running quickly that also includes requirements and guidelines. Whether you want to explore what GitHub Copilot can do beyond autocomplete, build reasoning agents on Microsoft Foundry, or create enterprise integrations for Microsoft 365 Copilot, we have a track for you. Important: Register first to be eligible for prizes and your digital badge. Without registration, you won't qualify for awards or receive a badge when you submit. What Is Agents League? It's a 2-week competition that combines learning with building: 📽️ Live coding battles – Watch Product teams, MVPs and community members tackle challenges in real-time on Microsoft Reactor 💻 Async challenges – Build at your own pace, on your schedule 💬 Discord community – Connect with other participants, join AMAs, and get help when you need it 🏆 Prizes – $500 per track winner, plus GitHub Copilot Pro subscriptions for top picks The Three Tracks 🎨 Creative Apps — Build with GitHub Copilot (Chat, CLI, or SDK) 🧠 Reasoning Agents — Build with Microsoft Foundry 💼 Enterprise Agents — Build with M365 Agents Toolkit (or Copilot Studio) More details on each track below, or jump straight to the starter kits. The Schedule Agents League starts on February 16th and runs through Feburary 27th. Within 2 weeks, we host live battles on Reactor and AMA sessions on Discord. Week 1: Live Battles (Feb 17-19) We're kicking off with live coding battles streamed on Microsoft Reactor. Watch experienced developers compete in real-time, explaining their approach and architectural decisions as they go. Tue Feb 17, 9 AM PT — 🎨 Creative Apps battle Wed Feb 18, 9 AM PT — 🧠 Reasoning Agents battle Thu Feb 19, 9 AM PT — 💼 Enterprise Agents battle All sessions are recorded, so you can watch on your own schedule. Week 2: Build + AMAs (Feb 24-26) This is your time to build and ask questions on Discord. The async format means you work when it suits you, evenings, weekends, whatever fits your schedule. We're also hosting AMAs on Discord where you can ask questions directly to Microsoft experts and product teams: Tue Feb 24, 9 AM PT — 🎨 Creative Apps AMA Wed Feb 25, 9 AM PT — 🧠 Reasoning Agents AMA Thu Feb 26, 9 AM PT — 💼 Enterprise Agents AMA Bring your questions, get help when you're stuck, and share what you're building with the community. Pick Your Track We've created a starter kit for each track with setup guides, project ideas, and example scenarios to help you get started quickly. 🎨 Creative Apps Tool: GitHub Copilot (Chat, CLI, or SDK) Build innovative, imaginative applications that showcase the potential of AI-assisted development. All application types are welcome, web apps, CLI tools, games, mobile apps, desktop applications, and more. The starter kit walks you through GitHub Copilot's different modes and provides prompting tips to get the best results. View the Creative Apps starter kit. 🧠 Reasoning Agents Tool: Microsoft Foundry (UI or SDK) and/or Microsoft Agent Framework Build a multi-agent system that leverages advanced reasoning capabilities to solve complex problems. This track focuses on agents that can plan, reason through multi-step problems, and collaborate. The starter kit includes architecture patterns, reasoning strategies (planner-executor, critic/verifier, self-reflection), and integration guides for tools and MCP servers. View the Reasoning Agents starter kit. 💼 Enterprise Agents Tool: M365 Agents Toolkit or Copilot Studio Create intelligent agents that extend Microsoft 365 Copilot to address real-world enterprise scenarios. Your agent must work on Microsoft 365 Copilot Chat. Bonus points for: MCP server integration, OAuth security, Adaptive Cards UI, connected agents (multi-agent architecture). View the Enterprise Agents starter kit. Prizes & Recognition To be eligible for prizes and your digital badge, you must register before submitting your project. Category Winners ($500 each): 🎨 Creative Apps winner 🧠 Reasoning Agents winner 💼 Enterprise Agents winner GitHub Copilot Pro subscriptions: Community Favorite (voted by participants on Discord) Product Team Picks (selected by Microsoft product teams) Everyone who registers and submits a project wins: A digital badge to showcase their participation. Beyond the prizes, every participant gets feedback from the teams who built these tools, a valuable opportunity to learn and improve your approach to AI agent development. How to Get Started Register first — This is required to be eligible for prizes and to receive your digital badge. Without registration, your submission won't qualify for awards or a badge. Pick a track — Choose one track. Explore the starter kits to help you decide. Watch the battles — See how experienced developers approach these challenges. Great for learning even if you're still deciding whether to compete. Build your project — You have until Feb 27. Work on your own schedule. Submit via GitHub — Open an issue using the project submission template. Join us on Discord — Get help, share your progress, and vote for your favorite projects on Discord. Links Register: https://aka.ms/agentsleague/register Starter Kits: https://github.com/microsoft/agentsleague/starter-kits Discord: https://aka.ms/agentsleague/discord Live Battles: https://aka.ms/agentsleague/battles Submit Project: Project submission templateAdding AI Personality to Browser Games
Introduction Browser games traditionally follow predictable patterns, fixed text messages, static tutorials, scripted NPC responses. Players see the same "Game Over" message whether they nearly won or failed spectacularly. Tutorial text remains identical regardless of player skill level. The game experience, while fun, lacks the dynamic reactivity of human-moderated gameplay. What if your Space Invaders game could comment on gameplay in real-time? Taunt players when they miss easy shots? Celebrate close victories with personalized messages? Adjust difficulty suggestions based on actual performance metrics? This article demonstrates exactly that: integrating AI-powered dynamic commentary into a browser game using Spaceinvaders-FoundryLocal, vanilla JavaScript, and Microsoft Foundry Local. You'll learn how to integrate local AI into client-side games, design AI personality systems that enhance rather than distract, implement context-aware commentary generation, and architect optional AI features that don't break core gameplay when unavailable. Whether you're building educational games, interactive training simulations, or simply adding personality to entertainment projects, this approach provides a blueprint for AI-enhanced gaming experiences. Why Local AI Transforms Browser Gaming Adding AI to games sounds expensive, cloud API costs scale with player counts, introducing per-gameplay pricing that makes free-to-play models challenging. Privacy concerns emerge when gameplay data leaves user devices. Latency affects real-time experiences, waiting 2 seconds for commentary after an action breaks immersion. Network requirements exclude offline play. Local AI solves all these challenges simultaneously. Foundry Local runs Small Language Models (SLMs) entirely on player devices, no API costs, no data leaving the machine, no network dependency. Inference happens in milliseconds, enabling truly real-time responses. Games work offline after initial load, perfect for mobile or low-connectivity scenarios. SLMs excel at personality-driven tasks like game commentary. They don't need perfect factual recall or complex reasoning, they generate entertaining, contextually relevant text based on game state. A 1.5B parameter model produces engaging taunts and celebration messages indistinguishable from hand-written content, while running easily on mid-range laptops. Integrating AI as an optional enhancement demonstrates good architecture. Core gameplay must function perfectly without AI, commentary enhances the experience but failure doesn't break the game. This graceful degradation pattern ensures maximum compatibility while offering AI features to capable devices. Architecture: Progressive Enhancement with AI The Spaceinvaders-FoundryLocal implementation uses progressive enhancement, the game fully works without AI, but adds dynamic personality when available: The base game implements classic Space Invaders mechanics entirely in vanilla JavaScript. Player ship movement, bullet physics, enemy patterns, collision detection, scoring, and power-up systems all operate independently of AI. This ensures universal compatibility across browsers, devices, and network conditions. The AI layer adds dynamic commentary through a backend Node.js proxy. The proxy runs locally, communicates with Foundry Local, and provides game context to the AI for generating personalized messages. The game polls the proxy periodically, sending current game state (score, accuracy, wave number, power-up usage) and receiving commentary responses. The architecture flow for AI-enhanced gameplay: Player Action (e.g., destroys enemy) ↓ Game Updates State (score += 100, accuracy tracked) ↓ Game Checks AI Status (polling every 5 seconds) ↓ If AI Available: Send Game Context to Backend → { event: 'wave_complete', score: 2500, accuracy: 78%, wave: 3 } ↓ Backend builds prompt with context ↓ Foundry Local generates comment ↓ Return commentary to game → "Wave 3 conquered! Your 78% accuracy shows improving skills." ↓ Display in game UI (animated text bubble) This design demonstrates several key patterns: Zero-dependency core: Game playable immediately, AI adds value incrementally Graceful degradation: If AI unavailable, game shows generic messages Asynchronous enhancement: AI runs in background, never blocks gameplay Context-aware generation: Commentary reflects actual player performance Local-first architecture: Everything runs on player's machine—no servers, no tracking Implementing Context-Aware AI Commentary Effective game commentary requires understanding current gameplay context. The AI needs to know what just happened, how the player is performing, and what makes this moment interesting: // llm.js - AI integration module export class GameAI { constructor() { this.baseURL = 'http://localhost:3001'; // Local proxy server this.available = false; this.checkAvailability(); } async checkAvailability() { try { const response = await fetch(`${this.baseURL}/health`, { method: 'GET', timeout: 2000 }); this.available = response.ok; return this.available; } catch (error) { console.log('AI server not available (optional feature)'); this.available = false; return false; } } async generateComment(gameContext) { if (!this.available) { return this.getFallbackComment(gameContext.event); } try { const response = await fetch(`${this.baseURL}/api/comment`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify(gameContext) }); if (!response.ok) { throw new Error('AI request failed'); } const data = await response.json(); return data.comment; } catch (error) { console.error('AI comment generation failed:', error); return this.getFallbackComment(gameContext.event); } } getFallbackComment(event) { // Static messages when AI unavailable const fallbacks = { 'wave_complete': 'Wave cleared!', 'player_hit': 'Shields damaged!', 'game_over': 'Game Over. Try again!', 'high_score': 'New high score!', 'power_up': 'Power-up collected!' }; return fallbacks[event] || 'Good job!'; } } The backend processes game context and generates contextually relevant commentary: // server.js - Node.js backend proxy import express from 'express'; import { FoundryLocalClient } from 'foundry-local-sdk'; const app = express(); const foundry = new FoundryLocalClient({ endpoint: process.env.FOUNDRY_LOCAL_ENDPOINT || 'http://127.0.0.1:5272' }); app.use(express.json()); app.use(express.cors()); // Allow browser game to connect app.get('/health', (req, res) => { res.json({ status: 'AI available', model: 'phi-3.5-mini' }); }); app.post('/api/comment', async (req, res) => { const { event, score, accuracy, wave, lives, combo } = req.body; // Build context-rich prompt const prompt = buildCommentPrompt(event, { score, accuracy, wave, lives, combo }); try { const completion = await foundry.chat.completions.create({ model: 'phi-3.5-mini', messages: [ { role: 'system', content: `You are an AI commander providing brief, encouraging commentary for a Space Invaders game. Be energetic, supportive, and sometimes humorous. Keep responses to 1-2 sentences maximum. Reference specific game metrics when relevant.` }, { role: 'user', content: prompt } ], temperature: 0.9, // High temperature for creative variety max_tokens: 50 }); const comment = completion.choices[0].message.content.trim(); res.json({ comment, model: 'phi-3.5-mini', timestamp: new Date().toISOString() }); } catch (error) { console.error('AI generation error:', error); res.status(500).json({ error: 'Commentary generation failed' }); } }); function buildCommentPrompt(event, context) { switch(event) { case 'wave_complete': return `The player just completed wave ${context.wave} with score ${context.score}. Their shooting accuracy is ${context.accuracy}%. ${context.lives} lives remaining. Generate an encouraging comment about their progress.`; case 'player_hit': return `The player got hit by an enemy! They now have ${context.lives} lives left. Score: ${context.score}. Provide a brief motivational comment to keep them engaged.`; case 'game_over': if (context.accuracy > 70) { return `Game over at wave ${context.wave}, score ${context.score}. The player had ${context.accuracy}% accuracy - pretty good! Generate an encouraging comment acknowledging their skill.`; } else { return `Game over at wave ${context.wave}, score ${context.score}. Accuracy was ${context.accuracy}%. Provide a supportive comment with a tip for improvement.`; } case 'combo_streak': return `Player achieved a ${context.combo}x combo streak! Score: ${context.score}. Generate an excited celebration comment.`; case 'power_up_used': return `Player activated a ${context.power_up_type} power-up. Generate a brief tactical comment about using it effectively.`; default: return `General gameplay comment. Score: ${context.score}, Wave: ${context.wave}.`; } } const PORT = 3001; app.listen(PORT, () => { console.log(`✓ Game AI server running on http://localhost:${PORT}`); console.log(`✓ Foundry Local endpoint: ${process.env.FOUNDRY_LOCAL_ENDPOINT || 'http://127.0.0.1:5272'}`); }); This backend demonstrates several best practices: Context-sensitive prompting: Different events get different prompt templates with relevant metrics Personality consistency: System message establishes tone and style guidelines Brevity constraints: max_tokens: 50 ensures comments don't overwhelm UI Creative variety: High temperature (0.9) produces diverse commentary on repeated events Performance-aware feedback: Comments adapt based on accuracy, lives remaining, combo streaks Integrating AI into Game Loop Without Performance Impact Games require 60 FPS to feel smooth, any blocking operation creates stutter. AI integration must be completely asynchronous and non-blocking: // game.js - Main game loop class SpaceInvadersGame { constructor() { this.ai = new GameAI(); this.lastAIUpdate = 0; this.aiUpdateInterval = 5000; // Poll AI every 5 seconds this.pendingAIRequest = false; // ... other game state } update(deltaTime) { // Core game logic (always runs) this.updatePlayer(deltaTime); this.updateEnemies(deltaTime); this.updateBullets(deltaTime); this.checkCollisions(); this.updatePowerUps(deltaTime); // AI commentary (optional, async) this.updateAI(deltaTime); } updateAI(deltaTime) { this.lastAIUpdate += deltaTime; // Only check AI periodically, never block gameplay if (this.lastAIUpdate >= this.aiUpdateInterval && !this.pendingAIRequest) { this.requestAICommentary(); } } async requestAICommentary() { // Check if there's an interesting event to comment on const event = this.getSignificantEvent(); if (!event) return; this.pendingAIRequest = true; // Fire-and-forget async request this.ai.generateComment({ event: event.type, score: this.score, accuracy: this.calculateAccuracy(), wave: this.currentWave, lives: this.lives, combo: this.comboMultiplier }) .then(comment => { this.displayAIComment(comment); this.lastAIUpdate = 0; }) .catch(error => { console.log('AI comment failed (non-critical):', error); }) .finally(() => { this.pendingAIRequest = false; }); } getSignificantEvent() { // Determine what's worth commenting on if (this.justCompletedWave) { this.justCompletedWave = false; return { type: 'wave_complete' }; } if (this.justGotHit) { this.justGotHit = false; return { type: 'player_hit' }; } if (this.comboMultiplier >= 5) { return { type: 'combo_streak' }; } return null; // Nothing interesting right now } displayAIComment(comment) { // Show comment in animated text bubble const bubble = document.createElement('div'); bubble.className = 'ai-comment-bubble'; bubble.textContent = comment; document.getElementById('game-container').appendChild(bubble); // Animate in setTimeout(() => bubble.classList.add('show'), 50); // Remove after 4 seconds setTimeout(() => { bubble.classList.remove('show'); setTimeout(() => bubble.remove(), 500); }, 4000); } calculateAccuracy() { if (this.shotsFired === 0) return 0; return Math.round((this.shotsHit / this.shotsFired) * 100); } } This integration pattern ensures: Zero gameplay impact: AI runs completely asynchronously—game never waits for AI Periodic updates only: Check AI every 5 seconds, not every frame (60 FPS → minimal CPU overhead) Event-driven commentary: Only request comments for significant moments, not continuous chatter Non-blocking display: Comments appear as animated overlays that don't interrupt gameplay Graceful failure: AI errors logged but never shown to players—game continues normally Designing AI Personality Systems Effective game AI has consistent personality that enhances rather than distracts. The system message establishes tone, response templates ensure variety, and context awareness makes commentary relevant: // Enhanced system message for consistent personality const AI_COMMANDER_PERSONALITY = ` You are AEGIS, an AI defense commander providing tactical commentary for a Space Invaders-style game. Your personality traits: - Enthusiastic but professional military commander tone - Celebrate victories with tactical language ("Excellent flanking maneuver!") - Acknowledge defeats with constructive feedback ("Regroup and maintain formation!") - Reference specific metrics to show you're paying attention - Keep responses to 1-2 sentences maximum - Use occasional humor but stay in character - Be encouraging even when player struggles Examples of your style: - "Wave neutralized! Your 85% accuracy shows precision targeting." - "Shield integrity compromised! Fall back and reassess the battlefield." - "Impressive combo multiplier! Sustained fire superiority achieved." - "That power-up spread pattern cleared the sector perfectly." `; // Context-aware response variety const RESPONSE_TEMPLATES = { wave_complete: { high_performance: [ "Your {accuracy}% accuracy led to decisive victory, Commander!", "Wave {wave} eliminated with tactical excellence!", "Strategic brilliance! {accuracy}% hit rate maintained." ], medium_performance: [ "Wave {wave} cleared. Solid tactics, Commander.", "Sector secured. Your {accuracy}% accuracy shows improvement potential.", "Objective achieved. Recommend tightening shot discipline." ], low_performance: [ "Wave {wave} cleared, but {accuracy}% accuracy needs work.", "Victory secured. Focus on accuracy in next engagement.", "Mission accomplished, though your hit rate needs improvement." ] }, player_hit: { lives_critical: [ "Critical damage! Only {lives} lives remain - exercise extreme caution!", "Shields failing! {lives} backup systems active.", "Red alert! Hull integrity at {lives} units." ], lives_okay: [ "Shields damaged. {lives} lives remaining. Stay focused!", "Hit sustained. {lives} backup systems online.", "Damage taken. Maintain defensive posture." ] } }; function selectResponseTemplate(event, context) { const templates = RESPONSE_TEMPLATES[event]; if (!templates) return; // Choose template category based on context let category; if (event === 'wave_complete') { if (context.accuracy >= 75) category = templates.high_performance; else if (context.accuracy >= 50) category = templates.medium_performance; else category = templates.low_performance; } else if (event === 'player_hit') { category = context.lives <= 2 ? templates.lives_critical : templates.lives_okay; } // Randomly select from category for variety const template = category[Math.floor(Math.random() * category.length)]; // Fill in context variables return template .replace('{accuracy}', context.accuracy) .replace('{wave}', context.wave) .replace('{lives}', context.lives); } This personality system creates: Consistent character: AEGIS always sounds like a military commander, never breaks character Context-appropriate responses: Different situations trigger different tones (celebration vs concern) Natural variety: Template randomization prevents repetitive commentary Metric awareness: Specific references to accuracy, lives, waves show AI is "watching" Encouraging feedback: Even in failure scenarios, provides constructive guidance Key Takeaways and Game AI Design Patterns Integrating AI into browser games demonstrates that advanced features don't require cloud services or complex infrastructure. Local AI enables personality-driven enhancements that run entirely on player devices, cost nothing at scale, and work offline. Essential principles for game AI integration: Progressive enhancement architecture: Core gameplay must work perfectly without AI—commentary enhances but isn't required Asynchronous-only integration: Never block game loop for AI—60 FPS gameplay is non-negotiable Context-aware generation: Commentary reflecting actual game state feels intelligent, generic messages feel robotic Personality consistency: Well-defined character voice creates memorable experiences Graceful failure handling: AI errors should be invisible to players—fallback to static messages Performance-conscious polling: Check AI every few seconds, not every frame Event-driven commentary: Only generate responses for significant moments This pattern extends beyond games, any interactive application benefits from context-aware AI personality: educational software providing personalized encouragement, fitness apps offering adaptive coaching, productivity tools giving motivational feedback. The complete implementation with game engine, AI integration, backend proxy, and deployment instructions is available at github.com/leestott/Spaceinvaders-FoundryLocal. Clone the repository to experience AI-enhanced gaming—just open index.html and start playing immediately, then optionally enable AI features for dynamic commentary. Resources and Further Reading Space Invaders with AI Repository - Complete game with AI integration Quick Start Guide - Play immediately or enable AI features Microsoft Foundry Local Documentation - SDK and model reference MDN Game Development - Browser game development patterns HTML5 Game Devs Forum - Community discussions and techniquesBenchmarking Local AI Models
Introduction Selecting the right AI model for your application requires more than reading benchmark leaderboards. Published benchmarks measure academic capabilities, question answering, reasoning, coding, but your application has specific requirements: latency budgets, hardware constraints, quality thresholds. How do you know if Phi-4 provides acceptable quality for your document summarization use case? Will Qwen2.5-0.5B meet your 100ms response time requirement? Does your edge device have sufficient memory for Phi-3.5 Mini? The answer lies in empirical testing: running actual models on your hardware with your workload patterns. This article demonstrates building a comprehensive model benchmarking platform using FLPerformance, Node.js, React, and Microsoft Foundry Local. You'll learn how to implement scientific performance measurement, design meaningful benchmark suites, visualize multi-dimensional comparisons, and make data-driven model selection decisions. Whether you're evaluating models for production deployment, optimizing inference costs, or validating hardware specifications, this platform provides the tools for rigorous performance analysis. Why Model Benchmarking Requires Purpose-Built Tools You cannot assess model performance by running a few manual tests and noting the results. Scientific benchmarking demands controlled conditions, statistically significant sample sizes, multi-dimensional metrics, and reproducible methodology. Understand why purpose-built tooling is essential. Performance is multi-dimensional. A model might excel at throughput (tokens per second) but suffer at latency (time to first token). Another might generate high-quality outputs slowly. Your application might prioritize consistency over average performance, a model with variable response times (high p95/p99 latency) creates poor user experiences even if averages look good. Measuring all dimensions simultaneously enables informed tradeoffs. Hardware matters enormously. Benchmark results from NVIDIA A100 GPUs don't predict performance on consumer laptops. NPU acceleration changes the picture again. Memory constraints affect which models can even load. Test on your actual deployment hardware or comparable specifications to get actionable results. Concurrency reveals bottlenecks. A model handling one request excellently might struggle with ten concurrent requests. Real applications experience variable load, measuring only single-threaded performance misses critical scalability constraints. Controlled concurrency testing reveals these limits. Statistical rigor prevents false conclusions. Running a prompt once and noting the response time tells you nothing about performance distribution. Was this result typical? An outlier? You need dozens or hundreds of trials to establish p50/p95/p99 percentiles, understand variance, and detect stability issues. Comparison requires controlled experiments. Different prompts, different times of day, different system loads, all introduce confounding variables. Scientific comparison runs identical workloads across models sequentially, controlling for external factors. Architecture: Three-Layer Performance Testing Platform FLPerformance implements a clean separation between orchestration, measurement, and presentation: The frontend React application provides model management, benchmark configuration, test execution, and results visualization. Users add models from the Foundry Local catalog, configure benchmark parameters (iterations, concurrency, timeout values), launch test runs, and view real-time progress. The results dashboard displays comparison tables, latency distribution charts, throughput graphs, and "best model for..." recommendations. The backend Node.js/Express server orchestrates tests and captures metrics. It manages the single Foundry Local service instance, loads/unloads models as needed, executes benchmark suites with controlled concurrency, measures comprehensive metrics (TTFT, TPOT, total latency, throughput, error rates), and persists results to JSON storage. WebSocket connections provide real-time progress updates during long benchmark runs. Foundry Local SDK integration uses the official foundry-local-sdk npm package. The SDK manages service lifecycle, starting, stopping, health checkin, and handles model operations, downloading, loading into memory, unloading. It provides OpenAI-compatible inference APIs for consistent request formatting across models. The architecture supports simultaneous testing of multiple models by loading them one at a time, running identical benchmarks, and aggregating results for comparison: User Initiates Benchmark Run ↓ Backend receives {models: [...], suite: "default", iterations: 10} ↓ For each model: 1. Load model into Foundry Local 2. Execute benchmark suite - For each prompt in suite: * Run N iterations * Measure TTFT, TPOT, total time * Track errors and timeouts * Calculate tokens/second 3. Aggregate statistics (mean, p50, p95, p99) 4. Unload model ↓ Store results with metadata ↓ Return comparison data to frontend ↓ Visualize performance metrics Implementing Scientific Measurement Infrastructure Accurate performance measurement requires instrumentation that captures multiple dimensions without introducing measurement overhead: // src/server/benchmark.js import { performance } from 'perf_hooks'; export class BenchmarkExecutor { constructor(foundryClient, options = {}) { this.client = foundryClient; this.options = { iterations: options.iterations || 10, concurrency: options.concurrency || 1, timeout_ms: options.timeout_ms || 30000, warmup_iterations: options.warmup_iterations || 2 }; } async runBenchmarkSuite(modelId, prompts) { const results = []; // Warmup phase (exclude from results) console.log(`Running ${this.options.warmup_iterations} warmup iterations...`); for (let i = 0; i < this.options.warmup_iterations; i++) { await this.executePrompt(modelId, prompts[0].text); } // Actual benchmark runs for (const prompt of prompts) { console.log(`Benchmarking prompt: ${prompt.id}`); const measurements = []; for (let i = 0; i < this.options.iterations; i++) { const measurement = await this.executeMeasuredPrompt( modelId, prompt.text ); measurements.push(measurement); // Small delay between iterations to stabilize await sleep(100); } results.push({ prompt_id: prompt.id, prompt_text: prompt.text, measurements, statistics: this.calculateStatistics(measurements) }); } return { model_id: modelId, timestamp: new Date().toISOString(), config: this.options, results }; } async executeMeasuredPrompt(modelId, promptText) { const measurement = { success: false, error: null, ttft_ms: null, // Time to first token tpot_ms: null, // Time per output token total_ms: null, tokens_generated: 0, tokens_per_second: 0 }; try { const startTime = performance.now(); let firstTokenTime = null; let tokenCount = 0; // Streaming completion to measure TTFT const stream = await this.client.chat.completions.create({ model: modelId, messages: [{ role: 'user', content: promptText }], max_tokens: 200, temperature: 0.7, stream: true }); for await (const chunk of stream) { if (chunk.choices[0]?.delta?.content) { if (firstTokenTime === null) { firstTokenTime = performance.now(); measurement.ttft_ms = firstTokenTime - startTime; } tokenCount++; } } const endTime = performance.now(); measurement.total_ms = endTime - startTime; measurement.tokens_generated = tokenCount; if (tokenCount > 1 && firstTokenTime) { // TPOT = time after first token / (tokens - 1) const timeAfterFirstToken = endTime - firstTokenTime; measurement.tpot_ms = timeAfterFirstToken / (tokenCount - 1); measurement.tokens_per_second = 1000 / measurement.tpot_ms; } measurement.success = true; } catch (error) { measurement.error = error.message; measurement.success = false; } return measurement; } calculateStatistics(measurements) { const successful = measurements.filter(m => m.success); const total = measurements.length; if (successful.length === 0) { return { success_rate: 0, error_rate: 1.0, sample_size: total }; } const ttfts = successful.map(m => m.ttft_ms).sort((a, b) => a - b); const tpots = successful.map(m => m.tpot_ms).filter(v => v !== null).sort((a, b) => a - b); const totals = successful.map(m => m.total_ms).sort((a, b) => a - b); const throughputs = successful.map(m => m.tokens_per_second).filter(v => v > 0); return { success_rate: successful.length / total, error_rate: (total - successful.length) / total, sample_size: total, ttft: { mean: mean(ttfts), median: percentile(ttfts, 50), p95: percentile(ttfts, 95), p99: percentile(ttfts, 99), min: Math.min(...ttfts), max: Math.max(...ttfts) }, tpot: tpots.length > 0 ? { mean: mean(tpots), median: percentile(tpots, 50), p95: percentile(tpots, 95) } : null, total_latency: { mean: mean(totals), median: percentile(totals, 50), p95: percentile(totals, 95), p99: percentile(totals, 99) }, throughput: { mean_tps: mean(throughputs), median_tps: percentile(throughputs, 50) } }; } } function mean(arr) { return arr.reduce((sum, val) => sum + val, 0) / arr.length; } function percentile(sortedArr, p) { const index = Math.ceil((sortedArr.length * p) / 100) - 1; return sortedArr[Math.max(0, index)]; } function sleep(ms) { return new Promise(resolve => setTimeout(resolve, ms)); } This measurement infrastructure captures: Time to First Token (TTFT): Critical for perceived responsiveness—users notice delays before output begins Time Per Output Token (TPOT): Determines generation speed after first token—affects throughput Total latency: End-to-end time—matters for batch processing and high-volume scenarios Tokens per second: Overall throughput metric—useful for capacity planning Statistical distributions: Mean alone masks variability—p95/p99 reveal tail latencies that impact user experience Success/error rates: Stability metrics—some models timeout or crash under load Designing Meaningful Benchmark Suites Benchmark quality depends on prompt selection. Generic prompts don't reflect real application behavior. Design suites that mirror actual use cases: // benchmarks/suites/default.json { "name": "default", "description": "General-purpose benchmark covering diverse scenarios", "prompts": [ { "id": "short-factual", "text": "What is the capital of France?", "category": "factual", "expected_tokens": 5 }, { "id": "medium-explanation", "text": "Explain how photosynthesis works in 3-4 sentences.", "category": "explanation", "expected_tokens": 80 }, { "id": "long-reasoning", "text": "Analyze the economic factors that led to the 2008 financial crisis. Discuss at least 5 major causes with supporting details.", "category": "reasoning", "expected_tokens": 250 }, { "id": "code-generation", "text": "Write a Python function that finds the longest palindrome in a string. Include docstring and example usage.", "category": "coding", "expected_tokens": 150 }, { "id": "creative-writing", "text": "Write a short story (3 paragraphs) about a robot learning to paint.", "category": "creative", "expected_tokens": 200 } ] } This suite covers multiple dimensions: Length variation: Short (5 tokens), medium (80), long (250)—tests models across output ranges Task diversity: Factual recall, explanation, reasoning, code, creative—reveals capability breadth Token predictability: Expected token counts enable throughput calculations For production applications, create custom suites matching your actual workload: { "name": "customer-support", "description": "Simulates actual customer support queries", "prompts": [ { "id": "product-question", "text": "How do I reset my password for the customer portal?" }, { "id": "troubleshooting", "text": "I'm getting error code 503 when trying to upload files. What should I do?" }, { "id": "policy-inquiry", "text": "What is your refund policy for annual subscriptions?" } ] } Visualizing Multi-Dimensional Performance Comparisons Raw numbers don't reveal insights—visualization makes patterns obvious. The frontend implements several comparison views: Comparison Table shows side-by-side metrics: // frontend/src/components/ResultsTable.jsx export function ResultsTable({ results }) { return ( {results.map(result => ( ))} Model TTFT (ms) TPOT (ms) Throughput (tok/s) P95 Latency Error Rate {result.model_id} {result.stats.ttft.median.toFixed(0)} (p95: {result.stats.ttft.p95.toFixed(0)}) {result.stats.tpot?.median.toFixed(1) || 'N/A'} {result.stats.throughput.median_tps.toFixed(1)} {result.stats.total_latency.p95.toFixed(0)} ms 0.05 ? 'error' : 'success'}> {(result.stats.error_rate * 100).toFixed(1)}% ); } Latency Distribution Chart reveals performance consistency: // Using Chart.js for visualization export function LatencyChart({ results }) { const data = { labels: results.map(r => r.model_id), datasets: [ { label: 'Median (p50)', data: results.map(r => r.stats.total_latency.median), backgroundColor: 'rgba(75, 192, 192, 0.5)' }, { label: 'p95', data: results.map(r => r.stats.total_latency.p95), backgroundColor: 'rgba(255, 206, 86, 0.5)' }, { label: 'p99', data: results.map(r => r.stats.total_latency.p99), backgroundColor: 'rgba(255, 99, 132, 0.5)' } ] }; return ( ); } Recommendations Engine synthesizes multi-dimensional comparison: export function generateRecommendations(results) { const recommendations = []; // Find fastest TTFT (best perceived responsiveness) const fastestTTFT = results.reduce((best, r) => r.stats.ttft.median < best.stats.ttft.median ? r : best ); recommendations.push({ category: 'Fastest Response', model: fastestTTFT.model_id, reason: `Lowest median TTFT: ${fastestTTFT.stats.ttft.median.toFixed(0)}ms` }); // Find highest throughput const highestThroughput = results.reduce((best, r) => r.stats.throughput.median_tps > best.stats.throughput.median_tps ? r : best ); recommendations.push({ category: 'Best Throughput', model: highestThroughput.model_id, reason: `Highest tok/s: ${highestThroughput.stats.throughput.median_tps.toFixed(1)}` }); // Find most consistent (lowest p95-p50 spread) const mostConsistent = results.reduce((best, r) => { const spread = r.stats.total_latency.p95 - r.stats.total_latency.median; const bestSpread = best.stats.total_latency.p95 - best.stats.total_latency.median; return spread < bestSpread ? r : best; }); recommendations.push({ category: 'Most Consistent', model: mostConsistent.model_id, reason: 'Lowest latency variance (p95-p50 spread)' }); return recommendations; } Key Takeaways and Benchmarking Best Practices Effective model benchmarking requires scientific methodology, comprehensive metrics, and application-specific testing. FLPerformance demonstrates that rigorous performance measurement is accessible to any development team. Critical principles for model evaluation: Test on target hardware: Results from cloud GPUs don't predict laptop performance Measure multiple dimensions: TTFT, TPOT, throughput, consistency all matter Use statistical rigor: Single runs mislead—capture distributions with adequate sample sizes Design realistic workloads: Generic benchmarks don't predict your application's behavior Include warmup iterations: Model loading and JIT compilation affect early measurements Control concurrency: Real applications handle multiple requests—test at realistic loads Document methodology: Reproducible results require documented procedures and configurations The complete benchmarking platform with model management, measurement infrastructure, visualization dashboards, and comprehensive documentation is available at github.com/leestott/FLPerformance. Clone the repository and run the startup script to begin evaluating models on your hardware. Resources and Further Reading FLPerformance Repository - Complete benchmarking platform Quick Start Guide - Setup and first benchmark run Microsoft Foundry Local Documentation - SDK reference and model catalog Architecture Guide - System design and SDK integration Benchmarking Best Practices - Methodology and troubleshootingAzure Landing Zones Accelerators for Bicep and Terraform. Announcing General Availability!
Azure Landing Zones Accelerators are designed to simplify the process of onboarding your Infrastructure as Code into a robust CI / CD pipeline with Azure DevOps or GitHub. Learn more about what the Accelerator can do for you and why you should be using it.32KViews12likes5CommentsFrom Cloud to Chip: Building Smarter AI at the Edge with Windows AI PCs
As AI engineers, we’ve spent years optimizing models for the cloud, scaling inference, wrangling latency, and chasing compute across clusters. But the frontier is shifting. With the rise of Windows AI PCs and powerful local accelerators, the edge is no longer a constraint it’s now a canvas. Whether you're deploying vision models to industrial cameras, optimizing speech interfaces for offline assistants, or building privacy-preserving apps for healthcare, Edge AI is where real-world intelligence meets real-time performance. Why Edge AI, Why Now? Edge AI isn’t just about running models locally, it’s about rethinking the entire lifecycle: - Latency: Decisions in milliseconds, not round-trips to the cloud. - Privacy: Sensitive data stays on-device, enabling HIPAA/GDPR compliance. - Resilience: Offline-first apps that don’t break when the network does. - Cost: Reduced cloud compute and bandwidth overhead. With Windows AI PCs powered by Intel and Qualcomm NPUs and tools like ONNX Runtime, DirectML, and Olive, developers can now optimize and deploy models with unprecedented efficiency. What You’ll Learn in Edge AI for Beginners The Edge AI for Beginners curriculum is a hands-on, open-source guide designed for engineers ready to move from theory to deployment. Multi-Language Support This content is available in over 48 languages, so you can read and study in your native language. What You'll Master This course takes you from fundamental concepts to production-ready implementations, covering: Small Language Models (SLMs) optimized for edge deployment Hardware-aware optimization across diverse platforms Real-time inference with privacy-preserving capabilities Production deployment strategies for enterprise applications Why EdgeAI Matters Edge AI represents a paradigm shift that addresses critical modern challenges: Privacy & Security: Process sensitive data locally without cloud exposure Real-time Performance: Eliminate network latency for time-critical applications Cost Efficiency: Reduce bandwidth and cloud computing expenses Resilient Operations: Maintain functionality during network outages Regulatory Compliance: Meet data sovereignty requirements Edge AI Edge AI refers to running AI algorithms and language models locally on hardware, close to where data is generated without relying on cloud resources for inference. It reduces latency, enhances privacy, and enables real-time decision-making. Core Principles: On-device inference: AI models run on edge devices (phones, routers, microcontrollers, industrial PCs) Offline capability: Functions without persistent internet connectivity Low latency: Immediate responses suited for real-time systems Data sovereignty: Keeps sensitive data local, improving security and compliance Small Language Models (SLMs) SLMs like Phi-4, Mistral-7B, Qwen and Gemma are optimized versions of larger LLMs, trained or distilled for: Reduced memory footprint: Efficient use of limited edge device memory Lower compute demand: Optimized for CPU and edge GPU performance Faster startup times: Quick initialization for responsive applications They unlock powerful NLP capabilities while meeting the constraints of: Embedded systems: IoT devices and industrial controllers Mobile devices: Smartphones and tablets with offline capabilities IoT Devices: Sensors and smart devices with limited resources Edge servers: Local processing units with limited GPU resources Personal Computers: Desktop and laptop deployment scenarios Course Modules & Navigation Course duration. 10 hours of content Module Topic Focus Area Key Content Level Duration 📖 00 Introduction to EdgeAI Foundation & Context EdgeAI Overview • Industry Applications • SLM Introduction • Learning Objectives Beginner 1-2 hrs 📚 01 EdgeAI Fundamentals Cloud vs Edge AI comparison EdgeAI Fundamentals • Real World Case Studies • Implementation Guide • Edge Deployment Beginner 3-4 hrs 🧠 02 SLM Model Foundations Model families & architecture Phi Family • Qwen Family • Gemma Family • BitNET • μModel • Phi-Silica Beginner 4-5 hrs 🚀 03 SLM Deployment Practice Local & cloud deployment Advanced Learning • Local Environment • Cloud Deployment Intermediate 4-5 hrs ⚙️ 04 Model Optimization Toolkit Cross-platform optimization Introduction • Llama.cpp • Microsoft Olive • OpenVINO • Apple MLX • Workflow Synthesis Intermediate 5-6 hrs 🔧 05 SLMOps Production Production operations SLMOps Introduction • Model Distillation • Fine-tuning • Production Deployment Advanced 5-6 hrs 🤖 06 AI Agents & Function Calling Agent frameworks & MCP Agent Introduction • Function Calling • Model Context Protocol Advanced 4-5 hrs 💻 07 Platform Implementation Cross-platform samples AI Toolkit • Foundry Local • Windows Development Advanced 3-4 hrs 🏭 08 Foundry Local Toolkit Production-ready samples Sample applications (see details below) Expert 8-10 hrs Each module includes Jupyter notebooks, code samples, and deployment walkthroughs, perfect for engineers who learn by doing. Developer Highlights - 🔧 Olive: Microsoft's optimization toolchain for quantization, pruning, and acceleration. - 🧩 ONNX Runtime: Cross-platform inference engine with support for CPU, GPU, and NPU. - 🎮 DirectML: GPU-accelerated ML API for Windows, ideal for gaming and real-time apps. - 🖥️ Windows AI PCs: Devices with built-in NPUs for low-power, high-performance inference. Local AI: Beyond the Edge Local AI isn’t just about inference, it’s about autonomy. Imagine agents that: - Learn from local context - Adapt to user behavior - Respect privacy by design With tools like Agent Framework, Azure AI Foundry and Windows Copilot Studio, and Foundry Local developers can orchestrate local agents that blend LLMs, sensors, and user preferences, all without cloud dependency. Try It Yourself Ready to get started? Clone the Edge AI for Beginners GitHub repo, run the notebooks, and deploy your first model to a Windows AI PC or IoT devices Whether you're building smart kiosks, offline assistants, or industrial monitors, this curriculum gives you the scaffolding to go from prototype to production.