Blog Post

Surface IT Pro Blog
17 MIN READ

Vibe Coding for the NPU

FrankBuchholz's avatar
FrankBuchholz
Icon for Microsoft rankMicrosoft
Mar 04, 2026

How to Build On-Device AI Apps on Copilot+ PCs

If you just bought a Copilot+ PC and want to know what that NPU can actually do, this is for you. If you manage a fleet of them and need workload placement guidance, this is for you too.

Cloud doesn’t have to be the default anymore. The NPU (Neural Processing Unit) is now a practical and accessible development target, making it possible to run advanced AI workloads directly on your device. This means you can take advantage of your PC’s local hardware to get faster results, lower latency, and even work offline or in airplane mode. Getting started with NPU development is easier than you might expect.

Foundry Local serves models through an OpenAI-compatible endpoint on localhost, so if you’ve used the OpenAI SDK, you already know how to build for the NPU. Combine that with AI-augmented coding and you don’t need to be a pro developer... you just need to be specific about what you want.

I’m in Surface Marketing. I haven’t done daily development work since the Petzold “Programming Windows” days... back when writing a hello world app meant 90 lines of C and a WndProc callback. Early in my career I quickly figured out I wasn’t the most talented dev in the room, but I was good at integrating hardware solutions and telling the story of what they could do. Now, with vibe coding, the thing that wasn’t my vibe has become a superpower... “wait, could I vibe code that?” And the answer keeps being yes.

I just built a working on-device AI application with four tabs and five AI tasks by describing features to an AI coding assistant. This post is everything I learned along the way: the platform, the tools, the workflow, and the real considerations that only surface when you’re actually building on the hardware.

Ready to skip to the good stuff? Jump ahead:

Why does this matter? Three words: availability, economics, and data sovereignty. Details below.


Why Build for the NPU?

For years, the cloud has been powering our most demanding AI workloads. That makes sense for frontier reasoning, complex agent chains, the hard stuff. But the models that run locally now are good enough for the majority of what your fleet actually does. And by “locally” I mean the device in your user’s bag... their Surface on the train, in a customer lobby, on a job site. Not a server. Not a VM. The actual endpoint, with its own dedicated AI silicon.

Availability. An NPU-powered workflow runs in airplane mode, in a clean room, on a factory floor, in a field inspection truck in a dead zone. No connectivity required. The AI is on the device, and the device is wherever your user is.

Economics. NPU inference comes at no additional per-inference cost after the hardware purchase. No per-token API fees, no egress charges, no metered compute. I think of it as an 80/20 planning model: the majority of routine AI tasks can run locally at no incremental cost, reserving premium cloud inference for the tasks that genuinely need frontier reasoning.

Data sovereignty. When operating without cloud escalation, data is processed locally on the NPU, reducing your reliance on cross-border data transfers. As with any deployment, customers should assess their specific regulatory, legal, and operational requirements, including endpoint management, device access controls, and applicable local laws. For regulated environments, on-device AI can be a powerful part of a broader compliance and data governance strategy, but it does not replace the need for appropriate legal, contractual, and security controls.


What Is Vibe Coding?

The industry calls it AI-augmented development. The internet calls it vibe coding. Same thing: you describe what you want, an AI coding assistant writes the code, you run it, you tell it what broke, it fixes it. Repeat. You’re the architect and the QA engineer. The AI handles the implementation.

Examples include GitHub Copilot CLI, Cursor, and similar AI coding assistants. The specific tool matters less than the workflow: describe a feature, the assistant writes code, you run it, you report what worked and what broke, the assistant fixes it. The AI handles the boilerplate... Flask routes, CSS layout, regex, JavaScript event handlers. You handle the architecture decisions, the hardware testing, and the product judgment.

Why does this matter for NPU development specifically? Because AI coding assistants already know the OpenAI SDK patterns inside and out, and Foundry Local speaks a compatible API. “Compatible” means the standard SDK works for chat completions and most common inference patterns, though you may encounter minor differences in model IDs or streaming behavior. Microsoft’s documentation frames this as compatibility with “OpenAI-compatible SDKs and HTTP clients.” In practice, for the workloads covered in this post, the SDK works as-is. The barrier is significantly lower than it used to be... if you can describe a workflow, you can start building for the NPU.


How It Started: From Playground Curiosity to Working App

This is where the IT pro lesson starts: validate the model first, then write code. This didn’t start with a plan. It started with a click.

I opened VS Code, installed the AI Toolkit extension, and browsed the Model Catalog. There was a model already on the device: Phi Silica. I loaded it in the Playground, typed a message, and... it responded. On-device. No API key. No cloud endpoint. Just the NPU doing inference right there in VS Code.

Wait. I remembered from using LM Studio that local models get served on a port. If Foundry Local works the same way... could I write a web app that talks to it?

So I opened GitHub Copilot CLI and just started describing what I wanted:

“I have a local AI model running through Foundry Local on my Surface. It exposes a compatible API on a local port. Build me a Flask web app that connects to it and serves a chat interface.”

And we were off.

Copilot generated a working Flask app. I ran it. It connected to the local model. I could chat with an AI running entirely on the NPU through a web browser. No cloud, no API key, no subscription. Wild.

From there, the vibe coding loop took over. Each session I’d just describe the next thing I wanted: a sidebar with tabs, a daily briefing from local calendar data, a two-brain router for local-vs-cloud workload decisions, and finally a full field inspection workflow with voice, camera, pen annotation, and translation.

The real acceleration came when I moved from copy-paste iteration to GitHub Copilot CLI, which reads and edits the full codebase directly on the device. That’s when it went from a cool demo to a genuine multi-tab application with real architecture. Kevin Roose and Casey Newton on the Hard Fork podcast recently compared the leap in AI coding tools to the original ChatGPT moment... and having lived it, that tracks. (If you’re not listening to Hard Fork, you should be.)

Python, HTML, CSS, and JavaScript. Built primarily through conversation with AI coding assistants. By a marketing person who hadn’t done daily dev work since writing WndProc callbacks in C.


The Starting Point: AI Toolkit Model Catalog

Before writing any code, start where I started: the AI Toolkit for VS Code. Two things matter here: the Model Catalog and the Playground.

Open the Model Catalog and filter for models optimized for local NPU execution:

ModelParametersStrengthsNPU Support
Phi-4 Mini3.8BGeneral text: summarization, extraction, generation, translationIntel (OpenVINO), Qualcomm (QNN)
Phi SilicaN/AOn-device language and multimodal scenarios on Copilot+ PCsIntel & Qualcomm (Windows AI APIs)
Qwen 2.57BGeneral text: larger context, stronger reasoningQualcomm (QNN), GPU fallback
Additional modelsVariesThe catalog is growing. Check for new NPU-optimized variants regularly.Varies by silicon

For the full local model landscape, see Ready-to-use local LLMs in Microsoft Foundry on Windows and the Windows AI overview decision tree.

The Playground is your validation sandbox. Select a model, load it, chat with it directly in VS Code. Test your prompts. Feel the latency. Hit the context limits. All before writing a line of application code. It’s also the most reliable method we’ve found to trigger the initial model download and readying for Phi Silica.


Platform Requirements

Hardware

Any Copilot+ PC with an NPU. Surface consumer Copilot+ PCs are currently Snapdragon X. Surface commercial (for Business) Copilot+ PCs are available with both Snapdragon X and Intel Core Ultra.

SiliconNPUExample Surface Devices
Intel Core Ultra (Lunar Lake)Intel AI Boost, OpenVINO runtimeSurface Laptop for Business, Surface Pro for Business (Commercial)
Qualcomm Snapdragon X (Elite/Plus)Hexagon NPU, QNN runtimeSurface Laptop, Surface Pro (Consumer), Surface Laptop for Business, Surface Pro for Business (Commercial)

Software Stack

Scope: Everything in this post is Windows-only. Foundry Local, AI Toolkit, and the NPU runtimes covered here require Windows 11 on Copilot+ PC hardware (Intel Core Ultra or Qualcomm Snapdragon X).

A note on Foundry Local: Foundry Local is currently in public preview. There is no SLA and no backward compatibility guarantee. Expect changes between releases and validate in staged rings before broad deployment. For the latest, see the Foundry Local documentation hub.

ComponentInstallLearn More
Windows 11 24H2+Windows UpdateWindows AI APIs
Foundry Local (preview)winget install Microsoft.FoundryLocalGet started
Python 3.10+winget install Python.Python.3.11 
foundry-local-sdkpip install foundry-local-sdkSDK reference
OpenAI Python SDKpip install openaiSDK integration guide
Flaskpip install flask 
VS Code + AI ToolkitVS Code MarketplaceAI Toolkit overview

⚠️ Package name warning: Install foundry-local-sdk (the official Microsoft SDK). Don’t confuse it with similarly named packages on PyPI. At time of writing, foundry-local (v0.0.1) is not the official SDK and may fail (or behave unexpectedly). Verify against Microsoft’s SDK documentation and PyPI.


Your First NPU App in Minutes

For the official quickstart, see Get started with Foundry Local. What follows adds the vibe coding workflow on top.

Step 1: Install the Runtime

winget install Microsoft.FoundryLocal
pip install foundry-local-sdk openai flask

Step 2: Prompt Your AI Coding Assistant

“Build me a Flask app that serves a chat interface on localhost:5000. The backend should use the OpenAI Python SDK pointed at a local Foundry Local runtime. Use the foundry-local-sdk to get the endpoint URL dynamically (don’t hardcode the port). The model alias is ‘phi-4-mini’. Make it a single-file app with the HTML inline.”

The assistant will generate something close to this:

from flask import Flask, request, jsonify
from openai import OpenAI
from foundry_local import FoundryLocalManager

# Start Foundry Local and discover the endpoint dynamically
manager = FoundryLocalManager("phi-4-mini")
client = OpenAI(base_url=manager.endpoint, api_key=manager.api_key)
model_id = manager.get_model_info("phi-4-mini").id

app = Flask(__name__)

@app.route("/chat", methods=["POST"])
def chat():
    user_msg = request.json["message"]
    response = client.chat.completions.create(
        model=model_id,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": user_msg}
        ],
        max_tokens=512
    )
    return jsonify({"reply": response.choices[0].message.content})

if __name__ == "__main__":
    app.run(host="127.0.0.1", port=5000, debug=True)

The key line is base_url=manager.endpoint. Don’t hardcode localhost:5272 or any other port. Foundry Local assigns a dynamic port each time the service starts. Always use the SDK’s endpoint discovery, or use foundry-local-sdk to resolve the active endpoint programmatically. For details, see the Foundry Local SDK reference.

Step 3: Run It

python app.py
# First run downloads the model (~3 GB). Subsequent starts are near-instant.
# Open http://localhost:5000

First model download requires network connectivity. After that, the model is cached locally and launches work fully offline. In practice, runtime or model updates and repair flows can occasionally re-trigger downloads. For cache management details, see the Foundry Local CLI reference.

Step 4: Iterate

Keep describing features. The AI coding assistant handles the implementation. You handle the testing on the actual device and the “this doesn’t work on the NPU” feedback that no AI assistant can discover on its own. Hardware-in-the-loop is the key. Don’t write a full spec. Write one feature at a time, test it, then describe the next.


Cross-Platform Development: Intel and Qualcomm

Foundry Local handles model selection across silicon families automatically through its alias system. For most developers and single-device use, you write one codebase and it works. If you’re deploying to a mixed enterprise fleet, here are the platform-specific optimizations we’ve found.

Architecture Detection Under Emulation

On Windows-on-ARM, Python x64 reports AMD64 via platform.machine(). Both it and PROCESSOR_ARCHITECTURE are wrong under emulation. Use WMI:

result = subprocess.run(
    ["powershell", "-NoProfile", "-Command",
     "(Get-CimInstance Win32_Processor).Name"],
    capture_output=True, text=True, timeout=5,
)
cpu = result.stdout.strip().lower()
if "qualcomm" in cpu or "snapdragon" in cpu:
    return "qualcomm"

Run native ARM64 Python where possible on Snapdragon devices for better performance and fewer emulation quirks.

Rule: On Snapdragon, prefer ARM64 Python. Avoid x64 emulation when you can.

Model Compatibility Varies by Silicon

Observed behavior during testing on preview runtimes (February 2026). Results may change with future Foundry Local or driver updates.

ModelIntel (OpenVINO NPU)Qualcomm (QNN NPU)GPU Fallback
Phi-4 Mini 3.8B✅ Stable⏳ NPU variant in development✅ Both
Phi-3.5 Mini✅ Stable✅ NPU (Foundry 0.8.119+)✅ Both
Qwen 2.5 7B✅ Stable✅ Stable✅ Both
Phi Silica (text + vision)✅ NPU✅ NPUN/A (Windows AI)

Foundry Local handles variant selection by alias. You request phi-4-mini, it pulls the best available build for your hardware. Model availability across NPU execution providers is expanding with each Foundry Local release. For tool-calling workloads on Qualcomm today, Qwen 2.5 7B is fully NPU-accelerated and production-ready. Your AI coding assistant handles the detection and model routing automatically.

Runtime Lifecycle Differs

if SILICON == "qualcomm":
    # Skip warmup - QNN is unstable with rapid reconnection attempts
    print("Skipping warmup on Qualcomm (first request will load model)...")
else:
    # Intel: warmup + keepalive for consistent latency
    warmup_model()
    start_keepalive_thread(interval=180)

Intel benefits from warmup and keepalive pings every 3 minutes. Qualcomm is the opposite: warmup destabilizes the QNN runtime. Let the first real request trigger model load. Use aggressive auto-reconnection on both platforms. These are observations from preview runtimes and may evolve.

NPU-to-GPU Fallback Chain

try:
    manager = FoundryLocalManager("phi-4-mini")  # NPU first
except Exception:
    try:
        from foundry_local.api import DeviceType
        manager = FoundryLocalManager("phi-4-mini", device=DeviceType.GPU)
    except Exception:
        # ⚠️ NOT RECOMMENDED FOR PRODUCTION - last resort only.
        # The port is dynamic; this may break if the service restarts.
        client = OpenAI(
            base_url="http://localhost:5272/v1",
            api_key="not-needed"
        )

For additional troubleshooting, see Foundry Local best practices.


Designing for the Token Budget

This is the constraint that shapes everything. Context limits vary by model. Phi Silica has a ~4K window; Phi-4 Mini supports much larger contexts, but practical token budgets still matter for latency and cost. Observed ranges: structured field extraction ~1,884 tokens, document classification ~500 tokens, morning briefing ~2,200 tokens.

Design principle: build single-shot endpoints that do one thing well. Instead of “one chatbot that does everything,” build endpoints like /extract-fields, /classify-doc, /summarize. One call, one job, one clean response.

And honestly? This constraint produces better architecture. Not every AI task needs a frontier model. Most of them are casual... summarization, extraction, classification, one-shot and done. It maps directly to the 80/20 planning model: the NPU handles the 80% that doesn’t need GPT-4.


Phi Silica: On-Device Vision

If you don’t need vision capabilities for your first app, skip this section. Start with Foundry Local + Phi-4 Mini. Come back when you need on-device image classification.

Phi Silica requires a few more setup steps than Foundry Local, but it unlocks on-device vision capabilities that no cloud-dependent workflow can match. Phi Silica is Microsoft’s on-device model accessed through the Windows AI APIs, supporting on-device language and multimodal scenarios available on Copilot+ PCs. For the full walkthrough, see Get started with Phi Silica and the Phi Silica tutorial.

Unlike Foundry Local (standard REST API), Phi Silica is a Windows API requiring: MSIX (Microsoft’s modern app packaging format) packaging with the systemAIModels restricted capability, a LAF (Limited Access Feature) token tied to your Package Family Name, and model provisioning on-device. Development builds may relax LAF enforcement; production deployments should always assume a valid token is required. See Windows AI API troubleshooting for LAF details.

Minimum Viable Checklist

  • ☐ MSIX-packaged app with systemAIModels capability
  • ☐ Phi Silica model readied on-device (AI Toolkit Playground or AI Dev Gallery)
  • ☐ App checks GetReadyState() before calling vision APIs
  • ☐ LAF token from Microsoft
  • ☐ Health endpoint for fallback detection

The Sidecar Pattern

Encapsulate the Windows API complexity in a small C# ASP.NET Core service rather than MSIX-packaging your entire app:

Flask app (localhost:5000)  →  Vision Service (localhost:5100)  →  Phi Silica on NPU
                                      ↑
                               MSIX packaged, LAF token

The Vision Service exposes /health, /classify, /describe, /extract-text. Your primary app stays a standard Python web application.


The Three-Tier Fallback Pattern

If there’s one architecture principle to take from this entire post, it’s this: never let the app fail silently. Models will hang. Drivers will update. The NPU will occasionally just... not cooperate.

TierStrategyExample: Photo Classification
Tier 1Full AI pipeline (preferred)Phi Silica Vision analyzes the actual image
Tier 2Simpler AI approach (degraded but functional)Phi-4 Mini infers from the filename
Tier 3Hardcoded safe default (always works)Pre-baked classification for known scenarios

Build these tiers from the start. They cost almost nothing to implement and they save everything when it matters.


Operationalizing NPU Apps in Enterprise

Foundry Local is in preview, so treat everything here as a framework that will evolve. That said, if you’re thinking about fleet deployment, these are the things that will bite you if you don’t plan for them.

Model acquisition and offline readiness. Models download on first use and cache locally (typically %LOCALAPPDATA%\.foundry\cache\models). Cache is per-user context, so plan pre-provisioning accordingly.

Update cadence. No SLA, no backward compatibility guarantee. Test in rings (dev → pilot → broad). Pin runtime versions. Revalidate after every Foundry Local or Windows update.

Logging. Track inference latency, token counts, fallback tier used, model load times. Do not log raw prompts or responses containing user data or PII (personally identifiable information).

Cloud escalation policy. Make cloud routing an explicit, auditable decision: local-by-default, cloud-only-when-invoked, with a log entry every time data leaves the device. Provide an admin toggle to disable cloud escalation entirely for sensitive environments.

Security. Foundry Local binds to localhost. Only local processes can reach the model endpoint. Don’t proxy to external interfaces without explicit security controls.

Packaging. Runtime installs via winget install Microsoft.FoundryLocal (Intune-compatible). (use --scope machine for machine-wide deployment; Intune-compatible). Stage three components: runtime, pre-cached model(s), and your application. The demo repo’s setup.ps1 validates each.

Fleet heterogeneity. If your fleet spans Intel and Qualcomm, the WMI detection pattern and three-tier fallback architecture are required, not optional. Test on both silicon families before broad deployment.


What We Built: The Surface NPU Demo App

Four tabs. Five AI tasks. Zero cloud calls in our demo. This is what a vibe-coded application looks like when it grows up.

I’ve demoed this to partner sellers, enterprise customers, and internal leadership. The moment that changes the conversation every time? Turning on airplane mode and watching it keep running. (That, and showing the tokenomics dashboard: $0.00 in cloud costs.)

Feature Details

AI Agent: governed local chat assistant powered by Phi-4 Mini through Foundry Local. Natural language queries, tool-calling, structured responses. This is where most people start.

My Day: takes local calendar, email, and task data and generates a structured morning briefing entirely on-device.

Two-Brain Router: evaluates each request and decides whether the local NPU can handle it or if it needs to escalate to a frontier cloud model. Shows the decision logic in real time, asks for explicit user consent before any data leaves the device.

Field Inspection Copilot: the multimodal showcase. Five NPU capabilities in a single workflow:

  • Voice: speak inspection findings, NPU transcribes and extracts structured fields (location, issue type, severity)
  • Camera: photograph the issue, Phi Silica classifies it on-device (water damage, structural crack, mold, equipment fault)
  • Pen: annotate photos with the Surface Pen, with local handwriting recognition
  • Report generation: NPU synthesizes voice, photos, and annotations into a formatted inspection report
  • Translation: one tap to translate the full report into Spanish entirely on-device

Target industries: construction, insurance claims, utilities, manufacturing QA, property management, OSHA compliance. The offline capability is load-bearing because these environments often have poor or zero connectivity.

The dashboard closes every session with the demo numbers: 5 local AI tasks, 0 cloud calls, ~520 tokens consumed, $0.00 in cloud inference costs, 0 bytes transmitted off-device.


The Code: Fork It, Build On It

github.com/frankcx1/surface-npu-demo

Clone it. Run setup.ps1. Open http://localhost:5000. It auto-detects your silicon, selects the right model, and you’re running.

Fork it and build your own use case on top of it. Add a tab for your industry workflow. Swap in a different model from the catalog. If you build something cool, submit a PR... I’ll merge it. This is a living project.


Lessons Learned

After building this across both Intel and Qualcomm devices over several months, here’s what I wish someone had told me on day one.

For IT Pros

  • The NPU is a production-capable inference target for supported workloads. Sustained low-watt inference for tasks that can cost pennies per call in the cloud (based on published Azure OpenAI pricing). Think about what that means for your fleet at scale.
  • Start with the AI Toolkit Model Catalog. Browse, test in the Playground, understand limits before committing.
  • Model constraints shape your architecture. Phi Silica is ~4K; Phi-4 Mini supports much larger contexts but token budget still matters for latency and cost. Design focused, single-task endpoints, not chatbots.
  • Cross-platform isn’t free. Intel and Qualcomm NPUs behave differently. Test on both. Use WMI. Build fallback chains.
  • Foundry Local is preview. No SLA. Stage updates through rings.
  • Airplane mode is the proof point. Turn off Wi-Fi. Kill the 5G. Run the app. That’s the demo that changes the conversation every single time.

For Vibe Coders

  • Start with Foundry Local + the OpenAI SDK. Fastest path to NPU inference.
  • Always use SDK endpoint discovery. base_url=manager.endpoint, not a hardcoded port.
  • Test on hardware early and often. Model hangs, context overflows, and driver quirks only surface on the actual device.
  • Hardcode fallbacks for everything. Not laziness. Professionalism.

What On-Device NPU Apps Are Not For

Just as important as knowing what to build... knowing what not to build. I learned some of these the hard way.

  • Not for model training. NPUs are inference accelerators.
  • Not for long-running agent loops. Small models lose coherence. Design for single-shot endpoints.
  • Not for unbounded conversation history. Manage state explicitly within the ~4K context window.
  • Not a cloud replacement. On-device handles the routine majority so your cloud budget goes to the tasks that need it.

Get Started Today

# 1. Install the runtime
winget install Microsoft.FoundryLocal

# 2. Install Python dependencies
pip install foundry-local-sdk openai flask

# 3. Clone the demo
git clone https://github.com/frankcx1/surface-npu-demo.git
cd surface-npu-demo
.\setup.ps1

# 4. Run it
python npu_demo_flask.py
# Open http://localhost:5000

# 5. Or start from scratch with your AI coding assistant:
# "Build me a Flask app that uses Foundry Local to serve
#  Phi-4 Mini on the NPU with an OpenAI-compatible API.
#  Use the SDK for endpoint discovery. Single file, HTML inline,
#  chat interface on localhost:5000."

Pro tip for vibe coders: Copy this entire post into your AI coding assistant as context. It knows the SDK patterns, the gotchas, the fallback chains. That’s the whole point.

Pick one workload this week. PII scanning. Document classification. Intake form extraction. Contract triage. Run it on the NPU, in airplane mode. Measure latency. Measure what you’d have paid in cloud inference. Build the business case from real numbers on your hardware.

I started at Silicon Graphics (SGI) and watched the introduction of the GPU up close... from rendering wireframes to reshaping entire industries. It’s wild to see how far that arc has come. Hardware has its own rise and fall and rise again, and dedicated AI silicon feels like the next big chapter. I’ve had the privilege of shipping things like Surface Hub along the way. The tools change. The builder instinct doesn’t. The difference now is that the tools meet you where you are... you don’t need to be an engineer to build something real.

Start building. Good luck out there.


Microsoft Learn References
TopicLink
Foundry Local overviewWhat is Foundry Local?
Foundry Local quickstartGet started with Foundry Local
SDK integration (Python, C#, JS)Integrate with inference SDKs
Foundry Local architectureArchitecture and components
SDK referenceFoundry Local SDK reference
CLI referenceFoundry Local CLI reference
Windows AI APIsWhat are Windows AI APIs?
Phi SilicaGet started with Phi Silica
Phi Silica tutorialPhi Silica walkthrough
Windows AI troubleshootingAPI troubleshooting
AI Toolkit for VS CodeAI Toolkit overview
Windows AI decision treeUse local AI on Windows
Ready-to-use local LLMsLocal LLMs on Windows
Foundry Local GitHubmicrosoft/Foundry-Local

Built on Surface Copilot+ PCs. Repo: github.com/frankcx1/surface-npu-demo

Updated Mar 04, 2026
Version 1.0
No CommentsBe the first to comment