Running local language models on Windows: a working example with RAG and Phi Silica
As developers integrate AI into their applications, they benefit from increased flexibility about where AI computation runs. With the introduction of Phi Silica which is integrated directly into the Windows App SDK as part of the Windows AI Foundry platform, Microsoft has created an integrated environment for building on-device AI experiences that are performant, private, and portable.
One recent effort—the Teknikos PDF Explorer demo—shows what’s now feasible. Built by Teknikos in collaboration with Microsoft Surface and running on Copilot+ PCs, the application leverages the dedicated Neural Processing Unit (NPU) to deliver local language understanding without relying on external infrastructure. This post examines the technical architecture, design process, and development implications of the project.
Teknikos PDF Explorer menu viewA Windows-native small language model
Phi Silica is a small language model optimized for NPU acceleration on
Windows 11. It integrates through the Windows App SDK and supports inference directly on compatible hardware. Unlike general-purpose models that require manual installation, Phi Silica deploys via Windows Update and runs with minimal code through C# bindings.
The Teknikos PDF Explorer project uses retrieval-augmented generation (RAG) to answer natural language questions about PDF documents. Input is parsed locally, and Phi Silica returns direct responses and suggested follow-ups—all computed on the NPU. The result is a self-contained document intelligence tool that does not require a network connection or cloud API.
Engineering context
Development began with earlier versions of the demo running inference on the GPU using Phi-3 Mini. This supported basic RAG workflows but lacked consistent performance. Once Phi Silica became accessible through the SDK, the team refactored the application to shift all inference to the NPU. The result was more predictable response quality and reduced latency, with an overall improvement in interaction flow and relevance. The updated app supports dynamic switching between GPU and NPU models. This flexibility enables comparative testing and adaptation across system configurations.
The Teknikos PDF Explorer RAG pipeline begins when a user selects a PDF file. The text is extracted and split by page, then further chunked and tokenized in preparation for vectorization. Using ONNX Runtime with the all-MiniLM-L6-v2 language model, an in-memory vector store is created to represent the PDF’s contents.
User question submission and interaction demoWhen the user submits a question, the application performs a semantic search using cosine similarity over the vector store, returning a ranked list of relevant chunks. The corresponding PDF pages are displayed as interactive thumbnails that can be enlarged. The user’s question and the top-ranked chunks are then passed to a language model as part of the prompt.
The language model streams its response in real-time. In the case of Phi Silica—due to its improved reliability compared to Phi-3 Mini—the model is also instructed to generate three follow-up questions based on the retrieved context. A prompt-driven structured response format ensures consistency, making it easy to extract the follow-up questions and render them as clickable buttons for continued exploration.
The result is an intuitive question-answering experience that guides users through complex documents with contextual relevance and minimal effort—all running on the local device.
Use case design: privacy, portability, performance
The value of on-device inference lies in constraint-aware environments. Applications that operate under regulatory, physical, or network limitations benefit from local execution in several ways:
- Data residency: Sensitive documents never leave the device.
- Connectivity tolerance: Remote or mobile use cases remain functional without requiring internet access.
- Cost stability: No per-call API charges.
- Power efficiency: NPUs offer potentially higher performance per watt for inferencing workloads.
Field scenarios such as legal review, healthcare compliance, and mobile sales operations all map well to this architecture.
Hybrid AI: structured evaluation and routing
Jon Khoo, the developer behind the Teknikos PDF Explorer, also demonstrated a parallel approach in another demo application: evaluators. These are business rules that route inference between local and remote models based on conditions such as network status, privacy sensitivity, or battery constraints. The technique uses a common ONNX model format for both deployment targets, ensuring parity in response logic regardless of execution location. The local execution path combines ONNX Runtime with Phi Silica as the on-device language model, fully mirroring the cloud path, which pairs ONNX Runtime on Azure with an Azure OpenAI GPT-based LLM. This approach introduces a structured pathway for hybrid AI, where cloud and edge systems share a model backbone and switch based on runtime variables.
Route inference between local and remote models based on conditions
Developer impact
For enterprise teams exploring Copilot+ PCs, Phi Silica offers a focused toolchain for incorporating intelligent document processing into existing Windows applications. Developers do not need to manage model conversion, tokenization workflows, or runtime dependencies. Instead, they can focus on user experience and business logic.
The integration model is prescriptive: Microsoft defines the runtime environment, deployment mechanism, and interface contract. This limits variance but simplifies development and support. For many enterprise teams, this has the potential to accelerate delivery timelines and improve maintainability.
Creating the future of on-device AI together
The Teknikos PDF Explorer project illustrates a practical approach to embedding local AI inference in enterprise applications. Phi Silica and the Windows AI Foundry offer a stable foundation for delivering secure, responsive, cost-effective AI functionality at the device level.
As the hardware ecosystem matures and the Windows AI Foundry becomes generally available, these capabilities are poised to become part of the standard development environment for Windows. Teams that invest in building now gain early insight into the architecture and operational patterns that will define the next stage of distributed application design.
Interested in exploring on-device AI solutions or collaborating on your next innovation? Connect with Teknikos to learn more about their work and how they can support your development goals. And to discover more about our latest Copilot+ PCs and business-ready Surface devices, visit Surface.com/Business.
Resources