On‑Device AI with Windows AI Foundry

Nandhini_Elango

Microsoft

Nov 04, 2025

Build AI that runs where the users are, on their devices. When every millisecond and every byte of data matter, on‑device AI helps you stay responsive and protect user data by keeping processing local.

From “waiting” to “instant”- without sending data away

AI is everywhere, but speed, privacy, and reliability are critical. Users expect instant answers without compromise. On-device AI makes that possible: fast, private and available, even when the network isn’t - empowering apps to deliver seamless experiences.

Imagine an intelligent assistant that works in seconds, without sending a text to the cloud. This approach brings speed and data control to the places that need it most; while still letting you tap into cloud power when it makes sense.

Windows AI Foundry: A Local Home for Models

Windows AI Foundry is a developer toolkit that makes it simple to run AI models directly on Windows devices. It uses ONNX Runtime under the hood and can leverage CPU, GPU (via DirectML), or NPU acceleration, without requiring you to manage those details.

The principle is straightforward:

Keep the model and the data on the same device.
Inference becomes faster, and data stays local by default unless you explicitly choose to use the cloud.

Foundry Local

Foundry Local is the engine that powers this experience. Think of it as local AI runtime - fast, private, and easy to integrate into an app.

Why Adopt On‑Device AI?

Faster, more responsive apps: Local inference often reduces perceived latency and improves user experience.
Privacy‑first by design: Keep sensitive data on the device; avoid cloud round trips unless the user opts in.
Offline capability: An app can provide AI features even without a network connection.
Cost control: Reduce cloud compute and data costs for common, high‑volume tasks.

This approach is especially useful in regulated industries, field‑work tools, and any app where users expect quick, on‑device responses.

Hybrid Pattern for Real Apps

On-device AI doesn’t replace the cloud, it complements it. Here’s how:

Standalone On‑Device: Quick, private actions like document summarization, local search, and offline assistants.
Cloud‑Enhanced (Optional): Large-context models, up-to-date knowledge, or heavy multimodal workloads.

Design an app to keep data local by default and surface cloud options transparently with user consent and clear disclosures.

Windows AI Foundry supports hybrid workflows:

Use Foundry Local for real-time inference.
Sync with Azure AI services for model updates, telemetry, and advanced analytics.
Implement fallback strategies for resource-intensive scenarios.

Application Workflow

Application Workflow: On‑Device AI with Windows AI Foundry and Cloud Integration as Hybrid Path

Code Example

1. Only On-Device: Tries Foundry Local first, falls back to ONNX

if foundry_runtime.check_foundry_available():
    # Use on-device Foundry Local models
    try:
        answer = foundry_runtime.run_inference(question, context)
        return answer, source="Foundry Local (On-Device)"
    except Exception as e:
        logger.warning(f"Foundry failed: {e}, trying ONNX...")

if onnx_model.is_loaded():
    # Fallback to local BERT ONNX model
    try:
        answer = bert_model.get_answer(question, context)
        return answer, source="BERT ONNX (On-Device)"
    except Exception as e:
        logger.warning(f"ONNX failed: {e}")

return "Error: No local AI available"

2. Hybrid approach: On-device first, cloud as last resort

def get_answer(question, context):
    """
    Priority order:
    1. Foundry Local (best: advanced + private)
    2. ONNX Runtime (good: fast + private)
    3. Cloud API (fallback: requires internet, less private)
    # in case of Hybrid approach, based on real-time scenario
    """
    
    if foundry_runtime.check_foundry_available():
        # Use on-device Foundry Local models
        try:
            answer = foundry_runtime.run_inference(question, context)
            return answer, source="Foundry Local (On-Device)"
        except Exception as e:
            logger.warning(f"Foundry failed: {e}, trying ONNX...")
    
    if onnx_model.is_loaded():
        # Fallback to local BERT ONNX model
        try:
            answer = bert_model.get_answer(question, context)
            return answer, source="BERT ONNX (On-Device)"
        except Exception as e:
            logger.warning(f"ONNX failed: {e}, trying cloud...")
    
    # Last resort: Cloud API (requires internet)
    if network_available():
        try:
            import requests
            response = requests.post(
                '{BASE_URL_AI_CHAT_COMPLETION}',
                headers={'Authorization': f'Bearer {API_KEY}'},
                json={
                    'model': '{MODEL_NAME}',
                    'messages': [{
                        'role': 'user',
                        'content': f'Context: {context}\n\nQuestion: {question}'
                    }]
                },
                timeout=10
            )
            answer = response.json()['choices'][0]['message']['content']
            return answer, source="Cloud API (Online)"
        except Exception as e:
            return "Error: No AI runtime available", source="Failed"
    else:
        return "Error: No internet and no local AI available", source="Offline"

Demo Project Output: Foundry Local answering context-based questions offline

Answer found in the context: The Foundry Local engine ran the Phi-4-mini model offline and retrieved context-based data.

No answer found in the context: The Foundry Local engine ran the Phi-4-mini model offline and mentioned that there is no answer.

Practical Use Cases

Privacy-First Reading Assistant: Summarize documents locally without sending text to the cloud.
Healthcare Apps: Analyze medical data on-device for compliance.
Financial Tools: Risk scoring without exposing sensitive financial data.
IoT & Edge Devices: Real-time anomaly detection without network dependency.

Conclusion

On-device AI isn’t just a trend - it’s a shift toward smarter, faster, and more secure applications. With Windows AI Foundry and Foundry Local, developers can deliver experiences that respect user specific data, reduce latency, and work even when connectivity fails. By combining local inference with optional cloud enhancements, you get the best of both worlds: instant performance and scalable intelligence.

Whether you’re creating document summarizers, offline assistants, or compliance-ready solutions, this approach ensures your apps stay responsive, reliable, and user-centric.