Microsoft Developer Community Blog

11 MIN READ

Building a Privacy-First Hybrid AI Briefing Tool with Foundry Local and Azure OpenAI

Lee_Stott

Microsoft

Feb 26, 2026

The Challenge: Balancing Speed, Privacy, and Quality in Client Work

Introduction

Management consultants face a critical challenge: they need instant AI-powered insights from sensitive client documents, but traditional cloud-only AI solutions create unacceptable data privacy risks. Every document uploaded to a cloud API potentially exposes confidential client information, violates data residency requirements, and creates compliance headaches.

The solution lies in a hybrid architecture that combines the speed and privacy of on-device AI with the sophistication of cloud models—but only when explicitly requested. This article walks through building a production-ready briefing assistant that runs AI inference locally first, then optionally refines outputs using Azure OpenAI for executive-quality presentations.

We'll explore a sample implementation using FL-Client-Briefing-Assistant, built with Next.js 14, TypeScript, and Microsoft Foundry Local. You'll learn how to architect privacy-first AI applications, implement sub-second local inference, and design transparent hybrid workflows that give users complete control over their data.

Why Hybrid AI Architecture Matters for Enterprise Applications

Before diving into implementation details, let's understand why a hybrid approach is essential for enterprise AI applications, particularly in consulting and professional services.

Cloud-only AI services like OpenAI's GPT-4 offer remarkable capabilities, but they introduce several critical challenges. First, every API call sends your data to external servers, creating audit trails and potential exposure points. For consultants handling merger documents, financial reports, or strategic plans, this is often a non-starter. Second, cloud APIs introduce latency, typically 2-5 seconds per request due to network round-trips and queue times. Third, costs scale linearly with usage, making high-volume document analysis expensive at scale.

Local-only AI solves privacy and latency concerns but sacrifices quality. Small language models (SLMs) running on laptops produce quick summaries, but they lack the nuanced reasoning and polish needed for C-suite presentations. You get fast, private results that may require significant manual refinement.

The hybrid approach gives you the best of both worlds: instant, private local processing as the default, with optional cloud refinement only when quality matters most. This architecture respects data privacy by default while maintaining the flexibility to produce executive-grade outputs when needed.

Architecture Overview: Three-Layer Design for Privacy and Performance

The FL-Client-Briefing-Assistant implements a clean three-layer architecture that separates concerns and ensures privacy at every level.

At the frontend, a Next.js 14 application provides the user interface with strong TypeScript typing throughout. Users interact with four quick-action templates: document summarization, talking points generation, risk analysis, and executive summaries. The UI clearly indicates which model (local or cloud) processed each request, ensuring transparency.

The middle tier consists of Next.js API routes that act as orchestration endpoints. These routes validate requests using Zod schemas, route to appropriate inference services, and enforce privacy settings. Critically, the API layer never persists user content unless explicitly opted in via privacy settings.

The inference layer contains two distinct services. The local service uses Foundry Local SDK to communicate with a locally running Phi-4 model (or similar SLM). This provides sub-second inference, typical 500ms-1s response times, completely offline. The cloud service connects to Azure OpenAI using the official JavaScript SDK, accessed via Managed Identity or API keys, with proper timeout and retry logic.

Setting Up Foundry Local for On-Device Inference

Foundry Local is Microsoft's runtime for running AI models entirely on your device—no internet required, no data leaving your machine. Here's how to get it running for this application.

First, install Foundry Local on Windows using Windows Package Manager:

winget install Microsoft.FoundryLocal

After installation, verify the service is ready:

foundry service start
foundry service status

The status command will show you the service endpoint, typically running on a dynamic port like http://127.0.0.1:5272. This port changes between restarts, so your application must query it programmatically.

Next, load an appropriate model. For briefing tasks, Phi-4 Mini provides an excellent balance of quality and speed:

foundry model load phi-4

The model downloads (approximately 3.6GB) and loads into memory. This takes 2-5 minutes on first run but persists between sessions. Once loaded, inference is nearly instant, most requests complete in under 1 second.

In your application, configure the connection in .env.local: the port for foundry local is dynamic so please ensure you add the correct port.

FOUNDRY_LOCAL_ENDPOINT=http://127.0.0.1:****

The application uses the Foundry Local SDK to query the running service:

import { FoundryLocalClient } from 'foundry-local-sdk';

const client = new FoundryLocalClient({
  endpoint: process.env.FOUNDRY_LOCAL_ENDPOINT
});

const response = await client.chat.completions.create({
  model: 'phi-4',
  messages: [
    { role: 'system', content: 'You are a professional consultant assistant.' },
    { role: 'user', content: 'Summarize this document: ...' }
  ],
  max_tokens: 500,
  temperature: 0.3
});

This code demonstrates several best practices:

Explicit model specification: Always name the model to ensure consistency across environments
System message framing: Set the appropriate professional context for consulting use cases
Conservative temperature: Use 0.3 for factual summarization tasks to reduce hallucination
Token limits: Cap outputs to prevent excessive generation times and costs

Implementing Privacy-First API Routes

The Next.js API routes form the security boundary of the application. Every request must be validated, sanitized, and routed according to privacy settings before reaching inference services.

Here's the core local inference route (app/api/briefing/local/route.ts):

import { NextRequest, NextResponse } from 'next/server';
import { z } from 'zod';
import { FoundryLocalClient } from 'foundry-local-sdk';

const RequestSchema = z.object({
  prompt: z.string().min(10).max(5000),
  template: z.enum(['summary', 'talking-points', 'risk-analysis', 'executive']),
  context: z.string().optional()
});

export async function POST(request: NextRequest) {
  try {
    // Validate and parse request body
    const body = await request.json();
    const validated = RequestSchema.parse(body);

    // Initialize Foundry Local client
    const client = new FoundryLocalClient({
      endpoint: process.env.FOUNDRY_LOCAL_ENDPOINT!
    });

    // Build system prompt based on template
    const systemPrompts = {
      'summary': 'You are a consultant creating concise document summaries.',
      'talking-points': 'You are preparing structured talking points for meetings.',
      'risk-analysis': 'You are analyzing risks and opportunities systematically.',
      'executive': 'You are crafting executive-level briefing notes.'
    };

    // Execute local inference
    const startTime = Date.now();
    const completion = await client.chat.completions.create({
      model: 'phi-4',
      messages: [
        { role: 'system', content: systemPrompts[validated.template] },
        { role: 'user', content: validated.prompt }
      ],
      temperature: 0.3,
      max_tokens: 500
    });

    const latency = Date.now() - startTime;

    // Return structured response with metadata
    return NextResponse.json({
      content: completion.choices[0].message.content,
      model: 'phi-4 (local)',
      latency_ms: latency,
      tokens: completion.usage?.total_tokens,
      timestamp: new Date().toISOString()
    });

  } catch (error) {
    if (error instanceof z.ZodError) {
      return NextResponse.json(
        { error: 'Invalid request format', details: error.errors },
        { status: 400 }
      );
    }

    console.error('Local inference error:', error);
    return NextResponse.json(
      { error: 'Inference failed', message: error.message },
      { status: 500 }
    );
  }
}

This implementation demonstrates several critical security and quality patterns:

Request validation with Zod: Every field is type-checked and bounded before processing, preventing injection attacks and malformed inputs
Template-based system prompts: Different use cases get optimized prompts, improving output quality and consistency
Comprehensive error handling: Validation errors, inference failures, and network issues are caught and reported with appropriate HTTP status codes
Performance tracking: Latency measurement enables monitoring and helps users understand response times
Metadata enrichment: Responses include model attribution, token usage, and timestamps for auditing

The cloud refinement route follows a similar pattern but adds privacy checks:

export async function POST(request: NextRequest) {
  try {
    const body = await request.json();
    const validated = RequestSchema.parse(body);

    // Check privacy settings from cookie/header
    const confidentialMode = request.cookies.get('confidential-mode')?.value === 'true';
    
    if (confidentialMode) {
      return NextResponse.json(
        { error: 'Cloud refinement disabled in confidential mode' },
        { status: 403 }
      );
    }

    // Proceed with Azure OpenAI call only if privacy allows
    const client = new OpenAI({
      apiKey: process.env.AZURE_OPENAI_KEY,
      baseURL: process.env.AZURE_OPENAI_ENDPOINT,
      defaultHeaders: { 'api-key': process.env.AZURE_OPENAI_KEY }
    });

    const completion = await client.chat.completions.create({
      model: process.env.AZURE_OPENAI_DEPLOYMENT!,
      messages: [/* ... */],
      temperature: 0.5, // Slightly higher for creative refinement
      max_tokens: 800
    });

    return NextResponse.json({
      content: completion.choices[0].message.content,
      model: `${process.env.AZURE_OPENAI_DEPLOYMENT} (cloud)`,
      privacy_notice: 'Content processed by Azure OpenAI',
      // ... metadata
    });

  } catch (error) {
    // Error handling
  }
}

The confidential mode check is crucial—it ensures that even if a user accidentally clicks the refinement button, no data leaves the device when privacy mode is enabled. This fail-safe design prevents data leakage through UI mistakes or automated workflows.

Building the Frontend: Transparent Privacy Controls

The user interface must make privacy decisions explicit and visible. Users need to understand which AI service processed their content and make informed choices about cloud refinement.

The main briefing interface (app/page.tsx) implements this transparency through clear visual indicators:

'use client';
import { useState, useEffect } from 'react';
import { PrivacySettings } from '@/components/PrivacySettings';

export default function BriefingAssistant() {
  const [confidentialMode, setConfidentialMode] = useState(true); // Privacy by default
  const [content, setContent] = useState('');
  const [result, setResult] = useState(null);
  const [loading, setLoading] = useState(false);

  // Load privacy preference from localStorage
  useEffect(() => {
    const saved = localStorage.getItem('confidential-mode');
    if (saved !== null) {
      setConfidentialMode(saved === 'true');
    }
  }, []);

  async function generateBriefing(template: string, useCloud: boolean = false) {
    if (useCloud && confidentialMode) {
      alert('Cloud refinement is disabled in confidential mode. Adjust settings to enable.');
      return;
    }

    setLoading(true);
    const endpoint = useCloud ? '/api/briefing/cloud' : '/api/briefing/local';

    try {
      const response = await fetch(endpoint, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ prompt: content, template })
      });

      const data = await response.json();
      setResult({
        ...data,
        processedBy: useCloud ? 'cloud' : 'local'
      });
    } catch (error) {
      console.error('Briefing generation failed:', error);
    } finally {
      setLoading(false);
    }
  }

  return (
    <div className="briefing-assistant">
      <header>
        <h1>Client Briefing Assistant</h1>
        <div className="status-bar">
          <span className={confidentialMode ? 'confidential' : 'standard'}>
            {confidentialMode ? '🔒 Confidential Mode' : '🌐 Standard Mode'}
          </span>
          <PrivacySettings 
            confidentialMode={confidentialMode}
            onChange={setConfidentialMode}
          />
        </div>
      </header>

      <div className="quick-actions">
        <button onClick={() => generateBriefing('summary')}>
          📄 Summarize Document
        </button>
        <button onClick={() => generateBriefing('talking-points')}>
          💬 Generate Talking Points
        </button>
        <button onClick={() => generateBriefing('risk-analysis')}>
          🎯 Risk Analysis
        </button>
        <button onClick={() => generateBriefing('executive')}>
          📊 Executive Summary
        </button>
      </div>

      <textarea 
        value={content}
        onChange={(e) => setContent(e.target.value)}
        placeholder="Paste client document or meeting notes here..."
      />

      {result && (
        <div className="result-card">
          <div className="result-header">
            <span className="model-badge">{result.model}</span>
            <span className="latency">{result.latency_ms}ms</span>
          </div>
          <div className="result-content">{result.content}</div>
          
          {result.processedBy === 'local' && !confidentialMode && (
            <button 
              onClick={() => generateBriefing(result.template, true)}
              className="refine-btn"
            >
              ✨ Refine for Executive Presentation
            </button>
          )}
        </div>
      )}
    </div>
  );
}

This interface design embodies several principles of responsible AI UX:

Privacy by default: Confidential mode is enabled unless explicitly changed, ensuring accidental cloud usage requires multiple intentional actions
Clear attribution: Every result shows which model generated it and how long it took, building user trust through transparency
Conditional refinement: The cloud refinement button only appears when privacy allows and local inference has completed, preventing premature cloud requests
Persistent settings: Privacy preferences save to localStorage, respecting user choices across sessions
Visual status indicators: The header always shows current privacy mode with recognizable icons (🔒 for confidential, 🌐 for standard)

Testing Privacy and Performance Requirements

A privacy-first application demands rigorous testing to ensure data never leaks unintentionally. The project includes comprehensive test suites using Vitest for unit tests and Playwright for end-to-end scenarios.

Here's a critical privacy test (tests/privacy.test.ts):

import { describe, it, expect, beforeEach } from 'vitest';
import { TestUtils } from './utils/test-helpers';

describe('Privacy Controls', () => {
  let testUtils: TestUtils;

  beforeEach(() => {
    testUtils = new TestUtils();
    testUtils.enableConfidentialMode();
  });

  it('should prevent cloud API calls when confidential mode is enabled', async () => {
    const response = await testUtils.requestBriefing({
      template: 'summary',
      prompt: 'Confidential merger document...',
      cloud: true
    });

    expect(response.status).toBe(403);
    expect(response.error).toContain('disabled in confidential mode');
  });

  it('should allow local inference in confidential mode', async () => {
    const response = await testUtils.requestBriefing({
      template: 'summary',
      prompt: 'Confidential merger document...',
      cloud: false
    });

    expect(response.status).toBe(200);
    expect(response.model).toContain('local');
    expect(response.content).toBeTruthy();
  });

  it('should not persist sensitive content without opt-in', async () => {
    await testUtils.requestBriefing({
      template: 'executive',
      prompt: 'Strategic acquisition plan...',
      cloud: false
    });

    const history = await testUtils.getConversationHistory();
    expect(history).toHaveLength(0); // No storage by default
  });

  it('should support opt-in history with explicit consent', async () => {
    testUtils.enableHistorySaving();
    
    await testUtils.requestBriefing({
      template: 'executive',
      prompt: 'Strategic acquisition plan...',
      cloud: false
    });

    const history = await testUtils.getConversationHistory();
    expect(history).toHaveLength(1);
    expect(history[0].prompt).toContain('acquisition');
  });
});

Performance testing ensures local inference meets the sub-second requirement:

describe('Performance SLA', () => {
  it('should complete local inference in under 1 second', async () => {
    const samples = [];
    
    for (let i = 0; i < 10; i++) {
      const start = Date.now();
      await testUtils.requestBriefing({
        template: 'summary',
        prompt: 'Standard 500-word document...',
        cloud: false
      });
      samples.push(Date.now() - start);
    }

    const p95 = calculatePercentile(samples, 95);
    expect(p95).toBeLessThan(1000); // 95th percentile under 1s
  });

  it('should handle 5 concurrent requests without degradation', async () => {
    const requests = Array(5).fill(null).map(() => 
      testUtils.requestBriefing({
        template: 'talking-points',
        prompt: 'Meeting agenda...',
        cloud: false
      })
    );

    const results = await Promise.all(requests);
    
    expect(results.every(r => r.status === 200)).toBe(true);
    expect(results.every(r => r.latency_ms < 2000)).toBe(true);
  });
});

These tests validate the core promise: local inference is fast, private, and reliable under realistic loads.

Deployment Considerations and Production Readiness

Moving from development to production requires addressing several operational concerns: model distribution, environment configuration, monitoring, and incident response.

For Foundry Local deployment, ensure IT teams pre-install the runtime and required models on consultant laptops. Use MDM (Mobile Device Management) systems or Group Policy to automate model downloads during onboarding. Models can be cached in shared network locations to avoid redundant downloads across teams.

Environment configuration should separate local and cloud credentials cleanly:

# .env.local (local development)
FOUNDRY_LOCAL_ENDPOINT=http://127.0.0.1:5272
AZURE_OPENAI_ENDPOINT=https://your-org.openai.azure.com
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
AZURE_OPENAI_KEY=your-key-here

# For production, use Azure Managed Identity instead of API keys
USE_MANAGED_IDENTITY=true

Managed Identity eliminates API key management—the application authenticates using Azure AD, with permissions controlled via IAM policies. This prevents key leakage and simplifies rotation.

Monitoring should track both local and cloud usage patterns. Implement structured logging with clear privacy labels:

logger.info('Briefing generated', {
  model: 'local',
  template: 'summary',
  latency_ms: 847,
  tokens: 312,
  privacy_mode: 'confidential',
  user_id: hash(userId), // Never log raw user IDs
  timestamp: new Date().toISOString()
});

This approach enables operational insights (average latency, most-used templates, error rates) without exposing sensitive content or user identities.

For incident response, establish clear escalation paths. If Foundry Local fails, the application should gracefully degrade—inform users that local inference is unavailable and offer cloud-only mode (with explicit consent). If cloud services fail, local inference continues uninterrupted, ensuring the application remains useful even during Azure outages.

Key Takeaways and Next Steps

Building a privacy-first hybrid AI application requires careful architectural decisions that prioritize user data protection while maintaining high-quality outputs. The FL-Client-Briefing-Assistant demonstrates that you can achieve sub-second local inference, transparent privacy controls, and optional cloud refinement in a production-ready package.

Key lessons from this implementation:

Privacy must be the default, not an opt-in feature—confidential mode should require explicit action to disable
Transparency builds trust—always show users which model processed their data and how long it took
Fallback strategies ensure reliability—graceful degradation when services fail keeps the application useful
Testing validates promises—comprehensive tests for privacy, performance, and functionality are non-negotiable
Operational visibility without privacy leaks—structured logging enables monitoring without exposing sensitive content

To extend this application, consider adding:

Document parsing: Integrate PDF, DOCX, and PPTX extractors to analyze file uploads directly
Multi-document synthesis: Combine insights from multiple client documents into unified briefings
Custom templates: Allow consultants to define their own briefing formats and save them for reuse
Offline mode indicators: Detect network connectivity and disable cloud features automatically
Audit logging: For regulated industries, implement immutable audit trails showing when cloud refinement was used

The full implementation, including all code, tests, and deployment guides, is available at github.com/leestott/FL-Client-Briefing-Assistant. Clone the repository, follow the setup guide, and experience privacy-first AI in action.

Resources and Further Reading

FL-Client-Briefing-Assistant Repository - Complete source code and documentation
Microsoft Foundry Local Documentation - Official runtime documentation and API reference
Azure OpenAI Service - Cloud refinement integration guide
Project Specification - Detailed requirements and acceptance criteria
Implementation Guide - Architecture decisions and design patterns
Testing Guide - How to run and interpret comprehensive test suites