Microsoft Developer Community Blog

9 MIN READ

Building HIPAA-Compliant Medical Transcription with Local AI

Lee_Stott

Microsoft

Feb 19, 2026

Privacy-First Voice-to-Text Using Foundry Local

Building HIPAA-Compliant Medical Transcription with Local AI

Introduction

Healthcare organizations generate vast amounts of spoken content, patient consultations, research interviews, clinical notes, medical conferences. Transcribing these recordings traditionally requires either manual typing (time-consuming and expensive) or cloud transcription services (creating immediate HIPAA compliance concerns). Every audio file sent to external APIs exposes Protected Health Information (PHI), requires Business Associate Agreements, creates audit trails on third-party servers, and introduces potential breach vectors. This sample solution lies in on-premises voice-to-text systems that process audio entirely locally, never sending PHI beyond organizational boundaries. This article demonstrates building a sample medical transcription application using FLWhisper, ASP.NET Core, C#, and Microsoft Foundry Local with OpenAI Whisper models. You'll learn how to build sample HIPAA-compliant audio processing, integrate Whisper models for medical terminology accuracy, design privacy-first API patterns, and build responsive web UIs for healthcare workflows.

Whether you're developing electronic health record (EHR) integrations, building clinical research platforms, or implementing dictation systems for medical practices, this sample could be a great starting point for privacy-first speech recognition.

Why Local Transcription Is Critical for Healthcare

Healthcare data handling is fundamentally different from general business data due to HIPAA regulations, state privacy laws, and professional ethics obligations. Understanding these requirements explains why cloud transcription services, despite their convenience, create unacceptable risks for medical applications.

HIPAA compliance mandates strict controls over PHI. Every system that touches patient data must implement administrative, physical, and technical safeguards. Cloud transcription APIs require Business Associate Agreements (BAAs), but even with paperwork, you're entrusting PHI to external systems. Every API call creates logs on vendor servers, potentially in multiple jurisdictions. Data breaches at transcription vendors expose patient information, creating liability for healthcare organizations. On-premises processing eliminates these third-party risks entirely, PHI never leaves your controlled environment. US State laws increasingly add requirements beyond HIPAA. California's CCPA, New York's SHIELD Act, and similar legislation create additional compliance obligations. International regulations like GDPR prohibit transferring health data outside approved jurisdictions. Local processing simplifies compliance by keeping data within organizational boundaries.

Research applications face even stricter requirements. Institutional Review Boards (IRBs) often require explicit consent for data sharing with external parties. Cloud transcription may violate study protocols that promise "no third-party data sharing." Clinical trials in pharmaceutical development handle proprietary information alongside PHI, double jeopardy for data exposure. Local transcription maintains research integrity while enabling audio analysis.

Cost considerations favor local deployment at scale. Medical organizations generate substantial audio, thousands of patient encounters monthly. Cloud APIs charge per minute of audio, creating significant recurring costs. Local models have fixed infrastructure costs that scale economically. A modest GPU server can process hundreds of hours monthly at predictable expense.

Latency matters for clinical workflows. Doctors and nurses need transcriptions available immediately after patient encounters to review and edit while details are fresh. Cloud APIs introduce network delays, especially problematic in rural health facilities with limited connectivity. Local inference provides <1 second turnaround for typical consultation lengths.

Application Architecture: ASP.NET Core with Foundry Local

The sample FLWhisper application implements clean separation between audio handling, AI inference, and state management using modern .NET patterns:

The ASP.NET Core 10 minimal API provides HTTP endpoints for health checks, audio transcription, and sample file streaming. Minimal APIs reduce boilerplate while maintaining full middleware support for error handling, authentication, and CORS. The API design follows OpenAI's transcription endpoint specification, enabling drop-in replacement for existing integrations.

The service layer encapsulates business logic: FoundryModelService manages model loading and lifetime, TranscriptionService handles audio processing and AI inference, and SampleAudioService provides demonstration files for testing. This separation enables easy testing, dependency injection, and service swapping.

Foundry Local integration uses the Microsoft.AI.Foundry.Local.WinML SDK. Unlike cloud APIs requiring authentication and network calls, this SDK communicates directly with the local Foundry service via in-process calls. Models load once at startup, remaining resident in memory for sub-second inference on subsequent requests.

The static file frontend delivers vanilla HTML/CSS/JavaScript, no framework overhead. This simplicity aids healthcare IT security audits and enables deployment on locked-down hospital networks. The UI provides file upload, sample selection, audio preview, transcription requests, and result display with copy-to-clipboard functionality.

Here's the architectural flow for transcription requests:

Web UI (Upload Audio File)
    ↓
POST /v1/audio/transcriptions (Multipart Form Data)
    ↓
ASP.NET Core API Route
    ↓
TranscriptionService.TranscribeAudio(audioStream)
    ↓
Foundry Local Model (Whisper Medium locally)
    ↓
Text Result + Metadata (language, duration)
    ↓
Return JSON/Text Response
    ↓
Display in UI

This architecture embodies several healthcare system design principles:

Data never leaves the device: All processing occurs on-premises, no external API calls
No data persistence by default: Audio and transcripts are session-only, never saved unless explicitly configured
Comprehensive health checks: System readiness verification before accepting PHI
Audit logging support: Structured logging for compliance documentation
Graceful degradation: Clear error messages when models unavailable rather than silent failures

Setting Up Foundry Local with Whisper Models

Foundry Local supports multiple Whisper model sizes, each with different accuracy/speed tradeoffs. For medical transcription, accuracy is paramount—misheard drug names or dosages create patient safety risks:

# Install Foundry Local (Windows)
winget install Microsoft.FoundryLocal

# Verify installation
foundry --version

# Download Whisper Medium model (optimal for medical accuracy)
foundry model add openai-whisper-medium-generic-cpu:1

# Check model availability
foundry model list

Whisper Medium (769M parameters) provides the best balance for medical use. Smaller models (Tiny, Base) miss medical terminology frequently. Larger models (Large) offer marginal accuracy gains at 3x inference time. Medium handles medical vocabulary well, drug names, anatomical terms, procedure names, while processing typical consultation audio (5-10 minutes) in under 30 seconds.

The application detects and loads the model automatically:

// Services/FoundryModelService.cs
using Microsoft.AI.Foundry.Local.WinML;

public class FoundryModelService {
    private readonly ILogger _logger;
    private readonly FoundryOptions _options;
    private ILocalAIModel? _loadedModel;
    
    public FoundryModelService(
        ILogger logger,
        IOptions options) {
        _logger = logger;
        _options = options.Value;
    }
    
    public async Task InitializeModelAsync() {
        try {
            _logger.LogInformation(
                "Loading Foundry model: {ModelAlias}", 
                _options.ModelAlias
            );
            
            // Load model from Foundry Local
            _loadedModel = await FoundryClient.LoadModelAsync(
                modelAlias: _options.ModelAlias,
                cancellationToken: CancellationToken.None
            );
            
            if (_loadedModel == null) {
                _logger.LogWarning("Model loaded but returned null instance");
                return false;
            }
            
            _logger.LogInformation(
                "Successfully loaded model: {ModelAlias}", 
                _options.ModelAlias
            );
            return true;
            
        } catch (Exception ex) {
            _logger.LogError(
                ex, 
                "Failed to load Foundry model: {ModelAlias}", 
                _options.ModelAlias
            );
            return false;
        }
    }
    
    public ILocalAIModel? GetLoadedModel() => _loadedModel;
    
    public async Task UnloadModelAsync() {
        if (_loadedModel != null) {
            await FoundryClient.UnloadModelAsync(_loadedModel);
            _loadedModel = null;
            _logger.LogInformation("Model unloaded");
        }
    }
}

Configuration lives in appsettings.json, enabling easy customization without code changes:

{
  "Foundry": {
    "ModelAlias": "whisper-medium",
    "LogLevel": "Information"
  },
  "Transcription": {
    "MaxAudioDurationSeconds": 300,
    "SupportedFormats": ["wav", "mp3", "m4a", "flac"],
    "DefaultLanguage": "en"
  }
}

Implementing Privacy-First Transcription Service

The transcription service handles audio processing while maintaining strict privacy controls. No audio or transcript persists beyond the HTTP request lifecycle unless explicitly configured:

// Services/TranscriptionService.cs
public class TranscriptionService {
    private readonly FoundryModelService _modelService;
    private readonly ILogger _logger;
    
    public async Task TranscribeAudioAsync(
        Stream audioStream,
        string originalFileName,
        TranscriptionOptions? options = null) {
            
        options ??= new TranscriptionOptions();
        var startTime = DateTime.UtcNow;
        
        try {
            // Validate audio format
            ValidateAudioFormat(originalFileName);
            
            // Get loaded model
            var model = _modelService.GetLoadedModel();
            if (model == null) {
                throw new InvalidOperationException("Whisper model not loaded");
            }
            
            // Create temporary file (automatically deleted after transcription)
            using var tempFile = new TempAudioFile(audioStream);
            
            // Execute transcription
            _logger.LogInformation(
                "Starting transcription for file: {FileName}", 
                originalFileName
            );
            
            var transcription = await model.TranscribeAsync(
                audioFilePath: tempFile.Path,
                language: options.Language,
                cancellationToken: CancellationToken.None
            );
            
            var duration = (DateTime.UtcNow - startTime).TotalSeconds;
            
            _logger.LogInformation(
                "Transcription completed in {Duration:F2}s", 
                duration
            );
            
            return new TranscriptionResult {
                Text = transcription.Text,
                Language = transcription.Language ?? options.Language,
                Duration = transcription.AudioDuration,
                ProcessingTimeSeconds = duration,
                FileName = originalFileName,
                Timestamp = DateTime.UtcNow
            };
            
        } catch (Exception ex) {
            _logger.LogError(
                ex, 
                "Transcription failed for file: {FileName}", 
                originalFileName
            );
            throw;
        }
    }
    
    private void ValidateAudioFormat(string fileName) {
        var extension = Path.GetExtension(fileName).TrimStart('.');
        var supportedFormats = new[] { "wav", "mp3", "m4a", "flac", "ogg" };
        
        if (!supportedFormats.Contains(extension.ToLowerInvariant())) {
            throw new ArgumentException(
                $"Unsupported audio format: {extension}. " +
                $"Supported: {string.Join(", ", supportedFormats)}"
            );
        }
    }
}

// Temporary file wrapper that auto-deletes
internal class TempAudioFile : IDisposable {
    public string Path { get; }
    
    public TempAudioFile(Stream sourceStream) {
        Path = System.IO.Path.GetTempFileName();
        using var fileStream = File.OpenWrite(Path);
        sourceStream.CopyTo(fileStream);
    }
    
    public void Dispose() {
        try {
            if (File.Exists(Path)) {
                File.Delete(Path);
            }
        } catch {
            // Ignore deletion errors in temp folder
        }
    }
}

This service demonstrates several privacy-first patterns:

Temporary file lifecycle management: Audio written to temp storage, automatically deleted after transcription
No implicit persistence: Results returned to caller, not saved by service
Format validation: Accept only supported audio formats to prevent processing errors
Comprehensive logging: Audit trail for compliance without logging PHI content
Error isolation: Exceptions contain diagnostic info but no patient data

Building the OpenAI-Compatible REST API

The API endpoint mirrors OpenAI's transcription API specification, enabling existing integrations to work without modifications:

// Program.cs
var builder = WebApplication.CreateBuilder(args);

// Configure services
builder.Services.Configure(
    builder.Configuration.GetSection("Foundry")
);
builder.Services.AddSingleton();
builder.Services.AddScoped();
builder.Services.AddHealthChecks()
    .AddCheck("foundry-health");

var app = builder.Build();

// Load model at startup
var modelService = app.Services.GetRequiredService();
await modelService.InitializeModelAsync();

app.UseHealthChecks("/health");
app.MapHealthChecks("/api/health/status");

// OpenAI-compatible transcription endpoint
app.MapPost("/v1/audio/transcriptions", async (
    HttpRequest request,
    TranscriptionService transcriptionService,
    ILogger logger) => {
        
    if (!request.HasFormContentType) {
        return Results.BadRequest(new {
            error = "Content-Type must be multipart/form-data"
        });
    }
    
    var form = await request.ReadFormAsync();
    
    // Extract audio file
    var audioFile = form.Files.GetFile("file");
    if (audioFile == null || audioFile.Length == 0) {
        return Results.BadRequest(new {
            error = "Audio file required in 'file' field"
        });
    }
    
    // Parse options
    var format = form["format"].ToString() ?? "text";
    var language = form["language"].ToString() ?? "en";
    
    try {
        // Process transcription
        using var stream = audioFile.OpenReadStream();
        var result = await transcriptionService.TranscribeAudioAsync(
            audioStream: stream,
            originalFileName: audioFile.FileName,
            options: new TranscriptionOptions {
                Language = language
            }
        );
        
        // Return in requested format
        if (format == "json") {
            return Results.Json(new {
                text = result.Text,
                language = result.Language,
                duration = result.Duration
            });
        } else {
            // Default: plain text
            return Results.Text(result.Text);
        }
        
    } catch (Exception ex) {
        logger.LogError(ex, "Transcription request failed");
        return Results.StatusCode(500);
    }
})
.DisableAntiforgery() // File uploads need CSRF exemption
.WithName("TranscribeAudio")
.WithOpenApi();

app.Run();

Example API usage:

# PowerShell
$audioFile = Get-Item "consultation-recording.wav"
$response = Invoke-RestMethod `
    -Uri "http://localhost:5192/v1/audio/transcriptions" `
    -Method Post `
    -Form @{ file = $audioFile; format = "json" }

Write-Output $response.text

# cURL
curl -X POST http://localhost:5192/v1/audio/transcriptions \
  -F "file=@consultation-recording.wav" \
  -F "format=json"

Building the Interactive Web Frontend

The web UI provides a user-friendly interface for non-technical medical staff to transcribe recordings:

SarahCare Medical Transcription

The JavaScript handles file uploads and API interactions:

// wwwroot/app.js
let selectedFile = null;

async function checkHealth() {
    try {
        const response = await fetch('/health');
        const statusEl = document.getElementById('status');
        
        if (response.ok) {
            statusEl.className = 'status-badge online';
            statusEl.textContent = '✓ System Ready';
        } else {
            statusEl.className = 'status-badge offline';
            statusEl.textContent = '✗ System Unavailable';
        }
    } catch (error) {
        console.error('Health check failed:', error);
    }
}

function handleFileSelect(event) {
    const file = event.target.files[0];
    if (!file) return;
    
    selectedFile = file;
    
    // Show file info
    const fileInfo = document.getElementById('fileInfo');
    fileInfo.textContent = `Selected: ${file.name} (${formatFileSize(file.size)})`;
    fileInfo.classList.remove('hidden');
    
    // Enable audio preview
    const preview = document.getElementById('audioPreview');
    preview.src = URL.createObjectURL(file);
    preview.classList.remove('hidden');
    
    // Enable transcribe button
    document.getElementById('transcribeBtn').disabled = false;
}

async function transcribeAudio() {
    if (!selectedFile) return;
    
    const loadingEl = document.getElementById('loadingIndicator');
    const resultEl = document.getElementById('resultSection');
    const transcribeBtn = document.getElementById('transcribeBtn');
    
    // Show loading state
    loadingEl.classList.remove('hidden');
    resultEl.classList.add('hidden');
    transcribeBtn.disabled = true;
    
    try {
        const formData = new FormData();
        formData.append('file', selectedFile);
        formData.append('format', 'json');
        
        const startTime = Date.now();
        
        const response = await fetch('/v1/audio/transcriptions', {
            method: 'POST',
            body: formData
        });
        
        if (!response.ok) {
            throw new Error(`HTTP ${response.status}: ${response.statusText}`);
        }
        
        const result = await response.json();
        const processingTime = ((Date.now() - startTime) / 1000).toFixed(1);
        
        // Display results
        document.getElementById('transcriptionText').value = result.text;
        document.getElementById('resultDuration').textContent = 
            `Duration: ${result.duration.toFixed(1)}s`;
        document.getElementById('resultLanguage').textContent = 
            `Language: ${result.language}`;
        
        resultEl.classList.remove('hidden');
        
        console.log(`Transcription completed in ${processingTime}s`);
        
    } catch (error) {
        console.error('Transcription failed:', error);
        alert(`Transcription failed: ${error.message}`);
    } finally {
        loadingEl.classList.add('hidden');
        transcribeBtn.disabled = false;
    }
}

function copyToClipboard() {
    const text = document.getElementById('transcriptionText').value;
    navigator.clipboard.writeText(text)
        .then(() => alert('Copied to clipboard'))
        .catch(err => console.error('Copy failed:', err));
}

// Initialize
window.addEventListener('load', () => {
    checkHealth();
    loadSamplesList();
});

Key Takeaways and Production Considerations

Building HIPAA-compliant voice-to-text systems requires architectural decisions that prioritize data privacy over convenience. The FLWhisper application demonstrates that you can achieve accurate medical transcription, fast processing times, and intuitive user experiences entirely on-premises.

Critical lessons for healthcare AI:

Privacy by architecture: Design systems where PHI never exists outside controlled environments, not as a configuration option
No persistence by default: Audio and transcripts should be ephemeral unless explicitly saved with proper access controls
Model selection matters: Whisper Medium provides medical terminology accuracy that smaller models miss
Health checks enable reliability: Systems should verify model availability before accepting PHI
Audit logging without content logging: Track operations for compliance without storing sensitive data in logs

For production deployment in clinical settings, integrate with EHR systems via HL7/FHIR interfaces. Implement role-based access control with Active Directory integration. Add digital signatures for transcript authentication. Configure automatic PHI redaction using clinical NLP models. Deploy on HIPAA-compliant infrastructure with proper physical security. Implement comprehensive audit logging meeting compliance requirements.

The complete implementation with ASP.NET Core API, Foundry Local integration, sample audio files, and comprehensive tests is available at github.com/leestott/FLWhisper. Clone the repository and follow the setup guide to experience privacy-first medical transcription.