Benchmarking Local AI Models

Lee_Stott

Microsoft

Feb 02, 2026

Scientific Performance Measurement with Foundry Local

Introduction

Selecting the right AI model for your application requires more than reading benchmark leaderboards. Published benchmarks measure academic capabilities, question answering, reasoning, coding, but your application has specific requirements: latency budgets, hardware constraints, quality thresholds. How do you know if Phi-4 provides acceptable quality for your document summarization use case? Will Qwen2.5-0.5B meet your 100ms response time requirement? Does your edge device have sufficient memory for Phi-3.5 Mini?

The answer lies in empirical testing: running actual models on your hardware with your workload patterns. This article demonstrates building a comprehensive model benchmarking platform using FLPerformance, Node.js, React, and Microsoft Foundry Local. You'll learn how to implement scientific performance measurement, design meaningful benchmark suites, visualize multi-dimensional comparisons, and make data-driven model selection decisions.

Whether you're evaluating models for production deployment, optimizing inference costs, or validating hardware specifications, this platform provides the tools for rigorous performance analysis.

Why Model Benchmarking Requires Purpose-Built Tools

You cannot assess model performance by running a few manual tests and noting the results. Scientific benchmarking demands controlled conditions, statistically significant sample sizes, multi-dimensional metrics, and reproducible methodology. Understand why purpose-built tooling is essential.

Performance is multi-dimensional. A model might excel at throughput (tokens per second) but suffer at latency (time to first token). Another might generate high-quality outputs slowly. Your application might prioritize consistency over average performance, a model with variable response times (high p95/p99 latency) creates poor user experiences even if averages look good. Measuring all dimensions simultaneously enables informed tradeoffs.

Hardware matters enormously. Benchmark results from NVIDIA A100 GPUs don't predict performance on consumer laptops. NPU acceleration changes the picture again. Memory constraints affect which models can even load. Test on your actual deployment hardware or comparable specifications to get actionable results.

Concurrency reveals bottlenecks. A model handling one request excellently might struggle with ten concurrent requests. Real applications experience variable load, measuring only single-threaded performance misses critical scalability constraints. Controlled concurrency testing reveals these limits.

Statistical rigor prevents false conclusions. Running a prompt once and noting the response time tells you nothing about performance distribution. Was this result typical? An outlier? You need dozens or hundreds of trials to establish p50/p95/p99 percentiles, understand variance, and detect stability issues.

Comparison requires controlled experiments. Different prompts, different times of day, different system loads, all introduce confounding variables. Scientific comparison runs identical workloads across models sequentially, controlling for external factors.

Architecture: Three-Layer Performance Testing Platform

FLPerformance implements a clean separation between orchestration, measurement, and presentation:

The frontend React application provides model management, benchmark configuration, test execution, and results visualization. Users add models from the Foundry Local catalog, configure benchmark parameters (iterations, concurrency, timeout values), launch test runs, and view real-time progress. The results dashboard displays comparison tables, latency distribution charts, throughput graphs, and "best model for..." recommendations.

The backend Node.js/Express server orchestrates tests and captures metrics. It manages the single Foundry Local service instance, loads/unloads models as needed, executes benchmark suites with controlled concurrency, measures comprehensive metrics (TTFT, TPOT, total latency, throughput, error rates), and persists results to JSON storage. WebSocket connections provide real-time progress updates during long benchmark runs.

Foundry Local SDK integration uses the official foundry-local-sdk npm package. The SDK manages service lifecycle, starting, stopping, health checkin, and handles model operations, downloading, loading into memory, unloading. It provides OpenAI-compatible inference APIs for consistent request formatting across models.

The architecture supports simultaneous testing of multiple models by loading them one at a time, running identical benchmarks, and aggregating results for comparison:

User Initiates Benchmark Run
    ↓
Backend receives {models: [...], suite: "default", iterations: 10}
    ↓
For each model:
    1. Load model into Foundry Local
    2. Execute benchmark suite
       - For each prompt in suite:
         * Run N iterations
         * Measure TTFT, TPOT, total time
         * Track errors and timeouts
         * Calculate tokens/second
    3. Aggregate statistics (mean, p50, p95, p99)
    4. Unload model
    ↓
Store results with metadata
    ↓
Return comparison data to frontend
    ↓
Visualize performance metrics

Implementing Scientific Measurement Infrastructure

Accurate performance measurement requires instrumentation that captures multiple dimensions without introducing measurement overhead:

// src/server/benchmark.js
import { performance } from 'perf_hooks';

export class BenchmarkExecutor {
  constructor(foundryClient, options = {}) {
    this.client = foundryClient;
    this.options = {
      iterations: options.iterations || 10,
      concurrency: options.concurrency || 1,
      timeout_ms: options.timeout_ms || 30000,
      warmup_iterations: options.warmup_iterations || 2
    };
  }
  
  async runBenchmarkSuite(modelId, prompts) {
    const results = [];
    
    // Warmup phase (exclude from results)
    console.log(`Running ${this.options.warmup_iterations} warmup iterations...`);
    for (let i = 0; i < this.options.warmup_iterations; i++) {
      await this.executePrompt(modelId, prompts[0].text);
    }
    
    // Actual benchmark runs
    for (const prompt of prompts) {
      console.log(`Benchmarking prompt: ${prompt.id}`);
      const measurements = [];
      
      for (let i = 0; i < this.options.iterations; i++) {
        const measurement = await this.executeMeasuredPrompt(
          modelId, 
          prompt.text
        );
        measurements.push(measurement);
        
        // Small delay between iterations to stabilize
        await sleep(100);
      }
      
      results.push({
        prompt_id: prompt.id,
        prompt_text: prompt.text,
        measurements,
        statistics: this.calculateStatistics(measurements)
      });
    }
    
    return {
      model_id: modelId,
      timestamp: new Date().toISOString(),
      config: this.options,
      results
    };
  }
  
  async executeMeasuredPrompt(modelId, promptText) {
    const measurement = {
      success: false,
      error: null,
      ttft_ms: null,  // Time to first token
      tpot_ms: null,  // Time per output token
      total_ms: null,
      tokens_generated: 0,
      tokens_per_second: 0
    };
    
    try {
      const startTime = performance.now();
      let firstTokenTime = null;
      let tokenCount = 0;
      
      // Streaming completion to measure TTFT
      const stream = await this.client.chat.completions.create({
        model: modelId,
        messages: [{ role: 'user', content: promptText }],
        max_tokens: 200,
        temperature: 0.7,
        stream: true
      });
      
      for await (const chunk of stream) {
        if (chunk.choices[0]?.delta?.content) {
          if (firstTokenTime === null) {
            firstTokenTime = performance.now();
            measurement.ttft_ms = firstTokenTime - startTime;
          }
          tokenCount++;
        }
      }
      
      const endTime = performance.now();
      measurement.total_ms = endTime - startTime;
      measurement.tokens_generated = tokenCount;
      
      if (tokenCount > 1 && firstTokenTime) {
        // TPOT = time after first token / (tokens - 1)
        const timeAfterFirstToken = endTime - firstTokenTime;
        measurement.tpot_ms = timeAfterFirstToken / (tokenCount - 1);
        measurement.tokens_per_second = 1000 / measurement.tpot_ms;
      }
      
      measurement.success = true;
      
    } catch (error) {
      measurement.error = error.message;
      measurement.success = false;
    }
    
    return measurement;
  }
  
  calculateStatistics(measurements) {
    const successful = measurements.filter(m => m.success);
    const total = measurements.length;
    
    if (successful.length === 0) {
      return {
        success_rate: 0,
        error_rate: 1.0,
        sample_size: total
      };
    }
    
    const ttfts = successful.map(m => m.ttft_ms).sort((a, b) => a - b);
    const tpots = successful.map(m => m.tpot_ms).filter(v => v !== null).sort((a, b) => a - b);
    const totals = successful.map(m => m.total_ms).sort((a, b) => a - b);
    const throughputs = successful.map(m => m.tokens_per_second).filter(v => v > 0);
    
    return {
      success_rate: successful.length / total,
      error_rate: (total - successful.length) / total,
      sample_size: total,
      
      ttft: {
        mean: mean(ttfts),
        median: percentile(ttfts, 50),
        p95: percentile(ttfts, 95),
        p99: percentile(ttfts, 99),
        min: Math.min(...ttfts),
        max: Math.max(...ttfts)
      },
      
      tpot: tpots.length > 0 ? {
        mean: mean(tpots),
        median: percentile(tpots, 50),
        p95: percentile(tpots, 95)
      } : null,
      
      total_latency: {
        mean: mean(totals),
        median: percentile(totals, 50),
        p95: percentile(totals, 95),
        p99: percentile(totals, 99)
      },
      
      throughput: {
        mean_tps: mean(throughputs),
        median_tps: percentile(throughputs, 50)
      }
    };
  }
}

function mean(arr) {
  return arr.reduce((sum, val) => sum + val, 0) / arr.length;
}

function percentile(sortedArr, p) {
  const index = Math.ceil((sortedArr.length * p) / 100) - 1;
  return sortedArr[Math.max(0, index)];
}

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

This measurement infrastructure captures:

Time to First Token (TTFT): Critical for perceived responsiveness—users notice delays before output begins
Time Per Output Token (TPOT): Determines generation speed after first token—affects throughput
Total latency: End-to-end time—matters for batch processing and high-volume scenarios
Tokens per second: Overall throughput metric—useful for capacity planning
Statistical distributions: Mean alone masks variability—p95/p99 reveal tail latencies that impact user experience
Success/error rates: Stability metrics—some models timeout or crash under load

Designing Meaningful Benchmark Suites

Benchmark quality depends on prompt selection. Generic prompts don't reflect real application behavior. Design suites that mirror actual use cases:

// benchmarks/suites/default.json
{
  "name": "default",
  "description": "General-purpose benchmark covering diverse scenarios",
  "prompts": [
    {
      "id": "short-factual",
      "text": "What is the capital of France?",
      "category": "factual",
      "expected_tokens": 5
    },
    {
      "id": "medium-explanation",
      "text": "Explain how photosynthesis works in 3-4 sentences.",
      "category": "explanation",
      "expected_tokens": 80
    },
    {
      "id": "long-reasoning",
      "text": "Analyze the economic factors that led to the 2008 financial crisis. Discuss at least 5 major causes with supporting details.",
      "category": "reasoning",
      "expected_tokens": 250
    },
    {
      "id": "code-generation",
      "text": "Write a Python function that finds the longest palindrome in a string. Include docstring and example usage.",
      "category": "coding",
      "expected_tokens": 150
    },
    {
      "id": "creative-writing",
      "text": "Write a short story (3 paragraphs) about a robot learning to paint.",
      "category": "creative",
      "expected_tokens": 200
    }
  ]
}

This suite covers multiple dimensions:

Length variation: Short (5 tokens), medium (80), long (250)—tests models across output ranges
Task diversity: Factual recall, explanation, reasoning, code, creative—reveals capability breadth
Token predictability: Expected token counts enable throughput calculations

For production applications, create custom suites matching your actual workload:

{
  "name": "customer-support",
  "description": "Simulates actual customer support queries",
  "prompts": [
    {
      "id": "product-question",
      "text": "How do I reset my password for the customer portal?"
    },
    {
      "id": "troubleshooting",
      "text": "I'm getting error code 503 when trying to upload files. What should I do?"
    },
    {
      "id": "policy-inquiry",
      "text": "What is your refund policy for annual subscriptions?"
    }
  ]
}

Visualizing Multi-Dimensional Performance Comparisons

Raw numbers don't reveal insights—visualization makes patterns obvious. The frontend implements several comparison views:

Comparison Table shows side-by-side metrics:

// frontend/src/components/ResultsTable.jsx
export function ResultsTable({ results }) {
  return (
    
        {results.map(result => (
          
        ))}

Model	TTFT (ms)	TPOT (ms)	Throughput (tok/s)	P95 Latency	Error Rate
{result.model_id}	{result.stats.ttft.median.toFixed(0)} (p95: {result.stats.ttft.p95.toFixed(0)})	{result.stats.tpot?.median.toFixed(1) \|\| 'N/A'}	{result.stats.throughput.median_tps.toFixed(1)}	{result.stats.total_latency.p95.toFixed(0)} ms	0.05 ? 'error' : 'success'}> {(result.stats.error_rate * 100).toFixed(1)}%

  );
}

Latency Distribution Chart reveals performance consistency:

// Using Chart.js for visualization
export function LatencyChart({ results }) {
  const data = {
    labels: results.map(r => r.model_id),
    datasets: [
      {
        label: 'Median (p50)',
        data: results.map(r => r.stats.total_latency.median),
        backgroundColor: 'rgba(75, 192, 192, 0.5)'
      },
      {
        label: 'p95',
        data: results.map(r => r.stats.total_latency.p95),
        backgroundColor: 'rgba(255, 206, 86, 0.5)'
      },
      {
        label: 'p99',
        data: results.map(r => r.stats.total_latency.p99),
        backgroundColor: 'rgba(255, 99, 132, 0.5)'
      }
    ]
  };
  
  return (
    
  );
}

Recommendations Engine synthesizes multi-dimensional comparison:

export function generateRecommendations(results) {
  const recommendations = [];
  
  // Find fastest TTFT (best perceived responsiveness)
  const fastestTTFT = results.reduce((best, r) => 
    r.stats.ttft.median < best.stats.ttft.median ? r : best
  );
  recommendations.push({
    category: 'Fastest Response',
    model: fastestTTFT.model_id,
    reason: `Lowest median TTFT: ${fastestTTFT.stats.ttft.median.toFixed(0)}ms`
  });
  
  // Find highest throughput
  const highestThroughput = results.reduce((best, r) =>
    r.stats.throughput.median_tps > best.stats.throughput.median_tps ? r : best
  );
  recommendations.push({
    category: 'Best Throughput',
    model: highestThroughput.model_id,
    reason: `Highest tok/s: ${highestThroughput.stats.throughput.median_tps.toFixed(1)}`
  });
  
  // Find most consistent (lowest p95-p50 spread)
  const mostConsistent = results.reduce((best, r) => {
    const spread = r.stats.total_latency.p95 - r.stats.total_latency.median;
    const bestSpread = best.stats.total_latency.p95 - best.stats.total_latency.median;
    return spread < bestSpread ? r : best;
  });
  recommendations.push({
    category: 'Most Consistent',
    model: mostConsistent.model_id,
    reason: 'Lowest latency variance (p95-p50 spread)'
  });
  
  return recommendations;
}

Key Takeaways and Benchmarking Best Practices

Effective model benchmarking requires scientific methodology, comprehensive metrics, and application-specific testing. FLPerformance demonstrates that rigorous performance measurement is accessible to any development team.

Critical principles for model evaluation:

Test on target hardware: Results from cloud GPUs don't predict laptop performance
Measure multiple dimensions: TTFT, TPOT, throughput, consistency all matter
Use statistical rigor: Single runs mislead—capture distributions with adequate sample sizes
Design realistic workloads: Generic benchmarks don't predict your application's behavior
Include warmup iterations: Model loading and JIT compilation affect early measurements
Control concurrency: Real applications handle multiple requests—test at realistic loads
Document methodology: Reproducible results require documented procedures and configurations

The complete benchmarking platform with model management, measurement infrastructure, visualization dashboards, and comprehensive documentation is available at github.com/leestott/FLPerformance. Clone the repository and run the startup script to begin evaluating models on your hardware.