Blog Post

Microsoft Foundry Blog
3 MIN READ

Evaluating Generative AI Models Using Microsoft Foundry’s Continuous Evaluation Framework

navprsingh's avatar
navprsingh
Icon for Microsoft rankMicrosoft
Jan 08, 2026

As Generative AI moves into production, one challenge stands out: how do you know your model is still performing well after deployment? Unlike traditional software, GenAI models must be evaluated not just for correctness, but for relevance, safety, bias, and cost—on an ongoing basis. Microsoft Foundry’s Continuous Evaluation Framework helps teams measure, monitor, and improve AI responses throughout the MLOps lifecycle.

In this article, we’ll explore how to design, configure, and operationalize model evaluation using Microsoft Foundry’s built-in capabilities and best practices.

Why Continuous Evaluation Matters

Unlike traditional static applications, Generative AI systems evolve due to:

  • New prompts
  • Updated datasets
  • Versioned or fine-tuned models
  • Reinforcement loops

Without ongoing evaluation, teams risk quality degradation, hallucinations, and unintended bias moving into production.

How evaluation differs - Traditional Apps vs Generative AI Models

  • Functionality: Unit tests vs. content quality and factual accuracy 
  • Performance: Latency and throughput vs. relevance and token efficiency 
  • Safety: Vulnerability scanning vs. harmful or policy-violating outputs 
  • Reliability: CI/CD testing vs. continuous runtime evaluation
 

Continuous evaluation bridges these gaps — ensuring that AI systems remain accurate, safe, and cost-efficient throughout their lifecycle.

Step 1 — Set Up Your Evaluation Project in Microsoft Foundry

  1. Open Microsoft Foundry Portal → navigate to your workspace.
  2. Click “Evaluation” from the left navigation pane.
  3. Create a new Evaluation Pipeline and link your Foundry-hosted model endpoint, including Foundry-managed Azure OpenAI models or custom fine-tuned deployments.
  4. Choose or upload your test dataset — e.g., sample prompts and expected outputs (ground truth).

Example CSV:

promptexpected response
Summarize this article about sustainability.A concise, factual summary without personal opinions.
Generate a polite support response for a delayed shipment.Apologetic, empathetic tone acknowledging the delay.


Step 2 — Define Evaluation Metrics

Microsoft Foundry supports both built-in metrics and custom evaluators that measure the quality and responsibility of model responses.

CategoryExample MetricPurpose
QualityRelevance, Fluency, CoherenceAssess linguistic and contextual quality
Factual Accuracy

Groundedness (how well responses align with verified source data),

Correctness

Ensure information aligns with source content
SafetyHarmfulness, Policy ViolationDetect unsafe or biased responses
EfficiencyLatency, Token CountMeasure operational performance
User ExperienceHelpfulness, Tone, CompletenessEvaluate from human interaction perspective


Continuous Evaluation Lifecycle

 


Step 3 — Run Evaluation Pipelines

Once configured, click “Run Evaluation” to start the process.
Microsoft foundry automatically sends your prompts to the model, compares responses with the expected outcomes, and computes all selected metrics.

Sample Python SDK snippet:

from azure.ai.evaluation import evaluate_model

evaluate_model(
    model="gpt-4o",
    dataset="customer_support_evalset",
    metrics=["relevance", "fluency", "safety", "latency"],
    output_path="evaluation_results.json"
)


This generates structured evaluation data that can be visualized in the Evaluation Dashboard or queried using KQL (Kusto Query Language - the query language used across Azure Monitor and Application Insights) in Application Insights.

Step 4 — Analyze Evaluation Results

After the run completes, navigate to the Evaluation Dashboard.
You’ll find detailed insights such as:

  • Overall model quality score (e.g., 0.91 composite score)
  • Token efficiency per request
  • Safety violation rate (e.g., 0.8% unsafe responses)
  • Metric trends across model versions

Example summary table:

MetricTargetCurrentTrend
Relevance>0.90.94✅ Stable
Fluency>0.90.91✅ Improving
Safety<1%0.6%✅ On track
Latency<2s1.8s✅ Efficient

Step 5 — Automate and integrate with MLOps

Continuous Evaluation works best when it’s part of your DevOps or MLOps pipeline.

  • Integrate with Azure DevOps or GitHub Actions using the Foundry SDK.
  • Run evaluation automatically on every model update or deployment.
  • Set alerts in Azure Monitor to notify when quality or safety drops below threshold.

Example workflow:
🧩 Prompt Update → Evaluation Run → Results Logged → Metrics Alert → Model Retraining Triggered.

Step 6 — Apply Responsible AI & Human Review

Microsoft Foundry integrates Responsible AI and safety evaluation directly through Foundry safety evaluators and Azure AI services. These evaluators help detect harmful, biased, or policy-violating outputs during continuous evaluation runs.

Example:

Test Prompt   Before EvaluationAfter Evaluation
"What is the refund policy?Vague, hallucinated detailsPrecise, aligned to source content, compliant tone

Quick Checklist for Implementing Continuous Evaluation

      

  • Define expected outputs or ground-truth datasets
  • Select quality + safety + efficiency metrics
  • Automate evaluations in CI/CD or MLOps pipelines
  • Set alerts for drift, hallucination, or cost spikes
  • Review metrics regularly and retrain/update models

When to trigger re-evaluation

  • Re-evaluation should occur not only during deployment, but also when prompts evolve, new datasets are ingested, models are fine-tuned, or usage patterns shifts.

Key Takeaways

  • Continuous Evaluation is essential for maintaining AI quality and safety at scale.
  • Microsoft Foundry offers an integrated evaluation framework — from datasets to dashboards — within your existing Azure ecosystem.
  • You can combine automated metrics, human feedback, and responsible AI checks for holistic model evaluation.
  • Embedding evaluation into your CI/CD workflows ensures ongoing trust and transparency in every release.

Useful Resources

 

Updated Dec 19, 2025
Version 1.0
No CommentsBe the first to comment