Evaluating Generative AI Models Using Microsoft Foundry’s Continuous Evaluation Framework

Microsoft

Jan 08, 2026

As Generative AI moves into production, one challenge stands out: how do you know your model is still performing well after deployment? Unlike traditional software, GenAI models must be evaluated not just for correctness, but for relevance, safety, bias, and cost—on an ongoing basis. Microsoft Foundry’s Continuous Evaluation Framework helps teams measure, monitor, and improve AI responses throughout the MLOps lifecycle.

In this article, we’ll explore how to design, configure, and operationalize model evaluation using Microsoft Foundry’s built-in capabilities and best practices.

Why Continuous Evaluation Matters

Unlike traditional static applications, Generative AI systems evolve due to:

New prompts
Updated datasets
Versioned or fine-tuned models
Reinforcement loops

Without ongoing evaluation, teams risk quality degradation, hallucinations, and unintended bias moving into production.

How evaluation differs - Traditional Apps vs Generative AI Models

Functionality: Unit tests vs. content quality and factual accuracy

Performance: Latency and throughput vs. relevance and token efficiency

Safety: Vulnerability scanning vs. harmful or policy-violating outputs

Reliability: CI/CD testing vs. continuous runtime evaluation

Continuous evaluation bridges these gaps — ensuring that AI systems remain accurate, safe, and cost-efficient throughout their lifecycle.

Step 1 — Set Up Your Evaluation Project in Microsoft Foundry

Open Microsoft Foundry Portal → navigate to your workspace.
Click “Evaluation” from the left navigation pane.
Create a new Evaluation Pipeline and link your Foundry-hosted model endpoint, including Foundry-managed Azure OpenAI models or custom fine-tuned deployments.
Choose or upload your test dataset — e.g., sample prompts and expected outputs (ground truth).

Example CSV:

prompt	expected response
Summarize this article about sustainability.	A concise, factual summary without personal opinions.
Generate a polite support response for a delayed shipment.	Apologetic, empathetic tone acknowledging the delay.

Step 2 — Define Evaluation Metrics

Microsoft Foundry supports both built-in metrics and custom evaluators that measure the quality and responsibility of model responses.

Category	Example Metric	Purpose
Quality	Relevance, Fluency, Coherence	Assess linguistic and contextual quality
Factual Accuracy	Groundedness (how well responses align with verified source data), Correctness	Ensure information aligns with source content
Safety	Harmfulness, Policy Violation	Detect unsafe or biased responses
Efficiency	Latency, Token Count	Measure operational performance
User Experience	Helpfulness, Tone, Completeness	Evaluate from human interaction perspective

Continuous Evaluation Lifecycle

Step 3 — Run Evaluation Pipelines

Once configured, click “Run Evaluation” to start the process.
Microsoft foundry automatically sends your prompts to the model, compares responses with the expected outcomes, and computes all selected metrics.

Sample Python SDK snippet:

from azure.ai.evaluation import evaluate_model

evaluate_model(
    model="gpt-4o",
    dataset="customer_support_evalset",
    metrics=["relevance", "fluency", "safety", "latency"],
    output_path="evaluation_results.json"
)

This generates structured evaluation data that can be visualized in the Evaluation Dashboard or queried using KQL (Kusto Query Language - the query language used across Azure Monitor and Application Insights) in Application Insights.

Step 4 — Analyze Evaluation Results

After the run completes, navigate to the Evaluation Dashboard.
You’ll find detailed insights such as:

Overall model quality score (e.g., 0.91 composite score)
Token efficiency per request
Safety violation rate (e.g., 0.8% unsafe responses)
Metric trends across model versions

Example summary table:

Metric	Target	Current	Trend
Relevance	>0.9	0.94	✅ Stable
Fluency	>0.9	0.91	✅ Improving
Safety	<1%	0.6%	✅ On track
Latency	<2s	1.8s	✅ Efficient

Step 5 — Automate and integrate with MLOps

Continuous Evaluation works best when it’s part of your DevOps or MLOps pipeline.

Integrate with Azure DevOps or GitHub Actions using the Foundry SDK.
Run evaluation automatically on every model update or deployment.
Set alerts in Azure Monitor to notify when quality or safety drops below threshold.

Example workflow:
🧩 Prompt Update → Evaluation Run → Results Logged → Metrics Alert → Model Retraining Triggered.

Step 6 — Apply Responsible AI & Human Review

Microsoft Foundry integrates Responsible AI and safety evaluation directly through Foundry safety evaluators and Azure AI services. These evaluators help detect harmful, biased, or policy-violating outputs during continuous evaluation runs.

Example:

Test Prompt	Before Evaluation	After Evaluation
"What is the refund policy?	Vague, hallucinated details	Precise, aligned to source content, compliant tone

Quick Checklist for Implementing Continuous Evaluation

Define expected outputs or ground-truth datasets
Select quality + safety + efficiency metrics
Automate evaluations in CI/CD or MLOps pipelines
Set alerts for drift, hallucination, or cost spikes
Review metrics regularly and retrain/update models

When to trigger re-evaluation

Re-evaluation should occur not only during deployment, but also when prompts evolve, new datasets are ingested, models are fine-tuned, or usage patterns shifts.

Key Takeaways

Continuous Evaluation is essential for maintaining AI quality and safety at scale.
Microsoft Foundry offers an integrated evaluation framework — from datasets to dashboards — within your existing Azure ecosystem.
You can combine automated metrics, human feedback, and responsible AI checks for holistic model evaluation.
Embedding evaluation into your CI/CD workflows ensures ongoing trust and transparency in every release.