Blog Post

Copilot Studio Blog
4 MIN READ

Agent Evaluation in Microsoft Copilot Studio is now generally available

Efrat_Gilboa's avatar
Efrat_Gilboa
Icon for Microsoft rankMicrosoft
Mar 31, 2026

As agents move into production, evaluations help take each build from experimentation to a reliable system. And they help answer the question that matters most in production: Can we trust this agent to behave correctly, consistently, and safely — every time?

Manual testing simply can't scale to answer that question. Spot-checking responses one-by-one is slow, inconsistent, and not designed for agents that handle hundreds or thousands of interactions. Agent Evaluation in Microsoft Copilot Studio helps fill that gap.

Today, we are giving every maker a better way to assess agent behavior at scale—before launch and over the agent's lifecycle. Agent Evaluation is now generally available.

Validate production readiness before launch and after every change

Agent Evaluation is built directly into Copilot Studio—there’s no separate tool to install and no integrations to configure. Within the agent, the evaluation experience provides an end-to-end workflow for creating test cases, running evals, and reviewing results, all without writing a single line of code.

Whether you're a maker validating readiness before publishing, a quality assurance (QA) team enforcing organizational standards, an agent owner preparing for rollout, or a compliance team that needs documented evidence of agent behavior, Agent Evaluation is designed to integrate into the workflows teams already use to ship and operate agents.

Designed to build trust at scale

Agent Evaluation is designed for organizations that carry real accountability for the agents they deploy. That means evals need to fit into existing workflows, help gather compliance documentation, and produce results that hold up to scrutiny.

  • Versioned and auditable results

Every evaluation run produces a structured record, including the test set used, the user profile that ran it, the date and duration, and the results from each grader for every test case. These records are available in the evaluation history view, where teams can track performance over time and compare results across runs. For regulated industries and compliance-driven deployments, this record is the artifact that can help demonstrate that an agent was tested against defined behavioral standards before reaching users.

  • Identity-based evaluation

Each evaluation run is associated with a selected user profile. The agent is evaluated under that identity, using the same knowledge sources, tools, and connectors that the maker accesses in production. This helps ensure evaluation results reflect real-world behavior, rather than a simplified test environment.

  • API-based evaluation

For teams that operate continuous integration and delivery pipelines, Agent Evaluation is available via API. Teams can retrieve test sets, trigger evaluation runs, and track results programmatically, integrating evals directly into existing deployment workflows to assess agent behavior proactively at scale.

 

Running an evaluation: from test case to results

Agent Evaluation in Copilot Studio follows a guided workflow that helps makers move from setup to results without disrupting their workflow or leaving the product.

Step 1: Create a test set

Evaluation starts with creating a test set—a collection of questions or scenarios used to assess an agent’s behavior. Makers can build test sets in multiple ways: uploading a CSV with prepared questions and expected responses, writing targeted questions manually, or generating questions from production conversations based on common topics.

To help teams save time configuring test questions, Copilot Studio even includes built-in AI generation options:

  • The quick question set generates 10 questions instantly based on the agent’s description, instructions, and capabilities, providing an initial signal with minimal preparation required.
  • The full question set generates up to 100 questions drawn from the agent’s knowledge sources or defined topics, helping teams build broader coverage grounded in the agent’s actual content.

Step 2: Configure evaluation methods

With test cases in place, makers can determine how evaluations measure agent responses by selecting one or more test methods. Built-in methods cover a range of evaluation dimensions, including:

  • General response quality
  • Semantic meaning relative to an expected answer
  • Keyword presence
  • Text similarity
  • Exact match
  • Capability usage

However, for organizations that need to go beyond these dimensions, Custom Graders  (available as a Classification method) allow makers to encode your organization’s policies, quality standards, or other rules directly into the evaluation.

Keep in mind, multiple methods can be combined in a single test run, giving teams a layered view of agent performance.

 

Step 3: Run the evaluation and review results

Once the test set and methods are configured, makers can run the evaluation directly from Copilot Studio. Results appear in a structured table, with each row representing a test case and each column representing an evaluation method.

Pass and fail signals are visible immediately, and the Evaluation summary panel shows aggregated scores across all methods for a given run. Selecting an individual test case opens a detailed view with the agent's full response, the result and explanation from each grader, the expected answer where you've provided one, and the knowledge sources the agent used to generate its response.

Because a test set can be saved and reused, evaluation becomes a repeatable quality check across agent versions. When a prompt changes, a knowledge source is updated, or a new capability is added, the same test set then runs again—producing consistent, comparable signals that help teams validate changes before they reach end users.

 

What's next for Agent Evaluation?

General availability establishes the foundation. From here, there are already plans to expand evaluation coverage to support multi-turn conversation, deeper automation, and more of the deployment lifecycle, so organizations can monitor agent reliability at scale.

The goal is evaluation that travels with your agent from first build through ongoing production use. And you can start today. Open the Evaluation tab in Copilot Studio, choose a test method, and run your first evaluation in minutes. No code required.

Log in to Copilot Studio to start evaluating agents—or explore the roadmap to see what's next for Agent Evaluation.

Updated Mar 31, 2026
Version 1.0
No CommentsBe the first to comment