When you build an agent in Microsoft Copilot Studio, you want confidence that it behaves exactly as intended: answering correctly, using the right tools, and following the logic you designed. Agent Evaluation (generally available) provides this foundation by allowing you to define test sets, run them against your agent, and understand how it performs.
As agents evolve from experimentation into real production scenarios, this foundation becomes part of an ongoing process. Evaluation is no longer a one-time step, but a continuous part of the development lifecycle. Teams are looking to validate changes quickly, track quality over time, and ensure consistent behavior across updates, environments, and use cases.
To support this, evaluation scales alongside your agents. Automated evaluation enables teams to expand their testing coverage, run evaluations more frequently, and establish consistent quality signals across the lifecycle. It brings evaluation closer to the way modern systems are built: iterative, data-driven, and continuously improving.
To fully realize this at scale, evaluation integrates seamlessly into your workflows and systems.
Now, these same evaluation capabilities can be used programmatically through Power Platform REST API and your connectors. Here’s how you can use these Evaluation APIs to automate agent evaluation as part of your development and release workflows.
What you can do with the Evaluation APIs
The Evaluation APIs expose the core evaluation experience as programmable endpoints. Using those endpoints, you can trigger evaluations on demand, integrate evalutaions into pipelines and approval workflows, and design processes relying on the results. Whether you prefer a code-first approach with APIs or a low-code experience using Microsoft Power Automate flows and Copilot Studio agent workflows, you can easily automate when and how evaluations run – and use the results for quality gateway.
Here are the capabilities included in the Maker Evaluation API:
|
Capability |
What it does |
|
List test sets |
Retrieve the test sets configured for your agent |
|
Run a test set |
Trigger a test set to execute against your agent |
|
Poll run status |
Poll a running evaluation to see when it completes |
|
Retrieve results |
Retrieve detailed results including per-test-case scores |
|
List historical runs |
List all previous evaluation runs for reporting or comparison |
These APIs work with any HTTP client, Python scripts, Azure DevOps pipelines, GitHub Actions, or custom tooling. For teams working in the Power Platform ecosystem, the same actions are available through the Microsoft Copilot Studio certified connector, which integrates directly with Power Automate flows.
When to use Evaluation APIs
The Evaluation APIs exist so you can run evaluations without manually triggering them, letting evaluation happen automatically as part of your pipelines, your flows, or your own tools. By default, runs evaluate the agent’s unpublished (draft) version, which makes this especially useful for CI/CD and pre-publish validation. The Copilot Studio UI is still the right place for one-off, interactive evaluation. Reach for the APIs when you want evaluation to happen on its own.
Here are three common scenarios.
1. Add evaluation to your CI/CD pipeline
When your agent source lives in a repository, every pull request and every merge to main is an opportunity to validate quality before changes reach production. Wire the Evaluation APIs into Azure DevOps, GitHub Actions, or any CI runner: each pipeline run triggers an evaluation, waits for the result, and passes or fails the build based on the score. Quality regressions are caught at PR time, not in production.
2. Trigger evaluation from a Power Automate flow
Many events that may affect agent quality happen outside Copilot Studio: a knowledge source is updated in SharePoint, a new article is added to a file library, a Dataverse record changes agent behavior. Use Power Automate (with the Microsoft Copilot Studio certified connector) to listen for these events and kick off an evaluation test run automatically, then route the results to Teams, email, or whichever channel your team watches.
3. Embed evaluation in your own tools
Sometimes you want evaluation as part of a tool you’re already building: a Center of Excellence dashboard tracking quality across many agents, an admin script that confirms every new agent has been evaluated before publish, or a custom integration that adds evaluation to an existing approval workflow. The APIs let you call evaluation programmatically from any system, with whatever logic fits your scenario.
How an evaluation run works through the API
The evaluation flow follows a simple pattern: Trigger → Poll → Get Results.
- Trigger: Send a POST request to start an evaluation run for a specific test set
- Poll: Check the run status until it completes (the execution is asynchronous)
- Get results: Retrieve the score and detailed per-test-case outcomes
Optionally, you can pass an MCS Connection ID when triggering a run. This allows the evaluation to run using an authenticated user context, enabling access to tools and knowledge sources that require authentication. Without it, the evaluation will run anonymously.
Working with the Evaluation APIs: the key endpoints
Below are the core Evaluation API endpoints available today, starting with how to retrieve test sets and trigger evaluation runs programmatically.
Prerequisites
API Permissions.
- Go to https://portal.azure.com
- Go to App Registrations
- Search for your App
- Click API permissions
- Click Add a permission
- Click APIs my organization uses
- Search "Power Platform API"
- Click Delegated permissions
- Expand CopilotStudio
- Select MakerOperations.Read, MakerOperations.ReadWrite
- Click Add Permissions
Endpoint 1: Retrieve available test sets
Use this endpoint to list all evaluation test sets defined for a specific agent.
Request:
Expected result:
Returns the list of maker evaluation test sets associated with the agent.
Sample response:
Endpoint 2: Retrieve a specific test set
Once you have a test set ID, you can fetch its full definition.
Request
Expected result
Returns the full configuration and structure of the selected test set.
Sample response:
End point 3: Trigger an evaluation run
This endpoint allows you to programmatically start an evaluation run for a given test set.
The Body consists of a JSON object with the following attributes:
McsConnectionId - string value. If an empty string is provided, the evaluation runs anonymously, meaning tools and knowledge sources are not used. Agents that rely on authenticated connectors, actions, or auth‑gated knowledge sources will therefore produce different (likely worse) evaluation results.
RunOnPublishedBot - optional boolean value, defaults to false. Runs against the draft version (true runs against the published version).
EvaluationRunName - optional string value, useful for naming runs in dashboards.
Request
Body
{
“RunOnPublishedBot”: {boolean value},
"mcsConnectionId": "{yourMCSConnectionId}",
“evaluationRunName”: “{yourEvaluationRunName}”,{
}
Sample request:
Sample response:
Removed the note
How to obtain mcsConnectionId
- Go to: https://make.powerautomate.com
- Open Connections from the side menu
- Select the relevant Microsoft Copilot Studio connection
- Copy the connection ID from the URL
This connection ID will look something like:
Note: One run at a time
The API returns HTTP 422 if you try to start a run while another is already in progress for the same agent.
Endpoint 4: Get evaluation run status and results
After triggering a run, use the returned run ID to retrieve status and results.
Request
Expected result
Returns the status and once completed, the evaluation results.
Sample response:
End point 5: List previous evaluation runs
This endpoint is useful for tracking trends, building dashboards, and supporting automated decision logic.
Request
Expected result
Returns an array of previous evaluation runs, each with the same schema as the run details API.
Sample response:
Start using the Evaluation APIs today
Pick a test set, call the API, and see what your agent scores. That first run gives you a baseline. From there, you can automate evaluations into your workflow, set thresholds, and build the checks that make sense for your team. The APIs are available now. Start simple, and build from there.
Sign into Copilot Studio to get started today.