Introducing Evaluation API on Azure OpenAI Service

Microsoft

Apr 24, 2025

We are excited to announce new Evaluations (Evals) API in Azure OpenAI Service! Evaluation API lets users test and improve model outputs directly through API calls, making the experience simple and customizable for developers to programmatically assess model quality and performance in their development workflows.

What is Azure OpenAI Evaluation API?

Evaluation has often been a manual, time-consuming process, especially when trying to scale applications across diverse domains. With the Evaluation API, Azure OpenAI Service provides a standardized and systematic approach to:

Assess model performance on a wide range of test criteria with custom test cases
Measure model quality improvements across multiple prompt iterations
Automate quality assurance by making evaluation part of CI/CD pipelines

The new Azure OpenAI Evaluations API allows developers to programmatically define evaluation test criteria and evaluation runs directly through API calls.

Why Evaluation?

Evaluation is the process of validating the outputs from your large language models, and it is an essential step in measuring their performance and inference quality. Through evaluation, you can test the expected input/output response pairs and assess the model’s performance across key metrics such as accuracy, reliability, resilience, and the overall performance.

This is especially important for fine-tuned models, where assessing the performance gains (or losses) from training determines how well the base model was finetuned for specific tasks or verticals.

How does this fit into Model Distillation experience?

Model distillation uses LLM outputs to fine-tune smaller, more efficient models. This technique allows the smaller models to perform just as well as the LLMs, particular in the specifically fine-tuned tasks, while significantly cutting down on both cost and latency.

Azure OpenAI Service distillation involves three main components:

Stored Completions: Capture and store input-output pairs from models like GPT-4o to generate datasets for evaluation and fine-tuning. The feature offers an interface for reviewing, filtering, and exporting data based on predefined criteria.
Evaluation: Create and run custom evaluations to measure model performance using data from Stored Completions or existing datasets.
Fine-tuning: Use datasets from Stored Completions in fine-tuning jobs and run evaluations on fine-tuned models.

Now with the launch of Evaluation API, you can combine the Stored Completions API with Fine-tuning API to create a code-first distillation experience. Retrieve training data from stored completions, train your model, and then evaluate it for quality – and iterate as needed.

How to use the Evaluation API

Evaluation API in Azure OpenAI Service has two primary concepts: `Eval` and `Eval Run`.

Eval: Contains information about evaluation configuration, such as data source, schema definition, testing criteria, metadata
Eval Run: Contains information about evaluation execution, such as the reference to the performed Eval, specific data samples, model responses

To get started with Evaluation API, you will first create an `Eval` task by specifying `data_source_config` and `testing_criteria`.

data_source_config: Contains information about your test data, such as a JSON schema with multiple properties your test data will conform to
testing_criteria: Contains information about what type of test the evaluation will perform, such as “String check” or “Text similarity”, as well as dynamic content values to specify ground truth values in test data and model-generated prompts

{
  "object": "eval",
  "id": "eval_67e321d23b54819096e6bfe140161184",
  "data_source_config": {
    "type": "custom",
    "schema": { ... }
  },
  "testing_criteria": [
    {
      "name": "Match output to human label",
      "id": "Match output to human label-c4fdf789-2fa5-407f-8a41-a6f4f9afd482",
      "type": "string_check",
      "input": "{{ sample.output_text }}",
      "reference": "{{ item.correct_label }}",
      "operation": "eq"
    }
  ],
  "name": "xxxxx",
  "created_at": 1742938578,
  "metadata": {}
}

Once you created an `Eval` task, you will create an `Eval Run` task, on which you will need to reference the `Eval` configuration by reference the `Eval` task UUID. You will also need to prepare and reference your test data. The test data should contain both test inputs and ground truth labels to compare model outputs against.

There are several ways to provide test data for eval runs, but it may be convenient to upload a JSONL file that contains data in the schema we specified when we created our `Eval` task.

{
    "object": "eval.run",
    "id": "evalrun_xxxx",
    "eval_id": "eval_xxxx",
    "report_url": "https://ai.azure.com/resource/evaluation/xxxxx",
    "status": "queued",
    "model": "gpt-4.1",
    "name": "Categorization text run",
    "created_at": 1743015028,
    "result_counts": { ... },
    "per_model_usage": null,
    "per_testing_criteria_results": null,
    "data_source": {
        "type": "completions",
        "source": {
            "type": "file_id",
            "id": "filexxxx"
        },
        "input_messages": {
            "type": "template",
            "template": [
                {
                    "type": "message",
                    "role": "developer",
                    "content": {
                        "type": "input_text",
                        "text": "You are an expert in...."
                    }
                },
                {
                    "type": "message",
                    "role": "user",
                    "content": {
                        "type": "input_text",
                        "text": "{{item.input_text}}"
                    }
                }
            ]
        },
        "model": "gpt-4.1",
        "sampling_params": null
    },
    "error": null,
    "metadata": {}
}

After you have created both `Eval` task and `Eval Run` task, your eval run has now been queued, and it will execute asynchronously as it processes every row in your data set.

After the evaluation run is completed, you will be able to fetch the complete evaluation results using your `Eval` task and `Eval Run` task UUIDs. The response will look something like this:

{
  "object": "eval.run",
  "id": "evalrun_xxxx",
  "eval_id": "eval_xxxx",
  "report_url": "https://ai.azure.com/resource/evaluation/xxx",
  "status": "completed",
  "model": "gpt-4.1",
  "name": "Categorization text run",
  "created_at": 1743015028,
  "result_counts": {
    "total": 3,
    "errored": 0,
    "failed": 0,
    "passed": 3
  },
  "per_model_usage": [
    {
      "model_name": "gpt-4o-2024-08-06",
      "invocation_count": 3,
      "prompt_tokens": 166,
      "completion_tokens": 6,
      "total_tokens": 172,
      "cached_tokens": 0
    }
  ],
  "per_testing_criteria_results": [
    {
      "testing_criteria": "Match output to human label-40d67441-5000-4754-ab8c-181c125803ce",
      "passed": 3,
      "failed": 0
    }
  ],
  "data_source": {
    "type": "completions",
    "source": {
      "type": "file_id",
      "id": "file-J7MoX9ToHXp2TutMEeYnwj"
    },
    "input_messages": {
      "type": "template",
      "template": [
        {
          "type": "message",
          "role": "developer",
          "content": {
            "type": "input_text",
            "text": "xxxxxx"
          }
        },
        {
          "type": "message",
          "role": "user",
          "content": {
            "type": "input_text",
            "text": "{{item.input_text}}"
          }
        }
      ]
    },
    "model": "gpt-4.1",
    "sampling_params": null
  },
  "error": null,
  "metadata": {}
}

Resources

Azure OpenAI Service Evaluation API provides developers greater flexibility, efficiency, and control over the model evaluation processes to strengthen model quality validation and performance. We are excited to see how our customers will leverage this API to create high quality models and applications.

Below are some additional resources for you to reference for overall Evaluation experience:

Updated Apr 25, 2025

Version 2.0

Microsoft

Joined May 25, 2023