Selecting and upgrading models using Evaluations – Part 2

Former Employee

Apr 09, 2025

In the previous article, we explored why evaluations are crucial and how they can help you choose the right model for your specific industry, domain, or app-level data. We also introduced the "bulk-run" feature in AI Toolkit for Visual Studio Code, which allows you to automate parts of the human evaluation process.

In this article, we'll take things a step further by using a more capable model to evaluate the responses of a less capable one. For example, you might compare older versions of a model against a newer, more powerful version, or evaluate a fine-tuned small language model (SLM) using a larger model like GPT-4o.

You can access this functionality through the "Evaluations" option in the tool's menu of the AI Toolkit for Visual Studio Code Extension (see below). But before we start using it, let’s take a moment to understand the distinct types of evaluation methods available for assessing responses from large language models.

Evaluators

When testing AI models, it's not enough to just look at outputs manually. Evaluators help us systematically measure how well a model is performing across different dimensions like relevance, coherence and fluency, these specific metrics include grammar, similarity to ground truth and more. Below is a brief overview of the key evaluators commonly used:

Coherence - Evaluates how naturally and logically a model’s response flows. It checks whether the answer makes sense in context and follows a consistent train of thought.
Required columns: query, response
Fluency - Assesses grammatical correctness and fluency. A fluent response reads smoothly, like something a human would write.
Required columns: response
Relevance - Checks how well the response answers the original question or prompt. It’s all about staying on topic and being helpful.
Required columns: query, response
Similarity - Measures how similar the model’s response is to a reference (ground truth), taking both the question and answer into account.
Required columns: query, response, ground_truth
BLEU (Bilingual Evaluation Understudy) - A popular metric that compares how closely the model’s output matches reference texts using n-gram overlaps.
Required columns: response, ground_truth
F1 Score - Calculates the overlap of words between the model’s output and the correct answer, balancing precision and recall.
Required columns: response, ground_truth
GLEU (General Language Understanding Benchmark) - Similar to BLEU but optimized for sentence-level evaluation. It uses n-gram overlap to assess how well the output matches the reference.
Required columns: response, ground_truth
METEOR - Goes beyond simple word overlap by aligning synonyms and related phrases, while also focusing on precision, recall, and word order.
Required columns: response, ground_truth

Using Evaluations

Now that we have an overview of the evaluations, let’s use a sample dataset to run an evaluation. Open Visual Studio Code and select the AI Toolkit Extension, in the AI Toolkit extension: Click on the Tools Menu > Eval the Tools Menu > Evaluations and you should get a window like below:

You can either create a new evaluation or create a new evaluation run (See the blue button on the top right of the screen). If you create a new evaluation, you can choose one or more of the evaluators we talked about above. You can use the sample dataset, or you can use your own dataset. Just be aware, that if you are running a large dataset of your own, you might run against the rate limit for GitHub models, if you choose those for evaluating the output. You can create your own dataset in the JSONLines format we discussed in the earlier part of this blog post.

In addition to using your own dataset to evaluate the model, you can also use your own python evaluators. Click on the evaluators tab and you should see the following screen.

Using the Create Evaluation button (highlighted in blue on the top right-hand corner of the pane), you can create and add your own evaluator. The fields are self-explanatory.

Evaluation run

Let's now run the evaluation and you should see something like the output below. You can see the line-by-line input (from the JSONLines dataset that you used) and output against each of the evaluators that you have selected. You can also see the details of the run in the output pane below as the evaluations run. You will see each evaluation start (once per evaluator) and run through each of lines in your dataset. You might also see some errors sometimes due to rate limiting and which can be retried automatically by AI toolkit executor.

You can see the scores for each of the evaluators by scrolling horizontally. You can additionally back up and check these scores using human evaluations as well, if necessary, especially for fields where domain expertise is important and the risk of harm due to errors is higher.

Evaluations play a key role in understanding, selecting models and improving model performance across tasks and domains. By using a mix of automated and human-in-the-loop evaluators, you can get a clearer picture of your model's strengths and weaknesses. Start small, measure often, and let the data guide your AI application iterations.

Blog Post

Selecting and upgrading models using Evaluations – Part 2

Evaluators

Using Evaluations

Evaluation run

Further reading