The AI landscape is evolving at breakneck speed, with new models released constantly. How can organizations keep up — and more importantly, how can they use the latest advancements in their existing AI applications? This is where evaluations come in. In this two-part series, we’ll explore why evaluations matter — both when building new AI applications and when replacing older models to use newer features and improvements.
AI models are powerful tools, but without proper evaluation, we can’t know how well they work, where they fail, or whether they’re safe and reliable. Evaluations are systematic processes of assessing the performance, reliability, and efficacy of generative AI applications. Evaluation helps us measure a model’s accuracy, fairness, robustness, relevance and efficiency. It’s the difference between building something that simply "works in theory" and something that can be trusted in the real world. Evaluations are systematic processes of assessing the performance, reliability, and efficacy of generative AI applications.
Also, evaluations need to be customized for your AI applications and domain. Typically, the people developing the application know more about the domain and the characteristics of the problem statement as compared to the models, so the model has to be evaluated for these parameters. While benchmarks are useful, they are not the final word when it comes to the quality of responses from the model.
A helpful analogy is to compare model benchmarks and evaluations to a human’s academic degree and job interviews. Benchmarks are like degrees — they certify general competence through standardized tests and scores. But job interviews are still necessary because real-world work is more specialized and nuanced. Similarly, model evaluations go beyond benchmarks to test how well a model performs in specific domains and under practical conditions.
Understanding the need for rigorous Evaluations
AI model evaluations serve multiple crucial purposes:
- Performance Verification: Evaluations provide a systematic way to measure an AI model's capabilities and limitations. They go beyond surface-level metrics, delving deep into how accurately and consistently a model performs across various scenarios. For instance, a language model might be tested on its ability to understand context, generate coherent responses, and handle nuanced communication challenges.
- Bias Detection and Mitigation: One of the most critical aspects of AI evaluations is identifying potential biases. AI models can inadvertently perpetuate or amplify societal biases present in their training data. Comprehensive evaluations help researchers detect these biases, allowing for targeted interventions and more equitable AI systems.
- Safety and Reliability Assessment: As AI systems become more complex and are deployed in sensitive domains like healthcare, finance, and autonomous systems, evaluations become paramount. They help identify potential risks, unexpected behaviors, and edge cases that could compromise system reliability or user safety.
Evaluations are especially significant in domains where the risk of harm to humans is much higher. This is also true in highly regulated sectors such as healthcare, insurance and finance. A language model might generate impressive text, but evaluation can reveal issues like bias, fabrication of information, or poor performance on less common languages or particular domains (where there are domain specific jargon and nuances). Similarly, a computer vision model might perform well on clean images but fail when images are blurry or taken in different lighting conditions.
Common types of evaluations include:
- Benchmark tests (e.g., GPT models on MMLU or BIG-bench) - As explained earlier, these metrics provide a broad assessment of the model’s overall performance across various domains represented in the benchmark dataset.
- Robustness checks (e.g., adversarial attacks on image classifiers). Robustness checks ensure that an AI model can handle unexpected, noisy, or slightly altered inputs without failing or producing incorrect outputs. In other words, they test how stable and reliable a model is beyond its ideal training conditions. Typically, your evaluation datasets should consist of edge cases or outliers or prompts/contexts that have had problematic output in earlier models.
- Human evaluations (e.g., rating AI-generated content for coherence and helpfulness). Often humans have good expertise in certain domains and understand the different nuances and scenarios when answering. Human evaluations are especially important in high-risk sectors.
- Real-world testing (e.g., monitoring self-driving car models in diverse traffic conditions). Your evaluation dataset should reflect real-world usage. To achieve this, you can set up a system that channels data from real-world interactions—after removing PII or other sensitive information—into your evaluation datasets.
Bulk run features in AI toolkit
While you can use the playground for testing the output of models for specific system messages, parameters and query combination, it can get quite tedious and manual. Bulk run can help you automate this feature. You will have to create a dataset in the JSONLines format. The JSONLines format looks like below with a well-formed JSON object on each new line.
{"conversation":"Person A: Hey, are we still on for the trip this weekend?\r\nPerson B: Absolutely! I can't wait. Have you thought about where we should go?\r\nPerson A: I was thinking about that lake we visited last summer. It was really nice there.\r\nPerson B: That sounds great! Should we take my car or yours?\r\nPerson A: Let’s take mine, it has better mileage. Plus, I'll bring some snacks for the drive.\r\nPerson B: Perfect choice! Do you think we need to book a campsite in advance?\r\nPerson A: Good idea. I'll check online to see if there are any available spots.\r\nPerson B: Awesome! Let’s meet tomorrow to finalize everything before we head out."} {"conversation":"Person A: Hi there! Did you see the latest episode of that series we both love?\r\nPerson B: Hey! Yes, I just finished watching it last night. Can you believe the twist at the end?\r\nPerson A: I know, right? I didn't see that coming at all! What did you think about the character's decision?\r\nPerson B: Honestly, I think it was a poor choice. It really caught me off guard.\r\nPerson A: Same here! I thought they had learned from their mistakes. Should we discuss it more over coffee?\r\nPerson B: Sounds like a plan! Let’s meet up this weekend then. I’ll bring the latest gossip too!"} {"conversation":"Friend A: Hey! Are we still going to see that new action movie tonight? \nFriend B: Yes, definitely! I can't wait to see it. What time do you want to meet? \nFriend A: How about 7 PM? We can grab some dinner first. \nFriend B: Sounds great! Do you have a place in mind? \nFriend A: Yeah, there's that new pizza place that just opened up near the theater. \nFriend B: Perfect! I’ve heard good things about it. Do you want me to buy the tickets online? \nFriend A: Yes, that would be awesome! Let me know how much it is so I can send you my share. \nFriend B: Will do! I’ll see you at 6:30 then? \nFriend A: See you then!"}
In the screenshot above, the highlighted feature in the left pane activates the bulk run. In the top-right corner, you can see the menu where you can import or export datasets. Additionally, you have the option to use a model to generate or regenerate dataset entries based on your specifications and even append these new entries to an existing dataset.
In the screenshot below, I’ve opened the model pane on the right, where I can load a specific model running either locally or remotely via GitHub or Ollama. I can also adjust parameters like temperature, top-p, and response length to control the model’s behavior. The bulk run feature is particularly useful for generating evaluation data — it allows you to run the same dataset through different models and compare their outputs. By switching models in the pane, you can easily observe how responses vary. These changes can help you refine your evaluation dataset and lets human reviewers assess outputs against known ideal answers, taking into account domain expertise and subtle response quality differences. It’s especially valuable when fine-tuning or selecting a model for a specific application.
Additionally, you have the option to execute each row individually and review the model's responses for evaluation purposes. Alternatively, you can perform a batch evaluation by running all rows simultaneously.
Bulk run is an excellent feature to automate some of the tedious parts of human evaluation when using the same data or prompt with multiple models. We have now seen how evaluations can be run with bulk run features. Lets now touch upon some real-world examples where evaluations are really useful.
Real-World Examples of AI Evaluations
The importance of evaluations becomes clear through concrete examples:
- Medical Diagnosis AI: A medical AI model designed to detect disease must be rigorously evaluated across diverse patient populations, ensuring accuracy, fairness, and reliability across different demographics, age groups, and medical conditions. Doctors or qualified medical personnel are often part of the evaluation loop here.
- Language Translation Systems: Evaluations for translation models go beyond simple word-for-word accuracy. They assess contextual understanding, cultural nuances, idiomatic expressions, and the ability to maintain the original message's intent and tone. Humans are better at understanding cultural nuances of languages as well as pop-culture references which may be very context specific.
- Document Classification and Contract Analysis Models: Evaluated for accuracy on historical annotated contracts but also tested on edge cases like unusual contract structures or legal language shifts. Human legal experts are often part of the evaluation loop.
Evaluations are multi-dimensional and can help tune the model, the architecture, the guardrails and the AI application to achieve desired results in a particular domain. In the next article, we will explore the evaluation features in greater depth, where a more powerful model will assist in rating another model’s responses to further automate some scenarios alongside human evaluation.
Further reading
- Evaluating generative AI applications - https://aka.ms/evaluate-genAI
- Generative AI for Beginners guide - https://microsoft.github.io/generative-ai-for-beginners
- AI toolkit for VSCode Marketplace - https://aka.ms/AIToolkit
- Docs for AI toolkit - https://aka.ms/AIToolkit/doc
- Part 2 Evaluators