Evaluation Flows for Large Language Models (LLM) in Azure AI Studio
Published May 28 2024 06:44 AM 2,424 Views

Large Language Models (LLMs) are incredibly useful for generating natural language texts for tasks like summarization, translation, question answering, and text generation. However, they aren't always perfect and can sometimes produce outputs that are inaccurate, irrelevant, biased, or even harmful. That's why it's super important to evaluate the outcomes from LLMs to ensure they meet the quality and ethical standards required for their intended use.


Imagine you're using an LLM to help create content for a website. Without proper evaluation, the model might generate text that doesn't quite fit the tone you're looking for, or worse, it might include biased or incorrect information. This is where evaluation flows come in handy. These systematic procedures help you assess and improve the LLM's outputs, making it easier to spot and fix errors, biases, and potential risks. Plus, evaluation flows can provide valuable feedback and guidance, helping developers and users align the LLM's performance with business goals and user expectations. By incorporating evaluation steps, you can ensure a more user-friendly and reliable experience for everyone involved.


In this article, we will explain what evaluation flows are, and how we can implement them in Azure AI Studio. We will start by pointing out the motivations for evaluating outcomes from LLMs and provide some examples of business situations where the absence of evaluation can incur in problems for the business. Then, we will describe the main components and steps of evaluation flows and show how to use Azure AI Studio to create and execute evaluation flows for LLMs. Finally, we will discuss some best practices and challenges of evaluation flows and provide some resources for further learning.


Motivations for Evaluating Outcomes from LLMs 

Evaluating the outcomes from Large Language Models (LLMs) is crucial for several reasons. First and foremost, it ensures the quality and accuracy of the texts they generate. Imagine using an LLM to create product descriptions. Without proper evaluation, the model might produce descriptions that are misleading, inaccurate, or irrelevant to the product's features. By evaluating the LLM's outputs, you can catch and correct these errors, improving the overall quality and accuracy of the text.


Another important aspect is ensuring the ethical and social responsibility of the generated texts. LLMs can sometimes produce biased, offensive, harmful, or even illegal content. For instance, if an LLM is used to write news articles, it might inadvertently generate text that is racist, sexist, or defamatory. Evaluating the outputs helps identify and mitigate these biases and risks, ensuring the texts are ethical and socially responsible.


It's essential to ensure that the LLM's outputs align with business goals and user expectations. Picture an LLM generating marketing emails. Without evaluation, these emails might come across as too formal, too casual, or just too generic, missing the mark entirely. By assessing the outputs, you can optimize their impact and relevance, making sure they effectively meet the business's objectives and resonate with the target audience.


Failing to evaluate LLM outputs can lead to serious problems for a business. For instance, if the generated texts are low-quality, unethical, or irrelevant, customers and users may lose trust and interest. Consider an LLM that produces fake or biased product reviews. Customers would likely stop trusting these reviews and might even turn to competitors.


Moreover, if the LLM generates harmful, offensive, or illegal content, the business could face legal, regulatory, or social repercussions. Imagine an LLM generating defamatory or false news articles; the business could end up facing lawsuits, fines, or boycotts, severely damaging its reputation and credibility.


Finally, the effectiveness of LLM outputs directly impacts a business's competitive advantage and profitability. If the texts aren't persuasive, personalized, or engaging—like in marketing emails—the business might fail to boost sales, conversions, or retention rates, ultimately losing its edge in the market.


Best Practices and Challenges of Evaluation Flows 

Evaluation flows for LLMs are not trivial or straightforward, and they involve various best practices and challenges that users should be aware of and address, such as: 


  • Defining clear and realistic evaluation goals and objectives. Users should specify what they want to evaluate, why they want to evaluate, and how they want to evaluate, and align their evaluation goals and objectives with the business goals and user expectations. 
  • Choosing appropriate and reliable evaluation data and metrics. Users should select data and metrics that are representative, diverse, and sufficient for the evaluation task, and ensure that they are relevant, reliable, and valid for the evaluation task. 
  • Choosing appropriate and robust evaluation methods and actions. Users should select methods and actions that are appropriate, robust, and scalable for the evaluation task, and ensure that they are transparent, explainable, and accountable for the evaluation results and impact. 
  • Conducting iterative and continuous evaluation. Users should conduct evaluation in an iterative and continuous manner, and update and refine their evaluation data, metrics, methods, and actions based on the feedback and findings from the evaluation. 
  • Collaborating and communicating with stakeholders. Users should collaborate and communicate with various stakeholders, such as developers, users, customers, and regulators, and involve them in the evaluation process and outcomes, and address their needs, concerns, and expectations. 

Evaluation flows for LLMs are an essential and valuable part of the LLM lifecycle, and they can help users to ensure the quality, ethics, and effectiveness of the LLM outputs, and to achieve the desired outcomes and objectives of the LLM use cases. In general, there are several different ways to implement evaluation flows, but the best strategy will always rely on using proper tools for managing, deploying and monitoring both the LLMs behavior and its evaluation flows. And we have the proper tool to do that.


How to evaluate LLM models using Azure AI Studio?

Azure AI Studio is a cloud-based platform that enables users to build, deploy, and manage LLM models and applications. It provides various features and tools that support the evaluation flows for LLMs, such as:


  • Create custom evaluation datasets
  • Perform qualitative and quantitative evaluations
  • Use pre-built and custom metrics to assess model performance


It also offers capabilities for human-in-the-loop evaluations, which allow for human feedback to be incorporated into the model evaluation process. This comprehensive evaluation approach ensures that users can effectively measure and improve the performance of their LLM models and applications within the Azure ecosystem.

First of all, it is necessary to have an Azure AI Studio environment. We encourage you to take a look on this documentation about the initial steps to create an AI Studio in your Azure environment.
As soon as you have your AI Studio hub, you can create an Evaluation project at your AI Studio project management page:

You just have to provide a name and select the scenario for the evaluation project:




You can also select a Prompt Flow for the evaluation project. Although optional, it's suggested the usage of prompt flow as an orchestrator due to its capacity to manage the connections and the requirements on each evaluation step, as well as for logging and properly decoupling each step of the flow. In this example, we use a prompt flow solution to provide the model's answers to the questions and be evaluated against the ground truth. Feel free to use your own assistant here. Basically, the idea is to generate the answer for a given input (you can also pre-process this data and provide to the dataset):




Next, it's time to select the dataset for the evaluation project. In the Azure-Samples/llm-evaluation (github.com) repository we provide some methods to synthetically generate the data based on a retrieval index. It's important to notice that, since the task at hand is an evaluation based on reference texts and / or well-defined text outcomes, the data used on the task should be a set of questions and answers that allow the flow to tag what is improper, proper, and how well does a given response will fit under the provided reference.


Since those metrics are based on referenced values for a well-defined set of texts and, in particular for QnA tasks, based on a question-answer pair, it's suitable for a limited amount of texts that is applicable for a wider range of situations, thus needing a single file that contains this referential question an answer pairs. Here is an example of a QnA dataset. In this case, we only have a pair of question/ground_truth data given the answer will be provide by the prompt flow assistant. Use Add your dataset to upload the file:




In our example, we use a Prompt Flow solution to provide the model's answers to the questions and be evaluated against the ground truth. For this, we select the dataset column $(data.question)As in any evaluation job, you'll need to define what is the proper metrics that you want to apply. Not only you can rely on Azure AI Studio's built-in metrics, but you could also include extra metrics and strategies, including those that depend on evaluating the embeddings. For that, proceed as follows:




We use GPT-4 as the model for the evaluation project due to its higher inference and cognitive capacities, but the proper choice would depend on the task you have at hand. You can select the model from the list of available models. We also added Risk and safety metrics to mitigate any potential risks regarding model's misuse.


As mentioned earlier, we use the Prompt Flow solution to provide the model's answers to the questions. It's an opt-in, where you could just use the default engine for the evaluation, but in this case we can map the output with the GPT similarity metric to evaluate the ground truth against the model's answers:




Finally, you can submit the evaluation project and view the results :hearteyes: .




In our example dataset we provided two samples with wrong ground truth answers to evaluate the model's GPT-similarity metric. Notice that the results demonstrate a low performance for these two samples. It is expected as the ground truth answers are incorrect.





We can see that the Similarity score for the two incorrect samples are very low given the incorrect ground truth labels. In the real-world is expected that the models can produce wrong answers, contrasting with the perfect ground truths values. Finally, the main dashboard on your Azure AI Studio project gives the average score for each metric.


Screenshot 2024-05-27 164139.png




Version history
Last update:
‎May 28 2024 06:44 AM
Updated by: