This blog series has several versions, each covering different aspects and techniques. Check out the following resources:
Fine-tuning a model can sometimes lead to unintended or undesired responses. To ensure that the model remains safe and effective, it's important to evaluate it. This evaluation helps to assess the model's potential to generate harmful content and its ability to produce accurate, relevant, and coherent responses. In this tutorial, you will learn how to evaluate the safety and performance of a fine-tuned Phi-3 / Phi-3.5 model integrated with Prompt flow in Azure AI Studio.
Here is an Azure AI Studio's evaluation process.
For more detailed information and to explore additional resources about Phi-3 and Phi-3.5, please visit the Phi-3CookBook.
Series1: Introduction to Azure AI Studio's Prompt flow evaluation
Series2: Evaluating the Phi-3 / Phi-3.5 model in Azure AI Studio
To ensure that your AI model is ethical and safe, it's crucial to evaluate it against Microsoft's Responsible AI Principles. In Azure AI Studio, safety evaluations allow you to evaluate an your model’s vulnerability to jailbreak attacks and its potential to generate harmful content, which is directly aligned with these principles.
Image Source: Evaluation of generative AI applications
Before beginning the technical steps, it's essential to understand Microsoft's Responsible AI Principles, an ethical framework designed to guide the responsible development, deployment, and operation of AI systems. These principles guide the responsible design, development, and deployment of AI systems, ensuring that AI technologies are built in a way that is fair, transparent, and inclusive. These principles are the foundation for evaluating the safety of AI models.
Microsoft's Responsible AI Principles include:
Fairness and Inclusiveness: AI systems should treat everyone fairly and avoid affecting similarly situated groups of people in different ways. For example, when AI systems provide guidance on medical treatment, loan applications, or employment, they should make the same recommendations to everyone who has similar symptoms, financial circumstances, or professional qualifications.
Reliability and Safety: To build trust, it's critical that AI systems operate reliably, safely, and consistently. These systems should be able to operate as they were originally designed, respond safely to unanticipated conditions, and resist harmful manipulation. How they behave and the variety of conditions they can handle reflect the range of situations and circumstances that developers anticipated during design and testing.
Transparency: When AI systems help inform decisions that have tremendous impacts on people's lives, it's critical that people understand how those decisions were made. For example, a bank might use an AI system to decide whether a person is creditworthy. A company might use an AI system to determine the most qualified candidates to hire.
Privacy and Security: As AI becomes more prevalent, protecting privacy and securing personal and business information are becoming more important and complex. With AI, privacy and data security require close attention because access to data is essential for AI systems to make accurate and informed predictions and decisions about people.
Accountability: The people who design and deploy AI systems must be accountable for how their systems operate. Organizations should draw upon industry standards to develop accountability norms. These norms can ensure that AI systems aren't the final authority on any decision that affects people's lives. They can also ensure that humans maintain meaningful control over otherwise highly autonomous AI systems.
To learn more about Microsoft's Responsible AI Principles, visit the What is Responsible AI?.
In this tutorial, you will evaluate the safety of the fine-tuned Phi-3 / Phi-3.5 model using Azure AI Studio's safety metrics. These metrics help you assess the model's potential to generate harmful content and its vulnerability to jailbreak attacks. The safety metrics include:
Evaluating these aspects ensures that the AI model does not produce harmful or offensive content, aligning it with societal values and regulatory standards.
To ensure that your AI model is performing as expected, it's important to evaluate its performance against performance metrics. In Azure AI Studio, performance evaluations allow you to evaluate your model's effectiveness in generating accurate, relevant, and coherent responses.
Image Source: Evaluation of generative AI applications
In this tutorial, you will evaluate the performance of the fine-tuned Phi-3 / Phi-3.5 model using Azure AI Studio's performance metrics. These metrics help you assess the model's effectiveness in generating accurate, relevant, and coherent responses. The performance metrics include:
These metrics help you evaluate the model's effectiveness in generating accurate, relevant, and coherent responses.
This tutorial is a follow up to the previous blog posts, "Fine-Tune and Integrate Custom Phi-3 Models with Prompt Flow: Step-by-Step Guide" and "Fine-Tune and Integrate Custom Phi-3 Models with Prompt Flow in Azure AI Studio." In these posts, we walked through the process of fine-tuning a Phi-3 / Phi-3.5 model in Azure AI Studio and integrating it with Prompt flow.
In this tutorial, you will deploy an Azure OpenAI model as an evaluator in Azure AI Studio and use it to evaluate your fine-tuned Phi-3 / Phi-3.5 model.
Before you begin this tutorial, make sure you have the following prerequisites, as described in the previous tutorials:
You will use the test_data.jsonl file, located in the data folder from the ULTRACHAT_200k dataset downloaded in the previous blog posts, as the dataset to evaluate the fine-tuned Phi-3 / Phi-3.5 model.
If you followed the low-code approach described in "Fine-Tune and Integrate Custom Phi-3 Models with Prompt Flow in Azure AI Studio", you can skip this exercise and proceed to the next one. However, if you followed the code-first approach described in "Fine-Tune and Integrate Custom Phi-3 Models with Prompt Flow: Step-by-Step Guide" to fine-tune and deploy your Phi-3 / Phi-3.5 model, the process of connecting your model to Prompt flow is slightly different. You will learn this process in this exercise.
To proceed, you need to integrate your fine-tuned Phi-3 / Phi-3.5 model into Prompt flow in Azure AI Studio.
You need to create a Hub before creating the Project. A Hub acts like a Resource Group, allowing you to organize and manage multiple Projects within Azure AI Studio.
Sign in Azure AI Studio.
Select All hubs from the left side tab.
Select + New hub from the navigation menu.
Perform the following tasks:
Select Next.
In the Hub that you created, select All projects from the left side tab.
Select + New project from the navigation menu.
Enter Project name. It must be a unique value.
Select Create a project.
To integrate your custom Phi-3 / Phi-3.5 model with Prompt flow, you need to save the model's endpoint and key in a custom connection. This setup ensures access to your custom Phi-3 / Phi-3.5 model in Prompt flow.
Visit Azure ML Studio.
Navigate to the Azure Machine learning workspace that you created.
Select Endpoints from the left side tab.
Select endpoint that you created.
Select Consume from the navigation menu.
Copy your REST endpoint and Primary key.
Visit Azure AI Studio.
Navigate to the Azure AI Studio project that you created.
In the Project that you created, select Settings from the left side tab.
Select + New connection.
Select Custom keys from the navigation menu.
Perform the following tasks:
Select Add connection.
You have added a custom connection in Azure AI Studio. Now, let's create a Prompt flow using the following steps. Then, you will connect this Prompt flow to the custom connection to use the fine-tuned model within the Prompt flow.
Navigate to the Azure AI Studio project that you created.
Select Prompt flow from the left side tab.
Select + Create from the navigation menu.
Select Chat flow from the navigation menu.
Enter Folder name to use.
Select Create.
You need to integrate the fine-tuned Phi-3 / Phi-3.5 model into a Prompt flow. However, the existing Prompt flow provided is not designed for this purpose. Therefore, you must redesign the Prompt flow to enable the integration of the custom model.
In the Prompt flow, perform the following tasks to rebuild the existing flow:
Select Raw file mode.
Delete all existing code in the flow.dag.yml file.
Add the folling code to flow.dag.yml.
inputs:
input_data:
type: string
default: "Who founded Microsoft?"
outputs:
answer:
type: string
reference: ${integrate_with_promptflow.output}
nodes:
- name: integrate_with_promptflow
type: python
source:
type: code
path: integrate_with_promptflow.py
inputs:
input_data: ${inputs.input_data}
Select Save.
Add the following code to integrate_with_promptflow.py to use the custom Phi-3 / Phi-3.5 model in Prompt flow.
import logging
import requests
from promptflow import tool
from promptflow.connections import CustomConnection
# Logging setup
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
level=logging.DEBUG
)
logger = logging.getLogger(__name__)
def query_phi3_model(input_data: str, connection: CustomConnection) -> str:
"""
Send a request to the Phi-3 / Phi-3.5 model endpoint with the given input data using Custom Connection.
"""
# "connection" is the name of the Custom Connection, "endpoint", "key" are the keys in the Custom Connection
endpoint_url = connection.endpoint
api_key = connection.key
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}
data = {
"input_data": [input_data],
"params": {
"temperature": 0.7,
"max_new_tokens": 128,
"do_sample": True,
"return_full_text": True
}
}
try:
response = requests.post(endpoint_url, json=data, headers=headers)
response.raise_for_status()
# Log the full JSON response
logger.debug(f"Full JSON response: {response.json()}")
result = response.json()["output"]
logger.info("Successfully received response from Azure ML Endpoint.")
return result
except requests.exceptions.RequestException as e:
logger.error(f"Error querying Azure ML Endpoint: {e}")
raise
@tool
def my_python_tool(input_data: str, connection: CustomConnection) -> str:
"""
Tool function to process input data and query the Phi-3 / Phi-3.5 model.
"""
return query_phi3_model(input_data, connection)
For more detailed information on using Prompt flow in Azure AI Studio, you can refer to Prompt flow in Azure AI Studio.
Select Chat input, Chat output to enable chat with your model.
Now you are ready to chat with your custom Phi-3 / Phi-3.5 model. In the next exercise, you will learn how to start Prompt flow and use it to chat with your fine-tuned Phi-3 / Phi-3.5 model.
The rebuilt flow should look like the image below:
Select Start compute sessions to start Prompt flow.
Select Validate and parse input to renew parameters.
Select the Value of the connection to the custom connection you created. For example, connection.
Select Chat.
Here's an example of the results: Now you can chat with your custom Phi-3 / Phi-3.5 model. It is recommended to ask questions based on the data used for fine-tuning.
To evaluate the Phi-3 / Phi-3.5 model in Azure AI Studio, you need to deploy an Azure OpenAI model. This model will be used to evaluate the performance of the Phi-3 / Phi-3.5 model.
Sign in to Azure AI Studio.
Navigate to the Azure AI Studio project that you created.
In the Project that you created, select Deployments from the left side tab.
Select + Deploy model from the navigation menu.
Select Deploy base model.
Select Azure OpenAI model you'd like to use. For example, gpt-4o.
Select Confirm.
Visit Azure AI Studio.
Navigate to the Azure AI Studio project that you created.
In the Project that you created, select Evaluation from the left side tab.
Select + New evaluation from the navigation menu.
Select Prompt flow evaluation.
perform the following tasks:
Select Next.
perform the following tasks:
Select Next.
perform the following tasks to configure the performance and quality metrics:
perform the following tasks to configure the risk and safety metrics:
Select Next.
Select Submit to start the evaluation.
The evaluation will take some time to complete. You can monitor the progress in the Evaluation tab.
The results presented below are intended to illustrate the evaluation process. In this tutorial, we have used a model fine-tuned on a relatively small dataset, which may lead to sub-optimal results. Actual results may vary significantly depending on the size, quality, and diversity of the dataset used, as well as the specific configuration of the model.
Once the evaluation is complete, you can review the results for both performance and safety metrics.
Performance and quality metrics:
Risk and safety metrics:
You can scroll down to view Detailed metrics result.
By evaluating your custom Phi-3 / Phi-3.5 model against both performance and safety metrics, you can confirm that the model is not only effective, but also adheres to responsible AI practices, making it ready for real-world deployment.
You have successfully evaluated the fine-tuned Phi-3 model integrated with Prompt flow in Azure AI Studio. This is an important step in ensuring that your AI models not only perform well, but also adhere to Microsoft's Responsible AI principles to help you build trustworthy and reliable AI applications.
Image Source: Evaluation of generative AI applications
Cleanup your Azure resources to avoid additional charges to your account. Go to the Azure portal and delete the following resources:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.