New GenAI simulation and evaluation tools in Azure AI Studio

Microsoft

Sep 24, 2024

Mitigating the potential risks of generative AI (GenAI) is critical, but throughout the development lifecycle, it’s just as important to measure risks so developers know where to focus their attention. Azure AI Studio provides comprehensive evaluation tools to help organizations proactively assess GenAI outputs for quality and safety metrics in a systematic, transparent, and repeatable way. Alongside tools for tracing and monitoring in Azure AI Studio, iterative evaluations can help development teams make data-driven decisions regarding model selection, content filter configurations, prompt engineering, and other application components before and after deploying to production.

In this blog, we share new capabilities available in public preview to help you evaluate and improve your application’s outputs with greater ease, plus new documentation and step-by-step tutorials to help you get started. Many of these evaluation capabilities were initially developed to aid internal teams at Microsoft building popular generative AI solutions such as Microsoft Copilot and GitHub Copilot. Now, with input from customers and partners, we’re bringing the same tried and tested capabilities to Azure AI Studio, to help every organization build more trustworthy AI applications.

Assess how your app responds to indirect prompt injection attacks

Risk and safety evaluations for indirect prompt injection attacks are now available in public preview, accessible through Azure AI Studio UI and SDK experiences. Indirect prompt injection attacks (also known as cross-domain prompt injection attacks or XPIA) are an emerging attack vector where a threat actor poisons a model’s grounding data source, such as a public website, email, or internal document, to pass hidden, malicious instructions to a model and circumvent safety guardrails. With the Azure AI Evaluation SDK, users can now simulate indirect prompt injection attacks on their generative AI model or application and measure how often their AI fails to detect and deflect the attacks (the defect rate) along subcategories of manipulated content, intrusion, and information gathering. Users can also drill into evaluation details to better-understand how their application typically responds to these attacks and the associated risks. With this information, users may decide to activate Prompt Shields using Azure AI Content Safety, adjust grounding data sources, or apply other mitigations in their system message before rerunning the evaluation and deploying to production.

Learn more by reading the documentation and follow this step-by-step tutorial in Python.

Assess how often your app outputs protected material

Risk and safety evaluations for protected material (text) are now available in public preview, accessible through Azure AI Studio UI and SDK experiences. Because foundation models are typically trained using a massive corpus of data, users are understandably concerned that models may output responses containing protected material, putting end users at risk of unintended infringement. With the Azure AI evaluation SDK, users can now simulate conversations with their generative AI model or application to try and illicit responses containing protected text (e.g. song lyrics, articles, recipes, select web content) and measure how often their AI outputs protected text in response (the defect rate). To do so, the evaluation checks the outputs against an index of third-party text content maintained on GitHub. Users can drill into evaluation details to better-understand how their application typically responds to these user prompts and the associated risks. With this information, users may decide to activate protected material detection in Azure AI Content Safety, adjust their system message, or apply other mitigations before rerunning the evaluation and deploying to production.

Learn more by reading the documentation follow this step-by-step tutorial in Python.

Assess the quality and accuracy of your app’s outputs

New quality evaluations are now available in public preview, accessible through the Azure AI Evaluation SDK today with UI support coming in October 2024. ROUGE, BLEU, METEOR, and GLEU are popular math-based metrics that can help AI developers assess text-based outputs for qualities such as similarity to expected outputs, precision, recall, and grammatical correctness. AI developers can evaluate each metric using a dedicated evaluator and combine multiple evaluators into a holistic evaluation run. After reviewing evaluation results, users may decide to compare different models, adjust their grounding data, or make other changes through prompt engineering before rerunning the evaluation to see the impact of their changes. Overall, these evaluations can help AI developers improve the quality, accuracy and trustworthiness of their AI application.

Evaluator name in Azure AI SDK	Definition
ROUGEScoreEvaluator	ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score measures the quality of the generated text by comparing it to a reference text using n-gram recall, precision, and F1-score.
BLEUScoreEvaluator	BLEU (Bilingual Evaluation Understudy) score measures how closely the generated text matches a reference text based on n-gram overlap.
METEORScoreEvaluator	METEOR (Metric for Evaluation of Translation with Explicit Ordering) score evaluates the quality of the generated text by considering precision, recall, and a range of linguistic features like synonyms, stemming, and word order.
GLEUScoreEvaluator	GLEU (Google-BLEU) score measures the degree of overlap between the generated text and both the reference text and source text, balancing between precision and recall.

Learn more by reading the documentation and follow this step-by-step tutorial in Python.

Role play with your application to see how it responds to typical user prompts

A synthetic data generator and simulator for non-adversarial tasks is now available in public preview, accessible through the Azure AI evaluation SDK. One of the biggest evaluation challenges we hear from customers is that they do not have comprehensive, high-quality test datasets to run holistic evaluations. In March, we introduced an adversarial simulator, specifically designed to role-play with a user’s model or application to generate a high-quality test data for risk and safety evaluations. However, because it was designed to help accelerate adversarial red-teaming processes, it lacked the ability to simulate more general interactions with the actual target users of the application. Now, we are excited to announce an end-to-end synthetic data generation capability to help developers understand how their application responds to everyday user prompts. AI developers can use an index-based query generator and fully-customizable simulator to create robust test datasets around non-adversarial tasks and personas specific to their application. This can help organizations fill a critical gap in their existing evaluation toolkit, facilitating higher-quality evaluations and faster iteration on an application.

Learn more by reading the documentation and follow this step-by-step tutorial in Python.

Learn more about evaluations in Azure AI Studio

Evaluations are a critical step for building production-ready GenAI apps. With evaluations in Azure AI Studio, we’re making it easy for organizations to:

Get started quickly with pre-built, customizable, and custom evaluation metrics
Prepare for the unexpected by simulating adversarial and non-adversarial interactions with your application
Make data-driven decisions with interpretable evaluation results that developers can easily update and compare over time

If you are just getting started with evaluations in Azure AI Studio, we encourage you to explore these helpful resources for Azure AI Studio: