Introducing Model Distillation in Azure OpenAI Service

Former Employee

Nov 19, 2024

We're excited to introduce our upcoming release of Model Distillation feature in Azure OpenAI Service. This feature provides developers with a seamless, integrated workflow to manage the entire distillation pipeline directly within the Azure OpenAI platform.

The new offering includes two UI experiences that support an end-to-end distillation flow: Stored Completions (coming soon) and Azure OpenAI Evaluation (public preview).

The Stored Completions experience features API updates to support per-request traffic logging and a user-friendly interface for reviewing, filtering, and exporting collected data. The Azure OpenAI Evaluation experience offers a UI-based approach to score data based on predefined criteria.

Together, these two experiences create a comprehensive distillation process: collecting live traffic from Azure OpenAI endpoints, filtering and subletting that traffic in the Stored Completions UI, exporting it to the Evaluation UI for quality scoring, and finally, fine-tuning from the collected data or a subset based on evaluation scoring. This comprehensive flow ensures that our customers can efficiently manage and optimize their data, enhancing their overall experience.

What is Model Distillation?

Model distillation empowers developers to use the outputs of large, complex models to fine-tune smaller, more efficient ones. This technique allows the smaller models to perform just as well on specific tasks, all while significantly cutting down on both cost and latency.

Historically, model distillation has been a complex, multi-step challenge. Developers had to manually coordinate numerous operations across various disconnected tools, from generating datasets to fine-tuning models and evaluating performance improvements. This made the process not only time-consuming but also prone to errors.

But not anymore! Our new offering simplifies the entire process, making it effortless for developers to iterate and refine their models.

What are the components of the Distillation flow?

Stored Completions: Easily generate datasets for distillation by capturing and storing input-output pairs from models like GPT-4o through our API. This allows you to build datasets with your production data for evaluating and fine-tuning models.

Evaluation: Create and run custom evaluations to measure model performance on specific tasks. Azure OpenAI Evaluation provides an integrated way to measure performance, using data from Stored Completions or existing datasets. Azure OpenAI Evaluation can also be utilized on its own to assess model performance for your specific use cases.

Fine-tuning: Stored Completions and Azure OpenAI Evaluation are fully integrated with Azure OpenAI fine-tuning. Use datasets created with Stored Completions in your fine-tuning jobs and run evaluations on fine-tuned models using Azure OpenAI Evaluations.

How does Distillation work End to End?

Let’s work together on a distillation user scenario: distilling from production for a news sentiment use case.

Imagine a company using an AI-powered platform to monitor news in real-time, tracking sentiment around its brand, products, and industry trends. Behind the scenes, the GPT-4o model is analyzing news articles to detect positive, negative, or neutral sentiment.

As the number of news sources and updates grows, so do the platform’s operating costs and with each new data source, processing times begin to slow. Now, imagine if the platform could maintain GPT-4o’s intelligence and accuracy while reducing both costs and response times.

Model distillation to the rescue! We capture GPT-4o's real-time interactions with news articles, building a rich dataset of responses. With this data, we can distill GPT-4o's power into a smaller, faster model that delivers high-quality answers at a fraction of the cost.

Here's how you can do it:

Capture model responses: Using Stored Completions, you can capture data at request level and optionally include metadata tags that will allow you to filter your completions later on. After a period of time, navigate to the Stored Completion UI. You can then review and interact with this traffic through a user-friendly interface, select events within a given time frame, requests from a specific model, free text search in prompts or responses, and user-specified tags. Once you are happy with your selection, you can export the data to Fine-Tune/Distill or Evaluation.

Evaluate the captured data:

From our investigations in the Stored Completions UI, we have noticed that sometimes the GPT-4o model is unable to provide any sentiment analysis because the news articles only contain a link without any content. We aim to capture these cases and exclude them from the data.

Let’s export the data to evaluation to make sure we have good quality data for distillation.

You can preview your data and name your evaluation job. Optionally you can enable "Generate responses" to add responses using a new model or prompt. In our example, we will skip “generate responses” step.

Then, choose your testing criteria to assess the effectiveness of each output. There are nine categories, from simple string checks to advanced model-based graders like factuality or semantic similarity. Model-based graders require a model deployment as part of the grader setting.

Let's begin with a quick string check to ensure that the GPT-4o model in production returns a sentiment answer. In our case, we would like to check if the output contains any positive, negative, or neutral sentiment. You can add other testing criteria as needed.

Export the good examples to Fine-tune: Looking at evaluation results, in about 98% of the data, the GPT-4o model provided a sentiment answer. However, there have been some cases where the model was not able to provide the sentiment.

Now, you can select only the data with a "Passed" status and export it to fine-tuning. This will allow you to fine-tune a GPT-4o-mini model with high-quality, accurate data as part of the Distillation flow. Choose GPT-4o-mini as the base model for distillation, set a validation set if applicable, configure the hyperparameters for the model, review everything, and finally, hit 'Fine-tune' to initiate the fine-tuning job.

Once the model is fine-tuned, we can review the loss and accuracy metrics, then deploy the selected model.

How did the distilled model perform? Now that you have a GPT-4o-mini fine-tuned model, it's time to evaluate the performance of your distilled model and compare it with the original model you had in production. With Azure OpenAI Evaluation and generate response step, you can effortlessly create a prompt and compare the model's response with your ground truth data. This time, we can use a new grader like semantic similarity, which compares the degree of similarity between model's response and the reference.

Let's dive into the exciting comparison of the Distilled GPT-4o-mini model, the GPT-4o-mini base model, and the original GPT-4o model. The original GPT-4o model achieved a score of 81.4%, while the GPT-4o-mini fine-tuned model reached a close 81.3%. The base GPT-4o-mini model, on the other hand, had a score of 76.2%. The fine-tuned model demonstrates excellent performance, comparable to the model in production. The best part is that with stored completions, you can log more data and use continuous fine-tuning to further enhance the performance, making it comparable to the GPT-4o production model.

In summary, by leveraging the Stored Completions, Fine-Tuning, and Evaluation features in Azure OpenAI Service, you can automate the generation of high-quality datasets ready for training. This innovation makes experimentation and evaluation not only easier but also more efficient. Our new offering simplifies the entire process, making it effortless for developers to iterate and refine their models.

Get ready to transform the way you develop models with Azure OpenAI.

Want to learn more?