Announcing Fine Tuning for GPT-4, AI Studio Support, and API improvements with Azure OpenAI Service

Microsoft

May 21, 2024

We love big announcements - and today is full of them! We’re announcing the public preview of GPT-4 fine tuning, along with automated evaluations assessing the potential for harmful outputs from fine-tuned models. GPT-4, along with all our other Azure OpenAI models, will be available through Azure AI Studio, so you can fine tune OSS and Azure OpenAI models in the same UI together with other Azure AI Studio capabilities. And if that wasn’t enough, we’re also releasing updates to our API to give customers more control over the fine-tuning process.

In this blog, we'll talk about:

GPT-4 Fine Tuning, and a case study with NI (now part of Emerson)
Automated evaluations to detect and prevent harmful content in the training and outputs of fine tuned models
AI Studio support for Azure OpenAI fine tuning, and
New API features including checkpointing

GPT-4 Fine Tuning Public Preview

GPT-4 is our most advanced fine tunable model yet: it outperforms GPT-35-Turbo on a variety of tasks, and thanks to its alignment training it shows improved factuality, steerability, and instruction following. Today, we're making GPT-4 fine tuning available in public preview, so you can now customize it with your own training data!

Like our other fine tunable models, we charge an hourly rate for training and deployment, with token-based billing for inferencing using the resulting model.

Mode	List Price (as of May 21, 2024)
GPT-4 Training	$102 / hour
GPT-4 Hosting	$5 / hour
GPT-4 Input Tokens	$0.03 / 1K tokens
GPT-4 Output Tokens	$0.06 / 1K tokens

We’ve had GPT-4 fine tuning in private preview for six months, with a cohort of early adopters testing its capabilities on tasks including translation, code generation, steering content, tone, and formatting, and even adapting the model to new domains. We were impressed by the breadth of applications, as well as their successful use cases. To call out one highlight...

Case Study: NI (now part of Emerson’s Test & Measurement business)

NI (now part of Emerson) provides software connected automated test and measurement systems. They used fine tuning to adapt GPT-4 to their graphical programming language (LabVIEW), which was not part of the base model's original training data:

“With GPT-4 fine tuning, we were able to improve our AI's ability to understand and generate LabVIEW code, at least twice as well on internal metrics than the best prompt engineering, open-source fine-tuned models, or highly sophisticated retrieval augmentation systems we have experimented with, even those built on top of base GPT 4 Turbo.

This fine-tuned model has significantly faster response times, is easier to operate, and is more cost effective when deployed at scale. We are excited at what appears to be possible with models such as GPT 4 and their ability to fine tune and layer on additional techniques.”

- Alejandro Barreto, Chief Software Engineer, Technology and Innovation Office, NI (now part of Emerson)

Responsible AI for GPT-4 Fine Tuning

One of the primary worries customers express about fine tuning is the risk of catastrophic forgetting, or unintentionally removing the built in mechanisms designed to reduce the output of harmful content. And of course we want to prevent bad actors from intentionally trying to adapt powerful models for malicious outcomes. At Microsoft, Responsible AI is at the heart of our service, and with GPT-4 fine tuning, we’ve taken extra steps to evaluate fine-tuned models for harmful content risks.

Training Data

Before training starts, your data is evaluated for potentially harmful content (violence, sexual, hate and fairness, self-harm: see category definitions here). If harmful content is detected above the specified severity level, your training job will fail, and you will receive a message informing you of the categories of failure so you can adjust your training data set before proceeding.

Model Evaluation

After training completes but before the fine-tuned model is available for deployment, the resulting model is evaluated for potentially harmful responses using Azure’s built-in risk and safety metrics. Using the same approach to testing that we use for the base large language models, our evaluation capability simulates a conversation with your fine-tuned model to assess the potential to output harmful content, again using specified harmful content categories (violence, sexual, hate and fairness, self-harm).

If a model is found to generate output containing harmful content at above an acceptable rate, you will be informed that your model is not available for deployment, with information about the specific categories of harm detected. You can then revise your training data or use case, and resubmit your fine-tuning job.

These evaluations are all performed in accordance with the data privacy practices that apply to all Azure OpenAI fine tuning (learn more at Data, privacy, and security for Azure OpenAI Service), and you aren’t charged for fine-tuning jobs that fail during evaluation.

Now Available in AI Studio

Beyond Azure OpenAI, Microsoft offers a range of model capabilities through the Azure AI Studio, with access to additional open and proprietary models including LLaMa, Phi, and Mistral, as well as capabilities ranging from application development to automated evaluation and metrics. We’re launching Azure OpenAI fine tuning in AI Studio so you can fine tune your favorite Azure OpenAI models side by side with LLaMa, Phi, and more.

With Azure AI Studio, fine tunable models can be identified directly from model cards or using the “fine tuning” tool option when you’re working on a project. Select your model, upload your data, and choose your hyperparameters, and when your model is ready, deploy it for inferencing.

API Updates and More

As if all of that wasn’t enough, we’re also releasing the ability to deploy model checkpoints, measure full validation metrics during your training runs, and set the seed property of a job to ensure repeatability.

Deployable Checkpoints

With our latest API update, in addition to creating the final fine-tuned model, we create one checkpoint at the end of each training epoch and return the last 3 checkpoints from your training run. Each checkpoint is a deployable model, which can be used for inferencing. You may want to deploy a checkpoint when your final model shows signs of overfitting (diverging accuracy between training and validation data, results that are less diverse than expected).

Full Validation Metrics

To better understand the performance of your model—and each checkpoint—you'll now see full validation metrics reported for each epoch in your results.csv file as well. Typically, we sample from the validation data to measure loss and accuracy, but now we will run the full validation set at each epoch to give you a better perspective on the quality of your fine tuned model, and the progress at each step.