Following our recent update on the new features and capabilities of the Azure OpenAI (AOAI) service, this blog focuses on fine tuning with function calling. We’ll do a deep dive into how fine tuning with function calling works, when you might want to use it, and provide an in-depth case study using stock price data.
In this blog we’ll talk about…
Function calling refers to the capability to define and describe calls to external application programming interfaces (APIs). With function calling, you can instruct your Language Model to utilize these APIs when appropriate, based on the context provided by the prompt. This functionality expands the LLM's abilities by allowing it to interact with external services, access additional data sources, or perform specific tasks beyond its built-in capabilities.
On AOAI, the newest versions of Open AI's gpt-35-turbo and gpt-4 now support function calling. When functions are requested, the model evaluates the context to decide if any should be used, providing a JSON object with the function details. It also allows parallel function calls, executing tasks simultaneously and reducing the number of API requests for better performance.
Typical scenarios where function calling is applied involve:
Please note, function calling triggers the API call as required but does not execute it directly. Instead, your application handles the call and returns the response to the language model. This approach empowers you to manage external calls, ensuring control over your application's interactions.
Fine tuning with function calling teaches your model how- and when – to call external APIs. gpt-35-turbo (0613) and newer models support function calling in both training data and inferencing. So, now both customized models and base models can make calls to external API. Fine tuning with function calling offers a multitude of benefits. Here is a list of some of the important benefits:
Fine tuning with function calling is currently available for the gpt-35-turbo (0613) and gpt-35-turbo-16k (1106) models. With support for function calling, you can incorporate functions into your training data, and have your fine-tuned model make function calls.
Besides the dataset, the experience for fine tuning and function calling is the same as training any other model for fine tuning. See the documentation for more details.
To demonstrate the utility of function calling with fine-tuned models, let’s use a real problem as a case study. We want to build a chatbot that retrieves stock prices from an external API, in response to user inquiries. With just the base model, we identified two challenges: (1) the model does a poor job at distinguishing real companies from fake, and (2) our function calling definitions were very long – and increased our tokens per prompt dramatically.
We’ll explore how we can use fine tuning, with function calling, to improve the model’s accuracy and performance. For each scenario, we’ll build a training dataset, compare the fine-tuned model to the base model, and measure the improvement from fine tuning.
Once we’ve created a fine tuned model that meets our needs, we'll put it all together by developing a basic application that allows users to check stock prices for different companies. We will use YFinance Python library for easy retrieval of current stock prices.
A common problem with large language models is hallucinations – providing plausible but false responses. With function calling, hallucinations can happen when the model calls a function in the wrong context or provides incorrect information to for the function call.
We evaluated whether the base model was able to correctly identify fake companies, and respond appropriately, instead of trying to quote a stock price. Our test dataset consists of 10 samples, comprising 5 fake and 5 real companies. Even though we provided a clear system message not to make assumptions (asking for clarification if the exact stock ticker symbol isn’t found) the base model struggled to differentiate between fake and real companies accurately. Please see the example below, where the base model generated a fake symbol for Titan Robotics and output a function.
Inference with base model - gpt-35-turbo (0613)
{"role": "user", "content": "What was the closing price of Titan Robotics' stock last Friday"},
We need to teach the model when to make function calls – and when to decline. Fine tuning to the rescue!
To address hallucination and enhance accuracy, we created a training dataset with function calling capabilities. Each line of this dataset includes sets of "messages" (from user, system, and assistant roles) paired with stock functions. We included fake company examples, with appropriate response, so we can teach our model how to identify and respond to those fake requests. Our dataset consists of 96 samples.
We trained a gpt-35-turbo (0613) model using a combination of different hyperparameters and evaluated it with the same test dataset. While our base model did a poor job of distinguishing between real and fake companies, our finetuned model intelligently identified invalid company entries. Please see the Titan Robotics example for reference.
The table below illustrates the outcomes of the test dataset evaluation. It clearly demonstrates how fine tuning can identify hallucination and deliver more accurate and reliable results.
Test Dataset |
Base Model gpt-35-turbo (0613) |
Fine-Tuned Model gpt-35-turbo (0613) finetuned |
Real Companies |
5/5 examples detected correctly |
5/5 examples detected correctly |
Fake Companies |
0/5 examples detected correctly |
4/5 examples detected correctly |
Overall Accuracy |
50% |
90% |
Hallucination Accuracy |
0% |
80% |
While the fine tuned model is not perfect, it is significantly better than the base model. Depending on your use case, and your need for accuracy, you may choose to fine tune with more data to get even better performance.
The inclusion of functions in the system message directly impacts token usage. As the number of functions grows, so does the number of tokens within the system message, resulting in verbose prompts and increased costs. Fine tuning lets you shorten your function calls by:
Without fine tuning, the model may struggle to correctly use the function without this additional information, but with fine tuning you can show the model when to call the function without explaining as much in the prompt.
In our two stock functions, we have achieved a noteworthy 55% reduction in tokens by eliminating the description field from both functions and parameters, and by removing the properties field (keep the property field but with an empty dictionary) from the parameters object within each function. Below is the updated, shortened function.
To kickstart our testing process, we first establish a baseline. We'll proceed with three phases:
Let’s begin with establishing the base model and full verbose functions as our baseline. The base model, gpt-35-turbo (0613), exhibited 100% accuracy with our test dataset, indicating its ability to generate the correct function when provided with complete prompts. However, when we transitioned to shortened functions while keeping the base model unchanged, it showed 0% accuracy, failing to detect any samples correctly and providing empty arguments in all 10 samples.
{"role": "user", "content": "what is the current price of Uber?"}
{"role": "user", "content": "What was the highest price that Walmart's stock reached last quarter?"}
To investigate whether fine tuning could address this issue, we constructed a dataset comprising 100 samples containing both shortened stock functions. We experimented with various combinations of system messages and hyperparameters to enhance the accuracy of the fine-tuning process. Finally, we successfully fine-tuned a model that achieved 100% accuracy with our test dataset when using our shortened functions. Please refer to the table summary below and the output of the fine-tuned model for further details.
{"role": "user", "content": "what is the current price of Uber?"}
{"role": "user", "content": "What was the highest price that Walmart's stock reached last quarter?"}
|
Base Model + Verbose |
Base Model + Short |
FT Model |
Accuracy |
100% |
0% |
100% |
Number of tokens |
230 |
108 |
108 |
Calculating the total cost of ownership: do shorter prompts save money?
When considering the cost trade-off between fine tuning with shortened function and using a base model with full verbose function, it is essential to assess factors such as the number of requests and associated costs. The base model has a higher per-prompt cost, due to length, but with fine tuning, we pay for both tokens and hosting the model. For our stock use case, the plot below compares the cost of fine tuning versus the base model: with many requests per day, fine tuning is less expensive than the base model!
Function calling only creates the call to an external API – it doesn’t execute it. To actually execute the request, you’ll need to extract the function name and arguments from the LLM response and proceed to call the function with those arguments. The function's output is in JSON format, which is then passed back to gpt-35-turbo to generate an appropriate result message for the user.
Although we may have made it look easy, getting quality examples that worked and were better than the base models required a lot of iteration and experimentation. We ran many trial models to identify the best performing one for each use case. A few recommendations, based on our experience:
When deploying your applications, consider
Want to learn more?
Customize a model with fine tuning
Fine tuning and function calling
Azure AI Samples - Finetuning with Function Calling
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.