Introduction
As organizations strive to leverage the power of LLMs on their own data, two prominent strategies have emerged: Retrieval Augmented Generation (RAG), Fine-tuning and combination of the two (Hybrid). Although both the approaches hold the potential to tailor AI responses based on their own data, these approaches present distinct advantages and challenges.
This blog outlines the advantages and disadvantages of RAG and fine-tuning methodologies while solving business use cases and focuses on implementing fine-tuned model in a cost optimized way in few use case scenarios.
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is a methodology that combines the power of retrieval-based and generative AI systems to enhance the performance of Large Language Models (LLMs). This approach retrieves information from a large corpus of data that can be used to augment the knowledge and responses of an LLM.
Pros:
Cons:
Majority of LLMs like OpenAI, Llama, Falcon, Mistral etc. offer the capability to fine tune them with specific datasets for tasks like text generation, classification etc. Fine-tuning is particularly beneficial when the objective is to do tasks that require a level of specificity and customization that general models may not readily provide, where the guiding information (context data) is too voluminous or intricate to be encapsulated within a single prompt. There are different ways in which finetuning can be achieved:
Pros:
Cons:
Most customers choose the RAG approach over the fine-tuning approach although the latter provides accurate and better-quality responses due to some disadvantages mentioned above.
In this blog, we are exploring scenarios where finetuned models can be hosted for inferencing through on–demand deployment which will avoid continuous hosting of the deployed models thereby reducing the hosting charges drastically. We will be using Azure OpenAI base models to describe the solutioning approaches while fine-tuning and hosting.
We will be using the term “fine-tuning” irrespective of the type of finetuning used since the focus of the blog is to optimize the hosting charges but still use the fine-tuned model.
Let us recap the steps for finetuning Azure OpenAI model and its deployment. In the case of Azure OpenAI models, the finetuning operations can be performed using REST API or SDK.
The dataset must be annotated based on the specific task and the format must be JSONL encoded in UTF-8. The file size must be less than 100 MB in size. Depending on the type of Azure OpenAI base models (Completion or Chat models), the data must be prepared with “prompt” and “completion” in the former and “messages” with the corresponding “roles” in the latter case. The minimum number of training samples could be as low as 10 but might not be sufficient for good quality output and hence as best practice to have at least a min of 50 good quality training examples. While increasing the number of samples, ensure that those contain highest quality examples and are representative of the data. Hence data pruning would be a key step before training the models which would otherwise result in responses which are worse. This would also help in optimizing the size of the training data, which is related to the training time and hence the cost. The training cost varies with the base models and also the training time on Azure.
For chat models, the training and validation data must be prepared as messages with System role, user role and assistant role and the corresponding content. The content corresponding to the “system” role will remain the same while the content related to “user” and “assistant” will be used to capture the information with which the Azure OpenAI chat model has to be fine-tuned with. In the case of Completion models, the data must be prepared as “prompt” and “completion”. For automating the process of finetuning, there must be some preliminary checks on the training and validation files for the number of training samples, the format of the data, whether the total token count for each individual sample is within the corresponding models token limit etc. This can be included as a part of the CI/CD pipeline so that appropriate corrections can be done before starting the finetuning.
2. Fine-tune the base models: The training and validation datasets can be uploaded to the Azure OpenAI models using SDK, REST API or Studio.
import os
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),
api_key=os.getenv("AZURE_OPENAI_KEY"),
api_version="2023-12-01-preview") # This API version or later is required to access fine-tuning for turbo/babbage-002/davinci-002
training_file_name = 'your training file(jsonl)'
validation_file_name = 'your validation file (jsonl)'
# Upload the training and validation dataset files to Azure OpenAI with the SDK.
training_response = client.files.create(file=open(training_file_name, "rb"), purpose="fine-tune")
training_file_id = training_response.id
validation_response = client.files.create(file=open(validation_file_name, "rb"), purpose="fine-tune")
validation_file_id = validation_response.id
print("Training file ID:", training_file_id)
print("Validation file ID:", validation_file_id)
client.fine_tuning.jobs.create( training_file="training_file_id", model="base-model-name", hyperparameters={"n_epochs":integer,"batch_size":integer,
"learning_rate_multiplier": (recommended range between 0.02to 0.2)})
response = client.fine_tuning.jobs.retrieve(job_id)
print("Job ID:", response.id)
print("Status:", response.status)
print(response.model_dump_json(indent=2))
3. Test performance of the model:
Each fine-tune job generates a result file called results.csv that contains various metrics and statistics about your customized model's performance. You can find the file ID for the result file in the list of your customized models and use the Python SDK to get the file ID and download the result file for further analysis. The details like step, training loss, train token accuracy, validation loss, validation accuracy etc. are provided to do a sanity check ensuring that the training happened smoothly where the loss should decrease, and accuracy should increase.
4. Deploy the model to an endpoint:
The model can be deployed using SDK, REST API or Studio. The deployment of a model requires authorization, API version and request url. Below code snippet shows the token-based authorization where authorization token must be generated. The status of deployments can be obtained from the response of the deployment request.
import json
import os
import requests
token= os.getenv("<TOKEN>")
subscription = "<YOUR_SUBSCRIPTION_ID>"
resource_group = "<YOUR_RESOURCE_GROUP_NAME>"
resource_name = "<YOUR_AZURE_OPENAI_RESOURCE_NAME>"
model_deployment_name =" custom deployment name " # Name you will use to reference the model when making inference calls.
deploy_params = {'api-version': "2023-05-01"}
deploy_headers = {'Authorization': 'Bearer {}'.format(token), 'Content-Type': 'application/json'}
deploy_data = {
"sku": {"name": "standard", "capacity": 1},
"properties": {
"model": {
"format": "OpenAI",
"name": <"fine_tuned_model">, #retrieve this value from the previous call "version": "1"
}
}
}
deploy_data = json.dumps(deploy_data)
request_url = f'https://management.azure.com/subscriptions/{subscription}/resourceGroups/{resource_group}/providers/Microsoft.CognitiveServices/accounts/{resource_name}/deployments/{model_deployment_name}'
print('Creating a new deployment...')
r = requests.put(request_url, params=deploy_params, headers=deploy_headers, data=deploy_data)
print(r)
print(r.reason)
print(r.json())
Once the deployment is completed, it is ready for inferencing. If the requests are ad hoc or continuous or batch, the models will be hosted 24 x 7, which will incur hosting charges on Azure. The hosting charges varies between models and also is charged for the time it is hosted.
Scenarios and proposed solutions:
Scenario 1:
Batch processing scenarios:
The business use cases like summarizing reviews, post call analytics, extracting relevant data or aspects from the reviews or analyze the data to gain insights, such as identifying trends, patterns, or areas for improvement are usually done through batch processing. The execution of these use cases might require complex prompt engineering while using out-of-the box models. However, these can be efficiently done using fine-tuned models tailored to meet specific requirements resulting in improved accuracy.
In batch processing scenarios, the processing of information happens during a specific pre-determined time. Hence, once we create a fine-tuned model following the Steps 1 – 3 mentioned above, an external time-based trigger can be set to start the deployment of the finetuned model.
The deployment of the model can be done using SDK (Step 4) or REST API. The external time- based trigger for deployment can be done in different ways like
The deployment status can be obtained from the deployment provisioning state returned in response to the REST API call. When "provisoningState” = “succeeded”, the service will be ready for inferencing.
In majority of the batch processing scenarios, the data for processing will be stored in a data store (eg blob storage or database) and the batch processing is completed once all the data is inferenced using the finetuned model and persist the extracted data/analysis into a data store. Subsequently the model can be deleted. The deletion of the deployed model can be done using REST API or Azure CLI.
Scenario 2:
Near-real time scenarios: In certain business use cases, extraction/processing/generation of data must happen real-time during the operating hours of the business units. For example - a human agent led customer support call where data uploaded must be processed for extracting relevant information or retrieving information to address user queries, analyze customer inquiries and route them to the appropriate support agents or departments during business hours etc
In these scenarios, the fine-tuned model deployment can be done through an external trigger few minutes before the starting of the business hours. The finetuned model will be ready to do the specific tasks during the business hours and persists the extracted/processed /generated data depending on the task to a data store. Post the business hours, another external time-based trigger can delete the finetuned model deployment.
Scenario 3:
There are certain use cases which are ad hoc requests and are not limited to business hours, but the response/outcome is not expected real-time. In these cases, the user might not wait for the outcome but expect the outcome to be send to the user/to the next workflow or persist in a location which could be downloadable at a later point in time. The examples of these scenarios could be processing of large documents (using Azure openAI models which support large input context or after relevant preprocessing steps or using RAG to find the relevant paragraphs) - eg legal documents, contracts, RFPs etc. to extract relevant information/analyze these large documents for which a list of queries are submitted, automated employee performance evaluation by analyzing quantitative and qualitative aspects before posting it to the manager for final review etc. Since these are ad hoc and the timing of the requests cannot be anticipated, time-based triggers will not be useful. We should be looking at event-based triggering of the finetuned model deployment. As soon as the user clicks a submit button or the user uploads the document, the deployment of the corresponding finetuned model must be triggered.
The blocks are briefly explained below:
Step 1:
Step 2:
Case 1: Status == Succeeded /Creating/Accepted
Case 2: Status != Succeeded /Creating/Accepted
The approach mentioned in this blog will help in reducing the hosting charges of fine-tune Azure OpenAI models, since they are not hosted 24 x 7. These deployment strategies could also be extended to other fine-tuned LLM models from the Azure Machine Learning (Model Catalogue) where the use cases match one of the two scenarios mentioned above and avoid continuous hosting of the models.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.