Motivations for Small Language Models (SLMs)
In this example, we are going to learn how to fine-tune phi-3-mini-4k-instruct using QLoRA: Efficient Finetuning of Quantized LLMs with Flash Attention. QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.
[Step 1: Preparation]
Let's prepare the dataset. In this case we are going to download the ultrachat dataset.
from datasets import load_dataset
from random import randrange
# Load dataset from the hub
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split='train_sft[:2%]')
print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
Let's take a shorter version of the dataset to create training and test example. To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function
that takes a sample and returns a string with our format instruction.
dataset = dataset.train_test_split(test_size=0.2)
train_dataset = dataset['train']
train_dataset.to_json(f"data/train.jsonl")
test_dataset = dataset['test']
test_dataset.to_json(f"data/eval.jsonl")
Let's save this training and test dataset in json format. Now let’s load the Azure ML SDK. This will help us create the necessary component.
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes
Now let's create the workspace client.
credential = DefaultAzureCredential()
workspace_ml_client = None
try:
workspace_ml_client = MLClient.from_config(credential)
except Exception as ex:
print(ex)
subscription_id= "Enter your subscription_id"
resource_group = "Enter your resource_group"
workspace= "Enter your workspace name"
workspace_ml_client = MLClient(credential, subscription_id, resource_group, workspace)
Here let's create a custom training environment.
from azure.ai.ml.entities import Environment, BuildContext
env_docker_image = Environment(
image="mcr.microsoft.com/azureml/curated/acft-hf-nlp-gpu:latest",
conda_file="environment/conda.yml",
name="llm-training",
description="Environment created for llm training.",
)
workspace_ml_client.environments.create_or_update(env_docker_image)
Let’s look at the conda.yaml
name: model-env
channels:
- conda-forge
dependencies:
- python=3.8
- pip=24.0
- pip:
- bitsandbytes==0.43.1
- transformers~=4.41
- peft~=0.11
- accelerate~=0.30
- trl==0.8.6
- einops==0.8.0
- datasets==2.19.1
- wandb==0.17.0
- mlflow==2.13.0
- azureml-mlflow==1.56.0
- torchvision==0.18.0
[Step 2: Training]
Lets look at the training script. We are going to use the recently introduced method in the paper “QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation” by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:
%%writefile src/train.py
import os
#import mlflow
import argparse
import sys
import logging
import datasets
from datasets import load_dataset
from peft import LoraConfig
import torch
import transformers
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
logger = logging.getLogger(__name__)
###################
# Hyper-parameters
###################
training_config = {
"bf16": True,
"do_eval": False,
"learning_rate": 5.0e-06,
"log_level": "info",
"logging_steps": 20,
"logging_strategy": "steps",
"lr_scheduler_type": "cosine",
"num_train_epochs": 1,
"max_steps": -1,
"output_dir": "./checkpoint_dir",
"overwrite_output_dir": True,
"per_device_eval_batch_size": 4,
"per_device_train_batch_size": 4,
"remove_unused_columns": True,
"save_steps": 100,
"save_total_limit": 1,
"seed": 0,
"gradient_checkpointing": True,
"gradient_checkpointing_kwargs":{"use_reentrant": False},
"gradient_accumulation_steps": 1,
"warmup_ratio": 0.2,
}
peft_config = {
"r": 16,
"lora_alpha": 32,
"lora_dropout": 0.05,
"bias": "none",
"task_type": "CAUSAL_LM",
"target_modules": "all-linear",
"modules_to_save": None,
}
train_conf = TrainingArguments(**training_config)
peft_conf = LoraConfig(**peft_config)
###############
# Setup logging
###############
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
handlers=[logging.StreamHandler(sys.stdout)],
)
log_level = train_conf.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
# Log on each process a small summary
logger.warning(
f"Process rank: {train_conf.local_rank}, device: {train_conf.device}, n_gpu: {train_conf.n_gpu}"
+ f" distributed training: {bool(train_conf.local_rank != -1)}, 16-bits training: {train_conf.fp16}"
)
logger.info(f"Training/evaluation parameters {train_conf}")
logger.info(f"PEFT parameters {peft_conf}")
################
# Modle Loading
################
checkpoint_path = "microsoft/Phi-3-mini-4k-instruct"
# checkpoint_path = "microsoft/Phi-3-mini-128k-instruct"
model_kwargs = dict(
use_cache=False,
trust_remote_code=True,
attn_implementation="flash_attention_2", # loading the model with flash-attenstion support
torch_dtype=torch.bfloat16,
device_map=None
)
model = AutoModelForCausalLM.from_pretrained(checkpoint_path, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
tokenizer.model_max_length = 2048
tokenizer.pad_token = tokenizer.unk_token # use unk rather than eos token to prevent endless generation
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'right'
##################
# Data Processing
##################
def apply_chat_template(
example,
tokenizer,
):
messages = example["messages"]
# Add an empty system message if there is none
if messages[0]["role"] != "system":
messages.insert(0, {"role": "system", "content": ""})
example["text"] = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False)
return example
def main(args):
train_dataset = load_dataset('json', data_files=args.train_file, split='train')
test_dataset = load_dataset('json', data_files=args.eval_file, split='train')
column_names = list(train_dataset.features)
processed_train_dataset = train_dataset.map(
apply_chat_template,
fn_kwargs={"tokenizer": tokenizer},
num_proc=10,
remove_columns=column_names,
desc="Applying chat template to train_sft",
)
processed_test_dataset = test_dataset.map(
apply_chat_template,
fn_kwargs={"tokenizer": tokenizer},
num_proc=10,
remove_columns=column_names,
desc="Applying chat template to test_sft",
)
###########
# Training
###########
trainer = SFTTrainer(
model=model,
args=train_conf,
peft_config=peft_conf,
train_dataset=processed_train_dataset,
eval_dataset=processed_test_dataset,
max_seq_length=2048,
dataset_text_field="text",
tokenizer=tokenizer,
packing=True
)
train_result = trainer.train()
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
#############
# Evaluation
#############
tokenizer.padding_side = 'left'
metrics = trainer.evaluate()
metrics["eval_samples"] = len(processed_test_dataset)
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
# ############
# # Save model
# ############
os.makedirs(args.model_dir, exist_ok=True)
torch.save(model, os.path.join(args.model_dir, "model.pt"))
def parse_args():
# setup argparse
parser = argparse.ArgumentParser()
# add arguments
parser.add_argument("--train-file", type=str, help="Input data for training")
parser.add_argument("--eval-file", type=str, help="Input data for eval")
parser.add_argument("--model-dir", type=str, default="./", help="output directory for model")
parser.add_argument("--epochs", default=10, type=int, help="number of epochs")
parser.add_argument(
"--batch-size",
default=16,
type=int,
help="mini batch size for each gpu/process",
)
parser.add_argument("--learning-rate", default=0.001, type=float, help="learning rate")
parser.add_argument("--momentum", default=0.9, type=float, help="momentum")
parser.add_argument(
"--print-freq",
default=200,
type=int,
help="frequency of printing training statistics",
)
# parse args
args = parser.parse_args()
# return args
return args
# run script
if __name__ == "__main__":
# parse args
args = parse_args()
# call main function
main(args)
Let’s create a training compute.
from azure.ai.ml.entities import AmlCompute
# If you have a specific compute size to work with change it here. By default we use the 1 x A100 compute from the above list
compute_cluster_size = "Standard_NC24ads_A100_v4" # 1 x A100 (80GB)
# If you already have a gpu cluster, mention it here. Else will create a new one with the name 'gpu-cluster-big'
compute_cluster = "gpu-a100"
try:
compute = ml_client.compute.get(compute_cluster)
print("The compute cluster already exists! Reusing it for the current run")
except Exception as ex:
print(
f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {compute_cluster_size}!"
)
try:
print("Attempt #1 - Trying to create a dedicated compute")
compute = AmlCompute(
name=compute_cluster,
size=compute_cluster_size,
tier="Dedicated",
max_instances=1, # For multi node training set this to an integer value more than 1
)
ml_client.compute.begin_create_or_update(compute).wait()
except Exception as e:
print("Error")
Some helpful tips:
Now let's call the compute job with the above training script in the AML compute we just created.
from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml.entities import ResourceConfiguration
job = command(
inputs=dict(
train_file=Input(
type="uri_file",
path="data/train.jsonl",
),
eval_file=Input(
type="uri_file",
path="data/eval.jsonl",
),
epoch=1,
batchsize=64,
lr = 0.01,
momentum = 0.9,
prtfreq = 200,
output = "./outputs"
),
code="./src", # local path where the code is stored
compute = 'gpu-a100',
command="accelerate launch train.py --train-file ${{inputs.train_file}} --eval-file ${{inputs.eval_file}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momentum}} --print-freq ${{inputs.prtfreq}} --model-dir ${{inputs.output}}",
environment="azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/52",
distribution={
"type": "PyTorch",
"process_count_per_instance": 1,
},
)
returned_job = workspace_ml_client.jobs.create_or_update(job)
workspace_ml_client.jobs.stream(returned_job.name)
Let's look at the pipeline output.
# check if the `trained_model` output is available
job_name = returned_job.name
print("pipeline job outputs: ", workspace_ml_client.jobs.get(job_name).outputs)
[Step 3: Endpoint]
Once the model is finetuned lets register the job in the workspace to create endpoint.
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes
run_model = Model(
path=f"azureml://jobs/{job_name}/outputs/artifacts/paths/outputs/mlflow_model_folder",
name="phi-3-finetuned",
description="Model created from run.",
type=AssetTypes.MLFLOW_MODEL,
)
model = workspace_ml_client.models.create_or_update(run_model)
Let's create the endpoint.
from azure.ai.ml.entities import (
ManagedOnlineEndpoint,
IdentityConfiguration,
ManagedIdentityConfiguration,
)
# Check if the endpoint already exists in the workspace
try:
endpoint = workspace_ml_client.online_endpoints.get(endpoint_name)
print("---Endpoint already exists---")
except:
# Create an online endpoint if it doesn't exist
# Define the endpoint
endpoint = ManagedOnlineEndpoint(
name=endpoint_name,
description=f"Test endpoint for {model.name}",
identity=IdentityConfiguration(
type="user_assigned",
user_assigned_identities=[ManagedIdentityConfiguration(resource_id=uai_id)],
)
if uai_id != ""
else None,
)
# Trigger the endpoint creation
try:
workspace_ml_client.begin_create_or_update(endpoint).wait()
print("\n---Endpoint created successfully---\n")
except Exception as err:
raise RuntimeError(
f"Endpoint creation failed. Detailed Response:\n{err}"
) from err
Once the endpoint is created we can go ahead and create the deployment.
# Initialize deployment parameters
deployment_name = "phi3-deploy"
sku_name = "Standard_NCs_v3"
REQUEST_TIMEOUT_MS = 90000
deployment_env_vars = {
"SUBSCRIPTION_ID": subscription_id,
"RESOURCE_GROUP_NAME": resource_group,
"UAI_CLIENT_ID": uai_client_id,
}
For inferencing we will use a different base image.
from azure.ai.ml.entities import Model, Environment
env = Environment(
image='mcr.microsoft.com/azureml/curated/foundation-model-inference:latest',
inference_config={
"liveness_route": {"port": 5001, "path": "/"},
"readiness_route": {"port": 5001, "path": "/"},
"scoring_route": {"port": 5001, "path": "/score"},
},
)
Lets deploy the model
from azure.ai.ml.entities import (
OnlineRequestSettings,
CodeConfiguration,
ManagedOnlineDeployment,
ProbeSettings,
Environment
)
deployment = ManagedOnlineDeployment(
name=deployment_name,
endpoint_name=endpoint_name,
model=model.id,
instance_type=sku_name,
instance_count=1,
#code_configuration=code_configuration,
environment = env,
environment_variables=deployment_env_vars,
request_settings=OnlineRequestSettings(request_timeout_ms=REQUEST_TIMEOUT_MS),
liveness_probe=ProbeSettings(
failure_threshold=30,
success_threshold=1,
period=100,
initial_delay=500,
),
readiness_probe=ProbeSettings(
failure_threshold=30,
success_threshold=1,
period=100,
initial_delay=500,
),
)
# Trigger the deployment creation
try:
workspace_ml_client.begin_create_or_update(deployment).wait()
print("\n---Deployment created successfully---\n")
except Exception as err:
raise RuntimeError(
f"Deployment creation failed. Detailed Response:\n{err}"
) from err
If you want to delete the endpoint please see the below code.
workspace_ml_client.online_deployments.begin_delete(name = deployment_name,
endpoint_name = endpoint_name)
workspace_ml_client._online_endpoints.begin_delete(name = endpoint_name)
You can perform hands-on with the code snippet above, but if you want to perform hands-on with more refined code, please refer to the GitHub repository.
Hope this tutorial helps you in fine-tuning and deploying Phi-3 model in Azure ML Studio.
Hope you like the blog. Please clap and follow me if you like to read more such blogs coming soon.
References:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.