Finetune Small Language Model (SLM) Phi-3 using Azure Machine Learning

mrajguru · ‎May 05 2024

Overview

Motivations for Small Language Models (SLMs)

Efficiency: SLMs are computationally more efficient, requiring less memory and storage, and can operate faster due to fewer parameters to process.
Cost: Training and deploying SLMs is less expensive, making them accessible to a wider range of businesses and suitable for applications in edge computing.
Customizability: SLMs are more adaptable to specialized applications and can be fine-tuned for specific tasks more readily than larger models· Under-Explored Potential: While large models have shown clear benefits, the potential of smaller models trained with larger datasets has been less explored. SLM aims to showcase that smaller models can achieve high performance when trained with enough data.
Inference Efficiency: Smaller models are often more efficient during inference, which is a critical aspect when deploying models in real-world applications with resource constraints. This efficiency includes faster response times and reduces computational and energy costs.
Accessibility for Research: By being open-source and smaller in size, SLM is more accessible to a broader range of researchers who may not have the resources to work with larger models. It provides a platform for experimentation and innovation in language model research without requiring extensive computational resources.
Advancements in Architecture and Optimization: SLM incorporates various architectural and speed optimizations to improve computational efficiency. These enhancements allow SLM to train faster and with less memory, making it feasible to train on commonly available GPUs.
Open-Source Contribution: The authors of SLM have made the model checkpoints and code publicly available, contributing to the open-source community and enabling further advancements and applications by others.
End-User Applications: With its excellent performance and compact size, SLM is suitable for end-user applications, potentially even on mobile devices, providing a lightweight platform for a wide range of applications.
Training Data and Process: SLM training process is designed to be effective and reproducible, using a mixture of natural language data and code data, aiming to make pre-training accessible and transparent.

In this example, we are going to learn how to fine-tune phi-3-mini-4k-instruct using QLoRA: Efficient Finetuning of Quantized LLMs with Flash Attention. QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.

Hands-on lab

[Step 1: Preparation]

Let's prepare the dataset. In this case we are going to download the ultrachat dataset.

from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split='train_sft[:2%]')

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])

Let's take a shorter version of the dataset to create training and test example. To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function that takes a sample and returns a string with our format instruction.

dataset = dataset.train_test_split(test_size=0.2)
train_dataset = dataset['train']
train_dataset.to_json(f"data/train.jsonl")
test_dataset = dataset['test']
test_dataset.to_json(f"data/eval.jsonl")

Let's save this training and test dataset in json format. Now let’s load the Azure ML SDK. This will help us create the necessary component.

# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes

Now let's create the workspace client.

credential = DefaultAzureCredential()
workspace_ml_client = None
try:
    workspace_ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    subscription_id= "Enter your subscription_id"
    resource_group = "Enter your resource_group"
    workspace= "Enter your workspace name"
    workspace_ml_client = MLClient(credential, subscription_id, resource_group, workspace)

Here let's create a custom training environment.

from azure.ai.ml.entities import Environment, BuildContext
env_docker_image = Environment(
    image="mcr.microsoft.com/azureml/curated/acft-hf-nlp-gpu:latest",
    conda_file="environment/conda.yml",
    name="llm-training",
    description="Environment created for llm training.",
)
workspace_ml_client.environments.create_or_update(env_docker_image)

Let’s look at the conda.yaml

name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pip=24.0
  - pip:
    - bitsandbytes==0.43.1
    - transformers~=4.41
    - peft~=0.11
    - accelerate~=0.30
    - trl==0.8.6
    - einops==0.8.0
    - datasets==2.19.1
    - wandb==0.17.0
    - mlflow==2.13.0
    - azureml-mlflow==1.56.0 
    - torchvision==0.18.0

[Step 2: Training]

Lets look at the training script. We are going to use the recently introduced method in the paper “QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation” by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:

Quantize the pretrained model to 4 bits and freezing it.
Attach small, trainable adapter layers. (LoRA)
Finetune only the adapter layers, while using the frozen quantized model for context.

%%writefile src/train.py

import os
#import mlflow
import argparse
import sys
import logging

import datasets
from datasets import load_dataset
from peft import LoraConfig
import torch
import transformers
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset

logger = logging.getLogger(__name__)


###################
# Hyper-parameters
###################
training_config = {
    "bf16": True,
    "do_eval": False,
    "learning_rate": 5.0e-06,
    "log_level": "info",
    "logging_steps": 20,
    "logging_strategy": "steps",
    "lr_scheduler_type": "cosine",
    "num_train_epochs": 1,
    "max_steps": -1,
    "output_dir": "./checkpoint_dir",
    "overwrite_output_dir": True,
    "per_device_eval_batch_size": 4,
    "per_device_train_batch_size": 4,
    "remove_unused_columns": True,
    "save_steps": 100,
    "save_total_limit": 1,
    "seed": 0,
    "gradient_checkpointing": True,
    "gradient_checkpointing_kwargs":{"use_reentrant": False},
    "gradient_accumulation_steps": 1,
    "warmup_ratio": 0.2,
    }

peft_config = {
    "r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM",
    "target_modules": "all-linear",
    "modules_to_save": None,
}
train_conf = TrainingArguments(**training_config)
peft_conf = LoraConfig(**peft_config)

###############
# Setup logging
###############
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)
log_level = train_conf.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()

# Log on each process a small summary
logger.warning(
    f"Process rank: {train_conf.local_rank}, device: {train_conf.device}, n_gpu: {train_conf.n_gpu}"
    + f" distributed training: {bool(train_conf.local_rank != -1)}, 16-bits training: {train_conf.fp16}"
)
logger.info(f"Training/evaluation parameters {train_conf}")
logger.info(f"PEFT parameters {peft_conf}")

################
# Modle Loading
################
checkpoint_path = "microsoft/Phi-3-mini-4k-instruct"
# checkpoint_path = "microsoft/Phi-3-mini-128k-instruct"
model_kwargs = dict(
    use_cache=False,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",  # loading the model with flash-attenstion support
    torch_dtype=torch.bfloat16,
    device_map=None
)
model = AutoModelForCausalLM.from_pretrained(checkpoint_path, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
tokenizer.model_max_length = 2048
tokenizer.pad_token = tokenizer.unk_token  # use unk rather than eos token to prevent endless generation
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'right'

##################
# Data Processing
##################
def apply_chat_template(
    example,
    tokenizer,
):
    messages = example["messages"]
    # Add an empty system message if there is none
    if messages[0]["role"] != "system":
        messages.insert(0, {"role": "system", "content": ""})
    example["text"] = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False)
    return example



def main(args):
    train_dataset = load_dataset('json', data_files=args.train_file, split='train')
    test_dataset = load_dataset('json', data_files=args.eval_file, split='train')
    column_names = list(train_dataset.features)

    processed_train_dataset = train_dataset.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=10,
        remove_columns=column_names,
        desc="Applying chat template to train_sft",
    )

    processed_test_dataset = test_dataset.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=10,
        remove_columns=column_names,
        desc="Applying chat template to test_sft",
    )

    ###########
    # Training
    ###########
    trainer = SFTTrainer(
        model=model,
        args=train_conf,
        peft_config=peft_conf,
        train_dataset=processed_train_dataset,
        eval_dataset=processed_test_dataset,
        max_seq_length=2048,
        dataset_text_field="text",
        tokenizer=tokenizer,
        packing=True
    )
    train_result = trainer.train()
    metrics = train_result.metrics
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()


    #############
    # Evaluation
    #############
    tokenizer.padding_side = 'left'
    metrics = trainer.evaluate()
    metrics["eval_samples"] = len(processed_test_dataset)
    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)


    # ############
    # # Save model
    # ############
    os.makedirs(args.model_dir, exist_ok=True)
    torch.save(model, os.path.join(args.model_dir, "model.pt"))

def parse_args():
    # setup argparse
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--train-file", type=str, help="Input data for training")
    parser.add_argument("--eval-file", type=str, help="Input data for eval")
    parser.add_argument("--model-dir", type=str, default="./", help="output directory for model")
    parser.add_argument("--epochs", default=10, type=int, help="number of epochs")
    parser.add_argument(
        "--batch-size",
        default=16,
        type=int,
        help="mini batch size for each gpu/process",
    )
    parser.add_argument("--learning-rate", default=0.001, type=float, help="learning rate")
    parser.add_argument("--momentum", default=0.9, type=float, help="momentum")
    parser.add_argument(
        "--print-freq",
        default=200,
        type=int,
        help="frequency of printing training statistics",
    )

    # parse args
    args = parser.parse_args()

    # return args
    return args


# run script
if __name__ == "__main__":
    # parse args
    args = parse_args()
    # call main function
    main(args)

Let’s create a training compute.

from azure.ai.ml.entities import AmlCompute
# If you have a specific compute size to work with change it here. By default we use the 1 x A100 compute from the above list

compute_cluster_size = "Standard_NC24ads_A100_v4"  # 1 x A100 (80GB)
# If you already have a gpu cluster, mention it here. Else will create a new one with the name 'gpu-cluster-big'
compute_cluster = "gpu-a100"
try:
    compute = ml_client.compute.get(compute_cluster)
    print("The compute cluster already exists! Reusing it for the current run")
except Exception as ex:
    print(
        f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {compute_cluster_size}!"
    )
    try:
        print("Attempt #1 - Trying to create a dedicated compute")
        compute = AmlCompute(
            name=compute_cluster,
            size=compute_cluster_size,
            tier="Dedicated",
            max_instances=1,  # For multi node training set this to an integer value more than 1
        )
        ml_client.compute.begin_create_or_update(compute).wait()
    except Exception as e:
        print("Error")

Some helpful tips:

LoRA rank does not need to be high. (e.g., r=256) In our experience, 8 or 16 is enough as a baseline.
If training dataset is small, it is better to set rank=alpha. Usually, 2*rank or 4*rank training is often unstable on small datasets.
Set the learning rate small when using lora. Learning rates such as 1e-3 or 2e-4 are not recommended. We start with 8e-4 or 5e-5.
Rather than setting larger batch size, you should check if we have enough GPU memory. This is because OOM (Out of Memory) may occur if the context length is long like 8K. Using gradient checkpointing and gradient accumulation can have the effect of increasing the batch size.
If you are sensitive to batch size and memory, definitely don't stick with Adam, including low bit Adam. Adam requires additional GPU memory to calculate 1st and 2nd momentum. SGD (Stochastic gradient descent) has a slow convergence speed but does not occupy extra GPU memory.

Now let's call the compute job with the above training script in the AML compute we just created.

from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml.entities import ResourceConfiguration

job = command(
    inputs=dict(
        train_file=Input(
            type="uri_file",
            path="data/train.jsonl",
        ),
        eval_file=Input(
            type="uri_file",
            path="data/eval.jsonl",
        ),        
        epoch=1,
        batchsize=64,
        lr = 0.01,
        momentum = 0.9,
        prtfreq = 200,
        output = "./outputs"
    ),
    code="./src",  # local path where the code is stored
    compute = 'gpu-a100',
    command="accelerate launch train.py --train-file ${{inputs.train_file}} --eval-file ${{inputs.eval_file}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momentum}} --print-freq ${{inputs.prtfreq}} --model-dir ${{inputs.output}}",
    environment="azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/52",
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 1,
    },
)
returned_job  = workspace_ml_client.jobs.create_or_update(job)
workspace_ml_client.jobs.stream(returned_job.name)

Let's look at the pipeline output.

# check if the `trained_model` output is available
job_name = returned_job.name
print("pipeline job outputs: ", workspace_ml_client.jobs.get(job_name).outputs)

[Step 3: Endpoint]

Once the model is finetuned lets register the job in the workspace to create endpoint.

from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

run_model = Model(
    path=f"azureml://jobs/{job_name}/outputs/artifacts/paths/outputs/mlflow_model_folder",
    name="phi-3-finetuned",
    description="Model created from run.",
    type=AssetTypes.MLFLOW_MODEL,
)
model = workspace_ml_client.models.create_or_update(run_model)

Let's create the endpoint.

from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    IdentityConfiguration,
    ManagedIdentityConfiguration,
)

# Check if the endpoint already exists in the workspace
try:
    endpoint = workspace_ml_client.online_endpoints.get(endpoint_name)
    print("---Endpoint already exists---")
except:
    # Create an online endpoint if it doesn't exist

    # Define the endpoint
    endpoint = ManagedOnlineEndpoint(
        name=endpoint_name,
        description=f"Test endpoint for {model.name}",
        identity=IdentityConfiguration(
            type="user_assigned",
            user_assigned_identities=[ManagedIdentityConfiguration(resource_id=uai_id)],
        )
        if uai_id != ""
        else None,
    )

# Trigger the endpoint creation
try:
    workspace_ml_client.begin_create_or_update(endpoint).wait()
    print("\n---Endpoint created successfully---\n")
except Exception as err:
    raise RuntimeError(
        f"Endpoint creation failed. Detailed Response:\n{err}"
    ) from err

Once the endpoint is created we can go ahead and create the deployment.

# Initialize deployment parameters

deployment_name = "phi3-deploy"
sku_name = "Standard_NCs_v3"

REQUEST_TIMEOUT_MS = 90000

deployment_env_vars = {
    "SUBSCRIPTION_ID": subscription_id,
    "RESOURCE_GROUP_NAME": resource_group,
    "UAI_CLIENT_ID": uai_client_id,
}

For inferencing we will use a different base image.

from azure.ai.ml.entities import Model, Environment
env = Environment(
    image='mcr.microsoft.com/azureml/curated/foundation-model-inference:latest',
    inference_config={
        "liveness_route": {"port": 5001, "path": "/"},
        "readiness_route": {"port": 5001, "path": "/"},
        "scoring_route": {"port": 5001, "path": "/score"},
    },
)

Lets deploy the model

from azure.ai.ml.entities import (
    OnlineRequestSettings,
    CodeConfiguration,
    ManagedOnlineDeployment,
    ProbeSettings,
    Environment
)

deployment = ManagedOnlineDeployment(
    name=deployment_name,
    endpoint_name=endpoint_name,
    model=model.id,
    instance_type=sku_name,
    instance_count=1,
    #code_configuration=code_configuration,
    environment = env,
    environment_variables=deployment_env_vars,
    request_settings=OnlineRequestSettings(request_timeout_ms=REQUEST_TIMEOUT_MS),
    liveness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        period=100,
        initial_delay=500,
    ),
    readiness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        period=100,
        initial_delay=500,
    ),
)

# Trigger the deployment creation
try:
    workspace_ml_client.begin_create_or_update(deployment).wait()
    print("\n---Deployment created successfully---\n")
except Exception as err:
    raise RuntimeError(
        f"Deployment creation failed. Detailed Response:\n{err}"
    ) from err

If you want to delete the endpoint please see the below code.

workspace_ml_client.online_deployments.begin_delete(name = deployment_name, 
                                                    endpoint_name = endpoint_name)
workspace_ml_client._online_endpoints.begin_delete(name = endpoint_name)

You can perform hands-on with the code snippet above, but if you want to perform hands-on with more refined code, please refer to the GitHub repository.

Hope this tutorial helps you in fine-tuning and deploying Phi-3 model in Azure ML Studio.

Hope you like the blog. Please clap and follow me if you like to read more such blogs coming soon.

References:

Products (50)

Special Topics (28)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Finetune Small Language Model (SLM) Phi-3 using Azure Machine Learning

Overview

Hands-on lab