Microsoft Foundry Blog

21 MIN READ

Fine-Tuning Small Language Models for Function-Calling: A Comprehensive Guide

Microsoft

Jan 08, 2025

In the rapidly evolving landscape of artificial intelligence, fine-tuning small language models (SLMs) for use case specific workloads has become increasingly essential. The motivation behind this lies in the need for lower latency, reduced memory footprint, and improved accuracy—all while maintaining cost-effectiveness. This blog delves into the reasons for fine-tuning SLMs for function-call, key considerations, and a practical guide to implementing fine-tuning on Azure

Why Fine-Tune Small Language Models?

1. Lower Latency and Reduced Memory Footprint : Smaller models with fewer weights inherently offer faster processing times due to reduced matrix multiplication operations. This lower latency is crucial for real-time applications where speed is paramount. Additionally, these models reduce the memory footprint, making them ideal for deployment in resource-constrained environments.

2. Cost Efficiency : Fine-tuning smaller models is more cost-effective than training large models from scratch. It reduces the computational resources required, thereby lowering operational costs. This makes it a viable option for startups and enterprises looking to optimize their AI expenditure.

3. Improved Accuracy : By tailoring a model to a specific function-calling use case, you can achieve higher accuracy. Fine-tuning allows the model to learn the intricacies of function-calling, thereby providing more relevant and precise outputs.

4. Smaller Token Size : Smaller models and efficient token handling lead to a reduction in token size, which further optimizes processing speed and resource usage.

Key Considerations for Fine-Tuning

a. Selection of the Right Base Model : Choosing the appropriate base model is crucial. Evaluate industrial benchmarks and leaderboards, such as the [Berkeley Function Call Leaderboard] to guide your selection. Consider factors like model size, which affects GPU VRAM requirements, accuracy, and context length. For this blg post, we will use Llama-3.2-3b-instruct model as our base model for fine-tuning.

b. Dataset Preparation : Proper dataset preparation is a cornerstone for successful fine-tuning of SLMs for function-calling tasks. The dataset must be representative of real-world scenarios and cover the full spectrum of use cases you anticipate. For this blog, we will utilize the glaiveai/glaive-function-calling-v2 dataset from Hugging Face, renowned for its comprehensive coverage of simple, multiple, and multi-turn function-calling scenarios across diverse domains.

- Key Steps in Dataset Preparation: Understanding the Complexity of the Use Case

Before diving into the technicalities of dataset preparation, it's essential to understand the complexity of the use case at hand. Is the task limited to function-calling, or does it involve a broader, more generic conversation? If the latter is true, it becomes imperative to ensure that the existing knowledge and capabilities of the language model (SLM) are preserved. The dataset should seamlessly integrate both function-call and non-function-call scenarios to provide a holistic conversational experience.

Differentiating Function-Calling Scenarios

Let's explore the different scenarios that might arise in function-calling applications:

Single Function-Calling: This scenario involves invoking a single function based on user input. For instance, in the travel industry, a user might ask, "What are the available flights from New York to London on December 10th?" The dataset should include examples that allow the model to extract relevant information and call the flight search function accurately.
Multiple Function-Calling: Here, the language model must choose one function from a set of possible tools. For example, if a user asks, "Can you book me a hotel or a flight to Paris?" the dataset should provide instances where the model decides between booking a hotel or a flight based on user preferences or additional input.
Multi-Turn Conversations: This scenario requires tools to be invoked in a sequence based on the conversation's state. Consider a user planning a vacation: "I want to visit Italy. What are my options?" followed by "Book me a flight," and then "Find a hotel in Rome." The dataset should capture the flow of conversation, enabling the model to handle each request in context.
Parallel Function-Calling: In situations where multiple tools need to be invoked simultaneously, such as booking flights and hotels at the same time, the dataset should include examples that allow the model to manage these parallel tasks effectively. For instance, "Book a flight to Tokyo and reserve a hotel in Shinjuku for the same dates."
Handling Missing Information: A robust dataset should also include scenarios where the language model needs to ask the user for missing information. For example, if a user simply says, "Book me a flight," the model should prompt, "Could you please specify the destination and dates?"

c. Compute Selection

Ensure your compute setup has adequate VRAM to accommodate model weights, gradients, and activations. The compute should be tailored to your model size and batch size requirements.

d. Hyperparameter Selection : The selection of hyperparameters is a critical step that can significantly influence the performance of a model. Hyperparameters, unlike the model’s parameters, are not learned from the data but are set before the training process begins. Choosing the right hyperparameters can lead to faster convergence and higher accuracy, making this an area that demands careful attention.

Hyperparameters can be thought of as the settings or knobs that you, as the model trainer, can adjust to tailor the training process. These include learning rate, batch size, the architecture of layers, and more. One of the leading methodologies for fine-tuning models is LORA (Low-Rank Adaptation), which has gained popularity due to its efficiency and effectiveness.

LORA is a technique that allows for the efficient adaptation of large language models by introducing low-rank matrices during the training process. This approach reduces the number of trainable parameters, leading to faster convergence and reduced computational costs.

When using LORA, two primary hyperparameters to consider are:

Rank: This represents the dimensionality of the low-rank matrices. It is a critical factor influencing the model’s capacity to learn nuanced patterns.
Alpha: This is a scaling factor applied to the low-rank updates, typically set to be 2-4 times the rank value.

A good starting point for these parameters might be a rank of 8 and an alpha of 16, but these values should be tailored based on the model's complexity and the specific task at hand.

e. Optimize context length : Another significant aspect of model fine-tuning, especially in function-calling scenarios, is the management of context length. In these prompts, we often provide detailed information such as function names, descriptions, and argument types, which consume a substantial number of tokens. Efficiently managing this context can lead to performance gains without sacrificing accuracy.

Iterative Experimentation with Context Details: To optimize context length, an iterative experimentation approach is recommended:

Baseline Experiment: Start by including all possible details—function descriptions, argument types, and more. This serves as your baseline for comparison.
Simplified Contexts: Gradually remove elements from the context:

First Iteration: Retain only the function names and arguments, omitting descriptions.
Second Iteration: Remove the arguments, keeping just the function names.
Final Iteration: Test the model's performance without any function names or arguments.

By incrementally simplifying the context, you can identify the minimal necessary
While conducting these experiments, it is advantageous to utilize previous checkpoints. Instead of starting from the base model for each iteration, use the trained model from the previous step as a starting point. This approach can save time and computational resources, allowing for more efficient experimentation.

Fine-Tuning on Azure: Step-by-Step

Now lets run the fine-tuning job while adhering to all the guidelines and instructions shared above:-

1. Create an Azure Machine Learning Workspace: An Azure Machine Learning workspace is your control center for managing all the resources you need to train, deploy, automate, and manage machine learning models. It serves as a central repository for your datasets, compute resources, and models. To get started, you can create a workspace through the Azure portal by navigating to the Azure Machine Learning service and selecting "Create new workspace." Ensure you configure resource group, workspace name, region, and other necessary settings.

2. Create a Compute Instance: To run your Python notebook and execute scripts, you need a compute instance. This virtual machine in Azure Machine Learning allows you to perform data preparation, training, and experimentation. Go to the "Compute" section in your workspace, select "Create," and choose a compute instance that fits your needs, ensuring it has the necessary specifications for your workload.

3: Dataset Preparation: For this blog, we'll use the glaiveai/glaive-function-calling-v2 dataset from Hugging Face, which includes simple, multi-turn function calling and generic conversations across various domains. The dataset needs to be formatted to be compatible with the OpenAI format:

Convert each conversation into a chat_template format.
Assign roles as 'system', 'user', or 'assistant'.
Remove "<|endoftext|>” string and if the response is a function-call, replace the “<functioncall>” string and add role as tool so that LLM knows when to stop responding and wait for function execution results

def parse_conversation(input_string):  
    
    ROLE_MAPPING = {"USER" : "user", "ASSISTANT" : "assistant", "SYSTEM" : "system", "FUNCTION RESPONSE" : "tool"}

    # Regular expression to split the conversation based on SYSTEM, USER, and ASSISTANT  
    pattern = r"(SYSTEM|USER|ASSISTANT|FUNCTION RESPONSE):"  
      
    # Split the input string and keep the delimiters  
    parts = re.split(pattern, input_string)  
      
    # Initialize the list to store conversation entries  
    conversation = []  
      
    # Iterate over the parts, skipping the first empty string  
    for i in range(1, len(parts), 2):  
        role = parts[i].strip()  
        content = parts[i + 1].strip()  
        content = content.replace("<|endoftext|>", "").strip()

        if content.startswith('<functioncall>'):  # build structured data for function call
                # try to turn function call from raw text to structured data
                content = content.replace('<functioncall>', '').strip()
                # replace single quotes with double quotes for valid JSON
                clean_content = content.replace("'{", '{').replace("'}", '}')
                data_json = json.loads(clean_content)
                # Make it compatible with openAI prompt format
                func_call = {'recipient_name': f"functions.{data_json['name']}", 'parameters': data_json['arguments']}
                content = {'tool_uses': [func_call]}
          
        # Append a dictionary with the role and content to the conversation list  
        conversation.append({"role": ROLE_MAPPING[role], "content": content})  
      
    return conversation  

def prepare_dataset(tokenizer, args):
    
    # Create the cache_dir
    cache_dir = "./outputs/dataset"
    os.makedirs(cache_dir, exist_ok = True)

    # Load the dataset from disk
    train_dataset = load_from_disk(args.train_dir) 
    eval_dataset = load_from_disk(args.val_dir)

    column_names = list(train_dataset.features)

    def apply_chat_template(examples):
        conversations = []
        for system, chat in zip(examples["system"], examples["chat"]):
            try:
                system_message = parse_conversation(system)
                chat_message = parse_conversation(chat)
                message = system_message + chat_message
                conversations.append(message)
            except Exception as e:
                print(e) 

        text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in conversations]
        return {"text": text}

    # process the dataseta and drop unused columns
    processed_train_dataset = train_dataset.map(apply_chat_template, cache_file_name = f"{cache_dir}/cache.arrow", batched = True, remove_columns=column_names)
    processed_eval_dataset = eval_dataset.map(apply_chat_template, cache_file_name = f"{cache_dir}/cache.arrow", batched = True, remove_columns=column_names)

    return processed_train_dataset, processed_eval_dataset

4: Create a Data Asset: Azure Machine Learning allows you to register datasets as data assets, making them easily manageable and reusable:

def get_or_create_data_asset(ml_client, data_name, data_local_dir, update=False):
    
    try:
        latest_data_version = max([int(d.version) for d in ml_client.data.list(name=data_name)])
        if update:
            raise ResourceExistsError('Found Data asset, but will update the Data.')            
        else:
            data_asset = ml_client.data.get(name=data_name, version=latest_data_version)
            logger.info(f"Found Data asset: {data_name}. Will not create again")
    except (ResourceNotFoundError, ResourceExistsError) as e:
        data = Data(
            path=data_local_dir,
            type=AssetTypes.URI_FOLDER,
            description=f"{data_name} for fine tuning",
            tags={"FineTuningType": "Instruction", "Language": "En"},
            name=data_name
        )
        data_asset = ml_client.data.create_or_update(data)
        logger.info(f"Created/Updated Data asset: {data_name}")
        
    return data_asset

train_data = get_or_create_data_asset(ml_client, f"{AZURE_DATA_NAME}_train", data_local_dir=f"{DATA_DIR}/train", update=True)
val_data = get_or_create_data_asset(ml_client, f"{AZURE_DATA_NAME}_val", data_local_dir=f"{DATA_DIR}/val", update=True)
test_data = get_or_create_data_asset(ml_client, f"{AZURE_DATA_NAME}_test", data_local_dir=f"{DATA_DIR}/test", update=True)

5: Create an Environment: While Azure provides built-in environments for common use cases, creating a custom environment tailored to your specific needs can be beneficial.

An environment in Azure ML is essentially a containerized setup that defines the software, libraries, and other dependencies required to run your machine learning workload.

Why Use Environments?

Reproducibility: By defining an environment, you ensure that your training and inference processes are reproducible, with the same configuration used every time.
Consistency: Environments help maintain consistency across different runs and teams, reducing "it works on my machine" problems.
Portability: They encapsulate your dependencies, making it easier to move and share your ML projects across different Azure services or even with external collaborators.

%%writefile {CLOUD_DIR}/train/Dockerfile

FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu124-py310-torch241:biweekly.202410.2

USER root

# support Deepspeed launcher requirement of passwordless ssh login
RUN apt-get update && apt-get -y upgrade
RUN pip install --upgrade pip
RUN apt-get install -y openssh-server openssh-client

# Install pip dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt --no-cache-dir

RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation

def get_or_create_docker_environment_asset(ml_client, env_name, docker_dir, update=False):
    
    try:
        latest_env_version = max([int(e.version) for e in ml_client.environments.list(name=env_name)])
        if update:
            raise ResourceExistsError('Found Environment asset, but will update the Environment.')
        else:
            env_asset = ml_client.environments.get(name=env_name, version=latest_env_version)
            print(f"Found Environment asset: {env_name}. Will not create again")
    except (ResourceNotFoundError, ResourceExistsError) as e:
        print(f"Exception: {e}")
        env_docker_image = Environment(
            build=BuildContext(path=docker_dir),
            name=env_name,
            description="Environment created from a Docker context.",
        )
        env_asset = ml_client.environments.create_or_update(env_docker_image)
        print(f"Created Environment asset: {env_name}")
    
    return env_asset

env = get_or_create_docker_environment_asset(ml_client, azure_env_name, docker_dir=f"{CLOUD_DIR}/train", update=False)

Reference : training.ipynb

6: Create a Training Script: Your training script will handle the fine-tuning process and log metrics using MLflow, which is tightly integrated with Azure Machine Learning. This involves - Loading the dataset, defining the model architecture, writing functions to track and log metrics such as training and evaluation loss.

def main(args):

     ###################
    # Hyper-parameters
    ###################
    # Only overwrite environ if wandb param passed
    if len(args.wandb_project) > 0:
        os.environ['WANDB_API_KEY'] = args.wandb_api_key    
        os.environ["WANDB_PROJECT"] = args.wandb_project
    if len(args.wandb_watch) > 0:
        os.environ["WANDB_WATCH"] = args.wandb_watch
    if len(args.wandb_log_model) > 0:
        os.environ["WANDB_LOG_MODEL"] = args.wandb_log_model
        
    use_wandb = len(args.wandb_project) > 0 or ("WANDB_PROJECT" in os.environ and len(os.environ["WANDB_PROJECT"]) > 0) 
        
    training_config = {"per_device_train_batch_size" : args.train_batch_size,  # Controls the batch size per device
                       "per_device_eval_batch_size" : args.eval_batch_size,    # Controls the batch size for evaluation
                       "gradient_accumulation_steps" : args.grad_accum_steps,
                       "warmup_ratio" : args.warmup_ratio,  # Controls the ratio of warmup steps
                        "learning_rate" : args.learning_rate,  
                        "fp16" : not torch.cuda.is_bf16_supported(),
                        "bf16" : torch.cuda.is_bf16_supported(),
                        "optim" : "adamw_8bit",
                        "lr_scheduler_type" : args.lr_scheduler_type,
                        "output_dir" : args.output_dir,
                        "logging_steps": args.logging_steps,
                        "logging_strategy": "epoch",
                        "save_steps": args.save_steps,
                        "eval_strategy": "epoch",
                        "num_train_epochs": args.epochs,
                        # "load_best_model_at_end": True,
                        "save_only_model": False,
                        "seed" : 0
    }

    peft_config = {
        "r": args.lora_r,
        "lora_alpha": args.lora_alpha,
        "lora_dropout": args.lora_dropout,
        "bias": "none",
        #"target_modules": "all-linear",
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        "modules_to_save": None,
        "use_gradient_checkpointing": "unsloth",
        "use_rslora": False,
        "loftq_config": None,
    }

    checkpoint_dir = os.path.join(args.output_dir, "checkpoints")

    train_conf = TrainingArguments(
        **training_config,
        report_to="wandb" if use_wandb else "azure_ml",
        run_name=args.wandb_run_name if use_wandb else None,    
    )

    model, tokenizer = load_model(args)
    model = FastLanguageModel.get_peft_model(model, **peft_config)

    ###############
    # Setup logging
    ###############
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    log_level = train_conf.get_process_log_level()
    logger.setLevel(log_level)

    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()

    # Log on each process a small summary
    logger.warning(
        f"Process rank: {train_conf.local_rank}, device: {train_conf.device}, n_gpu: {train_conf.n_gpu}"
        + f" distributed training: {bool(train_conf.local_rank != -1)}, 16-bits training: {train_conf.fp16}"
    )
    logger.info(f"Training/evaluation parameters {train_conf}")
    logger.info(f"PEFT parameters {peft_config}") 

    # Load the dataset
    train_dataset, eval_dataset = prepare_dataset(tokenizer, args)

     ###########
    # Training
    ###########
    trainer = SFTTrainer(
        model=model,
        args=train_conf,
        tokenizer = tokenizer,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        dataset_text_field="text",
        packing = False        # Can make training 5x faster for shorter responses
    )

    # Show current memory stats
    gpu_stats = torch.cuda.get_device_properties(0)
    start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
    logger.info(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
    logger.info(f"{start_gpu_memory} GB of memory reserved.")

    last_checkpoint = None
    if os.path.isdir(checkpoint_dir):
        checkpoints = [os.path.join(checkpoint_dir, d) for d in os.listdir(checkpoint_dir)]
        if len(checkpoints) > 0:
            checkpoints.sort(key=os.path.getmtime, reverse=True)
            last_checkpoint = checkpoints[0]  
    
    trainer_stats = trainer.train(resume_from_checkpoint=last_checkpoint)

    #############
    # Evaluation
    #############
    tokenizer.padding_side = "left"
    metrics = trainer.evaluate()
    metrics["eval_samples"] = len(eval_dataset)
    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)
    
    # ############
    # # Save model
    # ############
    os.makedirs(args.model_dir, exist_ok=True)

    if args.save_merged_model:
        print("Save PEFT model with merged 16-bit weights")
        model.save_pretrained_merged("outputs", tokenizer, save_method="merged_16bit")
    else:
        print(f"Save PEFT model: {args.model_dir}/model")    
        model.save_pretrained(f"{args.model_dir}/model")

    tokenizer.save_pretrained(args.model_dir)

Reference : train.py

7: Create the Compute Cluster: . For this experiment, we are using Standard_NC24ads_A100_v4 which has 1 GPU and 80 GB of VRAM. Select the compute based on the model size and batch size.

from azure.ai.ml.entities import AmlCompute

### Create the compute cluster
try:
    compute = ml_client.compute.get(azure_compute_cluster_name)
    print("The compute cluster already exists! Reusing it for the current run")
except Exception as ex:
    print(
        f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {azure_compute_cluster_size}!"
    )
    try:
        print("Attempt #1 - Trying to create a dedicated compute")
        tier = 'LowPriority' if USE_LOWPRIORITY_VM else 'Dedicated'
        compute = AmlCompute(
            name=azure_compute_cluster_name,
            size=azure_compute_cluster_size,
            tier=tier,
            max_instances=1,  # For multi node training set this to an integer value more than 1
        )
        ml_client.compute.begin_create_or_update(compute).wait()
    except Exception as e:
        print("Error")

8: Submit the Fine-Tuning Job

With everything set up, you can now submit your fine-tuning job:

from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml.entities import ResourceConfiguration

job = command(
    inputs=dict(
        #train_dir=Input(type="uri_folder", path=DATA_DIR), # Get data from local path
        train_dir=Input(path=f"{AZURE_DATA_NAME}_train@latest"),  # Get data from Data asset
        val_dir = Input(path=f"{AZURE_DATA_NAME}_val@latest"),
        epoch=d['train']['epoch'],
        train_batch_size=d['train']['train_batch_size'],
        eval_batch_size=d['train']['eval_batch_size'],  
    ),
    code=f"{CLOUD_DIR}/train",  # local path where the code is stored
    compute=azure_compute_cluster_name,
    command="python train_v3.py --train_dir ${{inputs.train_dir}} --val_dir ${{inputs.val_dir}} --train_batch_size ${{inputs.train_batch_size}} --eval_batch_size ${{inputs.eval_batch_size}}",
    #environment="azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/77", # Use built-in Environment asset
    environment=f"{azure_env_name}@latest",
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 1, # For multi-gpu training set this to an integer value more than 1
    },
)
returned_job = ml_client.jobs.create_or_update(job)
ml_client.jobs.stream(returned_job.name)

9: Monitor Training Metrics: After initiating the job, keep an eye on the output for key metrics like training loss and evaluation loss. Since we've logged the results to MLflow, which is seamlessly integrated with Azure Machine Learning, we can easily review the loss function by navigating to the metrics tab within the jobs section.

Key Takeways:

Both the training and evaluation loss decrease significantly in the initial steps, suggesting effective learning.
The gradual reduction in loss in subsequent steps indicates that the model continues to refine its parameters, but at a slower rate.
The consistency in the downward trend for both training and evaluation loss implies that the model is not overfitting and is generalizing well to new data. However, the slight uptick towards the end in the evaluation loss might need monitoring to ensure it doesn't indicate overfitting at later stages.

Overall, it looks promising, so lets go ahead and register the model.

10: Register the Model: After fine-tuning, register the model to make it available for deployment:

from azureml.core import Workspace, Run 
import os  
  
# Connect to your workspace  
ws = Workspace.from_config()  
  
experiment_name =  'experiment_name'
run_id = 'job_name'

run = Run(ws.experiments[experiment_name], run_id)  

# Register the model  
model = run.register_model(  
    model_name=d["serve"]["azure_model_name"],  # this is the name the model will be registered under  
    model_path="outputs"  # this is the path to the model file in the run's outputs  
)  
# Create a local directory to save the outputs  
local_folder = './model_v2'  
os.makedirs(local_folder, exist_ok=True)  
  
# Download the entire outputs folder  
run.download_files(prefix='outputs', output_directory=local_folder)

Step 11: Deploy the Model to a Managed Online Endpoint: Managed online endpoints provide a seamless way to deploy models without managing underlying infrastructure. They offer scalability, versioning, and easy rollback compared to deploying on an Azure Kubernetes Service (AKS) cluster.

11 a. Build the enviornment: For deploying the model to managed online endpoint, first create the environment with required dependencies and webserver for inference.

%%writefile {CLOUD_DIR}/serve/Dockerfile

FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu124-py310-torch241:biweekly.202410.2

# Install pip dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt --no-cache-dir

# Inference requirements
COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20230419.v1 /artifacts /var/

RUN /var/requirements/install_system_requirements.sh && \
    cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
    cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
    ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
    rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=400
EXPOSE 5001 8883 8888

# support Deepspeed launcher requirement of passwordless ssh login
RUN apt-get update
RUN apt-get install -y openssh-server openssh-client

RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation

Reference : serving.ipynb

11b. Create a serving script: Creating a serve script for inference is a crucial step in deploying your machine learning model to a production environment. This script handles incoming requests, processes input data, runs the model inference, and returns the results. In Azure Machine Learning, the serve script is part of the deployment package for your model, typically used in conjunction with a managed endpoint or a Kubernetes service.

A serve script in Azure ML typically consists of two main functions:

init(): This function initializes the model and any other necessary resources. It is called once when the deployment is first loaded.
run(data): This function is called every time a request is made to the deployed model. It processes the incoming data, performs inference using the model, and returns the results.

import os
import re
import json
import torch
import base64
import logging

from io import BytesIO
from transformers import AutoTokenizer, AutoProcessor, pipeline
from transformers import AutoModelForCausalLM, AutoProcessor

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def init():
    """
    This function is called when the container is initialized/started, typically after create/update of the deployment.
    You can write the logic here to perform init operations like caching the model in memory
    """
    global model
    global tokenizer
    # AZUREML_MODEL_DIR is an environment variable created during deployment.
    # It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION)
    # Please provide your model's folder name if there is one
    model_name_or_path = os.path.join(
        os.getenv("AZUREML_MODEL_DIR"), "outputs"
    )
    
    model_kwargs = dict(
        trust_remote_code=True,    
        device_map={"":0},
        torch_dtype="auto" 
    )
    
    model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map ={"" : 0}, **model_kwargs)
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)    

    logging.info("Loaded model.")
    
def run(json_data: str):
    logging.info("Request received")
    data = json.loads(json_data)
    input_data = data["input_data"]
    params = data['params']

    pipe = pipeline("text-generation", model = model, tokenizer = tokenizer)
    output = pipe(input_data, **params)
    result = output[0]["generated_text"]
    logging.info(f"Generated text : {result}")
    json_result = {"result" : str(result)}

    return json_result

Reference : score.py

11c. Create a managed online endpoint and deploy the model to endpoint: Creating an endpoint and deploying your model on Azure Machine Learning is the final step to make your model accessible for real-time inference. This process involves setting up a service that can handle incoming requests, execute the model, and return the results.

Why Create an Endpoint?

An endpoint is a network-accessible interface that allows external applications or users to interact with your deployed machine learning model. Creating an endpoint is crucial for the following reasons:

Accessibility: Endpoints make your model accessible over the internet or within a secured network, enabling other applications, services, or users to send requests and receive responses.
API Integration: By exposing your model as a RESTful API, endpoints facilitate integration with various applications, allowing seamless communication and data exchange.
Load Management: An endpoint can manage requests from multiple clients, handling concurrent requests and distributing the load appropriately.
Security: Endpoints provide mechanisms for authentication and authorization, ensuring that only authorized users can access the model.
Scalability: Azure-managed endpoints can automatically scale based on demand, ensuring that your model can handle varying workloads without manual intervention.

from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    IdentityConfiguration,
    ManagedIdentityConfiguration,
)

azure_endpoint_name = d['serve']['azure_endpoint_name']
# Check if the endpoint already exists in the workspace
try:
    endpoint = ml_client.online_endpoints.get(azure_endpoint_name)
    print("---Endpoint already exists---")
except:
    # Create an online endpoint if it doesn't exist

    # Define the endpoint
    endpoint = ManagedOnlineEndpoint(
        name=azure_endpoint_name,
        description=f"Test endpoint for {model.name}",
    )

# Trigger the endpoint creation
try:
    ml_client.begin_create_or_update(endpoint).wait()
    print("\n---Endpoint created successfully---\n")
except Exception as err:
    raise RuntimeError(
        f"Endpoint creation failed. Detailed Response:\n{err}"
    ) from err

Why Deploy a Model?

Deployment is the process of transferring your trained machine learning model from a development environment to a production environment where it can serve real-time predictions. Deployment is critical because:

Operationalization: Deployment operationalizes your model, moving it from an experimental or development phase to a live environment where it can deliver value to end-users or systems.
Resource Allocation: Deploying a model involves configuring the necessary compute resources (such as CPU, memory, and GPUs) to ensure optimal performance during inference.
Environment Consistency: During deployment, the model is packaged with its dependencies in a consistent environment, ensuring reproducibility and minimizing discrepancies between development and production.
Monitoring and Maintenance: Deployment sets up the infrastructure to monitor the model's performance, usage, and health, allowing for ongoing maintenance and updates.

Version Control: Deployment allows you to manage and update different versions of your model, providing flexibility to roll back or switch to newer versions as needed.

from azure.ai.ml.entities import (    
    OnlineRequestSettings,
    CodeConfiguration,
    ManagedOnlineDeployment,
    ProbeSettings,
    Environment
)

azure_deployment_name = f"{d['serve']['azure_deployment_name']}-v1"

deployment = ManagedOnlineDeployment(
    name=azure_deployment_name,
    endpoint_name=azure_endpoint_name,
    model=model,
    instance_type=azure_compute_cluster_size,
    instance_count=1,
    #code_configuration=code_configuration,
    environment = env,
    scoring_script="score.py",
    code_path=f"./{CLOUD_DIR}/inference",
    #environment_variables=deployment_env_vars,
    request_settings=OnlineRequestSettings(max_concurrent_requests_per_instance=20,
                                           request_timeout_ms=90000, max_queue_wait_ms=60000),
    liveness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        period=100,
        initial_delay=500,
    ),
    readiness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        period=100,
        initial_delay=500,
    ),
)

# Trigger the deployment creation
try:
    ml_client.begin_create_or_update(deployment).wait()
    print("\n---Deployment created successfully---\n")
except Exception as err:
    raise RuntimeError(
        f"Deployment creation failed. Detailed Response:\n{err}"
    ) from err
    
endpoint.traffic = {azure_deployment_name: 100}
endpoint_poller = ml_client.online_endpoints.begin_create_or_update(endpoint)

Step 12: Run Inference on Sample Data: Test the deployed model using sample data that expects function calls:

import json
import os 

sample = {
    "input_data": 
        [
            {'role': 'system', 'content': 'You are an helpful assistant who has access to the following functions to help the user, you can use the functions if needed- { "name": "calculate_shipping_cost", "description": "Calculate the cost of shipping a package", "parameters": { "type": "object", "properties": { "weight": { "type": "number", "description": "The weight of the package in pounds" }, "destination": { "type": "string", "description": "The destination of the package" } }, "required": [ "weight", "destination" ] }}}"'},
            {'role': 'user', 'content': 'Can you help me with shipping cost for a package?'},
            {'role': 'assistant', 'content': 'Sure! I can help you with that. Please provide me with the weight and destination of the package.'},
            {'role': 'user', 'content': 'The weight of the package is 10 pounds and the destination is New York.'}
        ],
    "params": {
        "temperature": 0.1,
        "max_new_tokens": 512,
        "do_sample": True,
        "return_full_text": False
    }
}
# Dump the sample data into a json file
with open(request_file, "w") as f:
    json.dump(sample, f)

result = ml_client.online_endpoints.invoke(
    endpoint_name=azure_endpoint_name,
    deployment_name=azure_deployment_name,
    request_file=request_file
)

result_json = json.loads(result)
result = result_json['result']

print(result)

Step 13: Compare with Base Model: Now, lets run the same sample through the base model to observe the difference in performance. As we can see, while the fine-tuned model did a perfect job of generating response with the right function and arguments, the base model struggles to generate the desired output

Step 14: Rerun the fine-tuning job by removing function descriptions from the system message: Now, lets rerun the experiment, but this time we will drop the function description from the dataset for context length optimization

def remove_desc_from_prompts(data):
    system_message = data['system']
    pattern = r'"description":\s*"[^"]*",?\n?'  
    
    # Remove the "description" fields  
    cleaned_string = re.sub(pattern, '"description":"",', system_message)  

    return cleaned_string

## Update the system message by removing function descriptions and argument description
train_dataset = train_dataset.map(lambda x : {"updated_system" : remove_desc_from_prompts(x)}, remove_columns = ["system"])
test_dataset = test_dataset.map(lambda x : {"updated_system" : remove_desc_from_prompts(x)}, remove_columns = ["system"])
val_dataset = val_dataset.map(lambda x : {"updated_system" : remove_desc_from_prompts(x)}, remove_columns = ["system"])

train_dataset.save_to_disk(f"{DATA_DIR}/train")
test_dataset.save_to_disk(f"{DATA_DIR}/test")
val_dataset.save_to_disk(f"{DATA_DIR}/val")

Reference : preprocess.py

As can be seen from the results, removing the function description doesn't degrade the model performance but instead this fine-tuned model version requires lesser input tokens resulting in a significant reduction in token consumption with improved latency.

Step 15: Further Exploration: Consider removing arguments or even the function itself in subsequent experiments to evaluate performance.

Conclusion

This blog post has walked through the process of fine-tuning an SLM for function-calling on Azure Machine Learning. By following these steps, you can effectively tailor a model to meet specific functional requirements.

You can access the full code here.

For a deeper dive into evaluating fine-tuned models, including metrics and code samples, check out the next blog post. By leveraging Azure's powerful tools, you can streamline the development and deployment of machine learning models, making them more efficient and effective for your specific tasks.

Reference:

Fine tuning for function calling | OpenAI Cookbook

Fine-tuning function calls with Azure OpenAI Service - Azure AI services | Microsoft Learn

michaelnny/Llama3-FunctionCalling: Fine-tune Llama3 model to support function calling

Fine Tuning LLMs for Function Calling w/Pawel Garbacki - YouTube

slm-innovator-lab/2_slm-fine-tuning-mlstudio at main · Azure/slm-innovator-lab

Updated Jan 08, 2025

Version 2.0

artificial intelligence

azure ai studio

azure machine learning

natural language processing