Machine learning
135 TopicsFine-Tuning DeepSeek-R1-Distill-Llama-8B with PyTorch FSDP, QLoRA on Azure Machine Learning
Large Language Models (LLMs) have demonstrated remarkable capabilities across various industries, revolutionizing how we approach tasks like legal document summarization, creative content generation, and customer sentiment analysis. However, adapting these general-purpose models to excel in specific domains often requires fine-tuning. This is where fine-tuning comes in, allowing us to tailor LLMs to meet unique requirements and improve their performance on targeted tasks. In this blog post, we'll explore the process of fine-tuning the DeepSeek-R1-Distill-Llama-8B model, highlighting the advantages of using PyTorch Fully Sharded Data Parallel (FSDP) and Quantization-Aware Low-Rank Adaptation (QLoRA) techniques in conjunction with the Azure Machine Learning platform. Why Fine-Tuning Matters In some cases, LLMs may not perform well on specific domains, tasks, or datasets, or may produce inaccurate or misleading outputs. In such cases, fine-tuning the model can be a useful technique to adapt it to the desired goal and improve its quality and reliability. Hallucinations: Hallucinations are untrue statements output by the model. They can harm the credibility and trustworthiness of your application. One possible mitigation is fine-tuning the model with data that contains accurate and consistent information. Accuracy and quality problems: Pre-trained models may not achieve the desired level of accuracy or quality for a specific task or domain. This shortfall can be due a mismatch between the pre-training data and the target data, the diversity and complexity of the target data, and/or incorrect evaluation metrics and criteria. DeepSeek-R1 is an open-source language model excelling in text-based tasks, including creative writing, question answering, editing, and summarization. It's particularly strong in reasoning-intensive tasks like coding, math, and explaining scientific concepts. DeepSeek-R1 stands out due to its mixture of experts (MoE) architecture and use of reinforcement learning, achieving high performance with greater efficiency and lower costs compared to other models. It has 671 billion parameters across multiple expert networks, but only 37 billion are required for a single forward pass. DeepSeek-R1 uses reinforcement learning (RL) to generate a chain-of-thought (CoT) before delivering its final answer. To make these capabilities more accessible, DeepSeek has distilled its R1 outputs into several smaller models. DeepSeek has also created smaller, distilled versions based on Qwen and Llama architectures. Qwen-based distilled models: 1.5B, 7B, 14B and 32B Llama-based distilled models: 8B and 70B DeepSeek-R1-Distill-Llama-8B is a distilled large language model (LLM) based on the Llama architecture, created using outputs from the larger DeepSeek-R1 model. Through knowledge distillation, the reasoning patterns of the larger 671 billion parameter DeepSeek-R1 model are transferred into a smaller, more efficient model. The DeepSeek-R1-Distill-Llama-8B has only 8 billion parameters, making it computationally efficient while retaining a significant portion of the original model's performance. It is fine-tuned from models like Llama-3.1-8B-Instruct, achieving high performance across multiple benchmarks. This distilled model offers a balance of performance and resource requirements, improving inference speed and reducing computational costs, making it cost-effective for production deployments. PyTorch FSDP: Scaling Fine-Tuning with Data Parallelism PyTorch Fully Sharded Data Parallel (FSDP) is a distributed training framework that addresses the challenges of fine-tuning large models by sharding model parameters, optimizer states, and gradients across multiple GPUs. This technique enables you to train models with billions of parameters on systems with limited GPU memory. QLoRA: Efficient Fine-Tuning with Quantization and Low-Rank Adaptation Quantization-Aware Low-Rank Adaptation (QLoRA) is a parameter-efficient fine-tuning technique that reduces memory usage and accelerates training by quantizing the model weights and fine-tuning only a small subset of parameters. QLoRA leverages Low-Rank Adaptation (LoRA) to fine-tune only a small subset of the model’s parameters, making training faster and memory efficient. Azure Machine Learning: Your Platform for Scalable Fine-Tuning Azure Machine Learning provides a robust platform for fine-tuning LLMs, offering a comprehensive suite of tools and services to streamline the process. Scalable Compute: Azure Machine Learning Compute provides virtual machines (VMs) that run parts of the distributed deep learning job, auto-scaling as necessary. Azure Machine Learning compute clusters can schedule tasks, collect results, adjust resources to actual loads, and manage errors[5]. VMs that participate in the cluster can be GPU-enabled to accelerate deep learning calculations. Data Storage: Azure offers standard and premium blob storage options for storing training data and execution logs. Premium blob storage is used to store training data and enable high-performance access during model training, which is needed for distributed training. Experiment Tracking: Azure Machine Learning provides tools for tracking and managing your fine-tuning experiments, allowing you to monitor performance metrics and reproduce your results. Hands-on lab Now let’s start finetune and deploy the same on AML. Lets sets up an Azure Machine Learning (ML) client using the DefaultAzureCredential for authentication. It imports necessary libraries and handles exceptions during the ML client initialization. # import required libraries """ This script sets up an Azure Machine Learning (ML) client using the DefaultAzureCredential for authentication. It imports necessary libraries and handles exceptions during the ML client initialization. Modules imported: - time: Provides various time-related functions. - azure.identity: Provides authentication capabilities with DefaultAzureCredential and InteractiveBrowserCredential. - azure.ai.ml: Contains classes and functions for interacting with Azure ML services, including MLClient, Input, pipeline, load_component, command, Data, Environment, BuildContext, Model, Input, Output, and AssetTypes. - azure.core.exceptions: Contains exceptions for handling resource-related errors. - os: Provides a way to interact with the operating system. Variables: - credential: An instance of DefaultAzureCredential used for authenticating with Azure services. - ml_client: An instance of MLClient initialized using the provided credentials. If the initialization fails, an exception is caught and printed. """ import time from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential from azure.ai.ml import MLClient, Input from azure.ai.ml.dsl import pipeline from azure.ai.ml import load_component from azure.ai.ml import command from azure.ai.ml.entities import Data, Environment, BuildContext from azure.ai.ml.entities import Model from azure.ai.ml import Input from azure.ai.ml import Output from azure.ai.ml.constants import AssetTypes from azure.core.exceptions import ResourceNotFoundError, ResourceExistsError import os credential = DefaultAzureCredential() ml_client = None try: ml_client = MLClient.from_config(credential) except Exception as ex: print(ex) Now lets install some libraries required to download the dataset and run the openai client. %conda run -n azureml_py310_sdkv2 pip install datasets==3.2.0 openai Lets create our training environment. os.makedirs("environment_train", exist_ok=True) Lets build our docker environment. %%writefile environment_train/Dockerfile FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu121-py310-torch22x:biweekly.202501.3 USER root # support Deepspeed launcher requirement of passwordless ssh login RUN apt-get update && apt-get -y upgrade RUN pip install --upgrade pip RUN apt-get install -y openssh-server openssh-client # Install pip dependencies COPY requirements.txt . RUN pip install -r requirements.txt --no-cache-dir RUN MAX_JOBS=4 pip install flash-attn==2.6.3 --no-build-isolation Let’s also specify our requirements.txt %%writefile environment_train/requirements.txt transformers==4.48.2 peft==0.14.0 accelerate==1.3.0 bitsandbytes==0.45.1 datasets==3.2.0 evaluate==0.4.3 huggingface_hub[hf_transfer] safetensors>=0.5.2 sentencepiece==0.2.0 scikit-learn==1.6.1 tokenizers>=0.21.0 py7zr Once we specify both lets create the AML custom training environment. env_name = "deepseek-training" env_docker_image = Environment( build=BuildContext(path = "environment_train", dockerfile_path="Dockerfile"), name=env_name, description="Environment created for llm fine-tuning.", version="1" ) env_asset_train = ml_client.environments.create_or_update(env_docker_image) While the training environment is ready let’s start with the dataset preparation. from datasets import load_dataset import pandas as pd dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en") df = pd.DataFrame(dataset['train']) df = df.iloc[0:2000] df.head() Here is quick snapshot of what the dataset looks like Noe lets split the dataset into train and test for validation. from sklearn.model_selection import train_test_split train, test = train_test_split(df, test_size=0.1, random_state=42) print("Number of train elements: ", len(train)) print("Number of test elements: ", len(test)) Let’s create the prompt template to run the finetuning process. In this case we have used COT prompt template. # custom instruct prompt start prompt_template = f""" <|begin▁of▁sentence|> You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response. <|User|> {{question}} <|Assistant|> <think> {{complex_cot}} </think> {{answer}} <|end▁of▁sentence|> """ # template dataset to add prompt to each sample def template_dataset(sample): sample["text"] = prompt_template.format(question=sample["Question"], complex_cot=sample["Complex_CoT"], answer=sample["Response"]) return sample Let’s run the mapping of this prompt through the whole dataset and create train and test jsonl files.. from datasets import Dataset, DatasetDict from random import randint train_dataset = Dataset.from_pandas(train) test_dataset = Dataset.from_pandas(test) dataset = DatasetDict({"train": train_dataset, "test": test_dataset}) train_dataset = dataset["train"].map(template_dataset, remove_columns=list(dataset["train"].features)) print(train_dataset[randint(0, len(dataset))]["text"]) test_dataset = dataset["test"].map(template_dataset, remove_columns=list(dataset["test"].features)) train_dataset.to_json(f"data/train.jsonl") test_dataset.to_json(f"data/eval.jsonl") Now let’s start creating our training script. os.makedirs("src_train", exist_ok=True) write the train.py which uses both Qlora and PyTorch FSDP. %%writefile src_train/train.py import os import argparse import sys import logging from accelerate import Accelerator import datetime from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed import transformers import traceback from huggingface_hub import snapshot_download from datasets import load_dataset def download_model(model_name): print("Downloading model ", model_name) os.makedirs("/tmp/tmp_folder", exist_ok=True) snapshot_download(repo_id=model_name, local_dir="/tmp/tmp_folder") print(f"Model {model_name} downloaded under /tmp/tmp_folder") def init_distributed(): # Initialize the process group torch.distributed.init_process_group( backend="nccl", # Use "gloo" backend for CPU timeout=datetime.timedelta(seconds=5400) ) local_rank = int(os.environ["LOCAL_RANK"]) torch.cuda.set_device(local_rank) return local_rank def main(args): model_name = args.model_name_or_path train_ds = load_dataset('json', data_files=args.train_file, split='train') test_ds = load_dataset('json', data_files=args.eval_file, split='train') per_device_train_batch_size=args.train_batch_size per_device_eval_batch_size=args.eval_batch_size gradient_accumulation_steps=args.grad_accum_steps learning_rate=args.learning_rate num_train_epochs=args.epochs lora_r=8 lora_alpha=16 lora_dropout=0.1 fsdp="full_shard auto_wrap offload" fsdp_config={ 'backward_prefetch': 'backward_pre', 'cpu_ram_efficient_loading': True, 'offload_params': True, 'forward_prefetch': False, 'use_orig_params': False } gradient_checkpointing=False merge_weights=True seed=42 token=None model_dir = args.model_dir if torch.cuda.is_available() and (torch.cuda.device_count() > 1 or int(os.environ.get("SM_HOST_COUNT", 1)) > 1): # Call this function at the beginning of your script local_rank = init_distributed() # Now you can use distributed functionalities torch.distributed.barrier(device_ids=[local_rank]) os.environ.update({"HF_HUB_ENABLE_HF_TRANSFER": "1"}) set_seed(seed) accelerator = Accelerator() if token is not None: os.environ.update({"HF_TOKEN": token}) accelerator.wait_for_everyone() if int(os.environ.get("SM_HOST_COUNT", 1)) == 1: if accelerator.is_main_process: download_model(model_name) else: download_model(model_name) accelerator.wait_for_everyone() model_name = "/tmp/tmp_folder" tokenizer = AutoTokenizer.from_pretrained(model_name) # Set Tokenizer pad Token tokenizer.pad_token = tokenizer.eos_token with accelerator.main_process_first(): # tokenize and chunk dataset lm_train_dataset = train_ds.map( lambda sample: tokenizer(sample["text"]), remove_columns=list(train_ds.features) ) print(f"Total number of train samples: {len(lm_train_dataset)}") if test_ds is not None: lm_test_dataset = test_ds.map( lambda sample: tokenizer(sample["text"]), remove_columns=list(test_ds.features) ) print(f"Total number of test samples: {len(lm_test_dataset)}") else: lm_test_dataset = None torch_dtype = torch.bfloat16 # Defining additional configs for FSDP if fsdp != "" and fsdp_config is not None: bnb_config_params = { "bnb_4bit_quant_storage": torch_dtype } model_configs = { "torch_dtype": torch_dtype } fsdp_configurations = { "fsdp": fsdp, "fsdp_config": fsdp_config, "gradient_checkpointing_kwargs": { "use_reentrant": False }, "tf32": True } else: bnb_config_params = dict() model_configs = dict() fsdp_configurations = dict() bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch_dtype, **bnb_config_params ) model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, quantization_config=bnb_config, attn_implementation="flash_attention_2", use_cache=not gradient_checkpointing, cache_dir="/tmp/.cache", **model_configs ) if fsdp == "" and fsdp_config is None: model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing) if gradient_checkpointing: model.gradient_checkpointing_enable() config = LoraConfig( r=lora_r, lora_alpha=lora_alpha, target_modules="all-linear", lora_dropout=lora_dropout, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, config) trainer = transformers.Trainer( model=model, train_dataset=lm_train_dataset, eval_dataset=lm_test_dataset if lm_test_dataset is not None else None, args=transformers.TrainingArguments( per_device_train_batch_size=per_device_train_batch_size, per_device_eval_batch_size=per_device_eval_batch_size, gradient_accumulation_steps=gradient_accumulation_steps, gradient_checkpointing=gradient_checkpointing, logging_strategy="steps", logging_steps=1, log_on_each_node=False, num_train_epochs=num_train_epochs, learning_rate=learning_rate, bf16=True, ddp_find_unused_parameters=False, save_strategy="no", output_dir="outputs", **fsdp_configurations ), data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), ) if trainer.accelerator.is_main_process: trainer.model.print_trainable_parameters() trainer.train() if trainer.is_fsdp_enabled: trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT") if merge_weights: output_dir = "/tmp/model" # merge adapter weights with base model and save # save int 4 model trainer.model.save_pretrained(output_dir, safe_serialization=False) if accelerator.is_main_process: # clear memory del model del trainer torch.cuda.empty_cache() # load PEFT model model = AutoPeftModelForCausalLM.from_pretrained( output_dir, torch_dtype=torch.float16, low_cpu_mem_usage=True, trust_remote_code=True, ) # Merge LoRA and base model and save model = model.merge_and_unload() model.save_pretrained( model_dir, safe_serialization=True, max_shard_size="2GB" ) else: trainer.model.save_pretrained( model_dir, safe_serialization=True ) if accelerator.is_main_process: tokenizer.save_pretrained(model_dir) accelerator.wait_for_everyone() def parse_args(): # setup argparse parser = argparse.ArgumentParser() # curr_time = datetime.now().strftime("%Y-%m-%d_%H:%M:%S") # hyperparameters parser.add_argument("--model_name_or_path", default="deepseek-ai/DeepSeek-R1-Distill-Llama-8B", type=str, help="Input directory for training") parser.add_argument("--train_file", type=str, help="Input data for training") parser.add_argument("--eval_file", type=str, help="Input data for eval") parser.add_argument("--epochs", default=1, type=int, help="number of epochs") parser.add_argument("--train_batch_size", default=2, type=int, help="training - mini batch size for each gpu/process") parser.add_argument("--eval_batch_size", default=4, type=int, help="evaluation - mini batch size for each gpu/process") parser.add_argument("--grad_accum_steps", default=4, type=int, help="gradient accumulation steps") parser.add_argument("--learning_rate", default=2e-4, type=float, help="learning rate") parser.add_argument("--save_merged_model", type=bool, default=False) parser.add_argument("--model_dir", type=str, default="./", help="output directory for model") # parse args args = parser.parse_args() # return args return args if __name__ == "__main__": #sys.argv = [''] args = parse_args() main(args) Next step is to create a compute cluster on which the training will run. azure_compute_cluster_name = "a100-compute" azure_compute_cluster_size = "Standard_NC24ads_A100_v4" USE_LOWPRIORITY_VM = True from azure.ai.ml.entities import AmlCompute ### Create the compute cluster try: compute = ml_client.compute.get(azure_compute_cluster_name) except Exception as ex: try: tier = "LowPriority" if USE_LOWPRIORITY_VM else "Dedicated" compute = AmlCompute( name=azure_compute_cluster_name, size=azure_compute_cluster_size, tier=tier, max_instances=1, # For multi node training set this to an integer value more than 1 ) ml_client.compute.begin_create_or_update(compute).wait() except Exception as e: print(e) Once the compute is ready, lets run the training job. from azure.ai.ml import command from azure.ai.ml import Input from azure.ai.ml.entities import ResourceConfiguration str_command = "" str_command += "python train.py --train_file ${{inputs.train_file}} --eval_file ${{inputs.eval_file}} \ --epochs ${{inputs.epoch}} --train_batch_size ${{inputs.train_batch_size}} \ --eval_batch_size ${{inputs.eval_batch_size}} --model_name_or_path ${{inputs.model_name_or_path}} \ --model_dir ${{inputs.model_dir}} --save_merged_model ${{inputs.save_merged_model}}" job = command( inputs=dict( train_file=Input( type="uri_file", path="data/train.jsonl", ), eval_file=Input( type="uri_file", path="data/eval.jsonl", ), epoch=1, train_batch_size=2, eval_batch_size=1, model_name_or_path="deepseek-ai/DeepSeek-R1-Distill-Llama-8B", model_dir="./outputs", save_merged_model = True ), code="./src_train", # local path where the code is stored compute=azure_compute_cluster_name, command=str_command, environment=env_asset_train, distribution={ "type": "PyTorch", "process_count_per_instance": 1, # For multi-gpu training set this to an integer value more than 1 }, ) returned_job = ml_client.jobs.create_or_update(job) ml_client.jobs.stream(returned_job.name) Once the training is completed, lets register the model as a custom model type. from azure.ai.ml.entities import Model from azure.ai.ml.constants import AssetTypes run_model = Model( path=f"azureml://jobs/{returned_job.name}/outputs/artifacts/paths/outputs/", name="deepseekr1-dist-llama8bft", description="Model created from run.", type=AssetTypes.CUSTOM_MODEL, ) model = ml_client.models.create_or_update(run_model) Once the model is registered the next step is to deploy the same as Online Managed Endpoint. from azure.ai.ml.entities import ( ManagedOnlineEndpoint, IdentityConfiguration, ManagedIdentityConfiguration, ) endpoint_name = "deepseekr1-dist-llama8bft-ep" # Check if the endpoint already exists in the workspace try: endpoint = ml_client.online_endpoints.get(endpoint_name) print("---Endpoint already exists---") except: # Create an online endpoint if it doesn't exist # Define the endpoint endpoint = ManagedOnlineEndpoint( name=endpoint_name, description=f"Test endpoint for {model.name}" ) # Trigger the endpoint creation try: ml_client.begin_create_or_update(endpoint).wait() print("\n---Endpoint created successfully---\n") except Exception as err: raise RuntimeError( f"Endpoint creation failed. Detailed Response:\n{err}" ) from err Let’s define the deployment name , SKU type of the VM and Request timeout parameter. # Initialize deployment parameters deployment_name = "deepseekr1-dist-llama8bftd-eploy" sku_name = "Standard_NC24ads_A100_v4" REQUEST_TIMEOUT_MS = 90000 os.makedirs("environment_inf", exist_ok=True) Lets create the environment for our inference . %%writefile environment_inf/Dockerfile FROM vllm/vllm-openai:latest ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server --model $MODEL_NAME $VLLM_ARGS Let’s build the environment with the docker file created above. from azure.ai.ml.entities import Environment, BuildContext env_docker_image = Environment( build=BuildContext(path="environment_inf", dockerfile_path= "Dockerfile"), name="vllm-custom", description="Environment created from a Docker context.", inference_config={ "liveness_route": { "port": 8000, "path": "/health", }, "readiness_route": { "port": 8000, "path": "/health", }, "scoring_route": { "port": 8000, "path": "/", }, }, ) env_asset_inf = ml_client.environments.create_or_update(env_docker_image) Once our environment for inference server is ready let’s do the deployment. Lets define some environment variables model_path = f"/var/azureml-app/azureml-models/{model.name}/{model.version}/outputs" env_vars = { "MODEL_NAME": model_path, "VLLM_ARGS": "--max-model-len 16000 --enforce-eager", } deployment_env_vars = {**env_vars} Lets do the deployment now. import time from azure.ai.ml.entities import ( OnlineRequestSettings, CodeConfiguration, ManagedOnlineDeployment, ProbeSettings, Environment ) t0 = time.time() deployment = ManagedOnlineDeployment( name= deployment_name, endpoint_name=endpoint_name, model=model, instance_type=sku_name, instance_count=1, environment_variables=deployment_env_vars, environment=env_asset_inf, request_settings=OnlineRequestSettings( max_concurrent_requests_per_instance=2, request_timeout_ms=50000, max_queue_wait_ms=60000 ), liveness_probe=ProbeSettings( failure_threshold=5, success_threshold=1, timeout=10, period=30, initial_delay=120 ), readiness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, timeout=2, period=10, initial_delay=120, ), ) # Trigger the deployment creation try: ml_client.begin_create_or_update(deployment).wait() except Exception as err: raise RuntimeError( f"Deployment creation failed. Detailed Response:\n{err}" ) from err endpoint.traffic = {deployment_name: 100} endpoint_poller = ml_client.online_endpoints.begin_create_or_update(endpoint) Wow!! Our endpoint is now deployed. Let’s start testing the same. endpoint_results = endpoint_poller.result() endpoint_name = endpoint_results.name keys = ml_client.online_endpoints.get_keys(name=endpoint_name) primary_key = keys.primary_key url = os.path.join(endpoint_results.scoring_uri, "v1") endpoint_name = ( endpoint_results.name if endpoint_name is None else endpoint_name ) keys = ml_client.online_endpoints.get_keys(name=endpoint_name) Once we get the API keys we can use openai client to stream the tokens. from openai import OpenAI vllm_client = OpenAI(base_url=url, api_key=primary_key) # Create your prompt system_message = """You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.""" user_message = f"""A 3-week-old child has been diagnosed with late onset perinatal meningitis, and the CSF culture shows gram-positive bacilli. What characteristic of this bacterium can specifically differentiate it from other bacterial agents?""" response = vllm_client .chat.completions.create( model=model_path, messages=[ {"role": "system", "content": system_message}, {"role": "user", "content": user_message}, ], temperature=0.7, max_tokens=4000, stream=True, # Stream the response ) print("Streaming response:") for chunk in response: delta = chunk.choices[0].delta if hasattr(delta, "content"): print(delta.content, end="", flush=True) Conclusion Fine-tuning the DeepSeek-R1-Distill-Llama-8B model with PyTorch FSDP and QLoRA on Azure Machine Learning offers a powerful approach to customising LLMs for specific tasks. By leveraging the scalability and efficiency of these techniques, you can unlock the full potential of LLMs and drive innovation in your respective domain. Hope you liked the blog. Do like the blog and follow me for more such content. Thanks Manoranjan Rajguru AI Global Black Belt1.4KViews0likes0CommentsScalable and Efficient Fine-Tuning of LLM on Azure ML
https://github.com/james-tn/llm-fine-tuning/tree/main/opensource_llm/single_step Co-Author: Mohamad AL jazaery Why Scalable and Efficient Fine-Tuning Matters Faster Iterations, Shorter Time-to-Value: In today’s competitive AI landscape, time is of the essence. The faster you can fine-tune a model, the quicker you can validate ideas, test hypotheses, and bring solutions to market. High-profile GPU machines are costly: High-performance GPUs and compute clusters don’t come cheap, and their availability is often limited. Efficient fine-tuning techniques, such as model sharding and distributed training, maximize the utilization of these precious resources—ensuring that you get the most out of your infrastructure investment. Choosing the Right Azure ML GPU Compute for the Job: NC or ND? Not all GPU computes are created equal, and choosing the right sku can make or break your training efficiency. ND Series: Ideal for distributed training across multiple nodes, thanks to its Infiniband (IB) connectivity that ensures high-speed communication between nodes like pretraining LLM or finetuning very large model ~70B params. NC Series: Small and medium workload where no heavy interaction between nodes needed like LLM inferencing or mid-size LLM finetuning. Azure GPU Machine Options by Scenario: Scenario Common model size Training Approach Recommended Azure Compute Small-scale fine-tuning < 3B parameters Parameter-efficient tuning NCas_T4_v3 (Tesla T4, 16 GB) Medium-scale fine-tuning 1–5B parameters Full or parameter-efficient NCs_v3 (Tesla V100, 16 GB) Distributed training for medium models 5–10B parameters Full fine-tuning ND_v2 (Tesla V100 NVLINK, 32 GB, InfiniBand) Large-scale fine-tuning (single machine) 10–30B parameters Full or parameter-efficient NC_A100_v4 (A100, 40 GB) Distributed training for very large models 20–70B parameters Full fine-tuning NDasrA100_v4 (A100, 80 GB, HDR InfiniBand) Very large models training (single machine) up to 70B parameters Full or parameter-efficient NCads_H100_v5 (H100 NVL, 94 GB) Massive-scale distributed training > 70B parameters Full fine-tuning ND-H100-v5 (H100, 80 GB, scale-out InfiniBand) Distributed Efficient Training: A Quick Guide When scaling fine-tuning tasks, choosing the right distributed training method is key: DDP (Data Parallelism): Works well when the entire model fits on a single GPU. It replicates the model across multiple GPUs and splits the data for parallel processing. Check experiment 1 in the following section. Model Parallelism: A game-changer for massive models that don’t fit on a single GPU. It shards not only the data but also the model parameters and optimizer states across multiple GPUs, enabling efficient training of models like LLaMA-70B on GPUs with low memory GPUs. Both FSDP and DeepSpeed as libraries excel at implementing advanced forms of model parallelism and memory optimization. Memory Optimization Techniques Gradient Checkpointing: Reduces memory by recomputing activations during the backward pass, trading memory for additional computation. Mixed Precision Training: Reduces memory usage by using FP16 or BF16 instead of FP32, accelerating training while maintaining numerical stability. Supported by both frameworks. Quantization (DeepSpeed Exclusive): Uses INT8 precision for weights and activations, dramatically reducing memory and compute requirements. Offloading (DeepSpeed Exclusive): Offloads optimizer states and model parameters to CPU or NVMe, freeing up GPU memory for computation. Our Experiments: Pushing the Limits of Scalability Experiment 1: Distributed Training on Multiple Nodes using DDP We conducted an experiment to fine-tune the Llama-3.1-8B model using LoRA (Low-Rank Adaptation) on Azure ML NDv2-V100 nodes. The goal was to evaluate the efficiency of fine-tuning across different numbers of nodes (1, 2, and 3) and observe the impact on training time and throughput. Azure ML Job YAML Definition $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml/models/mistralai-Mistral-7B-v01/versions/19 command: > accelerate launch --num_processes 16 # gpu per machine * num of machines --num_machines 2 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT compute: azureml:ndv2-cluster resources: instance_count: 2 # Number of nodes for distributed training distribution: type: pytorch process_count_per_instance: 1 # Number of processes per node Results: As you increased the number of nodes from one to three, the throughput increased proportionally. This indicates that the system scaled efficiently with the addition of more nodes, maintaining a close-to-linear improvement in throughput. Experiment 2: Model Parallelism using FSDP Fine-tuning a 70B-parameter model on GPUs with only 16GB of memory might sound impossible, but we made it happen using FSDP (Full Sharded Data Parallelism) on Azure ML using a cluster of multiple NDv2-V100 nodes. By distributing not only the data but also the model parameters and optimizer states across multiple nodes, we unlocked the power of full sharding. $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml-meta/models/Llama-3.3-70B-Instruct/versions/4 command: > accelerate launch --config_file "configs/fsdp_config.yaml" --num_processes 32 --num_machines 4 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT train.py compute: azureml:ndv2-cluster resources: instance_count: 4 # Number of nodes for distributed training distribution: type: pytorch process_count_per_instance: 1 # Number of processes per node Key Takeaways: Memory Efficiency: Full sharding enabled us to fine-tune the LLaMA-70B model on V100 GPUs despite their limited memory. Connectivity Matters: The Infiniband (IB) connectivity of ND nodes played a critical role in ensuring smooth communication across GPUs, making this feat possible. Conclusion Scalable and efficient fine-tuning is the key to unlocking the true potential of Large Language Models. By leveraging distributed training techniques, such as FSDP and DDP, and optimizing compute resources on Azure ML, researchers and practitioners can overcome the challenges of training massive models—reducing costs, accelerating time-to-value, and driving AI innovation. Access the code and start experimenting here! Future work: The second part will focus on real-world pipeline setups, including end-to-end model training, hyperparameter optimization, and testing. The third part will dive into deploying trained models for practical use. Future posts may explore best practices for specific fine-tuning scenarios and techniques.888Views3likes0CommentsUnlocking Function Calling with vLLM and Azure Machine Learning
Introduction In this post, we’ll explain how to deploy LLMs on vLLM using Azure Machine Learning’s Managed Online Endpoints for efficient, scalable, and secure real-time inference. Next, we will look at function calling, and how vLLM's engine can support you to achieve that. To get started, let’s briefly look into what vLLM and Managed Online Endpoints are. You can find the full code examples on vllm-on-azure-machine-learning. vLLM vLLM is a high-throughput and memory-efficient inference and serving engine designed for large language models (LLMs). It optimizes the serving and execution of LLMs by utilizing advanced memory management techniques, such as PagedAttention, which efficiently manages attention key and value memory. This allows for continuous batching of incoming requests and fast model execution, making vLLM a powerful tool for deploying and serving LLMs at scale. vLLM supports seamless integration with popular Hugging Face models and offers various decoding algorithms, including parallel sampling and beam search. It also supports tensor parallelism and pipeline parallelism for distributed inference, making it a flexible and easy-to-use solution for LLM inference (see full docs). Managed Online Endpoints in Azure Machine Learning Managed Online Endpoints in Azure Machine Learning provide a streamlined and scalable way to deploy machine learning models for real-time inference. These endpoints handle the complexities of serving, scaling, securing, and monitoring models, allowing us to focus on building and improving your models without worrying about infrastructure management. HuggingFace Model Deployment Let’s go through deploying a HuggingFace model on Azure Machine Learning’s Managed Online Endpoints. For this, we’ll use a custom Dockerfile and configuration files to set up the deployment. As a model, we’ll be using meta-llama/Llama-3.1-8B-Instruct on a single Standard_NC24ads_A100_v4 instance. Step 1: Create a custom Environment for vLLM on AzureML First, we create a Dockerfile to define the environment for our model. For this, we’ll be using vllm’s base container that has all the dependencies and drivers included: FROM vllm/vllm-openai:latest ENV MODEL_NAME facebook/opt-125m ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server --model $MODEL_NAME $VLLM_ARGS The idea here is that we can pass a model name via an ENV variable, so that we can easily define which model we want to deploy during deployment time. Next, we log into our Azure Machine Learning workspace: az account set --subscription <subscription ID> az configure --defaults workspace=<Azure Machine Learning workspace name> group=<resource group> Now, we create an environment.yml file to specify the environment settings: $schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json name: vllm build: path: . dockerfile_path: Dockerfile Then let’s build the environment: az ml environment create -f environment.yml Step 2: Deploy the AzureML Managed Online Endpoint Time for deployment, so let’s first create an endpoint.yml file to define the Managed Online Endpoint: $schema: https://azuremlsdk2.blob.core.windows.net/latest/managedOnlineEndpoint.schema.json name: vllm-hf auth_mode: key Let’s create it: az ml online-endpoint create -f endpoint.yml For the next step, we’ll need the address of the Docker image address we created. We can quickly get it from AzureML Studio -> Environments -> vllm: Finally, we create a `deployment.yml file to configure the deployment settings and deploy our desired model from HuggingFace via vLLM: $schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json name: current endpoint_name: vllm-hf environment_variables: MODEL_NAME: meta-llama/Llama-3.1-8B-Instruct # define the model name using the identifier from HG VLLM_ARGS: "--enable-auto-tool-choice --tool-call-parser llama3_json" HUGGING_FACE_HUB_TOKEN: xxxxxxxxxxxxxx # Add your HF API key here environment: image: xxxxxxxxx.azurecr.io/azureml/azureml_xxxxxxxxxxx # Replace with your own image inference_config: liveness_route: port: 8000 path: /ping readiness_route: port: 8000 path: /health scoring_route: port: 8000 path: / instance_type: Standard_NC24ads_A100_v4 instance_count: 1 request_settings: request_timeout_ms: 60000 max_concurrent_requests_per_instance: 16 liveness_probe: initial_delay: 10 period: 10 timeout: 2 success_threshold: 1 failure_threshold: 30 readiness_probe: initial_delay: 120 period: 10 timeout: 2 success_threshold: 1 failure_threshold: 30 Since vLLM does not support separate probes for readiness and liveness, we’ll need to make sure that the model has fully loaded before the fire the first probe. This is why we increased readiness_probe.initial_delay to 120s. For larger models, we should also follow vLLM’s documentation for using tensor parallel inference (model on single node but spanning multiple GPUs) by adding --tensor-parallel-size <NUM_OF_GPUs> to VLLM_ARGS. Since we’re using a single A100 GPU in our example (Standard_NC24ads_A100_v4), this is not required though. The request_settings depend a bit on our instance type/size and might require some manual tuning to get the model run properly and efficiently. Goal is to find a good tradeoff between concurrency (max_concurrent_requests_per_instance) and queue time in order to avoid either hitting request_timeout_ms from the endpoint side, or any HTTP-timeouts on the client side. Both these scenarios result in HTTP 429, and the client would need to implement exponential backoff (e.g. via tenacity library). Lastly, we can deploy the model: az ml online-deployment create -f deployment.yml --all-traffic By following these steps, we have deployed a HuggingFace model on Azure Machine Learning’s Managed Online Endpoints, ensuring efficient and scalable real-time inference. Time to test it! Step 3: Testing the deployment# First, let’s get the endpoint’s scoring uri and the api keys: az ml online-endpoint show -n vllm-hf az ml online-endpoint get-credentials -n vllm-hf For completion models, we can then call the endpoint using this Python code snippet: import requests url = "https://vllm-hf.polandcentral.inference.ml.azure.com/v1/completions" headers = { "Content-Type": "application/json", "Authorization": "Bearer xxxxxxxxxxxx" } data = { "model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "San Francisco is a", "max_tokens": 200, "temperature": 0.7 } response = requests.post(url, headers=headers, json=data) print(response.json()) Response: { "id": "cmpl-98d658cf-6310-4c87-a24f-723dda6db176", "object": "text_completion", "created": 1738267352, "model": "meta-llama/Llama-3.1-8B-Instruct", "choices": [ { "index": 0, "text": " top tourist destination known for its iconic Golden Gate Bridge, steep hills, vibrant neighborhoods, and cultural attractions. The city is a haven for foodies, with a diverse range of cuisines available, from seafood to Mexican to Chinese and more.\nOne of the best ways to experience San Francisco is by taking a ride on a historic cable car, which offers stunning views of the city and its surroundings. Explore the historic Fisherman's Wharf, a bustling waterfront district filled with seafood restaurants, street performers, and souvenir shops.\nVisit the vibrant neighborhoods of Haight-Ashbury and the Mission District, known for their colorful street art, independent shops, and lively music scenes. Take a stroll through Golden Gate Park, a sprawling urban park that features gardens, lakes, and walking and biking trails.\n\nThe city has a thriving arts and culture scene, with numerous museums, galleries, and performance venues. The San Francisco Museum of Modern Art (SFMOMA) is one of the largest modern art museums in", "logprobs": null, "finish_reason": "length", "stop_reason": null, "prompt_logprobs": null } ], "usage": { "prompt_tokens": 5, "total_tokens": 205, "completion_tokens": 200, "prompt_tokens_details": null } } Works! Function Calling Function calling in the context of large language models (LLMs) refers to the model's ability to dynamically generate and call structured functions based on context, user input, or specific task requirements. It enables seamless interaction with APIs, databases, or external tools while leveraging the model's reasoning capabilities. vLLM provides an OpenAI-compatible server that supports the Completions, Chat Completions, and Embeddings APIs. For instance, it enables developers to seamlessly integrate models into existing workflows. Developers can use the official OpenAI Python client or any HTTP client to interact with vLLM, making it straightforward to integrate into existing workflows. Before running the code, ensure you have the OpenAI library installed by executing: pip install openai The following code demonstrates the function-calling capabilities of vLLM using an example where the assistant retrieves information about historical events based on a provided date: Lets go through it step by step 1. Defining a Custom Function: A query_historical_event function is defined, containing a dictionary of fictional historical events. This function serves as a callable endpoint for vLLM to retrieve information based on a user-specified date. def query_historical_event(date): fictional_historical_events = { "1805-03-21": "On March 21, 1805, the Treaty of Varis signed by several European powers established the first coordinated effort to protect migratory bird species.", "1898-07-10": "On July 10, 1898, the Great Illumination Act was passed in London, mandating the installation of electric streetlights across all major cities in the United Kingdom.", "1923-09-05": "On September 5, 1923, the International Academy of Innovation was founded in Zurich, Switzerland, promoting global collaboration in scientific research.", "1940-02-14": "On February 14, 1940, the first underwater train tunnel connecting two countries was completed between France and the United Kingdom.", "1954-11-08": "On November 8, 1954, the Global Weather Watch Program was launched, pioneering the use of satellites for monitoring Earth's climate systems.", "1977-06-30": "On June 30, 1977, the first fully solar-powered town, Solaria, was inaugurated in Arizona, setting a benchmark for renewable energy communities.", "1983-12-12": "On December 12, 1983, the Universal Language Project introduced a simplified global auxiliary language intended to foster cross-cultural communication.", "1994-04-23": "On April 23, 1994, the Oceanic Research Pact was signed, marking a commitment by 40 nations to share oceanographic research and preserve marine ecosystems.", "2009-08-15": "On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours.", "2020-01-10": "On January 10, 2020, the World Clean Air Initiative achieved its milestone goal of reducing urban air pollution levels in 50 major cities globally." } return fictional_historical_events.get(date, f"No historical event information available for {date}.") 2. Tool Integration: The function is wrapped in a tools definition, which includes metadata such as the function’s name, description, and expected parameters (e.g., the date in YYYY-MM-DD format). tools = [ { "function": { "name": "query_historical_event", "description": "Provides information about a historical event that occurred on a specified date.", "parameters": { "type": "object", "properties": { "date": { "type": "string", "description": "The date of the event in YYYY-MM-DD format." }, }, "required": ["date"] } } } ] 3. Conversation Workflow: The conversation starts with a system message setting the assistant's role and a user query about a specific date. The assistant evaluates the query and decides if the custom function is needed. messages = [ {"role": "system", "content": "You are a knowledgeable assistant that can retrieve information about historical events."}, {"role": "user", "content": "Can you tell me what happened on August 15, 2009?"}, ] 4. Function Call Handling: If the assistant determines that the function is required, it: Parses the function call and extracts the necessary parameters (e.g., date). Executes the query_historical_event function with the provided arguments and returns the result to the user. chat_response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=messages, temperature=0.7, max_tokens=1024, top_p=0.9, frequency_penalty=0.5, presence_penalty=0.6, tools=tools, tool_choice='auto' ) if chat_response.choices[0].message.tool_calls: date_argument = json.loads( chat_response.choices[0].message.tool_calls[0].function.arguments) date = date_argument.get("date", None) response = query_historical_event(date) print("Assistant response:", response) else: print("Assistant response:", chat_response.choices[0].message.content) Example Workflow User Query: "Can you tell me what happened on August 15, 2009?" Assistant Function Call: The assistant identifies the query’s intent and calls query_historical_event with the argument date="2009-08-15". Response: The function retrieves the event: "On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours." Full Code from openai import OpenAI import json # Set up API client with the vLLM server settings openai_api_key = <your-deployment-key> openai_api_base = "https://vllm-hf.eastus2.inference.ml.azure.com/v1/" client = OpenAI(api_key=openai_api_key, base_url=openai_api_base) def query_historical_event(date): fictional_historical_events = { "1805-03-21": "On March 21, 1805, the Treaty of Varis signed by several European powers established the first coordinated effort to protect migratory bird species.", "1898-07-10": "On July 10, 1898, the Great Illumination Act was passed in London, mandating the installation of electric streetlights across all major cities in the United Kingdom.", "1923-09-05": "On September 5, 1923, the International Academy of Innovation was founded in Zurich, Switzerland, promoting global collaboration in scientific research.", "1940-02-14": "On February 14, 1940, the first underwater train tunnel connecting two countries was completed between France and the United Kingdom.", "1954-11-08": "On November 8, 1954, the Global Weather Watch Program was launched, pioneering the use of satellites for monitoring Earth's climate systems.", "1977-06-30": "On June 30, 1977, the first fully solar-powered town, Solaria, was inaugurated in Arizona, setting a benchmark for renewable energy communities.", "1983-12-12": "On December 12, 1983, the Universal Language Project introduced a simplified global auxiliary language intended to foster cross-cultural communication.", "1994-04-23": "On April 23, 1994, the Oceanic Research Pact was signed, marking a commitment by 40 nations to share oceanographic research and preserve marine ecosystems.", "2009-08-15": "On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours.", "2020-01-10": "On January 10, 2020, the World Clean Air Initiative achieved its milestone goal of reducing urban air pollution levels in 50 major cities globally." } return fictional_historical_events.get(date, f"No historical event information available for {date}.") tools = [ { "function": { "name": "query_historical_event", "description": "Provides information about a historical event that occurred on a specified date.", "parameters": { "type": "object", "properties": { "date": { "type": "string", "description": "The date of the event in YYYY-MM-DD format." }, }, "required": ["date"] } } } ] messages = [ {"role": "system", "content": "You are a knowledgeable assistant that can retrieve information about historical events."}, {"role": "user", "content": "Can you tell me what happened on August 15, 2009?"}, ] chat_response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=messages, temperature=0.7, max_tokens=1024, top_p=0.9, frequency_penalty=0.5, presence_penalty=0.6, tools=tools, tool_choice='auto' ) if chat_response.choices[0].message.tool_calls: date_argument = json.loads(chat_response.choices[0].message.tool_calls[0].function.arguments) date = date_argument.get("date", None) response = query_historical_event(date) print("Assistant response:", response) else: print("Assistant response:", chat_response.choices[0].message.content) Response: Tool has been called with date: 2009-08-15 Assistant response: On August 15, 2009, the first international digital art exhibition was hosted simultaneously in Tokyo, Berlin, and New York, linked by live virtual tours You've successfully implemented function calling using your deployed Llama-3.1-8B model. Conclusion To wrap up, deploying large language models on vLLM with Azure Machine Learning Managed Online Endpoints is a simple and effective way to enable real-time AI-powered applications. By following the steps shared—from setting up the environment to testing the deployment—you can quickly integrate advanced models like Llama-3.1-8B-Instruct into your workflows. With vLLM's optimized performance and support for function calling, your applications can handle complex tasks and interact with other systems seamlessly. This setup helps you build smarter, faster, and more scalable AI solutions.556Views0likes0CommentsUnlocking the Power of Synthetic Data for Fine-Tuning and Evaluation
In the rapidly evolving field of large language models (LLMs) and small language models (SLMs), fine-tuning and evaluation often present unique challenges. Whether the objective is to optimize models for function-calling use cases or to validate multi-agent workflows, one thing remains constant: the need for high-quality, diverse, and contextually relevant data. But what happens when real-world data is either unavailable, incomplete, or too sensitive to use? Enter synthetic data—a powerful tool for accelerating the journey from experimentation to deployment. In this blog, we’ll explore how synthetic data can address critical challenges, why it’s indispensable for certain scenarios, and how Azure AI’s Evaluator Simulator Package enables seamless generation of synthetic interaction data to simulate user personas and scenarios. The Growing Need for Synthetic Data in LLM Development Fine-tuning or evaluating an LLM/SLM for specific use cases often requires vast amounts of labeled data tailored to the task at hand. However, sourcing such data comes with hurdles: Data Scarcity: Real-world interaction data for niche use cases may not exist in sufficient quantity. Privacy Concerns: User interactions may contain sensitive information, making direct use of this data problematic. Scenario Testing: Real-world data rarely accounts for edge cases or extreme scenarios that models must handle gracefully. Synthetic data solves these problems by creating controlled, customizable datasets that reflect real-world conditions—without the privacy risks or availability constraints. Synthetic Data for Function-Calling Use Cases Function-calling in LLMs involves executing API calls based on natural language inputs. For example, users might ask a travel app to “find flights to Paris under $500.” Fine-tuning models for such use cases requires training them on structured, intent-rich inputs paired with corresponding API call structures. Synthetic data can: Simulate diverse intents: Generate variations of user queries across languages, styles, and preferences. Provide structured outputs: Automatically align these queries with the required API call schema for training or evaluation. Include edge cases: Test how models respond to ambiguous or incomplete queries. Model evaluation post fine-tuning presents another set of challenges where we need trusted data to evaluate the performance. Hence, having synthetic data generated by a superior model followed by human screening filtering out noise can provide a rich and diverse data to compare the performance of fine-tuned vs base models. Synthetic Data in Multi-Agent Workflow Evaluation Multi-agent workflows involve multiple models (or agents) collaborating to achieve a shared goal. A restaurant recommendation system, for example, may feature one agent parsing user preferences, another querying a knowledge graph, and a third crafting human-like responses. Synthetic data can: Simulate complex user personas: From foodies to budget-conscious travelers, generating interactions that test the robustness of multi-agent collaboration. Recreate realistic workflows: Model intricate agent-to-agent interactions, complete with asynchronous communication and fallback mechanisms. Stress-test failure scenarios: Ensure agents recover gracefully from errors, misunderstandings, or timeouts. Multi-agent workflows often rely on hybrid architectures that combine SLMs, LLMs, domain-specific models, and fine-tuned systems to balance cost, latency, and accuracy. Synthetic data generated by a superior model can serve as a baseline for evaluating nuances like agent orchestration and error recovery. Azure AI Evaluator Simulator: A Game-Changer Azure AI's Evaluator Simulator Package offers a robust framework for generating synthetic interaction data tailored to your application needs. By simulating diverse user personas and scenarios, it provides: Realistic Simulations: Emulate a wide range of user behaviors, preferences, and intents, making it ideal for creating datasets for function-calling and multi-agent workflows. Customizability: Tailor simulations to reflect domain-specific nuances, ensuring data relevance. Efficiency: Automate data generation at scale, saving time and resources compared to manual annotation. How It Works The Azure AI Evaluation SDK’s Simulator class is designed to generate synthetic conversations and simulate task-based interactions. The module allows you to configure different personas—such as tech-savvy users, college grads, enterprise professionals, customers, supply chain managers, procurement manager, finance admin etc each interacting with your application in unique ways. You can also define the tasks that each of these users are trying to accomplish like shopping for a family event, manging inventory, preparing financial reports etc. Here’s how it operates: Model Configuration: Initialize the simulator with your model’s parameters (e.g., temperature, top_p, presence_penalty). Input Preparation: Provide input data (e.g., text blobs) for context, such as extracting text from a Wikipedia page. Prompt Optimization: Use the query_response_generating_prompty_override to customize how query-response pairs are generated. User Prompt Specification: Define user behavior using the user_simulating_prompty_override to align simulations with specific personas. Target Callback Specification: Implement a callback function that connects the simulator with your application. Simulation Execution: Run the simulator to generate synthetic conversations based on your configurations. By following these steps, developers can create robust test datasets, enabling thorough evaluation and fine-tuning of their AI applications. Example: Synthetic Data for an E-Commerce Assistant Bot Let’s walk through an example of generating synthetic data for an e-commerce assistant bot. This bot can perform tasks such as acting as a shopping assistant, managing inventory, and creating promo codes. Before we get started, make sure to install azure-ai-evaluation package to follow along Step 1: Define Functions and APIs Start by defining the core functions the bot can invoke, such as search_products, fetch_product_details, and add_to_cart. These functions simulate real-world operations. Please refer functions and function_list to access the complete list of functions and function definitions. Step 2: Configure the Simulator model_config = { "azure_endpoint": azure_endpoint, "azure_api_key": azure_api_key, "azure_deployment": azure_deployment, } from azure.ai.evaluation.simulator import Simulator simulator = Simulator(model_config=model_config) Next connect the simulator to the application. For this, establish the client and implement a callback function that invokes the application and facilitate interaction between the simulator and app from typing import List, Dict, Any, Optional from functions import * from function_list import function_list from openai import AzureOpenAI from azure.identity import DefaultAzureCredential, get_bearer_token_provider def call_to_ai_application(query: str) -> str: # logic to call your application # use a try except block to catch any errors system_message = "Assume the role of e-commerce assistant designed for multiple roles. You can help with creating promo codes, tracking their usage, checking stock levels, helping customers make shopping decisions and more. You have access to a bunch of tools that you can use to help you with your tasks. You can also ask the user for more information if needed." completion = client.chat.completions.create( model=azure_deployment, messages=[ {"role" : "system", "content" : system_message }, { "role": "user", "content": query, } ], max_tokens=800, temperature=0.1, top_p=0.2, frequency_penalty=0, presence_penalty=0, stop=None, stream=False, tools = function_list, tool_choice="auto" ) message = completion.choices[0].message # print("Message : ", message) # change this to return the response from your application return message async def callback( messages: List[Dict], stream: bool = False, session_state: Any = None, # noqa: ANN401 context: Optional[Dict[str, Any]] = None, ) -> dict: messages_list = messages["messages"] # get last message latest_message = messages_list[-1] query = latest_message["content"] context = None # call your endpoint or ai application here response = call_to_ai_application(query) # we are formatting the response to follow the openAI chat protocol format: if response.tool_calls: prev_messages = messages["messages"] func_call_messages = [] tool_calls = response.tool_calls ## Add the tool calls to the messages for tool_call in tool_calls: formatted_response = {"role" : "assistant", "function_call" : tool_call.function.to_dict()} func_call_messages.append(formatted_response) ## Execute the APIs and add the responses to the messages for tool_call in tool_calls: function_name = tool_call.function.name function_args = tool_call.function.arguments func = globals().get(function_name) if callable(func): result = json.dumps(func(**json.loads(function_args))) # formatted_response = {"content" : result, "role" : "tool", "name" : function_name} formatted_response = {"role" : "function", "content" : result, "name" : function_name} func_call_messages.append(formatted_response) else: print("Function {} not found".format(function_name)) # Second API call: Get the final response from the model final_response = client.chat.completions.create( model=azure_deployment, messages=prev_messages + func_call_messages, ) final_response = {"content" : final_response.choices[0].message.content, "role" : "assistant"} func_call_messages.append(final_response) # Stringify func_call messages to store in session state func_call_messages = create_content_from_func_calls(func_call_messages) func_call_messages = {"role" : "assistant", "content" : func_call_messages} messages["messages"].append(func_call_messages) # messages["messages"].append(final_response) return {"messages": messages["messages"], "stream": stream, "session_state": session_state} else: formatted_response = { "content": response.content, "role": "assistant", } messages["messages"].append(formatted_response) return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context} We have used two helper functions here : create_content_from_func_calls : It creates a string content from a list of function call dictionaries. This merges all the internal messages invoking function calls into a single string. This is needed as the simulator module ignores all internal context and only retains the latest response. split_content : Split a string content into a list of dictionaries based on specified separators. This is required for post-processing step to split the string comprising of function-call and function-response into separate messages each with its own role and content. Step 3: Define the Tasks Use the Azure AI Evaluation SDK to configure the simulator with user personas and tasks, such as: A marketing manager creating a promo code and tracking its usage. A customer making a purchase using the promo code. An inventory manager checking stock levels. Step 4: Customize user persona Internally, the SDK has a prompty file that defines how the LLM which simulates the user should behave. The SDK also offers an option for users to override the file, to support your own prompty files. Let’s override this file to build a user persona who engages in an interactive conversation with the bot and asks follow up questions while responding to bot’s response basis his persona and requirement system: You must behave as a user who wants accomplish this task: {{ task }} and you continue to interact with a system that responds to your queries. If there is a message in the conversation history from the assistant, make sure you read the content of the message and include it your first response. Your mood is {{ mood }} Make sure your conversation is engaging and interactive. Output must be in JSON format Here's a sample output: { "content": "Here is my follow-up question.", "role": "user" } Step 5 : Generate and Store Outputs: Run the simulator to generate synthetic data. You can specify the "num_conversation_turns" that defines the predetermined number of conversation turns to simulate. outputs = await simulator( target=callback, text="Assume the role of e-commerce assistant designed for multiple roles. You can help with creating promo codes, tracking their usage, checking stock levels, helping customers make shopping decisions and more. You have access to a bunch of tools that you can use to help you with your tasks. You can also ask the user for more information if needed.", num_queries=3, max_conversation_turns=5, tasks=tasks, user_simulator_prompty=user_override_prompty, user_simulator_prompty_kwargs=user_prompty_kwargs, ) Step 6 : Review and Save the Outputs Let's look at the output for one of the tasks We can see how the simulator engages in an interactive conversation with the application to accomplish the desired task and all the interaction between app and simulator is captured in the final output. Let's store the output in a file with open("output.json", "w") as f: json.dump(final_outputs, f) Conclusion Synthetic data transcends being a mere substitute for real-world data—it’s a strategic asset for fine-tuning and evaluating LLMs. By enabling precise control over data generation, synthetic datasets empower developers to simulate user behaviors, test edge cases, and optimize models for specific workflows. With tools like Azure AI’s Evaluator Simulator, generating this data has never been more accessible or impactful. Whether you’re building models for function-calling, orchestrating multi-agent systems, or tackling niche use cases, synthetic data ensures you’re equipped to deliver reliable, high-performing solutions—regardless of complexity. Start leveraging synthetic data today and unlock the full potential of your LLM projects! You can access the full code here References azureai-samples/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Input_Text at main · Azure-Samples/azureai-samples How to generate synthetic and simulated data for evaluation - Azure AI Foundry | Microsoft Learn Generate Synthetic QnAs from Real-world Data on Azure | Microsoft Community Hub How to use function calling with Azure OpenAI Service - Azure OpenAI Service | Microsoft Learn Fine-tuning function calls with Azure OpenAI Service - Azure AI services | Microsoft Learn349Views0likes0CommentsMeta’s next generation model, Llama 3.1 405B is now available on Azure AI
Microsoft, in collaboration with Meta, is launching Llama 3.1 405B, now available via Azure AI’s Models as a Service. Also introducing fine-tuned versions of Llama 3.1 8B and 70B. Leverage powerful AI for synthetic data generation and distillation. Access these models and more through Azure AI Studio and popular developer tools like prompt flow, OpenAI, LangChain, LiteLLM, and more. Streamline development and enhance efficiency with Azure AI.46KViews3likes7CommentsDiscover the Azure AI Training Profiler: Transforming Large-Scale AI Jobs
Meet the AI Training Profiler Large-scale AI training can be complicated, especially in distributed environments like healthcare, finance, and e-commerce, where the need for accuracy, speed, and massive data processing is crucial. Efficiently managing hardware resources, ensuring smooth parallelism, and minimizing bottlenecks are crucial for optimal performance. The AI Training Profiler powered by PyTorch Profiler inAzure Machine Learning is here to help! By giving you detailed visibility into hardware and software metrics, this tool helps you spot inefficiencies, make the best use of resources, and scale your training workflows like a pro. Why Choose the AI Training Profiler? Running large AI training jobs on distributed infrastructure is inherently complex, and inefficiencies can quickly escalate into increased costs and delays in deploying models. The AI Training Profiler addresses these issues by providing a comprehensive breakdown of compute resource usage throughout the training lifecycle. This enables users to fine-tune and streamline their AI workflows, yielding several key benefits: Improved Performance: Identify bottlenecks and inefficiencies, such as slow data loading or underutilized GPUs, to enhance training throughput. Reduced Costs: Detect idle or underused resources, thereby minimizing compute time and hardware expenses. Faster Debugging: Leverage real-time monitoring and intuitive visualizations to troubleshoot performance issues swiftly. Key Features of the AI Training Profiler GPU Core and Tensor Core Utilization The profiler meticulously tracks GPU kernel execution, reporting utilization metrics such as time spent on forward and backward passes, tensor core operations, and other computation-heavy tasks. This detailed breakdown enables users to pinpoint under-utilized resources and optimize kernel execution patterns. Memory Profiling Memory Allocation and Peak Usage: Monitors GPU memory usage throughout the training process, offering insights into underutilized or over-allocated memory. CUDA Memory Footprint: Visualizes memory consumption during forward/backward propagation and optimizer steps to identify bottlenecks or fragmentation. Page Fault and Out-of-Memory Events: Detects critical events that could slow training or cause job failures due to insufficient memory allocation. Kernel Execution Metrics Kernel Execution Time: Provides per-kernel timing, breaking down execution into compute-bound and memory-bound operations, allowing users to discern whether performance bottlenecks stem from inefficient kernel launches or memory access patterns. Instruction-level Performance: Measures IPC (Instructions Per Cycle) to understand kernel-level performance and identify inefficient operations. Distributed Training Communication Primitives: Captures inter-GPU and inter-node communication patterns, focusing on the performance of primitives like AllReduce, AllGather, and Broadcast in multi-GPU training. This helps users identify communication bottlenecks such as imbalanced data distribution or excessive communication overhead. Synchronization Events: Measures the time spent on synchronization barriers between GPUs, highlighting where parallel execution is slowed by synchronization. Getting Started with the Profiling Process Using the AI Training Profiler is a breeze! Activate it when you launch a job, either through the CLI or our platform’s user-friendly interface. Here are the three environment variables you need to set: Enable/Disable the Profiler: ENABLE_AZUREML_TRAINING_PROFILER: 'true' Configure Trace Capture Duration: AZUREML_PROFILER_RUN_DURATION_MILLISECOND: '50000' Delay the Start of Trace Capturing: AZUREML_PROFILER_WAIT_DURATION_SECOND: '1200' Once your training job is running, the profiler collects metrics and stores them centrally. After the run, this data is analyzed to give you visual insights into critical metrics like kernel execution times. Use Cases The AI Training Profiler is a game-changer for fine-tuning large language models and other extensive architectures. By ensuring efficient GPU utilization and minimizing distributed training costs, this tool helps organizations get the most out of their infrastructure, whether they're working on cutting-edge models or refining existing workflows. In conclusion, the AI Training Profiler is a must-have for teams running large-scale AI training jobs. It offers the visibility and control needed to optimize resource utilization, reduce costs, and accelerate time to results. Embrace the future of AI training optimization with the AI Training Profiler and unlock the full potential of your AI endeavors. How to Get Started? The feature is available as a preview, you can just set up the environment variables and start using the profiler! Stay tuned for future repository with many samples that you can use as well!531Views2likes0CommentsSupercharge Your Deep Learning Workflows with NVIDIA Nsight Systems
Machine learning (ML), deep learning (DL), and AI workloads are becoming more complex, necessitating efficient resource utilization and time management. NVIDIA Nsight Systems is a powerful performance analysis tool designed to optimize these workloads by providing insights into application behavior on GPUs and CPUs. This blog post discusses the importance of optimizing ML/DL workloads to improve training times and productivity and provides an overview of NVIDIA Nsight Systems, including its key UI components. The post offers a practical example of optimizing a deep learning model using the FashionMNIST dataset. Initial profiling with Nsight Systems reveals bottlenecks in data handling rather than GPU computation. By increasing batch size, parallelizing data loading, and reducing fork operations, the training time is significantly reduced from 28 seconds to just 2.2 seconds. Further optimizations include enabling Automatic Mixed Precision (AMP) and using DistributedDataParallel for multi-GPU setups. By leveraging these optimizations and profiling tools like NVIDIA Nsight Systems, ML/DL workloads can achieve substantial performance improvements, leading to faster training times and more efficient use of resources.184Views1like0CommentsAccelerate enterprise GenAI application development with tracing in Azure AI Foundry
We are excited to announce the public preview of tracing in Azure AI Foundry, a powerful capability designed to enhance monitoring and debugging capabilities for your machine learning models and applications. Tracing allows you to gain deeper insights into the performance and behavior of your models, to help ensure they operate efficiently and effectively. Enable comprehensive monitoring and analysis of your application's execution Tracing allows you to trace application processes from input to output, review intermediate results, and measure execution times. Additionally, detailed logs for each function call in your workflow are accessible. You can inspect parameters, metrics, and outputs of each AI model used, for easier debugging and optimization of your application. The Azure AI Foundry SDK supports tracing to various endpoints including local viewers (Prompty trace viewer and Aspire dashboard), Azure AI Foundry, and Azure Monitor Application Insights. This flexibility helps you integrate tracing with any application, facilitating testing, evaluation, and deployment across different orchestrations and existing GenAI frameworks. Key Capabilities Basic debugging In situations where your application encounters an error, the trace functionality becomes extremely useful. It allows you to delve into the function causing the error, assess the frequency of exceptions, and troubleshoot using the provided exception message and stack trace. Detailed execution logs Tracing captures detailed traces of your model's execution, including data preprocessing, feature extraction, model inference, and post-processing steps. These details provide valuable insights into the inner workings of your models, helping you identify bottlenecks and optimize performance. For example, understanding the call flow of an application is crucial for complex AI systems where multiple components and services interact. By enabling tracing, developers can identify bottlenecks, understand dependencies, and optimize the flow for better performance. Performance metrics In addition to execution logs, tracing collects key performance metrics, such as latency and token utilization. These metrics allow you to monitor the efficiency of your models and make data-driven decisions to improve their performance. Building monitoring dashboards with the data collected from tracing can provide real-time visibility into the system's health. These dashboards can track key performance indicators (KPIs), provide alerts on anomalies, and help ensure that the AI services are running as expected. Error tracking Tracing helps you identify and troubleshoot errors in your models by capturing detailed error logs. Whether it's a data preprocessing issue or a model inference error, tracing provides the information you need to diagnose and fix problems quickly. This is particularly useful for capturing runtime exceptions, such as rate-limiting, which are critical for maintaining the reliability of your applications. Evaluations and user feedback You can attach evaluations metrics and user feedback to traces via online evaluation capabilities in Azure AI Foundry. Online evaluation allows you to incorporate real-world performance data and user insights into your monitoring process, to assess whether your models meet the desired quality standards. The Azure AI Foundry SDK simplifies the process of downstream evaluation, facilitating continuous improvement and validation of AI models against real-world data. Additionally, capturing user evaluations and interactions can provide insights into how users are engaging with the AI features, to inform user-centric improvements. Visualize Traces Azure AI Foundry provides robust tools for visualizing traces, both for local debugging and production-level monitoring. You can use these tools to gain a better understanding of your model's behavior and performance. The visualization capabilities include: Local debugging: Visualize traces during development to identify and resolve issues early, helping ensure that models are optimized before deployment. Visualize the data via Azure AI Foundry portal and Azure Monitor: In the post-deployment phase, developers often want to delve deeper into their applications' performance to optimize it further. For instance, you might want to monitor your GenAI application's performance, usage, and costs. In this scenario, the trace data for each request, the aggregated metrics, and user feedback become vital. Tracing seamlessly integrates with Azure Monitor, allowing you to visualize and analyze your model's performance metrics and logs using a customizable dashboard in Azure Monitor Application Insights. This integration provides a holistic view of your model's health and performance, enabling you to make informed decisions. Getting Started To start using tracing in Azure AI Foundry and Azure Monitor, follow these simple steps: Log Traces: Enable Tracing via Azure AI SDK for enabling tracing on Model inference API. Configure Logging: Set up the logging configuration to capture the desired level of detail for your model's execution. Enable Tracing in AI Studio: In your Azure AI Project, navigate to the Tracing and enable the feature for your models. Monitor and Analyze: Use Azure Monitor to visualize and analyze the collected logs and metrics, gaining insights into your model's performance. Find detailed guidance in our documentation: Overview of tracing capabilities in Azure AI Foundry Learn how to implement and use tracing with the Azure AI Foundry SDK Visualize your traces Build production-ready GenAI apps with Azure AI Foundry Want to learn about more ways to build and monitor enterprise-ready GenAI applications? Here are other exciting announcements from Microsoft Ignite to support your GenAIOps workflows: New ways to evaluate generative AI outputs for quality and safety New ways to monitor performance with Azure AI Foundry and Azure Monitor Whether you’re joining in person or online, we can’t wait to see you at Microsoft Ignite 2024. We’ll share the latest from Azure AI and go deeper into best practices for GenAIOps with these sessions: Microsoft Ignite Keynote Multi-agentic GenAIOps from prototype to production with dev tools Azure AI and the dev toolchain you need to infuse AI in all your apps1.3KViews0likes0Comments