Scalable and Efficient Fine-Tuning of LLM on Azure ML

JamesN

Microsoft

Jan 23, 2025

As Large Language Models (LLMs) continue to revolutionize industries with their ever-expanding capabilities, fine-tuning them efficiently and at scale has become a critical challenge. Fine-tuning is how we adapt these general-purpose models to excel at specific tasks—whether it’s summarizing legal documents, generating creative copy, or analyzing sentiment in customer reviews. However, the massive size and computational demands of LLMs make fine-tuning no small feat. In this blog, we’ll dive into strategies for scalable fine-tuning on Azure Machine Learning (Azure ML), share experimental results that demonstrate the power of distributed training. Oh, and we’ve got some exciting experiments to share: 1- 3x faster fine-tuning by scaling from 1 to 3 machines. 2 - Successfully fine-tuned a 70B-parameter model using old V100 GPUs—a feat made possible by efficient distributed training techniques. Curious to try it yourself?

https://github.com/james-tn/llm-fine-tuning/tree/main/opensource_llm/single_step

Co-Author: Mohamad AL jazaery

Why Scalable and Efficient Fine-Tuning Matters

Faster Iterations, Shorter Time-to-Value:

In today’s competitive AI landscape, time is of the essence. The faster you can fine-tune a model, the quicker you can validate ideas, test hypotheses, and bring solutions to market.

High-profile GPU machines are costly:

High-performance GPUs and compute clusters don’t come cheap, and their availability is often limited. Efficient fine-tuning techniques, such as model sharding and distributed training, maximize the utilization of these precious resources—ensuring that you get the most out of your infrastructure investment.

Choosing the Right Azure ML GPU Compute for the Job: NC or ND?

Not all GPU computes are created equal, and choosing the right sku can make or break your training efficiency.

ND Series: Ideal for distributed training across multiple nodes, thanks to its Infiniband (IB) connectivity that ensures high-speed communication between nodes like pretraining LLM or finetuning very large model ~70B params.
NC Series: Small and medium workload where no heavy interaction between nodes needed like LLM inferencing or mid-size LLM finetuning.

Azure GPU Machine Options by Scenario:

Scenario	Common model size	Training Approach	Recommended Azure Compute
Small-scale fine-tuning	< 3B parameters	Parameter-efficient tuning	NCas_T4_v3 (Tesla T4, 16 GB)
Medium-scale fine-tuning	1–5B parameters	Full or parameter-efficient	NCs_v3 (Tesla V100, 16 GB)
Distributed training for medium models	5–10B parameters	Full fine-tuning	ND_v2 (Tesla V100 NVLINK, 32 GB, InfiniBand)
Large-scale fine-tuning (single machine)	10–30B parameters	Full or parameter-efficient	NC_A100_v4 (A100, 40 GB)
Distributed training for very large models	20–70B parameters	Full fine-tuning	NDasrA100_v4 (A100, 80 GB, HDR InfiniBand)
Very large models training (single machine)	up to 70B parameters	Full or parameter-efficient	NCads_H100_v5 (H100 NVL, 94 GB)
Massive-scale distributed training	> 70B parameters	Full fine-tuning	ND-H100-v5 (H100, 80 GB, scale-out InfiniBand)

Distributed Efficient Training: A Quick Guide

When scaling fine-tuning tasks, choosing the right distributed training method is key:

DDP (Data Parallelism): Works well when the entire model fits on a single GPU. It replicates the model across multiple GPUs and splits the data for parallel processing. Check experiment 1 in the following section.
Model Parallelism: A game-changer for massive models that don’t fit on a single GPU. It shards not only the data but also the model parameters and optimizer states across multiple GPUs, enabling efficient training of models like LLaMA-70B on GPUs with low memory GPUs. Both FSDP and DeepSpeed as libraries excel at implementing advanced forms of model parallelism and memory optimization.
Memory Optimization Techniques
- Gradient Checkpointing: Reduces memory by recomputing activations during the backward pass, trading memory for additional computation.

- Mixed Precision Training: Reduces memory usage by using FP16 or BF16 instead of FP32, accelerating training while maintaining numerical stability.
  Supported by both frameworks.
- Quantization (DeepSpeed Exclusive): Uses INT8 precision for weights and activations, dramatically reducing memory and compute requirements.
- Offloading (DeepSpeed Exclusive): Offloads optimizer states and model parameters to CPU or NVMe, freeing up GPU memory for computation.

Our Experiments: Pushing the Limits of Scalability

Experiment 1: Distributed Training on Multiple Nodes using DDP

We conducted an experiment to fine-tune the Llama-3.1-8B model using LoRA (Low-Rank Adaptation) on Azure ML NDv2-V100 nodes. The goal was to evaluate the efficiency of fine-tuning across different numbers of nodes (1, 2, and 3) and observe the impact on training time and throughput.

Azure ML Job YAML Definition

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml/models/mistralai-Mistral-7B-v01/versions/19 command: > accelerate launch --num_processes 16 # gpu per machine * num of machines --num_machines 2 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT compute: azureml:ndv2-cluster resources: instance_count: 2 # Number of nodes for distributed training distribution: type: pytorch process_count_per_instance: 1 # Number of processes per node

Results:

As you increased the number of nodes from one to three, the throughput increased proportionally. This indicates that the system scaled efficiently with the addition of more nodes, maintaining a close-to-linear improvement in throughput.

Experiment 2: Model Parallelism using FSDP

Fine-tuning a 70B-parameter model on GPUs with only 16GB of memory might sound impossible, but we made it happen using FSDP (Full Sharded Data Parallelism) on Azure ML using a cluster of multiple NDv2-V100 nodes. By distributing not only the data but also the model parameters and optimizer states across multiple nodes, we unlocked the power of full sharding.

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json type: command code: ./ # Path to your training script and related files inputs: model_dir: path: azureml://registries/azureml-meta/models/Llama-3.3-70B-Instruct/versions/4 command: > accelerate launch --config_file "configs/fsdp_config.yaml" --num_processes 32 --num_machines 4 --machine_rank $NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT train.py compute: azureml:ndv2-cluster resources: instance_count: 4 # Number of nodes for distributed training distribution: type: pytorch process_count_per_instance: 1 # Number of processes per node

Key Takeaways:

Memory Efficiency: Full sharding enabled us to fine-tune the LLaMA-70B model on V100 GPUs despite their limited memory.
Connectivity Matters: The Infiniband (IB) connectivity of ND nodes played a critical role in ensuring smooth communication across GPUs, making this feat possible.

Conclusion

Scalable and efficient fine-tuning is the key to unlocking the true potential of Large Language Models. By leveraging distributed training techniques, such as FSDP and DDP, and optimizing compute resources on Azure ML, researchers and practitioners can overcome the challenges of training massive models—reducing costs, accelerating time-to-value, and driving AI innovation.

Access the code and start experimenting here!

Future work:
The second part will focus on real-world pipeline setups, including end-to-end model training, hyperparameter optimization, and testing. The third part will dive into deploying trained models for practical use. Future posts may explore best practices for specific fine-tuning scenarios and techniques.

Updated Apr 14, 2025

Version 10.0

artificial intelligence

azure machine learning

machine learning

natural language processing

JamesN

Microsoft

Joined October 06, 2020

View Profile

Microsoft Foundry Blog

Follow this blog board to get notified when there's new activity