SLM Model Weight Merging for Federated Multi-tenant Requirements

sudarsan

Microsoft

Apr 24, 2025

Authors: Jyotsna Ravi (Jyotsna.Ravi@microsoft.com), Sudarsan Lakshminarayanan (sudarsanl@microsoft.com), Srinath Gopalakrishnan (Srinath.Gopalakrishnan@microsoft.com)

Model merging is a technique for combining the model parameters of multiple models, specifically finetuned variants of a common base model, into a single unified model. In the context of Small Language Models (SLMs), which are lightweight and efficient, merging allows us to have variants of a domain specialized base model to suit different tenant-specific requirements (like fine-tune the base model on their own data set), and enable transfer of the model parameters to the base model without the need to expose the data used for tenant-specific requirements.

Model merging operates at the parameter level, using techniques such as weighted averaging, SLERP (Spherical Linear Interpolation), task arithmetic, or advanced methods like TIES leading to a model that preserves both the general abilities of the base model, and the nuanced strengths acquired during finetuning.

This approach has gained some attention as organizations increasingly face scenarios where the ability to fine-tune the LLMs are readily available, but closed nature of commercial LLMs and constraints on training data due to privacy regulations or intellectual property concerns make it difficult to take advantage of tenant specific datasets. Traditional approaches to creating multi-purpose models typically require centralized access to all training data, which presents a substantial barrier in privacy-sensitive contexts. Data-independent knowledge fusion through model merging offers a solution to this problem by working directly in the parameter space of the SLMs, since SLMs are open-weight and users have a liberty to host the models and fine-tune weights as per their requirements.

We have enterprise multi-tenant scenarios where a service provider can have a base or finetuned model (Let’s call this “central model”), that customers can access from their tenants. Customers can bring tenant specific domain data for finetuning the “central model”. Different customers can have their own version of fine-tuned models in their tenant and the “central model” can be updated by merging the tenant specific models. This helps to bring in the knowledge from the tenant models to “central model” without accessing the data used for fine-tuning.

In scenarios where the customer deployments need fine-tuned models based on specific customer data, we can fine tune a central / base model based on customer specific datasets. There can be regulatory requirements to keep separate models for each customer for Financial or Healthcare domains.

We can have a “central” base Small Language Model fine-tuned on specific domain dataset which is customer / tenant agnostic.
This model can be further fine-tuned for a specific customer using their own dataset (and PII data can be purged if needed)
We can merge the customer specific model deployments with the global model to transfer knowledge from the tenant specific models, without directly using the data used for tenant models
To have a better control on improving the accuracy of “central” base models, a validation can be run to make sure that iterations are improvements of previous versions of models
The updated “central” base model can be the basis for improving tenant specific fine-tuning
It is imperative to version the datasets, models and pipelines to have an audit and governance system in place

Approaches for weight update

Several technical approaches have been developed for model merging, each with distinct advantages.

Weight Averaging: This is one of the simplest and most common methods. It involves averaging the weights of the models being merged. This can be a simple average or a weighted average, where models with better performance or relevance to the target task are given more weight. This method often leads to suboptimal performance due to its failure to account for the relative importance of different parameters.
- More sophisticated approaches include Fisher-weighted averaging, which uses the Fisher Information Matrix to determine parameter importance. Fisher information matrix can be used to estimate the importance of each parameter in the model and thus assign weights to the models based on their Fisher information
Task Vectors: This technique involves identifying "task vectors" in the weight space that correspond to specific tasks or fine-tuning objectives. Merging can then be performed by combining these task vectors, allowing for more targeted and controlled merging.
- When we provide input and output pairs as examples in-context, language models can infer the mapping from inputs to outputs and understand the task. LLMs implicitly compress this mapping into a latent activation, called the task vector. For eg. If we give examples of country name and the corresponding currency names, language models can encode the relationship between them.
- A task vector is used to encapsulate the adjustments needed by a model to specialize in a specific task. It is derived from the differences between a pre-trained model's parameters and those fine-tuned for a particular task. Task Arithmetic algorithms compute a task vector for each individual task, using the set of model parameters. These task vectors are then aggregated to form a multi-task vector. Subsequently, the multi-task vector is combined with the pre-trained model parameters to obtain the final multi-task model.

SLERP (Spherical LinEar interpolation)
- Normalize: Vectors are placed on the sphere’s surface by normalizing them. Normalizing the input vectors to unit length will ensure they represent directions rather than magnitudes.
- Calculate the angle between the two vectors. This angle helps us understand how far apart the points are on the sphere.
- Decide on the blending factor between the 2 vectors. If you choose a halfway blend, you’ll get a point exactly in the middle of them along the sphere. SLERP calculates this by using the angle and the blend amount chosen, then finds the new point along the curve of the sphere.
TIES (TrIm, Elect Sign & Merge)
- Trim: This initial step involves refining the task-specific models by trimming unnecessary parameters, focusing the model on essential elements for each task.
- Elect Sign of Parameters: In this step, the algorithm selects the appropriate signs for the parameters, ensuring that the integrated model parameters are optimally oriented for multi-task learning.
- Disjoint Merge: Finally, the method performs a disjoint merge to combine the task-specific parameters into a single cohesive task vector

Approach for Multi-tenant implementation

In a federated multi-tenant architecture for language models, a single base model serves multiple tenants, each with potentially unique data and requirements. Model merging can be employed in this architecture as follows:

Tenant-Specific Fine-tuning: Each tenant's data is used to fine-tune a copy of the base model. This results in multiple fine-tuned models, each specialized for a particular tenant's needs. This fine-tuning process can be done in a federated manner, where models are trained locally on tenant data and only model updates are shared, preserving data privacy.
Performance Evaluation: After fine-tuning, each tenant-specific model is evaluated on a relevant validation dataset. This dataset could be specific to the tenant or a shared benchmark dataset. The evaluation metrics will depend on the task, but could include metrics like accuracy, F1-score, perplexity, or BLEU score.
Performance-Based Merging: Based on the performance evaluations, a decision is made on which tenant-specific models to merge back into the base model. This could involve:
- Selecting Top-Performing Models: Only the models that achieve a certain performance threshold or rank among the top performers are selected for merging.
- Weighted Averaging based on Performance: Models are merged using weighted averaging, where the weights are determined by their performance scores. Higher-performing models contribute more to the merged model.
- Dynamic Merging: The merging process can be dynamic and iterative. After an initial merge, the merged model can be further fine-tuned and re-evaluated, and the merging process can be repeated with potentially different weights or models.
Updating the Base Model: The selected and merged models are combined to update the base model. This updated base model now incorporates knowledge from multiple tenants, potentially improving its overall performance and adaptability.
Serving Tenants: The updated base model can then be used as the starting point for fine-tuning for new tenants or as a generally improved model for all existing tenants.

Reference Architecture

Source: Architectural approaches for AI and ML in multitenant solutions - Azure Architecture Center | Microsoft Learn

Advantages of Model Merging

Efficient Customization: Model merging allows for tenant-specific customization without requiring a separate base model for each tenant, saving storage and computational resources.
Knowledge Sharing: Model merging facilitates knowledge sharing across tenants, allowing the base model to benefit from the collective learning of all tenants.
Privacy Preservation: As tenants never share raw data—only model updates are exchanged—the federated setup maintains data privacy while still benefiting from diverse local training.
Model Robustness: By leveraging performance evaluations during merging, the system can adaptively incorporate the most effective tenant updates, ensuring that the final model is robust across different domains.
Performance Enhancement: By selectively merging high-performing models, the overall performance of the base model can be improved. Merging models fine-tuned on diverse tenant data lead to an improved base model.

Considerations and Challenges

Catastrophic Forgetting: Care must be taken to avoid catastrophic forgetting during fine-tuning and merging. Techniques like regularization or continual learning strategies might be necessary.
Tenant Interference: There's a risk of negative transfer or interference between tenants if their data or tasks are too dissimilar. Careful selection of merging strategies and evaluation metrics is important.
Evaluation Metrics: Choosing appropriate evaluation metrics that accurately reflect the desired performance for each tenant and for the merged model is crucial.
Computational Cost: While model merging can be more efficient than training separate models from scratch, the process of fine-tuning, evaluating, and merging still incurs computational costs.