Image generated by Stable Diffusion

Merge Large Language Models

Combine Mistral, WizardMath and CodeLlama in one model!

Sergei Savvov
7 min readJan 22, 2024

--

Let’s imagine you have several LLMs: one excels at solving mathematical problems, and another is adept at writing code. Switching between two models can be tricky, so you can combine them to utilize the best of both worlds. And it’s indeed possible! You won’t even need a GPU for this task.

Model merging is a novel technique gaining popularity recently. It allows for the combination of multiple models by merging them into one. In doing so, you not only retain quality, but you also get additional benefits. The new model begins to perform better on tasks, which is clearly demonstrated by the Open LLM Leaderboard:

All these models were obtained by merging

In this article, we will explore various merging algorithms, study how you can implement them, and delve into their working principles. We’ll show through examples how to merge models like Mistral, WizardMath, and CodeLlama using the mergekit toolkit. There will be useful links and resources at the end to further explore this topic.

“Short” Summary

  1. Model merging increases the quality of the final model. With the right expertise and model selection, you can achieve SOTA results.
  2. There are numerous merging algorithms, most of which use weighted averages. However, new methods like DARE and TIES have emerged, addressing the limitations of previous techniques.
  3. It’s feasible to merge models with mixing architectures, for example: LLaMA 2 + Mistral + Wizard.
  4. All algorithms discussed in the article are accessible through the mergekit tool, with more usage examples available in its ‘examples’ folder.

Merge algorithms

There are several algorithms for combining models. Many of them use various combinations of weighted averages. However, some propose different approaches. In this article, I will focus on some algorithms that I found interesting and arrange them in order of increasing complexity.

1. Task Vector Arithmetic

This method introduces a new paradigm for modifying the behavior of neural networks using “task vectors.” These vectors represent directions in the weight space of a pre-trained model, pointing towards improved performance on a specific task.

Vectors can be manipulated through arithmetic operations like negation and addition, allowing for targeted behavior changes in the model:

Schematic illustration of Task Vector Arithmetic
  • Negation to Decrease Performance: Negating a task vector diminishes the model’s performance on the target task while maintaining its behavior on control tasks.
  • Addition for Multi-Task Improvement: Adding task vectors can enhance the model’s performance across multiple tasks simultaneously.
  • Analogical Task Improvement: Combining task vectors from related tasks (based on an analogy relationship) can improve performance on a fourth task, even without using data from this task.

Advantages of Task Vector Arithmetic:

  1. Efficient Model Editing: This approach provides a simple and effective way to edit models, enabling performance improvements, bias mitigation, and updating models with new information.
  2. Versatility Across Models and Tasks: The method has been shown to work well with various models and tasks.

In summary, task vector-based model editing offers a novel and versatile approach for controlling and improving the performance of neural network models in various tasks.

Article | Code

2. SLERP

SLERP addresses the limitations of traditional weight averaging in model merging. It offers a more nuanced approach, blending models in a way that preserves the unique characteristics and curvature of each parent model in high-dimensional spaces.

Advantages of SLERP:

  • Smooth Transitions: ensures smoother parameter transitions, crucial in high-dimensional vector interpolations.
  • Preservation of Characteristics: maintains the distinct features and curvature of both parent models.
  • Nuanced Blending: accounts for geometric and rotational properties in vector space, resulting in a mixture that accurately reflects the characteristics of both models.

Steps in SLERP implementation:

Schematic illustration of SLERP algorithm operation
  1. Normalization: The input vectors are normalized to unit length, focusing on directions rather than magnitudes.
  2. Angle Calculation: The angle between these vectors is determined using their dot product. It calculates scale factors based on the interpolation factor and the angle between the vectors.
  3. Vector Weighing and Summation: The original vectors are weighed with these factors and summed to obtain the interpolated vector.

SLERP is characterized by its ability to merge models in a way that smoothly transitions between parameters and preserves the unique characteristics of each model, making it the preferred method for complex model merging tasks. Although SLERP is popular and effective for merging two models simultaneously, it is limited to pairwise combinations.

Good explanation video | Code

3. TIES

Traditional model merging methods face significant challenges, particularly in handling interference between parameters from different models. This interference leads to a substantial drop in performance when merging multiple models.

To overcome these challenges, the TIES method introduces three steps:

Schematic illustration of TIES algorithm operation
  1. Resetting Parameters: It resets parameters that have only changed marginally during fine-tuning. This step helps in reducing redundancy.
  2. Resolving Sign Conflicts: It resolves conflicts arising from differing signs of parameter values across models.
  3. Selective Merging: It merges only those parameters that align with the final agreed-upon sign.

The TIES-Merging approach has shown to outperform several existing merging methods in various settings. It effectively addresses interference issues, particularly sign interference, enhancing the overall performance of the merged model.

Article | Code

4. DARE

DARE is a novel approach without the need for retraining or GPUs. It primarily focuses on learning the parameters of similar (homologous) models to gain new capabilities. It uses a similar approach to TIES with two main differences:

  1. Pruning of Delta Parameters: identifies and eliminates most delta parameters (the differences between fine-tuned and pre-trained parameters) by setting them to zero. This process does not significantly affect the capabilities of models. Larger models can discard a higher proportion of these parameters.
  2. Rescaling Weights: includes a rescaling step where the weights of the models are adjusted to keep the output expectations approximately unchanged. This involves adding the rescaled weights of the models to the weights of the base model with a scale factor.

The algorithm works according to the following steps:

Schematic illustration of DARE algorithm operation
  1. Pruning: resets fine-tuned weights to their original pre-trained values, reducing unnecessary parameter changes.
  2. Merging: averages parameters from multiple models to create a single, unified model.
  3. Rescaling: adjusts the merged model’s weights to preserve its expected performance.

In summary, DARE offers a unique and efficient way to merge Language Models by strategically pruning and rescaling parameters, leading to models with enhanced and diverse capabilities without the need for extensive retraining.

Article | Code

Implementation

To merge models, we will use mergekit, a toolkit designed for merging pre-trained language models. It supports all the above-mentioned algorithms and is quite simple to set up. Model merging can be run on just a CPU, although you can speed it up with as little as 8 GB of VRAM.

1. Install mergekit:

python3 -m pip install --upgrade pip
git clone https://github.com/cg123/mergekit.git
cd mergekit && pip install -q -e .

2. Write your merge configuration in YAML file:

I will combine LLMs with mixing architectures: Mistral-7b, WizardMath-7b and CodeLlama-7b. Here is my ultra_llm_merged.yaml config:

models:
- model: mistralai/Mistral-7B-v0.1 # no parameters necessary for base model
- model: WizardLM/WizardMath-7B-V1.0
parameters:
density: 0.5 # fraction of weights in differences from the base model to retain
weight: # weight gradient
- filter: mlp
value: 0.5
- value: 0
- model: codellama/CodeLlama-7b-Instruct-hf
parameters:
density: 0.5
weight: 0.5
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
parameters:
normalize: true
int8_mask: true
dtype: float16

3. Run merging:

mergekit-yaml ultra_llm_merged.yaml output_folder \
--allow-crimes \ # Allow mixing architectures
--copy-tokenizer \ # Copy a tokenizer to the output
--out-shard-size 1B \ # Number of parameters per output shard
--low-cpu-memory \ # Store results and intermediate values on GPU. Useful if VRAM > RAM
--write-model-card \ # Output README.md containing details of the merge
--lazy-unpickle # Experimental lazy unpickler for lower memory usage

That’s it! Now you can deploy this model or upload it directly to Hugging Face.

Remember, merging multiple models simultaneously demands significant resources. For the configuration I previously described, using a system with 1x A10 (24Gb) GPU and 30 vCPUs, the resource and time commitments were as follows:

  • Downloading Models: Around 5 minutes.
  • Merging Process: Approximately 7 minutes.
  • The peak RAM usage: 30Gb.

Keep in mind that these timings and resource consumption can vary depending on your system’s specifications and the particular models being merged.

What to Read & Useful Links

  1. Merge Large Language Models with mergekit — a great article by Maxime Labonne, which I unfortunately saw too late.
  2. Blending Is All You Need — authors show that blending several ‘small’ models can be better than huge models.
  3. Сhronological list of papers about Model Merging that will help you get started with it!
  4. Mixture of Experts Explained — another way of combining multiple models.
  5. Experiments and benchmarks by Omar Sanseviero on model merging. It might be useful to learn additional nuances in the settings and to participate in the discussion.

Conclusion

We’ve explored the key possibilities of merging models and delved into the workings of several algorithms. I believe that in the near future, we’ll see an increasing number of models created through merging. This is a cost-effective way to combine useful skills without the need for fine-tuning.

Thank you for your attention, stay tuned for new articles!

Disclaimer: The information in the article is current as of January 2024, but please be aware that changes may occur thereafter.

Unless otherwise noted, all images are by the author.

If you have any questions or suggestions, feel free to connect on LinkedIn.

--

--