Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers

2026-06-22Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors propose a new way to merge large pretrained transformer models by making their internal parts more aligned so they can be connected smoothly with a simple line between them. Unlike previous methods that only adjusted one model, their approach adjusts both models together, reducing barriers in the merging process. They show this works well on language and vision models, helping to keep performance high along the merged path. This is the first time such smooth merging has been demonstrated at this large scale.

linear mode connectivitymodel mergingpretrained transformersweight transformationsparameter symmetrieslinear interpolation pathlanguage modelsvision transformersloss barriersbillion-parameter models
Authors
Tianyi Li, Zhiqiang Shen
Abstract
Linear mode connectivity (LMC) provides a promising foundation for understanding and merging independently trained neural networks, but existing methods typically optimize the interpolation path from only one model endpoint, limiting their scalability and effectiveness for large pretrained transformers. We propose a novel and scalable framework for enabling LMC-based model merging to {\em billion-parameter pretrained transformers}. Our method applies properly parameterized functionality-preserving weight transformations to align functionally equivalent solutions, and introduces a dual learning procedure in which both models jointly learn their corresponding transformations toward a shared linear interpolation path. This bidirectional optimization substantially reduces interpolation barriers and enables more reliable merging across large-scale architectures. Empirically, we show that our approach achieves near-zero loss barriers on WikiText for language models with medium-sized parameters, representing, to our knowledge, the first demonstration of near-barrier-free linear connectivity at this scale. In the vision domain, ViT-L maintains above 69\% ImageNet top-1 accuracy throughout the interpolation path, while modern billion-parameter LLMs exhibit only small loss barriers. These results suggest that properly resolving parameter symmetries enables large pretrained Transformers to be connected and merged through simple linear paths with substantially improved interpolation performance. Code: https://github.com/VILA-Lab/Dual-Learned-Matching .