SDS-LoRA: Overcoming Anisotropic Gradient Scaling in Low-Rank Adaptation
2026-06-15 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors study a method called LoRA, which helps large AI models learn new tasks by updating only small, easy-to-manage parts of the model. They find that LoRA's way of updating causes certain directions to get too much focus while ignoring others, making learning less effective. To fix this, they introduce SDS-LoRA, a new method that separates the important parts so updates don't get distorted, leading to better and faster learning. Their tests show that SDS-LoRA works better in both language and vision tasks compared to regular LoRA.
Low-Rank Adaptation (LoRA)fine-tuninggradient backpropagationsingular valuesanisotropic scalingorthonormal basesconvergence rateparameterizationnatural language processingcomputer vision
Authors
Junghun Oh, Sungyong Baik, Kyoung Mu Lee
Abstract
Low-Rank Adaptation (LoRA) enables efficient adaptation of large pre-trained models to downstream tasks by parameterizing weight updates with low-rank matrices. In this paper, we investigate the limitations of the LoRA parameterization from a geometric perspective. Specifically, we show that when a full fine-tuning gradient is backpropagated to the low-rank matrices, it undergoes anisotropic scaling driven by their singular values. We argue that this phenomenon is undesirable because it distorts the full fine-tuning gradient by skewing it toward dominant singular directions while suppressing others. Our analyses demonstrate that anisotropic gradient scaling reduces the effective rank of the low-rank matrices' gradients and results in suboptimal alignment between the full fine-tuning gradient and its low-rank approximation in LoRA, thereby exacerbating the gap to full fine-tuning. To address these limitations, we propose a new low-rank parameterization, SDS-LoRA, which structurally decouples singular values from the backward pass. Our method ensures that the full fine-tuning gradient backpropagates only through the orthonormal bases of the low-rank matrices' subspaces, independent of their scales. Convergence analysis demonstrates that while LoRA's convergence rate degrades with the condition number of the low-rank matrices, SDS-LoRA remains independent of it. Experimental results across natural language and vision benchmarks show that SDS-LoRA improves loss convergence and reduces the gap to full fine-tuning, significantly enhancing adaptation performance.