Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression

2026-06-02 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors studied a way to shrink big language models after training using a math trick called tensor decomposition, which can make the models use less memory. They tested this method on different types of models and found that it doesn't always work well because the math assumes parts of the model share similarities, but in reality, modern models learn many different kinds of information. Their work helps explain when and why this shrinking method works or doesn’t in practice. They also provide the code so others can try it out.

large language modelspost-training compressiontensor decompositionTransformerMixture of Experts (MoE)parameterizationmodel deploymentsubspaceheterogeneous representations

Authors

Artur Zagitov, Alexander Miasnikov, Maxim Krutikov, Vladimir Aletov, Gleb Molodtsov, Nail Bashirov, Artem Tsedenov, Aleksandr Beznosikov

Abstract

Post-training compression is essential for deploying large language models (LLMs) under tight resource constraints. Tensor decompositions have emerged as a promising direction, offering compact parameterizations well suited to Transformer weight structures. However, existing studies evaluate these methods in narrow settings, leaving unclear whether tensorization is effective at large-scale deployment. We systematically evaluate tensor compression across dense and MoE architectures, establishing performance trade-offs grounded in both empirical analysis and theoretical analysis. We identify a fundamental mismatch between the shared subspaces assumed by tensor decompositions and the heterogeneous representations learned by modern LLMs, thereby delineating their practical limits and clarifying their viable role in large-scale deployment. The code is available at https://github.com/brain-lab-research/TT-LLM.

View PDFOpen arXiv