A general tensor-structured compression scheme for efficient large language models

2026-05-25Computation and Language

Computation and LanguageArtificial IntelligenceMachine Learning
AI summary

The authors introduce MixT, a new way to shrink big language models by replacing some heavy parts with more efficient building blocks called tensor operators. This method works on general parts of the models, making it useful for different Transformer-based models. They tested MixT on two models and found it can keep accuracy high while significantly reducing the model size, speed, and memory needed for running and training. However, there is a sharp drop in performance beyond a certain compression level, linked to changes in how the models process information.

Large Language ModelsDense Linear TransformationsTensor OperatorsTransformerModel CompressionInference FLOPsParameter ReductionMMLU AccuracyEntropyNeural Network Adaptation
Authors
Ying Lu, Peng-Fei Zhou, Qi-Xuan Fang, Pan Zhang, Shi-Ju Ran, Gang Su
Abstract
Large language models (LLMs) are dominated by dense linear transformations, whose storage, memory and computational overheads hinder efficient adaptation and deployment while masking the functional impacts of structural simplification. Here we present Tensor Mixture (MixT), a general tensor-structured compression scheme that replaces targeted dense linear layers with natively executable mixtures of tensor operators. Operating directly on generic linear projections instead of model-specific components, MixT is potentially applicable across Transformer-based LLMs and other dense neural mappings. We evaluate MixT on Qwen3-8B and LLaMA2-7B under a unified recovery protocol, identifying a broad compressible regime in which MMLU accuracy is largely preserved before an abrupt transition at model-specific boundaries. This transition coincides with coordinated shifts in output entropy, prediction entropy and inter-layer geometry. At the LLaMA2-7B transition boundary, MixT reduces full-model parameters by 47.5\%, inference FLOPs by 37.1\%, training FLOPs by 52.1\% and peak inference memory by 60.4\%, demonstrating its practical potential for lower-cost LLM compression.