Selective Latent Thinking: Adaptive Compression of LLM Reasoning Chains

2026-05-25 • Computation and Language

Computation and Language

AI summaryⓘ

The authors propose a method called Selective Latent Thinking (SLT) to make large language models reason more efficiently without losing accuracy. Instead of compressing all reasoning steps equally, their method keeps important parts explicit while compressing less important parts into shorter codes. They train the model in three stages to decide which parts to compress and which to keep clear, balancing speed and correctness. Experiments show SLT improves accuracy compared to other compression methods and shortens reasoning chains significantly with little loss of accuracy.

Chain-of-Thought (CoT) reasoningLarge Language Models (LLMs)Autoregressive inferenceLatent reasoningCompressionDecoderReinforcement learningMathematical reasoning benchmarksConfidence gatingTrajectory optimization

Authors

Hui Xie, Jie Liu, Ziyue Qiao, Joaquin Vanschore

Abstract

Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a promising alternative, yet they often treat reasoning as uniformly compressible, causing precision-critical intermediate steps to be overly compressed and thereby degrading reasoning accuracy. In this work, we propose Selective Latent Thinking (SLT), a framework that selectively compresses redundant reasoning spans into latent representations while preserving precision-critical spans as explicit CoT within the same reasoning trajectory. Specifically, SLT first uses a lightweight decoder to anticipate a short upcoming reasoning span, and then applies confidence-based gating to determine the longest span that can be reliably compressed. The accepted span is encoded into a compact latent representation to improve reasoning efficiency, while uncertain or precision-critical reasoning remains in explicit CoT form to preserve accuracy. To learn this selective compression policy, SLT adopts a three-stage training strategy that combines span-level latent compression, reliability-aware future reasoning prediction, and trajectory-level reinforcement learning to optimize the trade-off between answer correctness and reasoning cost. Extensive experiments across four mathematical reasoning benchmarks demonstrate that SLT achieves 22.7\% higher accuracy than latent reasoning baselines at comparable compression ratios, while reducing reasoning chain length by 58.4\% with only 2.8\% accuracy degradation compared to explicit CoT,Our code can be found in https://github.com/hunshi34/SLT.

View PDFOpen arXiv