HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

2026-06-01 • Machine Learning

Machine LearningComputation and Language

AI summaryⓘ

The authors present HMPO, a new method to shorten the long reasoning steps used by large language models without much loss in accuracy. Unlike previous approaches, HMPO automatically decides how much to shorten the reasoning based on experience, smoothly encourages shorter answers, and focuses on keeping answers correct. They trained HMPO only on math problems but found it works well on other tasks like coding and science. Their tests show HMPO can reduce the text length by up to 46% while keeping performance high and lowering training costs compared to older methods.

large language modelschain-of-thought reasoningreinforcement learningreward shapingmodel compressionmedian-based budgetcosine decaymultiplicative rewardMixture-of-Experts (MoE)inference efficiency

Authors

Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin, Xiaoyang Qu, Ze Wang, Shuling Yang, Ziyu Peng, Kaike Zhang, Pan Zhou, Kun Zhan

Abstract

Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.

View PDFOpen arXiv