QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling
2026-07-01 • Machine Learning
Machine LearningComputation and Language
AI summaryⓘ
The authors explore a new way to generate multiple guesses from language models more efficiently by using correlated samples instead of independent ones. They introduce QuasiMoTTo, which spreads out its guesses evenly using a special sampling method called quasi-Monte Carlo, reducing repetition. This approach maintains the model’s correctness and improves training speed, needing fewer samples to reach the same accuracy on reasoning tasks and reinforcement learning. Their results show that QuasiMoTTo makes better use of computing power by covering a wider range of possible answers in each batch.
language modelsamplingquasi-Monte Carloautoregressive samplinginverse-CDF samplingpolicy-gradient reinforcement learningpass@kbootstrap estimatorsample efficiencyparallel inference
Authors
Michael Y. Li, Anthony Zhan, Kanishk Gandhi, Noah D. Goodman, Emily B. Fox
Abstract
Scaling inference compute, by generating many parallel attempts per problem, is a costly but reliable lever for improving language model capabilities. By default these attempts are generated independently, wasting inference compute on redundant solutions. This waste seems unavoidable. After all, independence is what makes parallel sampling trivial to scale. However, this tradeoff is not fundamental: there is a rich design space of samplers that generate correlated but exact samples entirely in parallel. We explore this design space as an avenue for improving sample efficiency in scaling inference compute and reinforcement learning (RL). Concretely, we introduce QuasiMoTTo, which uses correlated samples as a drop-in replacement for i.i.d. samples. To generate these samples, QuasiMoTTo uses a reparameterization of autoregressive sampling as inverse-CDF sampling and draws the underlying uniforms with quasi-Monte Carlo (QMC); because QMC spreads the uniforms out more evenly than i.i.d., the resulting samples cover the output space with far less redundancy. Even though the batch is correlated, each sample is marginally distributed according to the language model, so we can use the batch for policy-gradient training. Our empirical analysis focuses on understanding how efficiently QuasiMoTTo can turn compute into performance. To evaluate correlated samplers, whose dependence breaks standard pass@k estimators, we first develop an unbiased bootstrap estimator. Across four reasoning benchmarks, QuasiMoTTo matches i.i.d. pass@k accuracy with 25-47% fewer samples. Strikingly, QuasiMoTTo often saturates an upper bound on pass@k that holds for any marginal-preserving sampler. We also apply QuasiMoTTo to policy-gradient RL (GRPO) where it matches i.i.d. performance with 50% fewer training steps. These gains come from higher coverage, which yields a stronger learning signal per batch.