Stage-1 Controls the Entropy Regime, Not the Outcome
2026-06-08 • Machine Learning
Machine LearningArtificial IntelligenceComputer Vision and Pattern Recognition
AI summaryⓘ
The authors studied a two-step training process for vision-language models, where the first step prepares the model using supervised fine-tuning or on-policy distillation, and the second step uses reinforcement learning. They found that the first step mainly affects the model's policy entropy (a measure of uncertainty in responses) rather than drastically improving final accuracy. While on-policy distillation leads to more diverse answers early on, this advantage mostly disappears after reinforcement learning. Overall, the authors show that the initial training stage influences certain behaviors but has limited impact on the final model performance.
vision-language modelssupervised fine-tuningon-policy distillationreinforcement learningpolicy entropyout-of-domain generalizationmodel warm-startpass@16Qwen2.5-VL-7BGeometry3K dataset
Authors
Jianxiong Shen
Abstract
Two-stage post-training -- a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD) followed by Stage-2 reinforcement learning (RL) -- is increasingly used for vision-language models (VLMs). We ask what Stage-1 actually controls in a small-data study using Qwen2.5-VL-7B with a same-modality 72B VLM teacher for OPD. First, the three warm-starts reach a narrow $53$--$54\%$ band on Geometry3K internal validation, consistent with the narrow range reported by recent specialized methods; this setup provides little evidence that Stage-1 changes the in-domain endpoint. Second, a matched-recipe, early-stopped SFT improves out-of-domain MathVista by $+2.1$ points, reversing the $-9.5$-point drop of an over-trained variant. The clearest difference is the \emph{entropy regime}: OPD enters RL with substantially higher policy entropy than either SFT initialization, and the separation remains visible through the available trajectories. At the in-domain initialization, OPD also has higher answer diversity and pass@16 ($+2.0$ to $+5.2$ points over SFT), although problem-level bootstrap intervals show that the smaller contrast is uncertain. The advantage is absent after RL (endpoint pass@16 values within $1.1$ points) and on MathVista (six models within $1.2$ points). Our contribution is therefore a bounded empirical characterization: Stage-1 is strongly associated with the entropy regime in this setup, but the downstream payoff is small, localized, and not evidence that OPD is a better RL warm-start.