Skill-Conditioned Gated Self-Distillation for LLM Reasoning

2026-05-27Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors study a way to improve large language models' reasoning by learning from different skills stored in a 'skill bank,' which might sometimes give wrong or irrelevant advice. They propose a method called Skill-Conditioned Gated Self-Distillation (SGSD) that checks multiple skill-based suggestions like teachers grading a student's answer and only learns from the helpful disagreements. Their approach avoids blindly copying all skills and focuses on reliable signals, leading to better performance on math reasoning tests compared to previous methods. The work shows that using experience-derived skills can be helpful even when the skill information isn't fully trusted.

On-policy self-distillationLarge language modelsPrivileged informationSkill bankTeacher-student learningMathematical reasoning benchmarksGated objectiveHypothesis validation
Authors
Jiazhen Huang, Xiao Chen, Xiao Luo, Yong Dai, Senkang Hu, Yuzhi Zhao
Abstract
On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher's polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at https://github.com/walawalagoose/SGSD.