Finding the Evidence: Discovering Decision-Supporting Tokens for On-Policy Reasoning Distillation

2026-06-22Artificial Intelligence

Artificial Intelligence
AI summary

The authors study how to better teach a smaller AI (student) to solve problems by learning from a bigger AI (teacher). They find that there are two important kinds of knowledge in the reasoning steps: decisions about where to branch next, and evidence that supports those decisions. Their new method, DEAR, helps the student learn both types by first spotting uncertain decisions and then finding the key supporting evidence, improving performance on math and coding tasks compared to previous methods. This means the student AI gains deeper understanding, not just picking where to go next but also why.

on-policy distillationreasoning chainsdecision tokensevidence tokensstudent entropycosine similarityteacher-student divergencemath benchmarkscode generation
Authors
Jinwei Xiao, Zhuowen Han, Yueqing Sun, Zhengxi Lu, Yuxin Liu, Zhiyuan Yao, Wentao Chen, Qi Gu, Xunliang Cai
Abstract
On-policy distillation transfers reasoning ability through dense token-level supervision, yet the nature of the transferable signal remains unclear. We discover that reasoning chains contain two types of knowledge that require different discovery mechanisms: decisions (where to branch), which surface through student uncertainty, and evidence (intermediate steps that justify decisions), which hides in positions where the student is confident yet wrong. Current methods capture only decisions; the substantive knowledge in evidence tokens remains untransferred. We propose DEAR(Decision-Evidence Aware Reasoning Distillation), which first identifies decisions via student entropy, then discovers their supporting evidence through hidden-state cosine similarity to decision anchors, boosted by teacher-student divergence to prioritize the largest knowledge gaps. Across three student-teacher configurations on math and code benchmarks, DEAR consistently outperforms standard OPD, with up to +2.5pp on competition math and +5.7pp on code generation.