AI summaryⓘ
The authors address a challenge in teaching large language models when to rely on their own knowledge versus external information or when to avoid answering. They propose a method called KbSD that uses detailed guidance at the token level along with overall success signals to help the model make better decisions. Their approach involves a teacher model that knows more and teaches the student model with clear signals about knowledge limits, improving reasoning without needing a bigger external model. They also adapt their training to different types of decision contexts for better precision and coverage. Experiments show their method improves accuracy and reduces mistakes, especially in cases where traditional rewards provide little guidance.
Agentic searchLarge language modelsReinforcement learningReward sparsityKnowledge boundary calibrationSelf-distillationKL divergenceParametric certaintyRetrieval qualityHallucination mitigation
Abstract
Agentic search equips large language models with dynamic retrieval abilities, but existing reinforcement learning methods remain limited by reward sparsity in knowledge boundary calibration -- deciding when to trust parametric memory, when to rely on retrieved evidence, and when to abstain. Binary rewards can penalize undesirable outcomes, but provide little guidance on the reasoning process required to make calibrated decisions across different knowledge states. To address this, we propose KbSD (Knowledge boundary Self-Distillation), a framework that tackles this limitation through dense token-level supervision, outcome-level sparse rewards, and quadrant-adaptive optimization. KbSD constructs a hint-augmented teacher, architecturally identical to the student, that receives explicit knowledge boundary signals -- including parametric certainty, retrieval quality, and ground-truth answers -- to generate calibrated reasoning demonstrations. This information-asymmetric self-distillation enables dense supervision without requiring a larger external model. To further account for the heterogeneous reasoning distributions across knowledge states, we introduce a quadrant-adaptive distillation objective: reverse KL for concentrated integration, forward KL for diverse refusal, and Pareto-optimal bidirectional KL for asymmetric quadrants requiring both precision and coverage. Experiments on multiple benchmarks show that KbSD consistently improves both task accuracy and hallucination mitigation over strong baselines, with the largest gains appearing in the challenging quadrants where sparse rewards are least informative.