Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation
2026-07-02 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors address the challenge of improving large language models without using real-world feedback or costly expert labels, especially in specialized areas. They introduce Neuron On-Policy Self-Distillation (Neuron-OPSD), a method that uses the model’s own neuron activity to choose training data and guide learning. This approach trains the model from its own improved outputs without needing any true labels. Their method enhances the model’s performance on specific tasks while keeping it good at general tasks and avoiding common problems seen in previous methods. This makes their work useful when external supervision or interaction is not possible.
Large Language ModelsSelf-DistillationNeuron ActivationsOn-Policy TrainingAnnotation-Free LearningPseudo-LabelsCross-Domain GeneralizationModel CalibrationOffline Reinforcement LearningSpecialized Domain Adaptation
Authors
Zhuowei Chen, Xiang Lorraine Li
Abstract
Post-training large language models (LLMs) without real-world interaction feedback or human-labeled supervision remains challenging, particularly in specialized domains where expert annotations are costly to obtain. Recent annotation-free self-evolution methods address this by using the model's own outputs as supervision signals, constructing a teacher via additional context and aggregating predictions across multiple rollouts through majority voting to produce pseudo-labels. However, these approaches are not without drawbacks: SFT- and GRPO-based variants suffer out-of-domain performance degradation, while reward-based on-policy RL inflates calibration error. In this paper, we propose Neuron On-Policy Self-Distillation (Neuron-OPSD), a data-centric framework for annotation-free self-distillation that leverages internal neuron activations to guide both training-data selection and teacher context construction. The model is then trained via on-policy distillation from the teacher distribution, requiring no ground-truth labels at any stage. Across specialized-domain benchmarks, Neuron-OPSD improves in-domain task performance while preserving cross-domain generalization and mitigating calibration collapse over prior annotation-free baselines. This framework is particularly relevant to settings where online interaction or external supervision is costly or infeasible, and is conceptually distinct from offline RL approaches that rely on logged, reward-labeled trajectories.