Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use

2026-05-25Computation and Language

Computation and Language
AI summary

The authors tested a reinforcement learning method (GRPO) on a language model to use a simple knowledge-graph tool for answering complex questions. They observed the model's performance peak early and then sharply drop, a problem that persisted despite trying different reward setups. They found this issue relates to the lack of helpful error feedback in the knowledge-graph interface, unlike other tools that provide clearer signals of failure. Attempts to fix retrieval errors showed only small improvements, but using self-distillation helped boost performance significantly without needing larger models.

Reinforcement LearningKnowledge GraphGRPOSelf-DistillationKnowledge Graph APIsComplex WebQuestionsFreebaseReward DesignLanguage ModelsExact-Match Accuracy
Authors
Tianda Sun, Dimitar Kazakov
Abstract
We test the standard RLVR tool-use recipe -- GRPO on Qwen2.5-7B-Instruct -- on a deliberately minimal knowledge-graph tool API: four Freebase navigation verbs over Complex WebQuestions. Under a self-verifiable retrieval reward, the policy's tool-grounded answer rate climbs from $3.8\%$ to $9.6\%$ over 250 steps, then collapses to $0\%$ within a single 50-step window -- a \emph{peak-then-collapse} pattern replicated across four seeds. Across seven reward designs, we find four recurring failure modes: adding denser or more targeted proxy rewards shifts the failure mode rather than eliminating it. We argue that a key difference from Python interpreters, web search, and JSON APIs is interface feedback: their failures often leak natural-language signal the model saw in pretraining. A Python traceback names the failing line; an empty Freebase result \texttt{[]} does not. Stripping away that surface exposes a degradation regime that same-family reward redesigns do not fix. A direct oracle ablation rules out relation selection: injecting gold relations at every retrieval call lifts exact-match accuracy by only $+0.20$~pp, and $95.4\%$ of retrieval-dependent errors are retrieval-composition failures rather than answer-extraction failures. As a mitigation, one-iteration self-distillation reaches $40.0\%$ EM at 7B and is capacity-invariant: doubling capacity to 14B improves EM by only $0.25$~pp, and initialization barely matters -- the ceiling appears interface-bound within the 7B--14B range tested.