Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents

2026-06-15Artificial Intelligence

Artificial IntelligenceCryptography and SecuritySoftware Engineering
AI summary

The authors found that large language models (LLMs) usually pay attention to the right tool label but still pick the wrong tool, meaning the error happens when deciding which tool to use, not because the model can’t see the options. They showed that fixing the prompt helps only a little, but interventions focused on how the model makes its final choice fix many more errors. They also developed a method based on the model’s attention that improves tool selection without needing training or gold labels, showing the key problem lies in the model’s decision process rather than its understanding or attention.

LLM agentstool selectionattention mechanismreadout layerprompt engineeringattention biasrepresentation invariancetraining-free selectorBFCL benchmarkSeal-Tools
Authors
Shiyang Chen
Abstract
LLM agents mis-call tools, and the natural guess is that the model failed to see the right tool in a crowded harness. We show the opposite through a lens concurrent work sets aside -- the model's attention to labeled tool-definition segments. On real BFCL failures, by per-candidate attention argmax the model attends most to the correct tool 80% of the time (vs. 21% chance), and the gold is the under-attended segment on only 10%: it looks at the right tool and still picks wrong. This directly refutes the intuitive "crowded-harness / lost-in-the-middle" explanation: the failure is at the decision readout, not the harness, and we pin it there three ways. (1) Input vs. readout: repairing the prompt (reordering or duplicating the gold tool) recovers <=23% of failures, while readout-side interventions recover 59-91%. (2) Representation-invariance: two gold-pointed interventions in different representations -- an additive attention-logit bias and a residual-stream steering vector -- recover largely the same failures (per-task Jaccard 0.865 pooled, 0.79-0.91 per model), so the bottleneck is localized to the readout independent of which representation is poked. (3) A training-free, gold-free selector: per-segment attention closes most of the gold-free-vs-oracle gap on BFCL (+11.9 pts pooled function-name selection vs. +17.9-pt oracle headroom) and adds +14.9 pts on Seal-Tools; every model positive (exact McNemar p<=8e-4 each). Scopes differ: the causal attention-bias dose-response is bidirectional and monotonic on 10 mask-honoring models (3-32B), the full 0.5-32B span carrying only the correlational diagnostic; the deployable selector is evaluated on 5 single-turn models and does not yet transfer to a multi-turn loop.