TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

2026-05-11Computation and Language

Computation and Language
AI summary

The authors address a problem in AI models that use multiple tools to answer questions about images where it’s unclear which pieces of evidence support each part of the answer. They introduce TRACER, a system that links every sentence in an answer to specific evidence and how it supports that sentence, making the reasoning easier to check and improve. They also create a test called TRACE-Bench to measure how well systems track this evidence. Their results show that TRACER improves accuracy and reduces unnecessary tool use, proving that carefully tracking evidence is more helpful than simply using more tools. This helps make AI reasoning with images more trustworthy and efficient.

multimodal large language modelstool-using agentsprovenance gapTRACER frameworkprovenance recordsemantic support relationTRACE-Benchreinforcement learningmultimodal reasoninganswer accuracy
Authors
Bihui Yu, Caijun Jia, Jing Chi, Xiaohan Liu, Yining Wang, He Bai, Yuchen Liu, Jingxuan Wei, Junnan Zhu
Abstract
Multimodal large language models increasingly solve vision-centric tasks by calling external tools for visual inspection, OCR, retrieval, calculation, and multi-step reasoning. Current tool-using agents usually expose the executed tool trajectory and the final answer, but they rarely specify which tool observation supports each generated claim. We call this missing claim-level dependency structure the provenance gap. The gap makes tool use hard to verify and hard to optimize, because useful evidence, redundant exploration, and unsupported reasoning are mixed in the same trajectory. We introduce TRACER, a framework for verifiable generative provenance in multimodal tool-using agents. Instead of adding citations after generation, TRACER generates each answer sentence together with a structured provenance record that identifies the supporting tool turn, evidence unit, and semantic support relation. Its relation space contains Quotation, Compression, and Inference, covering direct reuse, faithful condensation, and grounded derivation. TRACER verifies each record through schema checking, tool-turn alignment, source authenticity, and relation rationality, and then converts verified provenance into traceability constraints and provenance-derived local credit for reinforcement learning. We further construct TRACE-Bench, a benchmark for sentence-level provenance reconstruction from coarse multimodal tool trajectories. On TRACE-Bench, simply adding tools often introduces noise. With Qwen3-VL-8B, TRACER reaches 78.23% answer accuracy and 95.72% summary accuracy, outperforming the strongest closed-source tool-augmented baseline by 23.80 percentage points. Compared with tool-only supervised fine-tuning, it also reduces total test-set tool calls from 4949 to 3486. These results show that reliable multimodal tool reasoning depends on provenance-aware use of observations, not on more tool calls alone.