State-Grounded Multi-Agent Synthetic Data Generation for Tool-Augmented LLMs
2026-06-15 • Artificial Intelligence
Artificial IntelligenceComputation and Language
AI summaryⓘ
The authors created StateGen, a tool that uses AI to generate conversations where a user talks to an agent that uses tools, keeping track of what’s true during the chat to avoid mistakes. It works by having four AI roles talk to each other: a user, an agent, a tool simulator, and a judge that scores how well the conversation goes. StateGen keeps a clear record of the conversation’s facts so the agent doesn’t make up wrong information when using tools. The system can also handle complex interactions where tools act like smaller agents. The authors tested StateGen on many conversations and showed it works better than existing public tools at making realistic, tool-aware dialogue data.
Large Language ModelsConversational DataTool-augmented AgentsSynthetic Data GenerationState ManagerMulti-turn DialogueTool-call HallucinationsHierarchical Multi-agent SystemsPersona SimulationLLM Judge
Authors
Rahul Khedar, Eshita, Sneha Teja Sree Reddy Thondapu, Mayank Malhotra, Arup Das, Jitesh Chandra, Yun-Shiuan Chuang, Chaitanya Kulkarni, Arun Menon, Linsey Pang, Avinash Karn, Mouli V, Prakhar Mehrotra
Abstract
Training tool-augmented LLM agents requires large corpora of multi-turn, tool-grounded conversational data that is expensive to annotate, privacy-constrained in production settings, and largely absent from public datasets. We present StateGen, a synthetic data generation platform that produces scored, reasoning-trace-rich training conversations by orchestrating a four-role LLM loop: a persona-conditioned user simulator, an agent under test, a state-grounded tool simulator, and a multi-axis LLM judge. The key architectural contribution is an authoritative state manager that maintains a structured world-state object across turns, enforcing a backend-is-truth invariant that eliminates the dominant class of tool-call hallucinations by construction. StateGen extends naturally to hierarchical multi-agent settings by declaring sub-agents as tools, all sharing a single state object. We report results on 64,698 evaluated conversations across three production corpora: tool-call hallucination scores reach 9.66/10, the system supports persona-driven variation via a 23-dimensional trait vector, and a cleanly separated train and golden evaluation set split confirms the data is not memorization bait (per-criterion gap analysis). Comparison with eight external systems shows that no single publicly available platform combines multi-turn generation, state-grounded tool simulation, hierarchical multi-agent support, and built-in judge scoring.