Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation

2026-06-01Artificial Intelligence

Artificial IntelligenceMachine Learning
AI summary

The authors created GAIATrace, a detailed dataset showing how two advanced AI agents work step-by-step on various tasks. This dataset records every token and action the AI models take, which helps researchers understand AI behavior better. They also made Vidur-Agent, a tool that uses these records to imitate AI runs cheaply and reliably. Using these tools, the authors studied how different design choices affect AI agents' problem-solving. Their work makes it easier to analyze and improve complex AI systems.

Agentic AIToken-level traceLarge Language Models (LLMs)Iterative planningSystem evaluationSimulated environmentsTask-level structuresBenchmarkingReproducibilityAI system design
Authors
Donghwan Kim, Prakhar Singh, Younghoon Min, Jongryool Kim, Jongse Park, Kiwan Maeng
Abstract
Agentic AI completes tasks through iterative planning, tool use, and reasoning based on observed outcomes. Despite its popularity, its system-level behavior remains poorly understood, particularly for complex datasets and agent architectures-owing to highly non-deterministic execution, prohibitive evaluation costs, and limited visibility into proprietary models. This paper presents GAIATrace, the first token-level trace dataset of two state-of-the-art agentic systems (MiroThinker and OWL) running GAIA, a benchmark composed of a heterogeneous mix of general-purpose tasks. Unlike prior trace datasets, GAIATrace captures full reasoning tokens, task-level structures, and activities of every major participating LLMs, enabling in-depth systems research. Complementing the dataset, we present Vidur-Agent, a trace-driven simulator that can replay GAIATrace to perform reproducible, low-cost system evaluation across diverse simulated environments. Using both artifacts, we characterize how modern agentic systems handle general tasks and how various system design choices shape their behavior, yielding several unique findings.