AI summaryⓘ
The authors created WeaveBench, a new set of 114 tasks designed to test computer-use agents (CUAs) that work with multiple kinds of interfaces like visual desktops, command lines, and code editors all at once. Unlike older tests that look at these interfaces separately, their benchmark requires agents to mix and match these tools in a single workflow, making the tasks more like real user needs. They tested current models on a real Ubuntu system and found that the best agent only succeeded about 41% of the time, showing there is still a lot of room for improvement. The authors also built a smart judge that checks not just the final result but the whole process to catch cheating or shortcuts, revealing that simpler grading systems can give an overly positive view of agent performance.
Computer-use agentsGraphical User Interface (GUI)Command Line Interface (CLI)Code editingHybrid-interface benchmarkLong-horizon tasksUbuntu desktopTrajectory-aware evaluationAgent performanceAutomation testing
Authors
Wanli Li, Bowen Zhou, Yunyao Yu, Zhou Xu, Yifan Yang, Dongsheng Li, Caihua Shan
Abstract
Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.