LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control
2026-06-15 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors created LabOSBench, a simulated testing setup for computer agents to control virtual scientific instruments through their webpages. This avoids the high cost and risks of using real instruments while still offering realistic challenges like adjusting settings based on feedback. They included many tasks that mimic real lab workflows and tested current AI agents on them. Their results show existing agents can handle some simple tasks but struggle with complex, step-by-step operations. LabOSBench aims to help improve AI systems that work with scientific tools in a safe and scalable way.
scientific instrumentationGUI agentssimulation testbedfeedback-driven controlworkflow automationvision-language modelsbrowser-based interfacetask benchmarkingparameter tuningdata acquisition
Authors
Anqi Zou, Han Deng, Chengyu Zhang, Junquan Hu, Yu Wang, Yuxiang Xing, Aokai Zhang, Hanling Zhang, Zhaoyang Liu, Ben Fei, Zhihui Wang, Wanli Ouyang
Abstract
Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems, whereas scientific instrumentation scenarios require coordinated control over complex interfaces, and feedback-driven parameter adjustment. However, directly evaluating agents on physical high-precision instruments is impractical due to high cost, safety risks, limited accessibility, and difficulty in ensuring reproducible evaluation. This motivates the need for a simulated yet realistic testbed that preserves the operational challenges of scientific instruments while enabling scalable and safe benchmarking. To this end, we introduce LabOSBench, a challenging benchmark for multimodal GUI agents built on a suite of web-based scientific-instrument simulators. Operating directly via a browser, LabOSBench avoids resource-heavy OS virtualization while supporting flexible task configuration and execution-based evaluation. Specifically, LabOSBench constructs 96 subtasks across eight instrument simulators, covering workflows from sample loading, alignment, parameter tuning, and data acquisition to result inspection. We evaluate general-purpose vision-language models, specialized GUI agent models, and advanced agentic frameworks at both subtask and end-to-end levels. Our experiments reveal that while existing agents can complete many structured GUI subtasks, they still struggle with feedback-driven operations and long-horizon workflow execution. Overall, LabOSBench provides a reproducible, low-cost testbed for advancing computer-using agents toward scientific-instrument control.