RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

2026-06-02 • Computation and Language

Computation and Language

AI summaryⓘ

The authors created RealClawBench, a new test set made from real user sessions to better reflect what people really ask software agents to do. They tackled challenges like varying user environments and unclear requests by rebuilding the environments and using scoring methods that can be checked automatically. Their benchmark includes 281 tasks that closely match real-world use and shows that current models still have a lot of room to improve. This helps researchers evaluate agents more realistically using real examples.

agent benchmarksreal-world evaluationexecution environmentsautomatic scoringJensen-Shannon divergencesoftware agentsdeveloper workflowstask reproducibility

Authors

Zongwei Lv, Zhewen Tan, Yaoming Li, Yilun Yao, Yuxuan Tian, Lin Sun, Xiangzheng Zhang, Weihong Lin, Tong Yang, Guangxiang Zhao

Abstract

Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, and real-world difficulty of deployed agent use. Real user requests are challenging to benchmark because they often depend on local execution environments, involve implicit or underspecified intent, and require nontrivial verification. RealClawBench addresses these challenges with two core mechanisms: reconstructed execution environments and deterministic verifiable scorers, which together convert real sessions into reproducible, automatically scored tasks. The resulting release contains 281 executable tasks sampled from a much larger real-session pool while preserving the source distribution, with maximum final-vs-source Jensen-Shannon divergence of 0.0448. Evaluating 14 contemporary models shows that the best system solves only 65.8% of tasks, revealing substantial headroom on realistic developer-agent workloads. By turning real deployed sessions into controlled evaluation instances, RealClawBench provides a practical path toward benchmarks that better measure agent capability in actual use. Code is available at:https://anonymous.4open.science/r/real-claw-bench-582B.

View PDFOpen arXiv