WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis

2026-06-01 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors created WorldCoder-Bench, a test set to evaluate how well AI models can build interactive 3D worlds using web technology like Three.js. This involves tasks that require understanding physics, spatial layout, and user controls, which are usually hidden inside complex computer graphics code. They also developed StateProbe, a tool that checks if the generated 3D programs behave correctly by examining their internal states during execution. Testing nine advanced models showed that even the best ones struggle to fully meet these detailed requirements, especially with maintaining consistent states and interactions. However, simpler or faster models still perform reasonably well on easier tasks.

Large Language ModelsThree.js3D Interactive WorldsBrowser-native 3DSimulationRenderingWebGLState VerificationSandboxed ExecutionBehavioral Contracts

Authors

Shuo Lu, Yinuo Xu, Kecheng Yu, Siru Jiang, Yongcan Yu, Yubin Wang, Haitao Yang, Yuxiang Zhang, Bin Wang, Ran He, Jian Liang

Abstract

Large language models (LLMs) are increasingly asked not only to write static interfaces, but to construct executable interactive worlds from natural language. Browser-native 3D, commonly built with Three.js, is a natural next frontier: generated programs must integrate assets, obey spatial and physical constraints, and keep user-facing controls synchronized with hidden runtime state. Existing web-generation benchmarks and evaluators, however, largely observe only pixels or DOM nodes, while the mechanics of a Three.js world unfold inside an opaque <canvas>. We introduce WorldCoder-Bench, a benchmark for autonomous, physically grounded 3D world synthesis. WorldCoder-Bench contains 2,026 expert-curated tasks across Simulation, Rendering, and Application scenarios, with optional .glb assets and hidden behavioral contracts. We further propose StateProbe, an execution-based protocol that probes generated programs in a sandboxed browser and verifies hidden, mutation-hardened contracts over runtime states and transitions. Beyond verification coverage, we report Return on Automation and Time Efficiency Multiplier to measure correctness-adjusted cost and time savings. Across nine frontier models, the best system reaches only 27.8% verification coverage on WorldCoder-Core and 19.9% on WorldCoder-Robust, with failures dominated by state-schema drift and broken interaction chains rather than missing scene elements. Utility metrics further show that cheap or fast models can still provide substantial value on easier domains. WorldCoder-Bench is available at https://anonymous.4open.science/r/WorldCoder-Bench/.

View PDFOpen arXiv