GeoT2V-Bench: Benchmarking 3D Consistency in Text-to-Video Models via 3D Reconstruction

2026-06-23 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors created GeoT2V-Bench, a tool to test if videos made by text-to-video models, using camera movements, actually show a single unchanging 3D scene. Instead of just saying yes or no, their benchmark provides detailed scores on how well the video matches a real 3D reconstruction, including camera movement accuracy and image consistency over time. They tested many models and found that different aspects of the video quality sometimes conflict, showing the benchmark captures various kinds of mistakes. This helps understand where these video models struggle when pretending to film real static scenes.

text-to-video (T2V)camera-prompted synthesis3D reconstructioncamera intrinsicscamera pose estimationDeformableGSMedianGSstatic scenereconstruction benchmarkvisual plausibility

Authors

Chenrui Fan, Paolo Favaro

Abstract

Camera-prompted text-to-video (T2V) models are increasingly used to synthesize virtual camera captures, such as orbiting objects or moving through static scenes. For these outputs, visual plausibility is insufficient: the generated frames should also provide coherent multi-view evidence for a single static 3D scene. We introduce GeoT2V-Bench, a reconstruction-based diagnostic benchmark for evaluating whether camera-prompted T2V clips can support explicit rigid 3D reconstruction. Our pipeline estimates per-frame camera intrinsics and poses with VGGT-style geometry estimation, fits DeformableGS, derives a static MedianGS proxy by temporal-median aggregation, and renders this proxy along the estimated camera path. Instead of producing a pass/fail label or a single scalar score, GeoT2V-Bench reports a continuous reconstruction profile covering apparent image motion, estimated trajectory behavior, MedianGS static rendering error, static-render flow agreement, and the gap between flexible and static fits. On a fair-format four-seed evaluation with 3,840 completed reconstructions from 12 open-weight model configurations and 80 GeCo-Eval static-scene prompts, we find that visible motion, static rendering error, flow agreement, and flexible-vs-static behavior often disagree. GeoT2V-Bench therefore captures complementary failure modes that emerge when generated videos are tested as global static-scene acquisitions.

View PDFOpen arXiv