CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

2026-06-08Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created CineDance-1M, a big open dataset with long videos that include both sound and pictures, to help computers learn how to make videos with matching audio. Their dataset has detailed, organized labels and was carefully made through a three-step process to ensure quality. They also built CineBench, a set of tests and scores, to check how well models create videos and sound together. To show the dataset's value, they improved an existing model called LTX-2.3 into CineDance, which makes good videos with well-aligned audio. This work aims to help future research in making longer, multi-scene videos with matching audio.

Text-to-Audio-Video (T2AV)multi-shot videoslong-form video generationdataset curationnarrative parsingdual-modal captioningaudio-video alignmenthuman-aligned metricsLTX-2.3CineBench
Authors
Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Zhucun Xue, Qianyu Zhou, Xiangtai Li, Lizhuang Ma, Jiangning Zhang, Dacheng Tao
Abstract
The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at https://aliothchen.github.io/projects/CineDance/.