VideoAgent: All-in-One Framework for Video Understanding and Editing

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors created VideoAgent, a new system that can understand and edit long videos better than before. It uses special agents to plan video shots and combine different editing tools to make videos that tell a clear story. Compared to older methods, VideoAgent works more smoothly, is cheaper to run, and makes videos that almost look as good as those made by humans. They tested it on different types of videos and shared their code online.

video editingvideo comprehensionmulti-agent systemshot planningcross-modal retrievallarge language modelsvideo editing pipelinenarrative coherenceAPI cost optimizationhuman evaluation
Authors
Hengji Zhou, Lingxuan Huang, Jian Wang, Bing Zhou, Si Wu, Lianghao Xia, Chao Huang
Abstract
Video editing has become essential in digital media creation, yet existing automated systems are restricted to short segment processing and domain-specific tasks. They face two critical limitations: i) inability to handle diverse video comprehension and editing operations, and ii) lack of long-video understanding for coherent narrative creation. We propose VideoAgent, an all-in-one agentic framework addressing these challenges through two key innovations. First, we develop automated video shot creation with shot planning agents for coherent narratives and cross-modal retrieval for aligned visual content. Second, we design a multi-agent orchestration framework integrating over thirty specialized editing agents. Intent parsing filters relevant tools while textual-gradient graph optimization assembles complex editing pipelines. Extensive experiments on our newly-proposed VideoEdit benchmark and public datasets demonstrate VideoAgent's superiority over existing multimodal LLMs and agentic systems. VideoAgent achieves 87-95% orchestration success rates while reducing API costs by 60%. Human evaluation across six video categories shows VideoAgent produces professional-quality content approaching human-level performance, with ratings only 4% below human-created videos. We release our code at https://github.com/HKUDS/VideoAgent.