SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series
2026-06-02 • Computation and Language
Computation and LanguageComputer Vision and Pattern Recognition
AI summaryⓘ
The authors created SagaQA, a test to check if computers can understand full TV series stories by connecting facts from different episodes. Unlike other tests that focus on short clips, SagaQA needs deep thinking across many parts of a show. They also explored different planning methods for how computers solve these puzzles and found that 'hybrid planners' work best for understanding the whole story. This helps show how computers might get better at following complex, ongoing narratives in videos.
multi-hop reasoninglong-form videoTV seriesmultimodal narrativesagentic methodsplanning strategiesparallel plannerssequential plannershybrid plannersnarrative understanding
Authors
Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen
Abstract
We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by requiring high-level comprehension of extended multimodal narratives in entire TV shows. A distinguishing feature of SagaQA is the granularity of its reasoning steps. Our dataset necessitates long-range reasoning hops to connect information across completely different episodes. This requires models to reason over entire events and actions, demanding a deep understanding of the show's narration and progression at a multimodal level. Motivated by recent progress in agentic methods, we further study how different planning strategies handle such complex reasoning. We categorize these approaches into three classes-Parallel, Sequential, and Hybrid planners-and evaluate their ability to generate coherent and complete reasoning plans. Our results on SagaQA suggest that hybrid planners consistently produce higher-quality plans and exhibit stronger capabilities for complex, high-level narrative understanding in TV shows.