A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

2026-06-03 • Computation and Language

Computation and Language

AI summaryⓘ

The authors studied how multimodal large language models (MLLMs) handle summaries when given multiple videos at once. They found that the placement of each video in the input list can affect the summary quality, even if the videos themselves don't change. By testing on various video types and models, they showed that this positional bias varies by model and video type, and simply giving the model more resources does not fix the problem. They also explored some prompt-based fixes but concluded that current models still struggle to summarize multiple videos in a consistent, position-independent way.

Multimodal Large Language ModelsMulti-video summarizationPositional biasActivityNet datasetDirectional Positional BiasCoverage metricMiddle-Edge GapPrompt engineeringVideo understanding

Authors

Huangchen Xu, Yuan Wu, Yi Chang

Abstract

Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.

View PDFOpen arXiv