Popcorn: A Configurable Benchmark for Visual Evidence in Multimodal Movie Recommendation

2026-06-08Information Retrieval

Information Retrieval
AI summary

The authors created Popcorn, a tool to help computers recommend movies using different kinds of visual information like full movies, trailers, and thumbnails. They show that these visual sources each provide different clues and aren’t interchangeable when making recommendations. Their tool helps researchers test how to best combine and use these visual cues along with text metadata to improve movie recommendations. They also share their code for others to use.

movie recommendationmultimodal learningvisual embeddingstrailersthumbnailsfull moviesvision-language modelsMovieLensrecommendation benchmarksdata fusion
Authors
Ali Tourani, Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia
Abstract
Movies are long-form audiovisual works, yet recommender benchmarks often rely on trailers, thumbnails, or metadata. These sources differ in semantics and scalability: full movies preserve consumption-level evidence, trailers concentrate promotional highlights, and thumbnails provide sparse but catalog-scale visual signals. We present Popcorn, a configurable benchmark for visual evidence in multimodal movie recommendation, combining title-aligned full-movie/trailer embeddings with MovieLens-linked thumbnail features encoded by modern visual and vision-language models. Popcorn standardizes modality assembly, fusion, splitting, evaluation, and LLM-augmented metadata through a single configuration contract. Experiments show that thumbnail VLMs provide strong, scalable item-side evidence, while controlled trailer/full-movie comparisons show that visual evidence sources are not interchangeable: the choice of source and fusion strategy affects ranking accuracy, coverage, diversity, and calibration. The framework is available at https://github.com/RecSys-lab/Popcorn.