How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms

2026-04-10Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors studied how different ways of producing video timestamps (outputs) affect the accuracy and speed of video event detection systems. They compared three common output methods using the same models, data, and training methods to make a fair comparison. Their experiments showed that the way outputs are generated greatly changes both accuracy and computation cost, regardless of the model size. They found that a method called continuous temporal decoding offers the best balance between accuracy and efficiency, making it a good choice for devices with limited resources. This helps guide future work in building fast and accurate video grounding systems.

Video Temporal GroundingMultimodal Large Language ModelsOutput ParadigmContinuous Temporal DecodingLocalization AccuracyInference LatencyLoRA fine-tuningCompact Vision-Language ModelsPareto FrontierSystem Efficiency
Authors
Shengji Jin, Yuanhao Zou, Victor Zhu, Zhengping Ji, Chen Chen
Abstract
While Multimodal Large Language Models (MLLMs) have advanced Video Temporal Grounding (VTG), existing methods often couple output paradigms with different backbones, datasets, and training protocols. This makes it challenging to isolate the specific impact of the output design. Additionally, as VTG systems are increasingly considered for resource-constrained edge deployment, the trade-off between output formulation and system-level efficiency requires systematic investigation. In this paper, we present a controlled empirical study comparing three dominant VTG output paradigms: Text Numeral Generation, Temporal Token Generation, and Continuous Temporal Decoding. We evaluate these paradigms across identical compact VLMs (SmolVLM2, FastVLM, and Molmo2) using consistent datasets and LoRA fine-tuning protocols. Evaluations on Charades-STA, QVHighlights, and YouCook2 measure both localization accuracy and system efficiency, including inference latency, training throughput, and parameter overhead. Our results demonstrate that the choice of output formulation significantly affects both grounding accuracy and computational cost, independent of model scale. Specifically, the continuous distribution paradigm consistently achieves the most favorable efficiency-accuracy trade-off on the Pareto frontier, delivering robust localization with minimal latency overhead. These findings provide objective empirical guidelines for designing efficient, deployment-ready VTG systems.