Rethinking Model Efficiency: Multi-Agent Inference with Large Models
2026-04-06 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors studied how different vision-language models work when generating answers token by token, which can slow things down. They found that bigger models that generate fewer tokens can be faster and perform as well or better than smaller models needing more tokens. To improve speed and accuracy, they created a system where a big model uses short answers but borrows important reasoning parts from a smaller model when needed. Their tests showed this approach works well and combines the strengths of both big and small models.
vision-language modelslarge language modelautoregressive decodingoutput tokenslatencymulti-agent inferencereasoning tokensmodel efficiencybenchmark tasks
Authors
Sixun Dong, Juhua Hu, Steven Li, Wei Wen, Qi Qian
Abstract
Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.