SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
2026-03-24 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionComputation and Language
AI summaryⓘ
The authors address a problem with multimodal large language models (MLLMs) that use many steps to look at and reason about images, which makes them slow. They introduce SpecEyes, a new method that uses a smaller, faster model to guess what the bigger model will do and stop early if possible, speeding things up without losing accuracy. They also create a way to measure how confident the smaller model is in its guesses, helping decide when to trust it. Their experiments show SpecEyes can make these models 1.1 to 3.35 times faster and sometimes even more accurate, especially when handling many tasks at once.
Multimodal Large Language ModelsAgentic ModelsSpeculative ExecutionCognitive GatingAnswer SeparabilityTool InvocationLatencyConcurrencyParallel FunnelSelf-verification
Authors
Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo
Abstract
Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.