AIR: Adaptive Interleaved Reasoning with Code in MLLMs

2026-06-22 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors focus on improving multimodal large language models (MLLMs) by teaching them to better combine reasoning with code, especially for tasks that require complex numerical calculations. Unlike previous methods that mostly handled visual tasks with simple rules, their approach uses reinforcement learning to train models on code-augmented computation tasks. They create a special training process with data preparation and a reward system to help the model decide when to use tools. Their experiments show this method improves accuracy and tool use effectiveness significantly.

multimodal large language modelsinterleaved reasoningreinforcement learningnumerical computationtool usecode augmentationreward functiondata filteringtraining pipelineevaluation benchmarks

Authors

Cong Han, Xiaohan Lan, Haibo Qiu, Yujie Zhong

Abstract

Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. The existing literature focuses primarily on tool-use within vision-perception tasks. However, such approaches typically rely on predefined heuristics for visual manipulation and are inherently incapable of addressing numerical computation problems due to their exclusive focus on visual operations. This paper empowers MLLMs with adaptive interleaved reasoning capabilities through extended reinforcement learning training on code-augmented complex numerical computation tasks. To this end, we propose a comprehensive three-component solution consisting of: a two-stage cold-start data construction pipeline, data filtering strategies for RL dataset curation, and an adaptive tool-invocation strategy leveraging a group-constrained reward function for interleaved reasoning trajectories. Extensive experiments demonstrate that after Reinforcement Learning training with the group-constrained reward function, performance improves by an average of 6.1 percentage points (pp) on evaluation benchmarks. Specifically, the accuracy for interleaved reasoning samples increases by 9.9 pp, and the overall success rate of tool-use exceeds 95%. Our data and code are available at: https://github.com/CongHan0808/AIR.git.

View PDFOpen arXiv