Dynamo: Dynamic Skill-Tool Evolution for Vision-Language Agents
2026-06-29 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors introduced Dynamo, a way to improve vision-language models without retraining them. Instead of changing the model weights, Dynamo lets the model learn from its own mistakes on a small set of examples by creating useful reasoning skills and visual tools. These skills and tools help the model handle different types of problems and are saved for future use. Across several tests and models, Dynamo improved accuracy and worked well even compared to more complex methods that require reinforcement learning.
vision-language modelsvisual reasoningfrozen modeltraining-free adaptationreusable skillsexecutable toolsreinforcement learningbenchmark evaluationmodel inference
Authors
Yutao Sun, Yanting Miao, Hao-Xuan Ma, Mengyu Zhou, Mingshuai Chen, Tiancheng Zhao, Dexin Wang, Lei Lv, Li Xu, Xiaoxi Jiang, Guanjun Jiang
Abstract
Improving vision-language models (VLMs) on visual reasoning typically requires retraining or hand-designed prompts and tools. We present Dynamo, a training-free framework that adapts a frozen VLM without any weight updates. On a small labeled training subset, the agent inspects its own correct and incorrect attempts and evolves two complementary capabilities: reusable reasoning skills for cognitive bottlenecks, and executable visual tools for perceptual ones. Each generated tool is paired with a skill that specifies when to invoke it, and both capability types accumulate in a persistent library. Across four visual reasoning benchmarks and five VLM backbones, Dynamo improves direct inference on all 20 model--benchmark settings (avg. +5.6 acc). When the tool set is given in advance, the framework learns when to call each tool, and per-step tool choice improves on every tested backbone. Against task-specific RL (VTool-R1, DeepEyes), Dynamo closes 65--99% of the RL gap at a fraction of the compute, and combines additively with RL when available.