MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

2026-06-01 • Computation and Language

Computation and LanguageArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors address the challenge of turning diverse and imperfect online how-to guides into skills that intelligent agents can actually use to complete complex tasks. They created a benchmark called MMG2Skill-Bench to test how well current agents can learn from these guides. Their proposed system, MMG2Skill, converts guides into clear, editable skills and improves them by learning from the agent's experiences without relying on external scoring. Their experiments show that this approach works better than simply feeding raw guides to agents, with structured skill building and feedback-driven revisions being essential for success. They also show that an early stopping method can save effort and prevent performance drops when detecting successful task completion is possible.

procedural knowledgeguide-to-skill learningvision-language modelstrajectory feedbackbenchmarkskill revisionmultimodal dataclosed-loop systemearly stoppingperformance evaluation

Authors

Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei, Shihao Li, Hang Yan, Han Li, Yuanxing Zhang, Zhiqi Bai, Jinhua Hao, Ming Sun, Han Li, Jiaheng Liu

Abstract

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.

View PDFOpen arXiv