CODESKILL: Learning Self-Evolving Skills for Coding Agents

2026-05-25 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors created CODESKILL, a system that helps AI coding agents learn from their past experiences by turning those experiences into useful skills. Instead of using fixed rules, CODESKILL uses another AI to decide how to organize and update these skills, making it smarter over time. They trained this system using reinforcement learning with feedback that checks skill quality and actual task success. Their tests showed that CODESKILL helps coding agents solve more problems compared to methods without skill learning or with simpler memory approaches.

coding agentsprocedural skillsskill extractionskill banklarge language model (LLM)reinforcement learningreward signalsoftware engineering benchmarkstrajectoryprompt

Authors

Yanzhou Li, Yiran Zhang, Xiaoyu Zhang, Xiaoxia Liu, Yang Liu

Abstract

Coding agents produce rich trajectories while solving software-engineering tasks. To enable agent self-evolution, these trajectories can be distilled into reusable procedural skills that compactly encode experience to guide future behavior. However, existing skill construction and maintenance methods often rely on fixed prompts and heuristic update rules, leaving it unclear how knowledge should be selected, abstracted, and maintained to best serve downstream agents. We propose CODESKILL, an LLM-based framework that reformulates skill extraction and skill-bank maintenance as a learnable management policy. CODESKILL extracts multi-granularity procedural skills from coding-agent trajectories, evolves skills with new experience, and maintains a compact skill bank for future task solving. We train CODESKILL with reinforcement learning, using a hybrid reward that combines dense rubric-based skill-quality feedback with sparse verifiable execution feedback from the frozen downstream agent. Experiments on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 show that CODESKILL improves average pass rate by 9.69 over the no-skill baseline and by 4.01 over the strongest prompt-based or memory baseline, while maintaining the skill bank at a stable size during iterative construction.

View PDFOpen arXiv