AIP: A Graph Representation for Learning and Governing Agent Skills
2026-06-03 • Artificial Intelligence
Artificial IntelligenceMachine Learning
AI summaryⓘ
The authors noticed that agent skills are usually written as plain text, which makes it hard for agents to reliably understand and use them. They created the Agent Instruction Protocol (AIP), which turns skills into clear step-by-step graphs with validated scripts and connections, making them easier for agents to follow and improve. Using AIP improved performance on real tasks and made it simpler to find and fix mistakes. This structured approach also supports better oversight and learning of skills over time.
agent skillsexecution graphschema validationYAMLdeterministic scriptsSkillBenchreinforcement learningtask rewardnatural language instructionsskill debugging
Authors
Zachary Blumenfeld, Jim Webber
Abstract
Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since editing prose is a fragile process that both humans and agents struggle with, particularly for domain-specific procedural knowledge underrepresented in model training. The Agent Instruction Protocol (AIP) addresses both by modeling a skill as a directed execution graph: discrete steps as nodes backed by deterministic scripts or natural-language descriptions, connected by explicit typed input/output edges, and governed by a schema-validated YAML specification. A compiler meta-skill translates existing human-written skills into this form. The benefits are twofold. First, compiling human-written skills to AIP raised Claude Sonnet's mean task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks from SkillsBench - a statistically significant gain (Wilcoxon signed-rank p = 0.011), winning 12 tasks to 2 with 13 ties - often in less wall-clock time. The graph delivers vetted, runnable units to the agent rather than asking it to re-derive code, commands, and tool calls from natural language. Second, on creation and improvement, because each skill is schema-validated, functionally testable, and addressable node-by-node, failures can be diagnosed and repaired precisely. Two authored-skill failures were traced to the script level. After adjusting the AIP spec and recompiling, both recovered with zero regressions (one task going from 0/5 to 5/5), turning skill improvement into a measurable tuning loop rather than a prose rewrite. That same graph structure supports corpus-level governance and skill introspection, and provides a natural action space for reinforcement learning over skills.