Transferable Self-Evolving Playbooks for Agentic Security Auditing

2026-06-15 • Cryptography and Security

Cryptography and Security

AI summaryⓘ

The authors created EvoHunt, a system that automatically improves security testing instructions (called playbooks) used by AI agents to find software vulnerabilities. Unlike past work relying on humans to write these instructions, EvoHunt uses AI agents to test, score, and update the playbooks by learning from mistakes. Their experiments show that these evolved playbooks can help AI tools find more security issues, even beating some commercial products and helping weaker AI models improve significantly. The evolved playbooks also work well when transferred to different AI models with little adjustment.

LLM (Large Language Model)vulnerability discoverysecurity auditingplaybook evolutionCodexOpenCodeground truth evaluationtransfer learningworkflow automationheuristics

Authors

Ziyue Wang, Cheuk Wang Maurice Ng, Chenchen Yu, Strick Sheng, Kaihua Qin, Liyi Zhou

Abstract

An LLM agent for vulnerability discovery and validation is more than a model. It combines three components: an LLM for code analysis, an agent harness such as Codex or OpenCode for navigation, tool use, and execution, and an audit playbook, domain-specific procedural knowledge that guides the LLM and harness toward vulnerability discovery. Prior work relies on human-supplied playbooks, including prompt engineering, manual workflows, knowledge bases, and heuristics. This raises two research questions: Acquisition - is human curation necessary, and can playbook creation be automated? Transfer - can an evolved playbook transfer the audit procedure to weaker agents, improving their capability? We present EvoHunt, a playbook evolution environment over open-source repositories for security auditing. Three agents drive the evolution loop: an audit agent rolls out the current playbook and produces findings; an evaluator scores outcomes against ground truth; and a reviser commits updates to the playbook based on failure analysis. The playbook format is unconstrained: starting empty, EvoHunt adds or removes workflows, heuristics, vulnerability knowledge, or domain-specific content. The evolved playbook requires only minor adaptation to run under a different LLM or harness. We evaluate EvoHunt on open-source security advisories. For acquisition, playbook evolution raises end-to-end exploits for Codex/GPT5.4-xhigh 6x, from 1.1% to 6.2%, and the evolved OpenCode/GLM5.1 playbook surpasses OpenAI Codex Security on every metric, with 11.3% vs. 9.2% target-match rate, showing open-source evolution can outperform a dedicated commercial product. For transfer, the GLM-evolved playbook gives the strongest student lift: Qwen3.6-27B improves from 2.4% to 6.5%, Qwen3.6-35B-A3B from 1.1% to 4.6%, and A3B obtains 2.4x more matches than GPT transfer.

View PDFOpen arXiv