What Should a Skill Remember? Quality-Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents

2026-06-08 • Computation and Language

Computation and Language

AI summaryⓘ

The authors study how to rewrite 'skills'—detailed procedures used by large language model agents—to balance quality and cost effectively. They find that simply making skills shorter doesn't always save cost because some details help with debugging and guiding the process. By testing different rewriting strategies on a benchmark, they show that no single method works best for all tasks, but some methods reduce operating costs without losing quality. Their work suggests skill design should focus on managing operational costs rather than just compressing prompts.

large language modelsskillsprompt compressionworkflowAPI anchoringdebuggingcost-quality trade-offoperational knowledgeSkillsBench

Authors

Qinghua Xing, Yinda Chen, Yaping Jin, Zhenhe Wu, Bohan Lin, Hang Zhou, Xinghao Chen, Hanting Chen, Zhiwei Xiong

Abstract

Large language model agents increasingly rely on skills: reusable procedural documents encoding workflows, tool use, implementation patterns, validation checks, and domain rules. Skill rewriting is often treated as prompt compression, but shorter skills can make agents more expensive by removing sparse operational anchors that prevent exploration, debugging, and recovery. We study skill rewriting through this economic lens. Our controlled framework profiles skill structure, rewrites skills using information-preservation strategies, and evaluates the rewrites under fixed task instructions, environments, and verifiers. Experiments on SkillsBench reveal distinct quality--cost trade-offs across strategies: API/code anchoring, workflow guarding, and rule/formula anchoring benefit different task families, with no universally dominant template. In the main held-out evaluation, the learned policy reduces total cost by 7.0\% and downstream agent-token cost by 6.0\%; in frozen cross-model transfer, the corresponding reductions average 14.7\% and 13.7\%, while verifier quality is preserved. These results position skill design as cost-aware operational knowledge engineering rather than prompt compression. Resources: \href{https://github.com/1Reminding/Skill_EE}{SkillEE}.

View PDFOpen arXiv