CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts

2026-06-03Computation and Language

Computation and LanguageMachine Learning
AI summary

The authors show that finding the best balance between prompt accuracy and the cost of using long prompts is tricky because focusing on just one combined score limits the options. They created CRAFT, a method that smartly searches for prompts that offer a good mix of accuracy and cost by testing only the most promising candidates. Their approach keeps a diverse set of solutions, letting users pick different trade-offs after the search. This method worked well across several tests, beating simpler ways that focus only on accuracy or cost.

prompt tuningPareto frontinference costscalarization collapsemulti-objective optimizationNSGA-IIcost-accuracy trade-offlarge language modelsedit proposalsbenchmarking
Authors
Shanu Kumar, Shubhanshu Khandelwal, Akhila Yesantarao Venkata, Parag Agrawal, Yova Kementchedjhieva, Manish Gupta
Abstract
Prompts tuned for accuracy often grow long, raising inference cost on every model call. The best accuracy-cost trade-off depends on the task and the budget, so prompt optimization is a search over the Pareto front of accuracy and prompt-token cost rather than for one prompt. The usual shortcut, collapsing the objectives into a weighted sum, fixes the trade-off weight before search and often recovers only a narrow region of the front, a failure we call scalarization collapse. We present CRAFT (Cost-aware Refinement And Front-aware Tuning), a Pareto-front prompt optimizer that treats target-LLM validation calls as the scarce resource and allocates them to candidates near the optimistic candidate front. Each round, complementary accuracy-oriented and cost-oriented generators propose edits, Pareto-gap acquisition spends the per-round validation budget, and NSGA-II retention keeps a spread-out population. Across six classification and reasoning benchmarks, CRAFT's retained fronts reach both high-accuracy and low-cost regions, while accuracy-only, cost-only, and weighted-sum baselines each concentrate in narrower regions. The accuracy-cost trade-off becomes a post-search choice, not a pre-search weight.