HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems

2026-06-01Computation and Language

Computation and Language
AI summary

The authors introduce HarnessForge, a system that helps large language model (LLM) agents adapt better by evolving both how they are structured (the harness) and how they think (the policy) together. Unlike previous work that adjusted either the structure or the reasoning alone, HarnessForge changes both parts in coordination, making the agent perform better across different tasks. Their experiments show this joint adaptation leads to clearer improvements and more efficient execution. This work highlights that making sure the structure and reasoning parts work well together is important for improving LLM agents.

LLM agentsmeta-adaptationharnesspolicyco-evolutionexecution structurereasoning behaviorreinforcement learningagent systemQwen model
Authors
Mingju Chen, Can Lv, Guibin Zhang, Heng Chang, Shiji Zhou
Abstract
LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execution paradigms. This challenges fixed agent systems and motivates system-level meta-adaptation beyond isolated component updates. While existing works have adapted external harness or trained underlying reasoning policies, full-system adaptation remains insufficiently characterized. The adaptation space between structure and execution is rarely made explicit, and the compatibility between the external harness and the internal reasoner is not optimized jointly. We propose HarnessForge, a meta-adaptive framework for evolving LLM agent systems. HarnessForge formulates an agent system as a harness--policy pair, defining a stable adaptation space that separates harness-level execution structure from policy-level reasoning behavior. It then performs harness--policy co-evolution through fault-guided harness tailoring and harness-conditioned policy alignment. Experiments across five benchmarks from diverse domains show that HarnessForge consistently improves both Qwen3-4B and Qwen3-8B backbones, outperforming harness-only and policy-only baselines with gains of up to 12.0\% over the strongest baseline and achieving favorable rollout-efficiency tradeoffs, demonstrating that harness--policy co-evolution is effective, and that executable compatibility between the harness and reasoning policy is essential for agent-system adaptation. The code is available at https://github.com/mingju-c/HarnessForge.