CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents
2026-06-29 • Artificial Intelligence
Artificial IntelligenceMachine Learning
AI summaryⓘ
The authors explain that simply ranking AI agents by how much money they make in trading isn’t reliable because market conditions heavily influence returns. They created CLQT, a new way to evaluate trading agents by carefully checking each step they take during trading, like gathering info, deciding, and reflecting. This approach records every decision in a way that can be reviewed later, helping to understand why the agent succeeds or fails. They also developed a score system to measure different skills of the agents and tested their method with real and simulated data. Overall, the authors provide a tool to better understand the abilities and weaknesses of AI trading agents beyond just their profits.
LLM agentsportfolio managementsequential tradingbacktestingalphatransaction costsscorecardaudit trailstrategy consistencycapability evaluation
Authors
Bo Qu, Mingguang Chen
Abstract
LLM agents are increasingly cast as autonomous portfolio managers, and benchmarks have moved from financial question-answering to sequential trading. Yet most still rank agents by returns over a fixed window -- a weak proxy, since a period's return is dominated by the market path and apparent alpha can dissolve once look-ahead leakage is controlled. Such a ranking certifies neither sound reasoning, nor a consistent strategy, nor a durable edge. We introduce CLQT, which reframes closed-loop trading evaluation as diagnosis rather than ranking: an instrument that localizes where and why an agent's process succeeds or fails. CLQT is a fully closed-loop, cost-aware, strategy-consistent, temporally-gated environment whose agents run a five-stage cycle: gather, synthesize, allocate, execute, reflect. Each round emits a complete DecisionRound sealed into a recompute-verifiable hash chain, so every metric is reconstructable from the trail. Six pillars form the substrate: a hard TimeGate, institutional transaction- and financing-cost modeling, strategy-consistency scoring, three-tier memory, a Model-Context-Protocol tool layer, and mandate-aware synthesis. The same agent runs as a constrained committee of specialized roles or a single full-autonomy orchestrator, making process scaffolding an experimental variable. From the audit trail we compute a five-axis capability scorecard (APM-CS: Coherence, Acuity, Composure, Discipline, Reliability), with Coherence judged partly by a held-out, out-of-cohort LLM to curb self-preference bias. We validate it on a contamination-controlled multi-model backtest with an ablation grid and a live broker track on unseen, post-cutoff data, against a repeated-run noise floor. CLQT separates outcome from capability, yielding not a model ranking but a durable, extensible map of agent competencies and limitations.