A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents
2026-06-29 • Machine Learning
Machine LearningComputation and Language
AI summaryⓘ
The authors found that some proprietary large language model (LLM) evaluators quickly become unreliable, sometimes within just weeks. They created a new method called EPC that helps detect when these evaluators stop working properly, by measuring how consistent and stable their judgments are. Testing across different versions and conditions, they saw that some evaluators suddenly lose accuracy, which shows that studies using these tools at a single time point can be misleading. Their work highlights the need to monitor evaluator stability over time rather than trusting one-off results.
Large Language ModelsLLM EvaluationEvaluator StabilityMultimodal Preference Collapse IndexJensen-Shannon DivergenceCoupling MatrixSelf-EvaluationVersion DriftDiagnostic Framework
Authors
Liu Zewen
Abstract
Measurements of proprietary LLM evaluators can become invalid within weeks -- we document one case and provide the diagnostic framework to detect it. We introduce EPC -- comprising the Multimodal Preference Collapse Index (MPCI), evaluator-indexed coupling matrix, and Jensen-Shannon divergence (JSD) -- and apply it across eight experimental conditions (N=112 main + N=10 ablation = 122 unique repetitions, all reported). Coupling coefficients range from 0.00 to 1.18 across per-condition means (CV approx 0.9, n=8 conditions). Four conditions show strong coupling (N=36; GPT-4o May, GPT-4o-mini, Qwen3.7-plus, DashScope 30r); four collapse to near-zero (N=76; GPT-4o June, qwen-plus N=30, symmetric LR, DeepSeek self-eval). The May-to-June GPT-4o drift -- an N=8 re-replication inverting the study's conclusion -- is the most informative measurement: a diagnostic instrument detecting its own instability demonstrates the fragility it was designed to measure. Self-evaluation (97% zero, JSD=0.003) consistently collapses, though floor effects are possible. Output-format confound analysis finds per-strategy aggregate rho=0.89 but per-instance rho=0.219 (p=0.093); PCI reported as preference-convergence metric. We release EPC with all data. The finding is not any single coupling magnitude but the pattern of version-conditional instability that makes single-snapshot evaluator studies unreliable.