Open Problems in Constitutional Preference Reconstruction

2026-06-29Artificial Intelligence

Artificial Intelligence
AI summary

The authors study how using short sets of rules called 'constitutions' can help explain why language models prefer one output over another. They find three main issues: it's hard to judge how good these rules are, different ways of applying the rules don't always agree, and different models have different constitutions. By refining the rules, they improve agreement between methods and models. Overall, the authors suggest that evaluating these rules together with how they're used is important for understanding large language model judgments.

pairwise preference dataInverse Constitutional AI (ICAI)language modelsconstitutionsprinciple compositionLLM judgemajority votePRISMAlpacaEvalChatbot Arena
Authors
Eleanor Clifford, Michael Amir, Arduin Findeis, Aaron Zhao, Robert Mullins
Abstract
Pairwise preference data is widely used for training and evaluating language models (e.g., RLHF), but each datapoint records a \emph{choice}, not the rationale behind it. Methods such as Inverse Constitutional AI (ICAI) attempt to improve interpretability by compressing datasets into short ``constitutions'' of natural-language principles. We argue this framing is under-specified: a flat list of principles is not yet an executable decision rule because it leaves principle composition implicit. We use the pairwise setting as a testbed to empirically characterize three open problems in constitutional methods. First, principle quality is hard to measure: coverage and accuracy are useful but incomplete proxies for end-to-end reconstruction. Second, \emph{composition is ambiguous}: holding principles fixed, different executors (LLM judge versus majority vote) agree only $73\%$ of the time. Third, \emph{constitutions differ between LLMs}: cross-model vote agreement is $73\%$, whereas intra-model agreement is $81\%$. Across PRISM, AlpacaEval, and Chatbot Arena, we show that principle refinement (ICAI+) may be a first step towards ameliorating these problems: inter-executor agreement rises to $78\%$, and transparent executors match LLM judge accuracy ($66\%$ vs.\ $67\%$). Our results highlight that constitutions should be evaluated as \emph{constitution--executor systems}, with implications for LLMs-as-a-judge broadly.