Can LLMs Rank? A Tale of Triads and Triage

2026-06-29Computers and Society

Computers and SocietyArtificial Intelligence
AI summary

The authors study how large language models (LLMs) can help make tough decisions that involve ranking people, like who gets housing or emergency care first. They focus on how to check if the LLM’s pairwise comparisons (judging two people at a time) are reliable before trusting the final ranking. They explain two ways to measure consistency within one set of rankings and how rankings change across multiple tries. Using these methods, they show different LLMs behave differently in important real-world tasks and suggest ways for people to check reliability before using these models.

large language modelspairwise comparisonsranking consistencycoefficient of consistencyKendall's tausocial choice theorytriagehomelessness service allocationtournament graphs
Authors
Gaurab Pokharel, Shafkat Farabi, Patrick J. Fowler, Sanmay Das
Abstract
From housing allocation for households experiencing homelessness to triage in emergency departments, LLMs are increasingly being considered as judges of consequential decisions that require ranking people for scarce resources. Ranking large groups simultaneously is cognitively demanding and error-prone. A natural solution, drawing on decades of social choice theory, elicits pairwise comparisons and aggregates them into a total order. However, a fundamental question remains when LLMs serve as the pairwise judge: how can a practitioner tell, before committing to a ranking, whether the LLM's judgments are sufficiently consistent to trust the result? We discuss two different ways of identifying consistency. A classical diagnostic, the coefficient of consistency $ζ$, originally developed to measure judge reliability by counting circular triads in tournament graphs, provides a cheap, model-free measure of intra-run consistency. Various standard measures of distance between rankings, for example Kendall's $τ$, can measure inter-run variability. We show, in both theory and practice, that these measures are independently valuable, and advocate for using both to assess reliability of rankings. We demonstrate the practical importance of our results across two high-stakes prioritization tasks: homelessness service allocation and emergency department triage. Three different leading LLMs have considerably different performance profiles across these two axes of consistency. We provide guidelines for how practitioners could think about measuring and assessing consistency before committing to a model for ranking or prioritization.