Can LLM Rerankers Predict Their Own Ranking Performance?

2026-06-02Information Retrieval

Information RetrievalComputation and Language
AI summary

The authors explore whether large language model (LLM) rerankers can predict how good their own search rankings are without needing extra steps after retrieval. They test methods that don't require training, like checking if the model's rankings are consistent or relying on the model's expressed confidence. Their experiments show that checking consistency works well and is better calibrated, while the model’s direct confidence tends to be too high. To fix this, they develop two simple supervised techniques that help the reranker give more accurate quality estimates with minimal extra output.

Query Performance Prediction (QPP)LLM rerankerranking quality estimationtraining-free methodsself-consistencyverbalized confidencecalibrationTREC Deep Learningsupervised learningranking evaluation
Authors
Shiyu Ni, Keping Bi, Jiafeng Guo, Jingtong Wu, Zengxin Han, Xueqi Cheng
Abstract
Retrieval effectiveness varies substantially across queries, making it important to estimate ranking quality before relevance judgments are available. Query performance prediction (QPP) addresses this need, but most existing methods rely on external predictors after retrieval or reranking. In this paper, we study \textit{reranker-internal QPP}: can an LLM reranker estimate the quality of the ranking it has just produced? We investigate both training-free and training-based approaches. For training-free estimation, we examine metric-specific self-consistency across sampled rankings and verbalized confidence produced directly by the reranker. Experiments on TREC Deep Learning 2019--2022 with four LLMs show that self-consistency is competitive with the state-of-the-art (SOTA) approach and better calibrated in almost all settings, while direct verbalized confidence is severely overconfident. To improve verbalized confidence, we propose two supervised methods, Verb-Num and Verb-List, which enable LLM rerankers to produce calibrated ranking-quality estimates with only a few additional output tokens.