LLM-as-a-Judge for Reliable and Explainable Offline Evaluation in Top-K Recommendation
2026-06-22 • Information Retrieval
Information Retrieval
AI summaryⓘ
The authors point out that current methods for testing recommendation systems often rely on guesses about what users like, based only on some of their past actions, which can be misleading and hard to understand. They suggest using large language models (LLMs) to better judge recommendations by interpreting users' written feedback instead of just matching exact items. This new method explains why it gives certain scores by providing clear reasons, making the evaluation easier to trust and understand. Their tests show that this approach is both more reliable and more transparent than traditional methods.
Recommendation EvaluationTop-K MetricsOffline EvaluationUser FeedbackBias in RecommendationsLarge Language Models (LLMs)Semantic MatchingExplainabilityRelevance JudgmentsBlack-box Evaluation
Authors
Yue Que, Junyi Zhou, Xiaokun Zhang, Haiming Jin, Qiao Xiang, Chen Ma
Abstract
Recommendation evaluation plays a crucial role in guiding the refinement and deployment of recommender systems. Most existing trials rely on offline evaluation using Top-K metrics computed over holdout user behaviors. However, we identify two fundamental limitations that undermine their ability to deliver reliable and explainable evaluations. Regarding reliability, offline evaluation treats observed user feedback as a proxy of true preferences and enforces rigid ID matching between the proxy and recommendation. In practice, feedback collections are inherently shaped by incomplete and biased item exposure, leading to distorted and unreliable assessments. Regarding explainability, Top-K metrics only establish numerical scores without offering meaningful insights to support them, thereby reinforcing the black-box nature of offline evaluation. In this paper, we propose a reliable and explainable LLM-as-a-Judge framework for offline recommendation evaluation. To enhance reliability, we introduce a semantic proxy from user textual behaviors to represent their true preferences. This proxy allows for more flexible matching between preferences and recommendations in the semantic space, rather than depending on the holdout feedback. To ensure explainability, the LLM Judge adopts a reasoning-then-scoring process to generate relevance judgments along with explicit rationale. Finally, we aggregate the individual scores into global Top-K metrics to quantify overall recommendation quality, and provide justification for each preference hit or miss. Extensive experiments demonstrate that the LLM Judge achieves solid reliability, explainability, and robustness in evaluation.