LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

2026-05-25Computation and Language

Computation and LanguageComputers and SocietyEmerging Technologies
AI summary

The authors tested how well large language models (LLMs) can act like academic reviewers by comparing them to human reviews of nearly 900 research papers. They found that LLMs tend to give higher scores to weaker papers and focus on different topics than human reviewers, especially missing clarity issues while overemphasizing reproducibility. The LLM-generated reviews were longer and used more repetitive language. They also showed that cleverly hidden instructions can trick the models into giving low-quality papers much better scores. The authors suggest that while LLMs can help organize reviews, caution is needed to avoid biases and manipulation.

large language modelspeer reviewrating calibrationprompt injectionNeurIPSICLRreproducibilityclarityadversarial attackslexical diversity
Authors
Lingyao Li, Junjie Xiong, Changjia Zhu, Runlong Yu, Chen Chen, Junyu Wang, Renkai Ma, Zhicong Lu
Abstract
Large language models (LLMs) are increasingly used in academic peer review, yet their reliability, alignment with human judgment, and robustness to adversarial attacks remain poorly understood. We present a systematic benchmark of LLM-as-a-Reviewer on 898 papers stratified from NeurIPS and ICLR, evaluating 12 LLMs along three axes: rating calibration, divergence from human reviewers, and resistance to prompt injection embedded via an invisible font-mapping attack. We find that LLMs systematically overrate weaker submissions and diverge from humans in topical emphasis, under-flagging Clarity and over-flagging Reproducibility, while producing reviews two to three times longer with lower lexical diversity and a more standardized vocabulary. Prompt injection remains highly effective. Simple hidden instructions can promote low-scoring papers to acceptance-level ratings in a substantial fraction of cases, with effectiveness varying sharply across model families. While LLMs offer utility in structuring evaluations, their integration into peer review requires safeguards against both intrinsic biases and adversarial risks.