When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews

2026-05-11Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors discuss how scientific peer reviews often have disagreements that are hard to sort out, especially when there are many reviews. They created a new way to look at these disagreements by examining whole reviews, identifying specific pieces of contradictory evidence, and measuring how strong the disagreements are. They also made a dataset called RevCI to train and test their approach. Their main system, IMPACT, uses multiple steps to find and judge contradictions, and a smaller model called TIDE does this faster with good accuracy. Their methods work better than previous ones at spotting and rating reviewer contradictions.

peer reviewreviewer contradictionevidence extractiondisagreement intensitybenchmark datasetmulti-agent frameworklanguage modelmodel distillationscientific publishingconflict detection
Authors
Sandeep Kumar, Yash Kamdar, Abid Hossain, Bharti Kumari, Tanik Saikh, Asif Ekbal
Abstract
Scientific peer reviews frequently contain conflicting expert judgments, and the increasing scale of conference submissions makes it challenging for Area Chairs and editors to reliably identify and interpret such disagreements. Existing approaches typically frame reviewer disagreement as binary contradiction detection over isolated sentence pairs, abstracting away the review-level context and obscuring differences in the severity of evaluative conflict. In this work, we introduce a fine-grained formulation of reviewer contradiction analysis that operates over full peer reviews by explicitly identifying contradiction evidence spans and assigning graded disagreement intensity scores. To support this task, we present RevCI, an expert-annotated benchmark of peer-review pairs with evidence-level contradiction annotations with graded intensity labels. We further propose IMPACT, a structured multi-agent framework that integrates aspect-conditioned evidence extraction, deliberative reasoning, and adjudication to model reviewer contradictions and their intensity. To support efficient deployment, we distill IMPACT into TIDE, a small language model that predicts contradiction evidence and intensity in a single forward pass. Experimental results show that IMPACT substantially outperforms strong single-agent and generic multi-agent baselines in both evidence identification and intensity agreement, while TIDE achieves competitive performance at significantly lower inference cost.