LazyReview: A Dataset for uncovering lazy thinking in NLP Peer Reviews

Abstract

Peer review is a cornerstone of quality control in scientific publishing. With the increasing workload, the unintended use of ‘quick’ heuristics, referred to as lazy thinking, has emerged as a recurring issue compromising review quality. Automated methods to detect such heuristics can help improve the peer-reviewing process. However, there is limited NLP research on this issue, and no real-world dataset exists to support the development of detection tools. This work introduces LAZYREVIEW, a dataset of peer-review sentences annotated with finegrained lazy thinking categories. Our analysis reveals that Large Language Models (LLMs) struggle to detect these instances in a zeroshot setting. However, instruction-based finetuning on our dataset significantly boosts performance by 10-20 performance points, highlighting the importance of high-quality training data. Furthermore, a controlled experiment demonstrates that reviews revised with lazy thinking feedback are more comprehensive and actionable than those written without such feedback. We will release our dataset and the enhanced guidelines that can be used to train junior reviewers in the community.

Motivation

Peer review is vital for evaluating scientific work, but increasing workloads often lead reviewers to rely on mental shortcuts.
The figure shows a reviewer dismissing a paper as not an “eye-opener” without citing prior work or offering constructive feedback. This is defined as "lazy thinking".
As per ACL 2023 report, 24.3% of author-reported issues pertained to lazy thinking.

Contributions

📊 We introduce LazyReview: A novel dataset comprising 500 expert-annotated and 1,276 silver-annotated review segments, categorized into fine-grained lazy thinking classes.
We introduce two novel tasks: Coarse-grained lazy thinking detection (lazy or not lazy) and Fine-grained lazy thinking detection (e.g., results are not novel)
We show that the lazy thinking annotations drastically improve review writing by a significant margin.

Dataset Analysis

Out of the 18 classes of lazy thinking, the class corresponding to "The authors could also do extra experiment X" is the highest! This is an obvious finding given the growing pace of the field and the gradual evolution of NLP into ML. Distribution of classes in our dataset

Findings

LLMs are good at detecting coarse-grained lazy thinking but underperform for fine-grained detection! Comparison of LLMs on the two tasks

Reviews rewritten with lazy thinking signals (Lazy re-written) has higher win rate against Original reviews in terms of Constructiveness (Constr.), Justification (Justi.) and Adherence (Adh.) ! Human evaluation of rewritten reviews in terms of Constructiveness (Const.), Justification (Justi.), Adherence (Adh.)

Human evaluation of rewritten reviews in terms of Constructiveness (Const.), Justification (Justi.), Adherence (Adh.)

Bibtex

@misc{purkayastha2025lazyreviewdatasetuncoveringlazy,
      title={LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews}, 
      author={Sukannya Purkayastha and Zhuang Li and Anne Lauscher and Lizhen Qu and Iryna Gurevych},
      year={2025},
      eprint={2504.11042},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.11042}, 
}