Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback

1UKP Lab, TU Darmstadt and Hessian Center for AI (hessian.AI)
2MBZUAI, 3The Allen Institute for AI (AI2)
ArXiv 2025

๐Ÿšจ The Peer Review System is Collapsing


25,000

NeurIPS 2025 submissions
60x increase since 2010 [1]

29%

Annual growth rate
since 2017 (CAGR) [1]

1M

Projected submissions/year
by 2040 at current rate [1]

๐Ÿ”ฅ NeurIPS 2021 Consistency Experiment: Two independent committees disagreed on 23% of identical papers โ€” revealing deep reliability issues beyond just capacity constraints [2].

Why Novelty Assessment is Particularly Broken

With 50+ person-years of work needed just for NeurIPS 2025 reviews (assuming 2 hours per paper) [1], overwhelmed reviewers resort to superficial analyses, producing vague feedback like "not novel enough" without clear justification. This knowledge-intensive process demands comprehensive awareness of related work and precise distinction between meaningful innovations and incremental modificationsโ€”a task that becomes exponentially more difficult as ~20,000 papers face rejection [1].

[1] Submission Tsunami at NeurIPS 2025: Is Peer Review About to Collapse?
[2] Beygelzimer et al., 2023. The Machine Learning Review Process at NeurIPS

Abstract

Novelty assessment is a central yet understudied aspect of peer review, particularly in high-volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence-based assessment. Our method is informed by a large-scale analysis of human-written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusionsโ€”substantially outperforming existing LLM-based baselines. The method produces detailed, literature-aware analyses and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM-assisted approaches to support more rigorous and transparent peer review without displacing human expertise.

๐Ÿ“ Real Example: How Our System Performs

Side-by-side comparison of novelty assessments on actual ICLR 2025 submissions


Comparison of novelty assessments

Key Insight: Our system consistently aligns better with human expert assessments, correctly identifying incremental contributions and citing specific prior work, while baselines often overstate novelty or miss critical context.

๐Ÿš€ Our Solution: Evidence-Based Novelty Assessment


Pipeline overview

We built a three-stage pipeline that mimics expert reviewer behavior:


Document Processing

  • ๐Ÿ“„ Extract title, abstract, citations
  • ๐Ÿ” Parse introduction with GROBID
  • ๐Ÿ“ Identify citation contexts

Related Work Discovery

  • ๐ŸŽฏ Generate search keywords with LLMs
  • ๐Ÿ“š Retrieve papers via Semantic Scholar
  • ๐Ÿ† Rank with SPECTER2 + RankGPT

Novelty Assessment

  • ๐Ÿ’ก Extract novelty claims
  • ๐Ÿ”ฌ Analyze research landscape
  • โš–๏ธ Compare with cited evidence

๐Ÿ“Š How Well Does It Work?


Performance comparison

Evaluated on 182 ICLR 2025 submissions:

Human Reasoning

86.5%

Agreement Rate

75.3%

๐Ÿ† vs OpenReviewer:
74% wins, 6% losses
๐Ÿ† vs DeepReviewer:
39% wins, 26% losses

Category breakdown

๐ŸŽฏ Key Insights from Human Analysis


๐Ÿ“ What Reviewers Actually Do

Analyzed 182 human reviews to understand:

  • Claim verification patterns
  • Evidence citation practices
  • Reasoning structures

๐Ÿ”ฌ Human-Informed Design

Our pipeline incorporates:

  • Structured prompting strategies
  • Targeted content extraction
  • Multi-step verification

๐Ÿ’ช Why It Works

Success factors:

  • Literature-aware analysis
  • Evidence-based reasoning
  • Systematic evaluation

๐Ÿš€ Ready to Transform Peer Review?

Join us in building more rigorous, evidence-based scholarly critique

BibTeX

@misc{afzal2025notnovelenoughenriching,
      title={Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback}, 
      author={Osama Mohammed Afzal and Preslav Nakov and Tom Hope and Iryna Gurevych},
      year={2025},
      eprint={2508.10795},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.10795}, 
}