Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback

¹UKP Lab, TU Darmstadt and Hessian Center for AI (hessian.AI)
²MBZUAI, ³The Allen Institute for AI (AI2)
ArXiv 2025

25,000

NeurIPS 2025 submissions
60x increase since 2010 ^[1]

29%

Annual growth rate
since 2017 (CAGR) ^[1]

Projected submissions/year
by 2040 at current rate ^[1]

🔥 NeurIPS 2021 Consistency Experiment: Two independent committees disagreed on 23% of identical papers — revealing deep reliability issues beyond just capacity constraints ^[2].

Why Novelty Assessment is Particularly Broken

With 50+ person-years of work needed just for NeurIPS 2025 reviews (assuming 2 hours per paper) ^[1], overwhelmed reviewers resort to superficial analyses, producing vague feedback like "not novel enough" without clear justification. This knowledge-intensive process demands comprehensive awareness of related work and precise distinction between meaningful innovations and incremental modifications—a task that becomes exponentially more difficult as ~20,000 papers face rejection ^[1].

[1] Submission Tsunami at NeurIPS 2025: Is Peer Review About to Collapse?
[2] Beygelzimer et al., 2023. The Machine Learning Review Process at NeurIPS

Novelty assessment is a central yet understudied aspect of peer review, particularly in high-volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence-based assessment. Our method is informed by a large-scale analysis of human-written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions—substantially outperforming existing LLM-based baselines. The method produces detailed, literature-aware analyses and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM-assisted approaches to support more rigorous and transparent peer review without displacing human expertise.

We built a three-stage pipeline that mimics expert reviewer behavior:

Document Processing

📄 Extract title, abstract, citations
🔍 Parse introduction with GROBID
📝 Identify citation contexts

Related Work Discovery

🎯 Generate search keywords with LLMs
📚 Retrieve papers via Semantic Scholar
🏆 Rank with SPECTER2 + RankGPT

Novelty Assessment

💡 Extract novelty claims
🔬 Analyze research landscape
⚖️ Compare with cited evidence

Evaluated on 182 ICLR 2025 submissions:

Human Reasoning

86.5%

Agreement Rate

75.3%

🏆 vs OpenReviewer:
74% wins, 6% losses

🏆 vs DeepReviewer:
39% wins, 26% losses

📝 What Reviewers Actually Do

Analyzed 182 human reviews to understand:

Claim verification patterns
Evidence citation practices
Reasoning structures

🔬 Human-Informed Design

Our pipeline incorporates:

Structured prompting strategies
Targeted content extraction
Multi-step verification

💪 Why It Works

Success factors:

Literature-aware analysis
Evidence-based reasoning
Systematic evaluation

BibTeX

@misc{afzal2025notnovelenoughenriching,
      title={Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback}, 
      author={Osama Mohammed Afzal and Preslav Nakov and Tom Hope and Iryna Gurevych},
      year={2025},
      eprint={2508.10795},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.10795}, 
}

Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback

🚨 The Peer Review System is Collapsing

Why Novelty Assessment is Particularly Broken

Abstract

📝 Real Example: How Our System Performs

🚀 Our Solution: Evidence-Based Novelty Assessment