PeerQA: A Scientific Question Answering Dataset from Peer Reviews

Baumgärtner, Tim; Briscoe, Ted; Gurevych, Iryna

Summary

TL;DR

1 Peer Review

“Can the authors provide an explanation for the significant performance difference?”

2 Question Processing

Clean Decontextualize Decompose Filter

3 Expert Annotations

Quality CheckVerify question clarity

AnswerabilityAnswerable from paper?

EvidenceHighlight relevant text

Free-Form AnswerWrite the answer

PeerQA is a scientific QA dataset where questions come from real peer reviews and answers are annotated by the paper authors themselves. It contains 579 QA pairs across 208 papers spanning ML, NLP, Geoscience, and Public Health, supporting three tasks: evidence retrieval, answerability classification, and answer generation.

We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which contain questions that reviewers raised while thoroughly examining the scientific article. Answers have been annotated by the original authors of each paper.

The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP, as well as subsets from other scientific communities like Geoscience and Public Health. PeerQA supports three critical tasks for developing practical QA systems: Evidence Retrieval, Unanswerable Question Classification, and Answer Generation.

We provide a detailed analysis of the collected data and conduct experiments establishing baseline systems for all three tasks. Our experiments and analyses reveal the need for decontextualization in document-level retrieval, where we find that even simple decontextualization approaches consistently improve retrieval performance across architectures.

Dataset

At a Glance

0Labeled QA Pairs

0Academic Papers

0Unlabeled Questions

0Avg. Paper Length (tokens)

0Source Venues

0Scientific Domains

Watch

Video Overview

Example

PeerQA Examples

Reviewer Question

“Are the annotators of the test sets native English speakers?”

Evidence from Paper

“All were annotated by three annotators, each of whom was a native speaker of English and a linguist with experience with AMR.”

“A total of nine annotators participated in this evaluation. Six of the annotators were native English speakers; all were either current or former PhD students or professors at our university.”

Author's Answer

“In the pilot, all 3 annotators were native English speakers. For the main evaluation, 6 out of 9 annotators were native English speakers.”

Reviewer Question

“What is the average length of character-chains and non-character-chains?”

Unanswerable

The paper does not contain sufficient information to answer this question.

Benchmark

Three Tasks

Evidence Retrieval

Given a question and a paper, retrieve the sentences or paragraphs containing evidence relevant to answering the question.

Answerability Classification

Determine whether a question can be answered from the information in the paper, or is unanswerable due to insufficient evidence.

Answer Generation

Generate a natural language answer to the question, grounded in the evidence retrieved from the scientific paper.

Experiments

Results

Evidence Retrieval

Given a question, retrieve relevant passages from the full paper. We evaluate cross-encoder, dense, multi-vector dense, and sparse retrieval models at paragraph and sentence granularity.

Key finding: Simply prepending the paper title to passages (decontextualization) consistently improves retrieval performance across all model architectures.

Paragraph-Level Retrieval

Model	Architecture	MRR	MRR (+Title)	Recall@10	R@10 (+Title)
MiniLM-L12-v2	Cross-Encoder	0.4723	0.4839	0.6467	0.6709
Dragon+	Dense	0.4657	0.4845	0.6563	0.6817
SPLADEv3	Sparse	0.4536	0.4725	0.6661	0.6851
ColBERTv2	Multi-Dense	0.4368	0.4122	0.6287	0.6371
BM25	Sparse	0.4288	—	0.6388	—
Contriever-MS	Dense	0.4095	0.4408	0.6160	0.6314
GTR-XL	Dense	0.3955	0.4142	0.5940	0.6122
Contriever	Dense	0.3494	0.3624	0.5567	0.5340

Bold = best, underline = runner-up. +Title = decontextualization by prepending the paper title.

Answerability Classification

Given a question and context, classify whether the question is answerable from the paper. We evaluate instruction-tuned LLMs (Llama-3, Mistral, Command-R, GPT-3.5, GPT-4o) with gold passages, RAG top-k, and full text. We report macro-F1 to account for class imbalance (383 answerable vs. 112 unanswerable).

Key finding: All models exhibit systematic bias toward one class, making answerability classification a challenging and unsolved problem on PeerQA. Command-R and GPT-4o provide the best trade-off, achieving the highest macro-F1.

Bias toward “Unanswerable”

Llama and GPT models tend to predict unanswerable, with high recall on that class but low recall on answerable questions.

Bias toward “Answerable”

Mistral and Command-R tend to predict answerable, with high recall there but low recall on unanswerable questions.

Answer Generation

Given a question and context, generate a free-form answer. We evaluate Llama-3 (8k, 32k), Mistral (32k), Command-R (128k), GPT-3.5 (16k), and GPT-4o (128k) with gold evidence, RAG top-k passages, and the full paper text. We measure ROUGE-L, AlignScore, and Prometheus Correctness (LLM-as-judge, scale 1–5).

Key finding: Providing the model with top passages from a retriever (RAG) outperforms providing the full paper text, despite LLMs' large context windows. Gold evidence provides an upper bound, and increased retrieval performance correlates with improved generation.

RAG beats Full-Text

Models perform better with top retrieved passages than with the entire paper as context, showing that a retriever filtering relevant information is more effective than relying on the LLM's internal attention.

GPT-4o: The Exception

GPT-4o uniquely shows stable performance with increasing context length, demonstrating advanced long-context capabilities. It is the most recent and powerful model in the evaluation.

Resources

Get Started

Everything you need to use PeerQA in your own research.

Dataset

Available on Hugging Face — 579 QA pairs from 208 papers, ready to load with one line of Python.

Code

Full codebase for data processing, retrieval, classification, and generation experiments on GitHub.

Paper

The full methodology, dataset analysis, and all baselines are described in the ACL Anthology paper.

from datasets import load_dataset

# Load PeerQA from Hugging Face Hub
dataset = load_dataset("UKPLab/PeerQA")

# Access the labeled QA pairs (579 examples)
example = dataset["train"][0]
print(f"Question: {example['question']}")
print(f"Evidence: {example['answer_evidence_sent']}")
print(f"Answer:   {example['free_form_answer']}")

See the GitHub README for full setup instructions, experiment reproduction, and data format documentation.

Reference

Citation

If you use PeerQA in your research, please cite our paper:

@inproceedings{baumgartner-etal-2025-peerqa,
    title = {{PeerQA: A Scientific Question Answering Dataset from Peer Reviews}},
    author = "Baumg{\"a}rtner, Tim  and Briscoe, Ted  and Gurevych, Iryna",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas
                 Chapter of the Association for Computational Linguistics: Human
                 Language Technologies (Volume 1: Long Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.22/",
    pages = "508--544",
}