TL;DR
PeerQA is a scientific QA dataset where questions come from real peer reviews and answers are annotated by the paper authors themselves. It contains 579 QA pairs across 208 papers spanning ML, NLP, Geoscience, and Public Health, supporting three tasks: evidence retrieval, answerability classification, and answer generation.
We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which contain questions that reviewers raised while thoroughly examining the scientific article. Answers have been annotated by the original authors of each paper.
The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP, as well as subsets from other scientific communities like Geoscience and Public Health. PeerQA supports three critical tasks for developing practical QA systems: Evidence Retrieval, Unanswerable Question Classification, and Answer Generation.
We provide a detailed analysis of the collected data and conduct experiments establishing baseline systems for all three tasks. Our experiments and analyses reveal the need for decontextualization in document-level retrieval, where we find that even simple decontextualization approaches consistently improve retrieval performance across architectures.
At a Glance
PeerQA Examples
“All were annotated by three annotators, each of whom was a native speaker of English and a linguist with experience with AMR.”
“A total of nine annotators participated in this evaluation. Six of the annotators were native English speakers; all were either current or former PhD students or professors at our university.”
Three Tasks
Evidence Retrieval
Given a question and a paper, retrieve the sentences or paragraphs containing evidence relevant to answering the question.
Answerability Classification
Determine whether a question can be answered from the information in the paper, or is unanswerable due to insufficient evidence.
Answer Generation
Generate a natural language answer to the question, grounded in the evidence retrieved from the scientific paper.
Results
Evidence Retrieval
Given a question, retrieve relevant passages from the full paper. We evaluate cross-encoder, dense, multi-vector dense, and sparse retrieval models at paragraph and sentence granularity.
Key finding: Simply prepending the paper title to passages (decontextualization) consistently improves retrieval performance across all model architectures.
| Model | Architecture | MRR | MRR (+Title) | Recall@10 | R@10 (+Title) |
|---|---|---|---|---|---|
| MiniLM-L12-v2 | Cross-Encoder | 0.4723 | 0.4839 | 0.6467 | 0.6709 |
| Dragon+ | Dense | 0.4657 | 0.4845 | 0.6563 | 0.6817 |
| SPLADEv3 | Sparse | 0.4536 | 0.4725 | 0.6661 | 0.6851 |
| ColBERTv2 | Multi-Dense | 0.4368 | 0.4122 | 0.6287 | 0.6371 |
| BM25 | Sparse | 0.4288 | — | 0.6388 | — |
| Contriever-MS | Dense | 0.4095 | 0.4408 | 0.6160 | 0.6314 |
| GTR-XL | Dense | 0.3955 | 0.4142 | 0.5940 | 0.6122 |
| Contriever | Dense | 0.3494 | 0.3624 | 0.5567 | 0.5340 |
Bold = best, underline = runner-up. +Title = decontextualization by prepending the paper title.
Answerability Classification
Given a question and context, classify whether the question is answerable from the paper. We evaluate instruction-tuned LLMs (Llama-3, Mistral, Command-R, GPT-3.5, GPT-4o) with gold passages, RAG top-k, and full text. We report macro-F1 to account for class imbalance (383 answerable vs. 112 unanswerable).
Key finding: All models exhibit systematic bias toward one class, making answerability classification a challenging and unsolved problem on PeerQA. Command-R and GPT-4o provide the best trade-off, achieving the highest macro-F1.
Bias toward “Unanswerable”
Llama and GPT models tend to predict unanswerable, with high recall on that class but low recall on answerable questions.
Bias toward “Answerable”
Mistral and Command-R tend to predict answerable, with high recall there but low recall on unanswerable questions.
Answer Generation
Given a question and context, generate a free-form answer. We evaluate Llama-3 (8k, 32k), Mistral (32k), Command-R (128k), GPT-3.5 (16k), and GPT-4o (128k) with gold evidence, RAG top-k passages, and the full paper text. We measure ROUGE-L, AlignScore, and Prometheus Correctness (LLM-as-judge, scale 1–5).
Key finding: Providing the model with top passages from a retriever (RAG) outperforms providing the full paper text, despite LLMs' large context windows. Gold evidence provides an upper bound, and increased retrieval performance correlates with improved generation.
RAG beats Full-Text
Models perform better with top retrieved passages than with the entire paper as context, showing that a retriever filtering relevant information is more effective than relying on the LLM's internal attention.
GPT-4o: The Exception
GPT-4o uniquely shows stable performance with increasing context length, demonstrating advanced long-context capabilities. It is the most recent and powerful model in the evaluation.
Get Started
Everything you need to use PeerQA in your own research.
Dataset
Available on Hugging Face — 579 QA pairs from 208 papers, ready to load with one line of Python.
Code
Full codebase for data processing, retrieval, classification, and generation experiments on GitHub.
Paper
The full methodology, dataset analysis, and all baselines are described in the ACL Anthology paper.
from datasets import load_dataset # Load PeerQA from Hugging Face Hub dataset = load_dataset("UKPLab/PeerQA") # Access the labeled QA pairs (579 examples) example = dataset["train"][0] print(f"Question: {example['question']}") print(f"Evidence: {example['answer_evidence_sent']}") print(f"Answer: {example['free_form_answer']}")
See the GitHub README for full setup instructions, experiment reproduction, and data format documentation.
Citation
If you use PeerQA in your research, please cite our paper:
@inproceedings{baumgartner-etal-2025-peerqa,
title = {{PeerQA: A Scientific Question Answering Dataset from Peer Reviews}},
author = "Baumg{\"a}rtner, Tim and Briscoe, Ted and Gurevych, Iryna",
booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas
Chapter of the Association for Computational Linguistics: Human
Language Technologies (Volume 1: Long Papers)",
month = apr,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.naacl-long.22/",
pages = "508--544",
}