TL;DR
We present SciCoQA, a benchmark of 635 paper-code discrepancies (92 real-world, 543 synthetic) across AI, Physics, Biology, and more. We evaluate 22 LLMs on detecting these mismatches—even the best models find only 46.7% of real discrepancies, far from reliable scientific quality assurance.
We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches.
In total, our dataset consists of 635 paper-code discrepancies (92 real, 543 synthetic), covering the AI domain from real-world data and extending to Physics, Quantitative Biology, and other computational sciences through synthetic data. Our evaluation of 22 LLMs demonstrates the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models' pre-training corpus. The best performing models in our evaluation, Gemini 3.1 Pro and GPT-5 Mini, detect only 46.7% of real-world paper-code discrepancies.
How SciCoQA Was Collected
× Bugs independent of the paper
× Hyperparameter mismatches
× Trivial implementation details
return v/np.sqrt(np.sum(v**2))
return v/np.sum(np.abs(v))
Dataset construction pipeline: For real-world data, we source discrepancy candidates from reproducibility papers (via GPT-5) and GitHub issues (via Qwen3) in parallel, manually filter them, then verify and rephrase with GPT-5 and Gemini 3.1 Pro, yielding 92 discrepancies. For synthetic data, we sample scientific codebases and use GPT-5 to generate code modifications that create controlled discrepancies, yielding 543 examples across 6 domains.
Paper-Code Discrepancies
Publishing code is now standard practice, but the availability of code does not guarantee consistency with the scientific text. Implementation details can diverge from their descriptions—from equations that differ in code, to evaluation metrics computed differently than described—introducing unreported performance variations that undermine scientific comparisons.
Definition: Paper-Code Discrepancy
A meaningful semantic conflict between the scientific method described in the publication and its actual implementation in the codebase—a fundamental alteration to the scientific logic, experimental protocol, or mathematical formulation.
These discrepancies manifest as three types:
We distinguish these from engineering artifacts:
SciCoQA Examples
The discrepancies in SciCoQA are sourced from GitHub issues and reproducibility papers, as well as synthetically generated. A discrepancy description consists of three components: What the Paper claims, what the Code implements, and what the Difference is.
Discrepancy Taxonomy
We classify discrepancies along two dimensions: type (how they manifest) and category (what component is affected).
Discrepancy Types
Discrepancy Categories
Distribution by Type
Distribution by Category
Distribution of discrepancy types and categories in SciCoQA. Hover for exact values per subset.
Evaluation
We benchmark 22 LLMs on detecting paper-code discrepancies in a zero-shot setting. Each model receives the full paper (as markdown) and the codebase, then generates a list of discrepancies. An LLM judge (GPT-OSS 20B) compares each prediction against the ground truth.
Recall by Model
Mini
3.1 Pro
20B
2.5 Pro
Codex
120B
2.5 Flash
Nano
49B
Recall of the 10 top-performing LLMs. Sorted by real-world + synthetic recall.
Key Finding: Even the best models (Gemini 3.1 Pro and GPT-5 Mini) achieve only 46.7% recall on real-world discrepancies—far too low for reliable quality assurance in scientific publishing.
Analysis
Precision Analysis
Manual annotation of top model predictions on 20 NLP and CV papers (129 pooled discrepancies).
| GPT-5 | Gemini 2.5 Pro | GPT-OSS 20B | |
|---|---|---|---|
| True Positives | 66 | 55 | 72 |
| False Positives | 9 | 3 | 31 |
| Precision | 88.0% | 94.6% | 69.9% |
| Recall | 51.2% | 41.1% | 55.8% |
| F1 | 64.7% | 57.3% | 62.1% |
Error Analysis
Our evaluation and manual error analysis revealed three main failure modes for discrepancy detection:
Conclusions
Publishing
46.7% recall means AI can't reliably check alignment—human-in-the-loop verification remains essential.
AI Scientists
Autonomous AI-generated code can't be reliably verified by LLMs alone—human oversight needed.
Paper Omissions
The hardest discrepancy type to detect—details present in code but missing from the paper remain largely invisible to LLMs.
Future Work
Better automated paper-code consistency methods are crucial as AI scales in scientific discovery.
Getting Started
Everything you need to explore the SciCoQA dataset and work on this important problem.
Dataset
Available on Hugging Face for easy access and integration.
Code
The complete codebase for data collection, inference, and evaluation is on GitHub.
Demo
Try discrepancy detection interactively in our Hugging Face Space demo.
from datasets import load_dataset # Load SciCoQA from HuggingFace Hub dataset = load_dataset("UKPLab/scicoqa") # Access the different subsets real_data = dataset["real"] # 92 real-world discrepancies synthetic_data = dataset["synthetic"] # 543 synthetic examples pooled_data = dataset["pooled"] # 129 pooled annotations # Explore a discrepancy example = real_data[0] print(f"Paper: {example['paper_url_versioned']}") print(f"Code: {example['code_url_versioned']}") print(f"Discrepancy: {example['discrepancy_description_gpt']}") print(f"Type: {example['discrepancy_type']}") print(f"Category: {example['discrepancy_category']}")
Example code for loading and exploring the SciCoQA dataset from Hugging Face.
Citation
If you found this work useful or used the dataset, please consider citing it.
@article{scicoqa-baumgaertner-etal-2026,
title={{SciCoQA: Quality Assurance for Scientific Paper--Code Alignment}},
author={Tim Baumgärtner and Iryna Gurevych},
year={2026},
eprint={2601.12910},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.12910}
}