Abstract
We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches.
In total, our dataset consists of 611 paper-code discrepancies (81 real, 530 synthetic), spanning diverse computational science disciplines, including AI, Physics, Quantitative Biology, and others. Our evaluation of 21 LLMs highlights the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models' pre-training corpus. The best performing model in our evaluation, GPT-5, can only detect 45.7% of real-world paper-code discrepancies.
Paper-Code Discrepancies
The "reproducibility crisis" in AI and across science casts doubt on the reliability of research. While publishing code is now standard practice, the availability of code does not guarantee consistency with the scientific text. Implementation details can diverge from their descriptions, introducing performance variations that go unreported—from "mathiness" where equations simulate technical depth while actual gains stem from undocumented tricks, to evaluation metrics that differ in implementation, rendering scientific comparisons invalid.
📖 Definition: Paper-Code Discrepancy
A semantic conflict between the scientific method described in the publication and its actual implementation in the codebase, such that the code does not faithfully reproduce the reported method. This mismatch must be meaningful, implying a fundamental alteration to the scientific logic, experimental protocol, or mathematical formulation described in the text.
These discrepancies manifest as three distinct types:
- ✅ Differences — the code implements a logic distinct from the paper's description (e.g., L1 vs. L2 normalization)
- ✅ Paper omissions — the code includes critical components missing from the text
- ✅ Code omissions — a step described in the paper is absent from the repository
We distinguish these from engineering artifacts:
- ❌ Bugs — independent of the paper's scientific description
- ❌ Hyperparameter mismatches — if the code supports the paper's settings via configuration files or CLI arguments
- ❌ Trivial implementation details — standard engineering practices typically omitted from scientific descriptions (e.g., adding noise to a denominator for numerical stability)
SciCoQA Examples
Our dataset covers 81 real-world discrepancies from Machine Learning, Computer Vision, and NLP domains, plus 530 synthetic examples spanning additional domains including Physics, Quantitative Biology, Statistics, Math, and EE & Systems Science.
The discrepancies in the SciCoQA dataset are sourced from GitHub issues and reproducibility papers, as well as synthetically generated discrepancies. A discrepancy description consists of three components: A summary of what the Paper claims, what the Code implements, and what the Difference is. We show summarized examples of each type and category below.
TYPE
Paper Omission
CATEGORY
Algorithm
FiLM: Visual Reasoning with a General Conditioning Layer
Describes FiLM conditioning using unconstrained affine transforms σ ⊙ f + μ produced by linear layers. Applies sigmoid activation to both σ and μ, constraining coefficients to (0,1) range. Code restricts FiLM coefficients to positive bounded values, unlike paper's unconstrained coefficients.
Source: Reproducibility Paper
TYPE
Code Omission
CATEGORY
Training
FedNova: Tackling the Objective Inconsistency Problem in Federated Optimization
Each worker performs τ local updates per round and accumulates gradients across batches. Worker loop processes only one minibatch and breaks; multi-step accumulation code is commented out. Code implements single local update (τ=1) instead of accumulating τ updates as described.
Source: Reproducibility Paper
TYPE
Difference
CATEGORY
Loss
Optimal Transport for Explainable Clustering
Penalizes negative values of ξ using min(ξ, 0)² to enforce admissibility constraints. Uses torch.clamp(ξ, min=0)², penalizing positive ξ values instead. Code penalizes opposite sign from paper's definition—penalizes valid constraints instead of violations.
Source: Reproducibility Paper
Discrepancy Taxonomy
We categorize discrepancies by type (how they manifest) and category (what component is affected).
Discrepancy Types
Difference: Paper and code describe/implement conflicting approaches
Paper Omission: Implementation detail missing from the paper description
Code Omission: Paper describes something not present in the code
Discrepancy Categories
Algorithm: Differences in step order, operations, core logic
Model: Architectural or initialization difference
Loss: Alterations to loss definitions or terms
Evaluation: Modifications to evaluation logic or metrics
Data: Dataset usage, preprocessing, augmentation, filtering
Training: Changes to the learning process, schedule, optimization
Distribution by Type
Distribution by Category
Distribution of discrepancy types and categories in the SciCoQA dataset. Hover for exact results per subset.
Evaluation
Can AI reliably detect when scientific papers and code don't match? Our evaluation of 21 LLMs reveals a concerning reality: while models can identify some discrepancies, they miss most real-world cases that threaten scientific integrity.
Inference Pipeline
We tested 21 LLMs in a zero-shot setting: given the full-text paper (preprocessed to markdown format) and the codebase (containing only relevant code files), the model generates a list of discrepancies it finds. Subsequently, each generated discrepancy is compared against the ground-truth discrepancies for that paper. The comparison is performed by an LLM-Judge, in our case GPT-OSS 20B. Finally, the recall can be computed, dividing the number of discrepancies that were "matched" against a ground truth by the total number of discrepancies in the dataset.
Overview of the inference and evaluation pipeline for SciCoQA.
Main Results
Executing the above pipeline yields the following results:Recall scores across the 10 top-performing LLMs on real-world and synthetic discrepancies. Hover for exact values.
💡 Key Finding
Even the best model (GPT-5) achieves only 45.7% recall on real-world discrepancies — far too low for reliable quality assurance in scientific publishing.
Precision Analysis
To investigate precision, we manually annotated predictions from top models on 12 NLP papers. We verified each prediction and decided whether it is a valid paper-code discrepancy or not. This yielded a total of 67 pooled discrepancies. We then computed precision and recall on this pooled set for each model.
| GPT-5 | Gemini 2.5 Pro | GPT-OSS 20B | |
|---|---|---|---|
| True Positives | 38 | 29 | 40 |
| False Positives | 8 | 2 | 20 |
| Precision | 82.6% | 93.5% | 66.7% |
| Recall | 56.7% | 43.3% | 59.7% |
| F1 | 67.3% | 59.2% | 63.0% |
Performance on 67 pooled discrepancies from 12 NLP papers
Error Analysis
Our evaluation and manual error analysis revealed the following failure modes:
✂️ Paper Omissions
Models miss discrepancies where critical implementation details are present in code but absent from the paper description. These account for the majority of detection failures.
📏 Long Context
Performance drops significantly when analyzing large codebases (100k+ tokens), where models struggle to maintain coherence across extended inputs.
🆕 Recent Papers
Models perform worse on papers published after their training cutoff, showing degradation on "non-contaminated" data outside their pre-training corpus.
What This Means for Science
For Scientific Publishing: The 45.7% recall rate shows that current AI tools cannot reliably ensure paper-code alignment, highlighting the need for improved quality assurance processes in scientific publishing.
For AI-Assisted Research: Autonomous scientific systems and AI research assistants require specialized training to detect subtle implementation discrepancies that could affect experimental reproducibility.
For Peer Review: Human reviewers should prioritize checking code-paper alignment, especially for complex algorithms, loss functions, and evaluation metrics where discrepancies are most likely to occur.
For Future Research: Developing better methods for automated paper-code consistency checking is crucial as AI systems take on larger roles in scientific discovery and validation.
Getting Started
Here we share the resources to get started on working on this important problem and exploring the SciCoQA dataset.
🤗 Dataset
The SciCoQA dataset is available on Hugging Face for convenient access and integration into any project.
💻 Code
The complete codebase for SciCoQA, including data collection, inference and evaluation scripts, is available on GitHub.
📺 Demo
Try out the SciCoQA discrepancy detection system interactively in our Hugging Face Space demo.
from datasets import load_dataset
# Load SciCoQA from HuggingFace Hub
dataset = load_dataset("UKPLab/scicoqa")
# Access the different subsets
real_data = dataset["real"] # subset of 81 real-world discrepancies
synthetic_data = dataset["synthetic"] # subset of 530 synthetic examples
# Explore a discrepancy
example = real_data[0]
print(f"Paper: {example['paper_url_versioned']}") # static URL to the paper
print(f"Code: {example['code_url_versioned']}") # static URL to the code
print(f"Discrepancy: {example['discrepancy_description']}") # description of the discrepancy
print(f"Type: {example['discrepancy_type']}") # type of the discrepancy
print(f"Category: {example['discrepancy_category']}") # category of the discrepancy
Example code for loading and exploring the SciCoQA dataset from Hugging Face.
BibTeX
If you found this work useful or used the dataset, please consider citing it.
@article{scicoqa-baumgaertner-etal-2026,
title={{SciCoQA: Quality Assurance for Scientific Paper--Code Alignment}},
author={Tim Baumgärtner and Iryna Gurevych},
year={2026},
eprint={2601.12910},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.12910}
}