SciCoQA: Quality Assurance for Scientific Paper-Code Alignment

Ubiquitous Knowledge Processing Lab · Technical University of Darmstadt

TL;DR

We present SciCoQA, a benchmark of 635 paper-code discrepancies (92 real-world, 543 synthetic) across AI, Physics, Biology, and more. We evaluate 22 LLMs on detecting these mismatches—even the best models find only 46.7% of real discrepancies, far from reliable scientific quality assurance.

We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches.

In total, our dataset consists of 635 paper-code discrepancies (92 real, 543 synthetic), covering the AI domain from real-world data and extending to Physics, Quantitative Biology, and other computational sciences through synthetic data. Our evaluation of 22 LLMs demonstrates the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models' pre-training corpus. The best performing models in our evaluation, Gemini 3.1 Pro and GPT-5 Mini, detect only 46.7% of real-world paper-code discrepancies.

0Paper-Code Discrepancies
0Scientific Domains
0LLMs Evaluated
0Best Real-World Recall

How SciCoQA Was Collected

Real-World Data
Reproducibility Papers
Extraction of paper-code discrepancy descriptions from reproducibility papers
via GPT-5
GitHub Issues
Issue classification: identify which issues from scientific repositories report paper-code discrepancies
via Qwen3
Manual Filter
Verify all candidates against the discrepancy definition: a meaningful semantic conflict between the paper's method and the code's implementation.
Remove:
× Bugs independent of the paper
× Hyperparameter mismatches
× Trivial implementation details
Human
Verify & Rephrase
Verification
Confirm each discrepancy actually exists given the original paper, codebase, and source evidence.
Rephrasing
Generate a standardized description: what the paper claims, what the code implements, and the difference.
via GPT-5 Gemini 3.1 Pro
92
Real
MLCVNLP
+
Synthetic Data
Scientific Codebases
Sample repositories linked to arXiv papers with permissive licenses, balanced across CS and non-CS domains
Generate Code Diffs
Given the paper and code, generate code changes that introduce discrepancies. Sample up to 3 per repo.
via GPT-5
Synthetic Paper-Code Discrepancy
def normalize(v):
return v/np.sqrt(np.sum(v**2))
return v/np.sum(np.abs(v))
Description: norm type changed from Euclidean to Manhattan distance.
543
Synthetic
MLCVNLPPhysicsBioEEStatsMath

Dataset construction pipeline: For real-world data, we source discrepancy candidates from reproducibility papers (via GPT-5) and GitHub issues (via Qwen3) in parallel, manually filter them, then verify and rephrase with GPT-5 and Gemini 3.1 Pro, yielding 92 discrepancies. For synthetic data, we sample scientific codebases and use GPT-5 to generate code modifications that create controlled discrepancies, yielding 543 examples across 6 domains.

Paper-Code Discrepancies

Publishing code is now standard practice, but the availability of code does not guarantee consistency with the scientific text. Implementation details can diverge from their descriptions—from equations that differ in code, to evaluation metrics computed differently than described—introducing unreported performance variations that undermine scientific comparisons.

Definition: Paper-Code Discrepancy

A meaningful semantic conflict between the scientific method described in the publication and its actual implementation in the codebase—a fundamental alteration to the scientific logic, experimental protocol, or mathematical formulation.

These discrepancies manifest as three types:

Difference
Code implements a logic distinct from the paper (e.g., L1 vs. L2 normalization)
Paper Omission
Code includes critical components missing from the paper description
Code Omission
A step described in the paper is absent from the repository

We distinguish these from engineering artifacts:

Bugs
Independent of the paper's scientific description
Hyperparameters
Configurable via CLI arguments or config files
Trivial Details
Standard engineering practices (e.g., numerical stability)

SciCoQA Examples

Real-World 92 discrepancies
Machine Learning Computer Vision NLP
Synthetic 543 discrepancies
ML CV NLP EE & Systems Physics Quant. Biology Statistics Math

The discrepancies in SciCoQA are sourced from GitHub issues and reproducibility papers, as well as synthetically generated. A discrepancy description consists of three components: What the Paper claims, what the Code implements, and what the Difference is.

Paper Omission Algorithm
Describes FiLM conditioning using unconstrained affine transforms σ ⊙ f + μ produced by linear layers. Applies sigmoid activation to both σ and μ, constraining coefficients to (0,1) range. Code restricts FiLM coefficients to positive bounded values, unlike paper's unconstrained coefficients.
Code Omission Training
Each worker performs τ local updates per round and accumulates gradients across batches. Worker loop processes only one minibatch and breaks; multi-step accumulation code is commented out. Code implements single local update (τ=1) instead of accumulating τ updates as described.
Difference Loss
Penalizes negative values of ξ using min(ξ, 0)² to enforce admissibility constraints. Uses torch.clamp(ξ, min=0)², penalizing positive ξ values instead. Code penalizes opposite sign from paper's definition—penalizes valid constraints instead of violations.

Discrepancy Taxonomy

We classify discrepancies along two dimensions: type (how they manifest) and category (what component is affected).

Discrepancy Types

Difference
Paper and code describe conflicting approaches
Paper Omission
Code includes components missing from the paper
Code Omission
Paper describes a step absent from the code

Discrepancy Categories

Algorithm
Step order, operations, or core logic differs between paper and code
Model
Architectural choices or initialization methods don't match
Loss
Loss function definition, terms, or weighting differs
Evaluation
Evaluation metrics, logic, or test scripts computed differently
Data
Dataset usage, preprocessing, augmentation, or filtering differs
Training
Learning process, schedule, or optimization strategy changed

Distribution by Type

Real Synthetic
0%
30%
60%
Real: 55.4% · Synth: 78.1%
Real: 13.0% · Synth: 10.1%
Real: 31.5% · Synth: 11.8%
Difference
Code Om.
Paper Om.

Distribution by Category

Real Synthetic
0%
10%
20%
Real: 25.0% · Synth: 25.5%
Real: 12.0% · Synth: 20.6%
Real: 23.9% · Synth: 12.6%
Real: 9.8% · Synth: 16.1%
Real: 12.0% · Synth: 13.6%
Real: 17.4% · Synth: 11.5%
Algorithm
Model
Loss
Evaluation
Data
Training

Distribution of discrepancy types and categories in SciCoQA. Hover for exact values per subset.

Evaluation

We benchmark 22 LLMs on detecting paper-code discrepancies in a zero-shot setting. Each model receives the full paper (as markdown) and the codebase, then generates a list of discrepancies. An LLM judge (GPT-OSS 20B) compares each prediction against the ground truth.

Paper + Code
markdown + source files
LLM
zero-shot inference
Discrepancies
generated predictions
LLM Judge
GPT-OSS 20B
Σ
Recall
match / no match

Recall by Model

Real Synthetic
0%
20%
40%
60%
Real: 41.3% · Synth: 70.0%
Real: 46.7% · Synth: 64.3%
Real: 46.7% · Synth: 56.4%
Real: 42.4% · Synth: 47.1%
Real: 39.1% · Synth: 48.4%
Real: 27.2% · Synth: 48.6%
Real: 41.3% · Synth: 44.0%
Real: 34.8% · Synth: 41.4%
Real: 19.6% · Synth: 27.4%
Real: 23.9% · Synth: 23.9%
GPT-5
GPT-5
Mini
Gemini
3.1 Pro
GPT-OSS
20B
Gemini
2.5 Pro
GPT-5
Codex
GPT-OSS
120B
Gemini
2.5 Flash
GPT-5
Nano
Nemotron
49B

Recall of the 10 top-performing LLMs. Sorted by real-world + synthetic recall.

Key Finding: Even the best models (Gemini 3.1 Pro and GPT-5 Mini) achieve only 46.7% recall on real-world discrepancies—far too low for reliable quality assurance in scientific publishing.

Analysis

Precision Analysis

Manual annotation of top model predictions on 20 NLP and CV papers (129 pooled discrepancies).

GPT-5Gemini 2.5 ProGPT-OSS 20B
True Positives665572
False Positives9331
Precision88.0%94.6%69.9%
Recall51.2%41.1%55.8%
F164.7%57.3%62.1%

Error Analysis

Our evaluation and manual error analysis revealed three main failure modes for discrepancy detection:

Paper Omissions
Models miss details present in code but absent from the paper — no textual anchor to match against. Majority of failures.
Long Context
Performance drops on codebases exceeding 100k tokens — models lose coherence across long papers and multi-file repos.
Recent Papers
Worse on papers after the training cutoff — suggests detection on older papers may partly stem from memorization.

Conclusions

Publishing

46.7% recall means AI can't reliably check alignment—human-in-the-loop verification remains essential.

AI Scientists

Autonomous AI-generated code can't be reliably verified by LLMs alone—human oversight needed.

Paper Omissions

The hardest discrepancy type to detect—details present in code but missing from the paper remain largely invisible to LLMs.

Future Work

Better automated paper-code consistency methods are crucial as AI scales in scientific discovery.

Getting Started

Everything you need to explore the SciCoQA dataset and work on this important problem.

Dataset

Available on Hugging Face for easy access and integration.

Code

The complete codebase for data collection, inference, and evaluation is on GitHub.

Demo

Try discrepancy detection interactively in our Hugging Face Space demo.

from datasets import load_dataset

# Load SciCoQA from HuggingFace Hub
dataset = load_dataset("UKPLab/scicoqa")

# Access the different subsets
real_data = dataset["real"]            # 92 real-world discrepancies
synthetic_data = dataset["synthetic"]  # 543 synthetic examples
pooled_data = dataset["pooled"]        # 129 pooled annotations

# Explore a discrepancy
example = real_data[0]
print(f"Paper: {example['paper_url_versioned']}")
print(f"Code: {example['code_url_versioned']}")
print(f"Discrepancy: {example['discrepancy_description_gpt']}")
print(f"Type: {example['discrepancy_type']}")
print(f"Category: {example['discrepancy_category']}")

Example code for loading and exploring the SciCoQA dataset from Hugging Face.

Citation

If you found this work useful or used the dataset, please consider citing it.

@article{scicoqa-baumgaertner-etal-2026,
  title={{SciCoQA: Quality Assurance for Scientific Paper--Code Alignment}},
  author={Tim Baumgärtner and Iryna Gurevych},
  year={2026},
  eprint={2601.12910},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2601.12910}
}