SciCoQA: Quality Assurance for Scientific Paper-Code Alignment

Ubiquitous Knowledge Processing Lab
Technical University of Darmstadt
Overview of the SciCoQA dataset creation process

Overview of the creation of SciCoQA: We source real-world data from Reproducibility papers and GitHub issues. For the former, paper-code discrepancies are extracted from the paper with GPT-5, for the latter, issues are pre-filtered using Qwen3. Next, all candidates are manually filtered to remove any that do not fit our discrepancy definition. Finally, all paper-code discrepancies are verified with GPT-5. For synthetic data, we generate discrepancies using GPT-5 for AI and other computational domains.

Abstract

We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches.

In total, our dataset consists of 611 paper-code discrepancies (81 real, 530 synthetic), spanning diverse computational science disciplines, including AI, Physics, Quantitative Biology, and others. Our evaluation of 21 LLMs highlights the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models' pre-training corpus. The best performing model in our evaluation, GPT-5, can only detect 45.7% of real-world paper-code discrepancies.

Paper-Code Discrepancies

The "reproducibility crisis" in AI and across science casts doubt on the reliability of research. While publishing code is now standard practice, the availability of code does not guarantee consistency with the scientific text. Implementation details can diverge from their descriptions, introducing performance variations that go unreported—from "mathiness" where equations simulate technical depth while actual gains stem from undocumented tricks, to evaluation metrics that differ in implementation, rendering scientific comparisons invalid.

📖 Definition: Paper-Code Discrepancy

A semantic conflict between the scientific method described in the publication and its actual implementation in the codebase, such that the code does not faithfully reproduce the reported method. This mismatch must be meaningful, implying a fundamental alteration to the scientific logic, experimental protocol, or mathematical formulation described in the text.

These discrepancies manifest as three distinct types:

  • Differences — the code implements a logic distinct from the paper's description (e.g., L1 vs. L2 normalization)
  • Paper omissions — the code includes critical components missing from the text
  • Code omissions — a step described in the paper is absent from the repository

We distinguish these from engineering artifacts:

  • Bugs — independent of the paper's scientific description
  • Hyperparameter mismatches — if the code supports the paper's settings via configuration files or CLI arguments
  • Trivial implementation details — standard engineering practices typically omitted from scientific descriptions (e.g., adding noise to a denominator for numerical stability)

SciCoQA Examples

Our dataset covers 81 real-world discrepancies from Machine Learning, Computer Vision, and NLP domains, plus 530 synthetic examples spanning additional domains including Physics, Quantitative Biology, Statistics, Math, and EE & Systems Science.

Real-World 81 discrepancies
Machine Learning Computer Vision Natural Language Processing
Synthetic 530 discrepancies
ML CV NLP EE & Systems Physics Quant. Biology Statistics Math

The discrepancies in the SciCoQA dataset are sourced from GitHub issues and reproducibility papers, as well as synthetically generated discrepancies. A discrepancy description consists of three components: A summary of what the Paper claims, what the Code implements, and what the Difference is. We show summarized examples of each type and category below.

TYPE

Paper Omission

CATEGORY

Algorithm

FiLM: Visual Reasoning with a General Conditioning Layer

Describes FiLM conditioning using unconstrained affine transforms σ ⊙ f + μ produced by linear layers. Applies sigmoid activation to both σ and μ, constraining coefficients to (0,1) range. Code restricts FiLM coefficients to positive bounded values, unlike paper's unconstrained coefficients.

Source: Reproducibility Paper

TYPE

Code Omission

CATEGORY

Training

FedNova: Tackling the Objective Inconsistency Problem in Federated Optimization

Each worker performs τ local updates per round and accumulates gradients across batches. Worker loop processes only one minibatch and breaks; multi-step accumulation code is commented out. Code implements single local update (τ=1) instead of accumulating τ updates as described.

Source: Reproducibility Paper

TYPE

Difference

CATEGORY

Loss

Optimal Transport for Explainable Clustering

Penalizes negative values of ξ using min(ξ, 0)² to enforce admissibility constraints. Uses torch.clamp(ξ, min=0)², penalizing positive ξ values instead. Code penalizes opposite sign from paper's definition—penalizes valid constraints instead of violations.

Source: Reproducibility Paper

Discrepancy Taxonomy

We categorize discrepancies by type (how they manifest) and category (what component is affected).

Discrepancy Types

Difference: Paper and code describe/implement conflicting approaches

Paper Omission: Implementation detail missing from the paper description

Code Omission: Paper describes something not present in the code

Discrepancy Categories

Algorithm: Differences in step order, operations, core logic

Model: Architectural or initialization difference

Loss: Alterations to loss definitions or terms

Evaluation: Modifications to evaluation logic or metrics

Data: Dataset usage, preprocessing, augmentation, filtering

Training: Changes to the learning process, schedule, optimization

Distribution by Type

Distribution by Category

Distribution of discrepancy types and categories in the SciCoQA dataset. Hover for exact results per subset.

Evaluation

Can AI reliably detect when scientific papers and code don't match? Our evaluation of 21 LLMs reveals a concerning reality: while models can identify some discrepancies, they miss most real-world cases that threaten scientific integrity.

Inference Pipeline

We tested 21 LLMs in a zero-shot setting: given the full-text paper (preprocessed to markdown format) and the codebase (containing only relevant code files), the model generates a list of discrepancies it finds. Subsequently, each generated discrepancy is compared against the ground-truth discrepancies for that paper. The comparison is performed by an LLM-Judge, in our case GPT-OSS 20B. Finally, the recall can be computed, dividing the number of discrepancies that were "matched" against a ground truth by the total number of discrepancies in the dataset.

Inference pipeline flowchart showing paper excerpt and code snippet as inputs, model inference, generated discrepancy, LLM judge evaluation, and final match/no match output

Overview of the inference and evaluation pipeline for SciCoQA.

Main Results

Executing the above pipeline yields the following results:

Recall scores across the 10 top-performing LLMs on real-world and synthetic discrepancies. Hover for exact values.

💡 Key Finding

Even the best model (GPT-5) achieves only 45.7% recall on real-world discrepancies — far too low for reliable quality assurance in scientific publishing.

Precision Analysis

To investigate precision, we manually annotated predictions from top models on 12 NLP papers. We verified each prediction and decided whether it is a valid paper-code discrepancy or not. This yielded a total of 67 pooled discrepancies. We then computed precision and recall on this pooled set for each model.

GPT-5 Gemini 2.5 Pro GPT-OSS 20B
True Positives 38 29 40
False Positives 8 2 20
Precision 82.6% 93.5% 66.7%
Recall 56.7% 43.3% 59.7%
F1 67.3% 59.2% 63.0%

Performance on 67 pooled discrepancies from 12 NLP papers

Error Analysis

Our evaluation and manual error analysis revealed the following failure modes:

✂️ Paper Omissions

Models miss discrepancies where critical implementation details are present in code but absent from the paper description. These account for the majority of detection failures.

📏 Long Context

Performance drops significantly when analyzing large codebases (100k+ tokens), where models struggle to maintain coherence across extended inputs.

🆕 Recent Papers

Models perform worse on papers published after their training cutoff, showing degradation on "non-contaminated" data outside their pre-training corpus.

What This Means for Science

For Scientific Publishing: The 45.7% recall rate shows that current AI tools cannot reliably ensure paper-code alignment, highlighting the need for improved quality assurance processes in scientific publishing.

For AI-Assisted Research: Autonomous scientific systems and AI research assistants require specialized training to detect subtle implementation discrepancies that could affect experimental reproducibility.

For Peer Review: Human reviewers should prioritize checking code-paper alignment, especially for complex algorithms, loss functions, and evaluation metrics where discrepancies are most likely to occur.

For Future Research: Developing better methods for automated paper-code consistency checking is crucial as AI systems take on larger roles in scientific discovery and validation.

Getting Started

Here we share the resources to get started on working on this important problem and exploring the SciCoQA dataset.

🤗 Dataset

The SciCoQA dataset is available on Hugging Face for convenient access and integration into any project.

💻 Code

The complete codebase for SciCoQA, including data collection, inference and evaluation scripts, is available on GitHub.

📺 Demo

Try out the SciCoQA discrepancy detection system interactively in our Hugging Face Space demo.

from datasets import load_dataset

# Load SciCoQA from HuggingFace Hub
dataset = load_dataset("UKPLab/scicoqa")

# Access the different subsets
real_data = dataset["real"]            # subset of 81 real-world discrepancies
synthetic_data = dataset["synthetic"]  # subset of 530 synthetic examples

# Explore a discrepancy
example = real_data[0]
print(f"Paper: {example['paper_url_versioned']}")           # static URL to the paper
print(f"Code: {example['code_url_versioned']}")             # static URL to the code
print(f"Discrepancy: {example['discrepancy_description']}") # description of the discrepancy
print(f"Type: {example['discrepancy_type']}")               # type of the discrepancy
print(f"Category: {example['discrepancy_category']}")       # category of the discrepancy

Example code for loading and exploring the SciCoQA dataset from Hugging Face.

BibTeX

If you found this work useful or used the dataset, please consider citing it.

@article{scicoqa-baumgaertner-etal-2026,
  title={{SciCoQA: Quality Assurance for Scientific Paper--Code Alignment}}, 
  author={Tim Baumgärtner and Iryna Gurevych},
  year={2026},
  eprint={2601.12910},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2601.12910}
}