EACL 2026 Main Conference

GRITHopper

Decomposition-Free Multi-Hop Dense Retrieval

Justus-Jonas Erker¹,Nils Reimers²,Iryna Gurevych¹

UKP Lab · TU Darmstadt

GritHopper-7B Code 2503.07519 pip install grithopper

Scroll to see how it works

GRITHopper is a state-of-the-art multi-hop dense retriever and the first decoder-based model to perform multi-hop retrieval in an encoder-only fashion, similar to MDR (Xiong et al., 2021) and BeamRetriever (Zhang et al., 2024). Unlike previous approaches that struggle with longer reasoning chains and out-of-distribution data, GRITHopper achieves robust performance by combining dense retrieval with generative training objectives.

How Decomposition-Free Retrieval Works

Watch GRITHopper recursively retrieve documents by expanding context with each hop.

Input

Query

Where does the body of water by the city that shares a border with Elizabeth Berg's birthplace and Ohio River meet?

encode

Encoding

GRITHopper-7B

Multi-Hop Dense Embedder

Ready

Search Documents in Embedding Space

Document Vector

document returns

Hop 1/4

MultiHop-RAG Benchmark

Hits@1 (Tang et al., 2024)

GRITHopper-7B (ours)

GRITLM-7B (Muennighoff et al., 2024)

BeamRetriever (Zhang et al., NAACL 2024)

GPT-4o + GRITLM (decomposition-based)

Qwen2.5-32B + GRITLM (decomposition-based)

Waiting...

Open Retrieval Performance (Hits@1)

Model	MuSiQue					HoVer					ExFever				MoreHopQA*
Model	H1	H2	H3	H4	Avg	H1	H2	H3	H4	Avg	H1	H2	H3	Avg	H1	H2	Avg
GRITHopper (ours)	94.25	76.13	55.45	32.10	76.42	95.86	91.56	91.69	92.31	93.88	96.88	92.20	85.38	93.02	96.96	93.92	95.44
GRITLM-7B	91.15	57.51	22.32	5.43	60.51	95.81	88.09	83.95	88.46	91.81	91.13	54.88	17.28	63.83	98.75	95.53	97.14
BeamRetriever	88.75	60.70	30.73	12.84	62.80	98.04	88.96	85.96	76.92	93.42	-	-	-	-	97.85	93.02	95.44
MDR	81.75	45.18	-	-	63.47	84.77	65.69	-	-	77.10	92.93	77.16	-	85.13	88.73	75.58	82.16
Decomposition-based (LLM + retriever)
Qwen2.5-32B + GRITLM	82.62	45.72	13.91	1.48	51.06	75.38	61.44	50.43	46.15	67.69	63.24	29.88	11.93	40.90	96.24	55.19	75.72
GPT-4o + GRITLM	81.96	48.53	13.39	1.98	51.81	-	-	-	-	-	-	-	-	-	-	-	-

*MoreHopQA is a zero-shot (out-of-distribution) benchmark. H1-H4 = Hop depth. MultiHop-RAG results shown in graph above.

Key Strengths

Encoder-Only Efficiency

Each retrieval iteration requires only a single forward pass, rather than multiple autoregressive steps.

OOD Robustness

State-of-the-art performance compared to other decomposition-free methods on multiple out-of-distribution benchmarks.

Unified Training

Combines dense retrieval with generative objectives, exploring how post-retrieval generation loss improves dense retrieval.

Self-Stopping

Utilizes generative capabilities via ReAct to control its own state, stopping itself through causal next-token prediction.

Quick Start

terminal

$pip install grithopper

Training GRITHopper

GRITHopper uses a joint training objective combining contrastive learning for embedding similarity and causal language modeling for next-token prediction:

L = L_rep + L_gen

Post-retrieval language modeling refers to predicting tokens that appear after the retrieval chain (e.g., the final answer). By keeping the retrieval sequence identical for both losses and only appending post-retrieval tokens to the generative objective, we ensure any performance gains come from learning what information is useful, not from extra computation or thinking tokens.

Contrastive

Embedding similarity loss

Anchor (query context)

GritHopper-7B

Embedding Space - Hop 1

D1_N

Hop 1: Anchor Q pulls D1 (positive) closer, pushes D1_N (hard negative from distractors) away.

No Post-Retrieval LM

Same sequence for both losses

Input

QD1D2

GritHopper-7B

Output (next token prediction)

We add 'Eval: Relevant' to match sequence length with other variants. Since it's always the same token, it provides no discriminative signal—isolating whether gains come from actual post-retrieval info vs. just more compute tokens.

+ Answer

Post-retrieval answer tokens

Input

QD1D2

GritHopper-7B

Output (next token prediction)

The final answer is appended as post-retrieval signal. This teaches what information leads to correct answers, improving retrieval.

+ Reward

Causal negative observation

Input

QD1Distractor

GritHopper-7B

Output (next token prediction)

Hard negatives are observed causally with 'Irrelevant' label. Improves distractor discrimination but can overfit.

Ablation Results (Hits@1)

MuSiQue Distractor Setting

Answers + Reward	82.32
Answers Only	82.08
No Post-Retrieval LM	80.78
Contrastive Only	78.02

Open Retrieval (avg. 2 seeds)

Dataset	Ans+Rew	Ans	No Post
MuSiQue	76.16	75.95	75.22
ExFever	87.10	91.81	89.69
HoVer	93.34	94.29	94.36
MultiHop-RAG	51.74	54.03	51.13
MoreHopQA	96.14	95.80	94.68

Key Findings

Answer prediction always helps: Adding the final answer to the generative loss teaches the model what information is needed to solve the query, improving retrieval quality (+4.06 Hits@1 on MuSiQue).

Reward modeling trade-off: While observing causal negatives improves discrimination on handcrafted distractors (82.32 Hits@1), it overfits to these specific negatives. In open retrieval, reward modeling causes a 7.32% drop vs only 5.09% for answers-only, indicating that learning to reject specific negatives hurts generalization to unseen corpora.

Citation

BibTeX

@inproceedings{erker2026grithopper,
  title={{GRITHopper}: Decomposition-Free Multi-Hop Dense Retrieval},
  author={Erker, Justus-Jonas and Reimers, Nils and Gurevych, Iryna},
  booktitle={Proceedings of the 2026 Conference of the European Chapter
             of the Association for Computational Linguistics (EACL)},
  year={2026},
  url={https://arxiv.org/abs/2503.07519}
}