Expert Preference-based Evaluation of Automated Related Work Generation

Ubiquitous Knowledge Processing Lab (UKP Lab)
Technical University of Darmstadt
www.ukp.tu-darmstadt.de

We propose a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert specific preferences.

Abstract

Expert domain writing, such as scientific writing, typically demands extensive domain knowledge. Recent advances in large language models (LLMs) show promising potential in automating this process, reducing the expert workload. However, evaluating the quality of automatically generated scientific writing is a crucial open issue, as it requires knowledge of domain-specific evaluation criteria and the ability to discern expert preferences. Conventional task-agnostic automatic evaluation metrics and LLM-as-a-judge systems—primarily designed for mainstream NLP tasks—are insufficient to grasp expert preferences and domain-specific quality standards. To address this gap and support realistic human-AI collaborative writing, we focus on related work generation, one of the most challenging scientific tasks, as an exemplar. We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences. Instead of assigning a single overall score, our framework decomposes the evaluation into smaller fine-grained dimensions. This localized evaluation approach is further augmented with contrastive few-shot examples to provide detailed contextual guidance for the evaluation dimensions. The design principles allow our framework to deliver cardinal assessment of quality, which can theoretically facilitate better post-training compared to ordinal preference data. For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs. Empirical investigation reveals that our framework is able to assess the quality of related work sections in a much more robust manner compared to standard LLM judges, reflects natural scenarios of scientific writing, and bears a strong correlation with the assessment of human experts. We also observe that generations from state-of-the-art LLMs struggle to satisfy validation constraints of a suitable related work section. They (mostly) fail to improve based on feedback as well.

Project Summary

The project consists of two stages. We first select the evaluator models via preliminary evaluation experiments with contrastive few-shot examples. Then, we run our pipeline which employs an iterative algorithm where generation and evaluation are interleaved, simulating multi-turn human-AI interaction.

Localized Evaluation Based on Expert Preferences

We design GREP (Granular Related-work Evaluation based on Preferences) a fine-grained, multi-turn evaluation system to assess the quality of generated RW sections and the ability of the generator to cater to evaluation feedback (figure on the right). Our evaluation rubric consists of hard constraints (i.e., necessary to fulfill to be considered as a valid RW section, e.g., no omitted paper, no hallucination, coherent citation, etc.) as well as soft constraints (i.e., reflect human preferences when multiple valid RW sections are possible, e.g., internal structuring, emphasis on certain cited papers, etc.)

Contrastive Few-Shot Examples

Our preliminary experiments show that applying vanilla zero-shot LLM-as-a-judge remains insufficient for such expert domain evaluations. We identify the main reason as the absence of context information indicating a specific evaluation criteria and what does it mean to satisfy (or not) that (It is trivial that few-shot examples improve classification. However, such examples cannot be presented if one uses an end-to-end judge, due to context-length bottleneck). For each possible outcome of a specific evaluation, we include an example along with a reasoning component that explains the expected outcome. Since finding failing examples for specific aspects is non-trivial, we generate synthetic examples using LLMs prompted to make deliberate mistakes.

Paper Information

Example Generated Draft

Recent work on bilingual training has demonstrated the benefits of inducing aligned word embeddings that mitigate data sparsity. For example, [1] employs a multitask framework leveraging co-occurrence statistics from parallel data to generate shared representations, while [2] extends the distributional hypothesis to multilingual settings by inducing joint-space embeddings that capture compositional semantics without explicit word alignments. These approaches establish a solid foundation for our model, which harnesses bilingual training to capture both semantic representations and context-dependent language production in a grounded color reference task.

Additional research has illustrated the effectiveness of sharing representations across languages with minimal architectural changes. In [3], the use of an artificial token to indicate the target language enabled a single neural machine translation model to perform zero-shot translation, illustrating implicit crosslingual transfer. Moreover, [4] demonstrates that a unified multi-task model spanning diverse domains attains competitive performance by learning shared representations. These insights motivate our bilingual strategy, where a shared vocabulary supports the generation of language-specific utterances while benefiting from cross-lingual inductive biases.

In summary, our work draws on the complementary strengths of bilingual representation learning from [1] and [2] and crosslingual sharing techniques from [3] and [4]. We introduce a bilingual model for color reference games that not only exhibits human-like contextual sensitivity and improves pragmatic informativeness but also faithfully captures language-specific semantic distinctions. This contribution extends the current literature on bilingual and multilingual models in grounded communication tasks by effectively integrating semantic understanding with pragmatic language production.

Evaluation Report

Missing Citations: None

Hallucinated Citations: None

Length: Adequate

Citation Emphasis

  • Cited Paper 1: Excessive emphasis
  • Cited Paper 2: Excessive emphasis
  • Cited Paper 3: Excessive emphasis
  • Cited Paper 4: Excessive emphasis

Coherence

Cited Paper: 4

Sentence: In summary, our work draws on the complementary strengths of bilingual representation learning from [1] and [2] and crosslingual sharing techniques from [3] and [4].

Reasoning: The given paper context discusses advancements in multi-task and multi-domain deep learning models, including the MultiModel architecture and its application to translation tasks, but does not explicitly address cross-lingual sharing techniques or bilingual representation learning. Consequently, the citation sentence referring to cross-lingual sharing techniques is not supported or entailed by the paper's content.

Positioning

Positioning Type Result: Matched

Positioning Type Reasoning: The paper lays its foundation by reviewing prior work and proposing its approach within that context, but it only explicitly summarizes its contributions—specifically, the introduction of a bilingual model for color reference games—in the final paragraph.

Positioning Problematic Paragraphs: None

Expert Evaluation

We implement an expert evaluation study to validate the automated assessment. Human experts interact with a pair of generator models simultaneously, for three iterations. At each iteration, the experts evaluate the generated drafts in terms of coherence, positioning, and feedback (instruction) following capabilities, and provide feedback to the models independently.


Results - Takeaways

Experiments unravel fundamental limitations of SoTA LLMs as RW section generator: they struggle to coherently cite prior work (the best performing model, improvement upon explicit feedback is rare, and they struggle to incorporate even simple preference-based instructions like adjusting the length of the generated RW section.
While specialized SoTA LLM judge delivers near-random matching with expert judgments (e.g., 53% match in terms of citation coherence), automated assessments from PreciseGREP and OpenGREP provide judgments that are closely similar to experts, e.g., matching 78% and 66% in terms of evaluating citation coherence, respectively.

BibTeX

@misc{sahinuc2025expertEval,
    title       = {Expert Preference-based Evaluation of Automated Related Work Generation},
    author      = {Furkan \c{S}ahinu\c{c} and Subhabrata Dutta and Iryna Gurevych},
    year        = {2025},
    eprint      = {2508.07955},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url         = {https://arxiv.org/abs/2508.07955},
}