Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue

Abstract

Meta-reviewing is a pivotal stage in the peerreview process, serving as the final step in determining whether a paper is recommended for acceptance. Prior research on meta-reviewing has treated this as a summarization problem over review reports. However, complementary to this perspective, meta-reviewing is a decisionmaking process that requires weighing reviewer arguments and placing them within a broader context. Prior research has demonstrated that decision-makers can be effectively assisted in such scenarios via dialogue agents. In line with this framing, we explore the practical challenges for realizing dialog agents that can effectively assist meta-reviewers. Concretely, we first address the issue of data scarcity for training dialogue agents by generating synthetic data using Large Language Models (LLMs) based on a self-refinement strategy to improve the relevance of these dialogues to expert domains. Our experiments demonstrate that this method produces higher-quality synthetic data and can serve as a valuable resource towards training meta-reviewing assistants. Subsequently, we utilize this data to train dialogue agents tailored for meta-reviewing and find that these agents outperform off-the-shelf LLMbased assistants for this task. Finally, we apply our agents in real-world meta-reviewing scenarios and confirm their effectiveness in enhancing the efficiency of meta-reviewing

Motivation

Existing approaches treat meta-reviewing as summarization, despite it being a complex decision-making task requiring deliberation over reviewer arguments.
Current LLMs lack grounded, interactive support to help meta-reviewers reason over reviews and justify accept/reject decisions.
The absence of realistic dialogue data and task-specific evaluation hinders progress on building effective meta-reviewing assistants.

Contributions

We present the first study framing meta-reviewing as a document-grounded dialogue and propose comprehensive measures to develop dialogue agents for this scenario.
We address data scarcity for training dialogue agents by generating synthetic meta-reviewing dialogues with LLMs and introduce a self-refinement strategy to improve dialogue quality.
We fine-tune dialogue agents on the synthetic data and demonstrate their effectiveness in real-world meta-reviewing, improving both efficiency and quality of meta-review reports and reducing meta-reviewing time by 50%.

Synthetic Data Generation Setup

Our proposed method, ReMuSE (Reward-based Multi-aspect Self-Editing) is a self-refinement method for generating high-quality synthetic meta-reviewing dialogues. It first generates an initial dialogue from reviews using an LLM, then evaluates the dialogue with multiple quality metrics. Based on these evaluations, it produces natural-language feedback and refines the dialogue to improve groundedness and specificity. By iteratively applying this process, ReMuSE creates a large synthetic dataset that can be used to train and evaluate models for meta-reviewing tasks.

Comparison of human vs LLM-generated dialogues

Comparison of LLM vs human generated dialogues on the two tasks

LLM-based dialogue agents perform meta-reviewing at a level comparable to humans in certain tasks.
LLM meta-reviewers still lack sufficient domain knowledge to fully replace human meta-reviewers.

Training dialogue agents on synthetic data

Flan T5 achieves strong K-Prec (68.2), while ChatGPT performs lower (42.1), highlighting the need for high-quality task-specific supervision.
ReMuSE maintains high faithfulness (K-Prec 67.6) but has lower BLEU and BERT scores, as it prioritizes faithful and diverse generation over surface-level alignment.
Even with only 10% of the data, models trained on ReMuSE outperform zero-shot variants, demonstrating its efficiency for meta-reviewing tasks.

Bibtex

@misc{purkayastha2026decisionmakingdeliberationmetareviewingdocumentgrounded,
      title={Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue}, 
      author={Sukannya Purkayastha and Nils Dycke and Anne Lauscher and Iryna Gurevych},
      year={2026},
      eprint={2508.05283},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.05283}, 
}