Localizing and Mitigating Errors in Long-form Question Answering

1UKP Lab, TU Darmstadt   2University of Massachusetts Amherst

Annotating Fine-grained Errors in Long-form Answers

Prior LFQA evaluations with non-expert (Nakano et al., 2021) and expert (Xu et al., 2023a) annotators collect preference judgments over model responses. However, overall preference is not indicative of fine-grained errors in LFQA. Our work addresses this gap by introducing HaluQuestQA, a dataset of long-form answers annotated at the span level with five error types: question misconception, factuality, completeness, relevance, and references. Expert annotators provide these annotations along with overall preference judgments.

HaluQuestQA Data Collection

Overview of our data collection process. Using five fine-grained evaluation criteria, we collect span-level expert human judgments on question-answer pairs from the Reddit platform, as well as on corresponding answers generated by GPT-4.

Abstract

Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension. However, such detailed responses are prone to hallucinations and factual inconsistencies, challenging their faithful evaluation. This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers. HaluQuestQA comprises 698 QA pairs with 1.8k span-level error annotations for five different error types by expert annotators, along with preference judgments. Using our collected data, we thoroughly analyze the shortcomings of long-form answers and find that they lack comprehensiveness and provide unhelpful references. We train an automatic feedback model on this dataset that predicts error spans with incomplete information and provides associated explanations. Finally, we propose a prompt-based approach, Error-informed refinement, that uses signals from the learned feedback model to refine generated answers, which we show reduces errors and improves answer quality across multiple models. Furthermore, humans find answers generated by our approach comprehensive and highly prefer them (84%) over the baseline answers.

Answers lack comprehensiveness and provide unhelpful references

We score human and model answers on our defined evaluation criteria to understand how experts’ answer preferences diverge across different domains. We observe that human-written and model-generated answers score high on factuality and relevance, meaning most of the information provided is verifiable, trustworthy and relevant to the question. However, the answers score low on completeness and references aspects, lacking important information and providing web references and examples that are not helpful (Liu et al., 2023a), according to expert judgments. Specifically, GPT-4 hallucinates and provides incorrect or fabricated web links, while human answers digress from the topic and include irrelevant information.

How to improve answer comprehensiveness?

We propose Error-Informed Refinement (EIR), a method that enhances the quality of both human-written and language model-generated answers by leveraging targeted, model-generated feedback. Our approach consists of two key components: an error feedback model, trained on annotated completeness errors from the HaluQuestQA dataset, which evaluates an initial response and generates sentence-level feedback; and a refinement model, which takes the original prompt, the initial response, and the feedback to produce a more complete and polished answer.

EIR improves answer comprehensiveness

Our results show that inadequate feedback can deteriorate generation quality. While directly prompting the refinement model (LLaMA2-13B-chat) to generate answers (ZERO-SHOT) or improve answers without detailed feedback (IMPROVE) performs better than the baseline, using more targeted feedback, such as asking the model to complete the answer (GENERIC), consistently leads to higher-quality LFQA answers. In contrast, fine-grained feedback from our error detection model (EIR) outperforms coarse-grained feedback and fine-grained human feedback (on HQ2A), reducing error samples and error scores by ~3% and ~Δ38%, respectively, and improving F1 scores by ~5%, on average.

Experts prefer EIR refined answers

To evaluate completeness, we adopt a comparative comprehensiveness metric: annotators judge which answer more fully addresses all parts of the question, based on our defined criteria for identifying completeness errors. To assess the overall answer quality, annotators consider broader factors, such as the factual precision and relevance, when selecting their preferred answer. We observe that refined answers are considered more comprehensive in ~60% of cases and preferred overall in ~84% of comparisons on average across all evaluated datasets, demonstrating improved completeness and quality over the baseline answers.

BibTeX

@misc{sachdeva2025localizingmitigatingerrorslongform,
      title={Localizing and Mitigating Errors in Long-form Question Answering},
      author={Rachneet Sachdeva and Yixiao Song and Mohit Iyyer and Iryna Gurevych},
      year={2025},
      eprint={2407.11930},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.11930},
}