HaRiM^+: Evaluating Summary Quality with Hallucination Risk

Q3 Environmental Science

AACL Bioflux Pub Date : 2022-11-22 DOI:10.48550/arXiv.2211.12118

Seonil Son, Junsoo Park, J. Hwang, Junghwa Lee, Hyungjong Noh, Yeonsoo Lee

引用次数: 1

Abstract

One of the challenges of developing a summarization model arises from the difficulty in measuring the factual inconsistency of the generated text. In this study, we reinterpret the decoder overconfidence-regularizing objective suggested in (Miao et al., 2021) as a hallucination risk measurement to better estimate the quality of generated summaries. We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods. Deploying it requires no additional training of models or ad-hoc modules, which usually need alignment to human judgments. For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits the use of summarization models, facilitates the progress of both automated evaluation and generation of summary.

查看原文本刊更多论文

HaRiM^+:用幻觉风险评估总结质量

开发摘要模型的挑战之一是难以衡量生成文本的事实不一致性。在本研究中，我们将(Miao et al.， 2021)中提出的解码器过度置信度正则化目标重新解释为幻觉风险测量，以更好地估计生成摘要的质量。我们提出了一个无参考的度量，HaRiM+，它只需要一个现成的总结模型来计算基于令牌可能性的幻觉风险。部署它不需要对模型或特别模块进行额外的训练，这通常需要与人类的判断保持一致。对于摘要质量估计，HaRiM+在三个摘要质量注释集FRANK、QAGS和SummEval上记录了最先进的与人类判断的相关性。我们希望我们的工作，值得使用的摘要模型，促进自动化评估和生成摘要的进展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

AACL Bioflux Environmental Science-Management, Monitoring, Policy and Law

CiteScore

1.40

自引率

0.00%

发文量