Seonil Son, Junsoo Park, J. Hwang, Junghwa Lee, Hyungjong Noh, Yeonsoo Lee
{"title":"HaRiM^+: Evaluating Summary Quality with Hallucination Risk","authors":"Seonil Son, Junsoo Park, J. Hwang, Junghwa Lee, Hyungjong Noh, Yeonsoo Lee","doi":"10.48550/arXiv.2211.12118","DOIUrl":null,"url":null,"abstract":"One of the challenges of developing a summarization model arises from the difficulty in measuring the factual inconsistency of the generated text. In this study, we reinterpret the decoder overconfidence-regularizing objective suggested in (Miao et al., 2021) as a hallucination risk measurement to better estimate the quality of generated summaries. We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods. Deploying it requires no additional training of models or ad-hoc modules, which usually need alignment to human judgments. For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits the use of summarization models, facilitates the progress of both automated evaluation and generation of summary.","PeriodicalId":39298,"journal":{"name":"AACL Bioflux","volume":"52 1","pages":"895-924"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AACL Bioflux","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2211.12118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Environmental Science","Score":null,"Total":0}
引用次数: 1
Abstract
One of the challenges of developing a summarization model arises from the difficulty in measuring the factual inconsistency of the generated text. In this study, we reinterpret the decoder overconfidence-regularizing objective suggested in (Miao et al., 2021) as a hallucination risk measurement to better estimate the quality of generated summaries. We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods. Deploying it requires no additional training of models or ad-hoc modules, which usually need alignment to human judgments. For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits the use of summarization models, facilitates the progress of both automated evaluation and generation of summary.
开发摘要模型的挑战之一是难以衡量生成文本的事实不一致性。在本研究中,我们将(Miao et al., 2021)中提出的解码器过度置信度正则化目标重新解释为幻觉风险测量,以更好地估计生成摘要的质量。我们提出了一个无参考的度量,HaRiM+,它只需要一个现成的总结模型来计算基于令牌可能性的幻觉风险。部署它不需要对模型或特别模块进行额外的训练,这通常需要与人类的判断保持一致。对于摘要质量估计,HaRiM+在三个摘要质量注释集FRANK、QAGS和SummEval上记录了最先进的与人类判断的相关性。我们希望我们的工作,值得使用的摘要模型,促进自动化评估和生成摘要的进展。