A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization.

Proceedings of machine learning research Pub Date : 2023-08-01

Griffin Adams, Jason Zucker, Noémie Elhadad

{"title":"A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization.","authors":"Griffin Adams, Jason Zucker, Noémie Elhadad","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. The factual consistency of summaries-their faithfulness-is critical to their safe usage in clinical settings. To better understand the limitations of state-of-the-art natural language processing (NLP) systems, as well as the suitability of existing evaluation metrics, we benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course. We create a corpus of patient hospital admissions and summaries for a cohort of HIV patients, each with complex medical histories. Annotators are presented with summaries and source notes, and asked to categorize manually highlighted summary elements (clinical entities like conditions and medications as well as actions like \"following up\") into one of three categories: \"Incorrect,\" \"Missing,\" and \"Not in Notes.\" We meta-evaluate a broad set of faithfulness metrics-proposed for the general NLP domain-by measuring the correlation of metric scores to clinician ratings. Across metrics, we explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble. We find that off-the-shelf metrics with no exposure to clinical text correlate well to clinician ratings yet overly rely on copy-and-pasted text. As a practical guide, we observe that most metrics correlate best to clinicians when provided with one summary sentence at a time and a minimal set of supporting sentences from the notes before discharge.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"219 ","pages":"2-30"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11441639/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of machine learning research","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. The factual consistency of summaries-their faithfulness-is critical to their safe usage in clinical settings. To better understand the limitations of state-of-the-art natural language processing (NLP) systems, as well as the suitability of existing evaluation metrics, we benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course. We create a corpus of patient hospital admissions and summaries for a cohort of HIV patients, each with complex medical histories. Annotators are presented with summaries and source notes, and asked to categorize manually highlighted summary elements (clinical entities like conditions and medications as well as actions like "following up") into one of three categories: "Incorrect," "Missing," and "Not in Notes." We meta-evaluate a broad set of faithfulness metrics-proposed for the general NLP domain-by measuring the correlation of metric scores to clinician ratings. Across metrics, we explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble. We find that off-the-shelf metrics with no exposure to clinical text correlate well to clinician ratings yet overly rely on copy-and-pasted text. As a practical guide, we observe that most metrics correlate best to clinicians when provided with one summary sentence at a time and a minimal set of supporting sentences from the notes before discharge.

本刊更多论文

长篇医院病历摘要忠实度指标的元评价。

对住院病例进行长篇临床总结具有现实意义，因为它可以帮助临床医生和患者。摘要的事实一致性--即其忠实性--对其在临床环境中的安全使用至关重要。为了更好地了解最先进的自然语言处理（NLP）系统的局限性以及现有评估指标的适用性，我们针对模型生成的患者住院病程摘要的细粒度人工注释，制定了忠实度指标基准。我们创建了一个患者入院病历和摘要语料库，其中包含了一批艾滋病患者，每个人都有复杂的病史。我们向注释者展示了摘要和源注释，并要求他们将手动突出显示的摘要元素（如病情和药物等临床实体以及 "随访 "等操作）归入三个类别之一："不正确"、"缺失 "和 "不在注释中"。通过衡量指标得分与临床医生评分的相关性，我们对为一般 NLP 领域提出的一系列广泛的忠实度指标进行了元评估。在各种度量标准中，我们探讨了领域适应性的重要性（例如，领域内预训练和度量标准微调的影响）、源摘要排列的使用以及从组合中提炼单一度量标准的效果。我们发现，没有接触过临床文本的现成度量标准与临床医生的评分有很好的相关性，但却过度依赖复制粘贴的文本。作为实用指南，我们观察到，如果每次只向临床医生提供一个摘要句子和出院前病历中最基本的辅助句子集，大多数指标与临床医生的相关性最佳。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of machine learning research

自引率

0.00%

发文量