Improving Large Language Models' Summarization Accuracy by Adding Highlights to Discharge Notes: Comparative Evaluation.

IF 3.8 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-07-24 DOI:10.2196/66476

Mahshad Koohi Habibi Dehkordi, Yehoshua Perl, Fadi P Deek, Zhe He, Vipina K Keloth, Hao Liu, Gai Elhanan, Andrew J Einstein

{"title":"Improving Large Language Models' Summarization Accuracy by Adding Highlights to Discharge Notes: Comparative Evaluation.","authors":"Mahshad Koohi Habibi Dehkordi, Yehoshua Perl, Fadi P Deek, Zhe He, Vipina K Keloth, Hao Liu, Gai Elhanan, Andrew J Einstein","doi":"10.2196/66476","DOIUrl":null,"url":null,"abstract":"Background: The American Medical Association recommends that electronic health record (EHR) notes, often dense and written in nuanced language, be made readable for patients and laypeople, a practice we refer to as the simplification of discharge notes. Our approach to achieving the simplification of discharge notes involves a process of incremental simplification steps to achieve the ideal note. In this paper, we present the first step of this process. Large language models (LLMs) have demonstrated considerable success in text summarization. Such LLM summaries represent the content of EHR notes in an easier-to-read language. However, LLM summaries can also introduce inaccuracies.Objective: This study aims to test the hypothesis that summaries generated by LLMs from highlighted discharge notes will achieve increased accuracy compared to those generated from the original notes. For this purpose, we aim to prove a hypothesis that summaries generated by LLMs of discharge notes in which detailed information is highlighted are likely to be more accurate than summaries of the original notes.Methods: To test our hypothesis, we randomly sampled 15 discharge notes from the MIMIC III database and highlighted their detailed information using an interface terminology we previously developed with machine learning. This interface terminology was curated to encompass detailed information from the discharge notes. The highlighted discharge notes distinguished detailed information, specifically the concepts present in the aforementioned interface terminology, by applying a blue background. To calibrate the LLMs' summaries for our simplification goal, we chose GPT-4o and used prompt engineering to ensure high-quality prompts and address issues of output inconsistency and prompt sensitivity. We provided both highlighted and unhighlighted versions of each EHR note along with their corresponding prompts to GPT-4o. Each generated summary was manually evaluated to assess its quality using the following evaluation metrics: completeness, correctness, and structural integrity.Results: We used the study sample of 15 discharge notes. On average, summaries from highlighted notes (H-summaries) achieved 96% completeness, 8% higher than the summaries from unhighlighted notes (U-summaries). H-summaries had higher completeness in 13 notes, and U-summaries had higher or equal completeness in 2 notes, resulting in P=.01, which implied statistical significance. Moreover, H-summaries demonstrated better correctness than U-summaries, with fewer instances of erroneous information (2 vs 3 errors, respectively). The number of improper headers was smaller for H-summaries for 11 notes and U-summaries for 4 notes (P=.03; implying statistical significance). Moreover, we identified 8 instances of misplaced information in the U-summaries and only 2 in the H-summaries. We showed that our findings supported the hypothesis that summarizing highlighted discharge notes improves the accuracy of the summaries.Conclusions: Feeding LLMs with highlighted discharge notes, combined with prompt engineering, results in higher-quality summaries in terms of correctness, completeness, and structural integrity compared to unhighlighted discharge notes.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e66476"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12332456/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/66476","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The American Medical Association recommends that electronic health record (EHR) notes, often dense and written in nuanced language, be made readable for patients and laypeople, a practice we refer to as the simplification of discharge notes. Our approach to achieving the simplification of discharge notes involves a process of incremental simplification steps to achieve the ideal note. In this paper, we present the first step of this process. Large language models (LLMs) have demonstrated considerable success in text summarization. Such LLM summaries represent the content of EHR notes in an easier-to-read language. However, LLM summaries can also introduce inaccuracies.

Objective: This study aims to test the hypothesis that summaries generated by LLMs from highlighted discharge notes will achieve increased accuracy compared to those generated from the original notes. For this purpose, we aim to prove a hypothesis that summaries generated by LLMs of discharge notes in which detailed information is highlighted are likely to be more accurate than summaries of the original notes.

Methods: To test our hypothesis, we randomly sampled 15 discharge notes from the MIMIC III database and highlighted their detailed information using an interface terminology we previously developed with machine learning. This interface terminology was curated to encompass detailed information from the discharge notes. The highlighted discharge notes distinguished detailed information, specifically the concepts present in the aforementioned interface terminology, by applying a blue background. To calibrate the LLMs' summaries for our simplification goal, we chose GPT-4o and used prompt engineering to ensure high-quality prompts and address issues of output inconsistency and prompt sensitivity. We provided both highlighted and unhighlighted versions of each EHR note along with their corresponding prompts to GPT-4o. Each generated summary was manually evaluated to assess its quality using the following evaluation metrics: completeness, correctness, and structural integrity.

Results: We used the study sample of 15 discharge notes. On average, summaries from highlighted notes (H-summaries) achieved 96% completeness, 8% higher than the summaries from unhighlighted notes (U-summaries). H-summaries had higher completeness in 13 notes, and U-summaries had higher or equal completeness in 2 notes, resulting in P=.01, which implied statistical significance. Moreover, H-summaries demonstrated better correctness than U-summaries, with fewer instances of erroneous information (2 vs 3 errors, respectively). The number of improper headers was smaller for H-summaries for 11 notes and U-summaries for 4 notes (P=.03; implying statistical significance). Moreover, we identified 8 instances of misplaced information in the U-summaries and only 2 in the H-summaries. We showed that our findings supported the hypothesis that summarizing highlighted discharge notes improves the accuracy of the summaries.

Conclusions: Feeding LLMs with highlighted discharge notes, combined with prompt engineering, results in higher-quality summaries in terms of correctness, completeness, and structural integrity compared to unhighlighted discharge notes.

Abstract Image

查看原文本刊更多论文

通过在出院笔记中添加亮点来提高大型语言模型的总结准确性：比较评价。

背景：美国医学协会建议电子健康记录（EHR）笔记，通常是密集的，用细致入微的语言写的，让病人和外行人都能读懂，我们称之为简化出院记录的做法。我们实现出院笔记简化的方法包括一个渐进简化步骤的过程，以实现理想的笔记。在本文中，我们提出了这一过程的第一步。大型语言模型（llm）在文本摘要方面已经取得了相当大的成功。这样的LLM摘要以一种更容易阅读的语言表示EHR笔记的内容。然而，法学硕士总结也可能带来不准确性。目的：本研究旨在验证法学硕士根据突出显示的出院记录生成的摘要与从原始记录生成的摘要相比，准确性更高的假设。为此，我们旨在证明一个假设，即法学硕士对突出显示详细信息的出院笔记生成的摘要可能比原始笔记的摘要更准确。方法：为了验证我们的假设，我们从MIMIC III数据库中随机抽取了15份出院记录，并使用我们之前用机器学习开发的界面术语突出显示了它们的详细信息。这个界面术语是精心设计的，以包含出院记录中的详细信息。突出显示的出院说明通过应用蓝色背景区分了详细信息，特别是上述接口术语中存在的概念。为了校准LLMs的摘要以实现我们的简化目标，我们选择了gpt - 40，并使用提示工程来确保高质量的提示，并解决输出不一致和提示敏感性的问题。我们提供了每个EHR笔记的高亮和非高亮版本以及相应的gpt - 40提示。每个生成的摘要都是手动评估的，以使用以下评估度量来评估其质量：完整性、正确性和结构完整性。结果：我们使用了15份出院记录作为研究样本。平均而言，高亮标注的摘要（h -摘要）达到了96%的完备性，比未高亮标注的摘要（u -摘要）高出8%。h - summary在13个音符中完备性较高，u - summary在2个音符中完备性较高或相等，得到P=。0.01，这意味着有统计学意义。此外，h -摘要比u -摘要显示出更好的正确性，错误信息的实例更少（分别为2和3个错误）。h型摘要11个音符，u型摘要4个音符，不正确的标题数较少(P=.03；暗示统计显著性)。此外，我们在u -摘要中发现了8个错误信息，而在h -摘要中仅发现了2个错误信息。我们发现我们的研究结果支持这样的假设，即总结突出显示的出院记录可以提高摘要的准确性。结论：与未突出显示的出院记录相比，给llm提供突出显示的出院记录，并结合及时的工程处理，可以在正确性、完整性和结构完整性方面获得更高质量的总结。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.