Development and evaluation of a clinical note summarization system using large language models.

IF 5.4 Q1 MEDICINE, RESEARCH & EXPERIMENTAL

Communications medicine Pub Date : 2025-08-28 DOI:10.1038/s43856-025-01091-3

Juliana Damasio Oliveira, Henrique D P Santos, Ana Helena D P S Ulbrich, Julia Colleoni Couto, Marcelo Arocha, Joaquim Santos, Manuela Martins Costa, Daniela Faccio, Fabio O Tabalipa, Rodrigo F Nogueira

{"title":"Development and evaluation of a clinical note summarization system using large language models.","authors":"Juliana Damasio Oliveira, Henrique D P Santos, Ana Helena D P S Ulbrich, Julia Colleoni Couto, Marcelo Arocha, Joaquim Santos, Manuela Martins Costa, Daniela Faccio, Fabio O Tabalipa, Rodrigo F Nogueira","doi":"10.1038/s43856-025-01091-3","DOIUrl":null,"url":null,"abstract":"Background: Clinical notes are a vital and detailed source of information about patient hospitalizations. However, the sheer volume and complexity of these notes make evaluation and summarization challenging. Nonetheless, summarizing clinical notes is essential for accurate and efficient clinical decision-making in patient care. Generative language models, particularly large language models such as GPT-4, offer a promising solution by creating coherent, contextually relevant text based on patterns learned from large datasets.Methods: This study describes the development of a discharge summary system using large language models. By conducting an online survey and interviews, we gather feedback from end users, including physicians and patients, to ensure the system meets their practical needs and fits their experiences. Additionally, we develop a rating system to evaluate prompt effectiveness by comparing model-generated outputs with human assessments, which serve as benchmarks to evaluate the performance of the automated model.Results: Here we show that the model's ability to interpret diagnoses borders on humanlevel accuracy, demonstrating its potential to assist healthcare professionals in routine tasks such as generating discharge summaries.Conclusions: This advancement underscores the potential of large language models in clinical settings and opens up possibilities for broader applications in healthcare documentation and decision-making support.","PeriodicalId":72646,"journal":{"name":"Communications medicine","volume":"5 1","pages":"376"},"PeriodicalIF":5.4000,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12394402/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1038/s43856-025-01091-3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Clinical notes are a vital and detailed source of information about patient hospitalizations. However, the sheer volume and complexity of these notes make evaluation and summarization challenging. Nonetheless, summarizing clinical notes is essential for accurate and efficient clinical decision-making in patient care. Generative language models, particularly large language models such as GPT-4, offer a promising solution by creating coherent, contextually relevant text based on patterns learned from large datasets.

Methods: This study describes the development of a discharge summary system using large language models. By conducting an online survey and interviews, we gather feedback from end users, including physicians and patients, to ensure the system meets their practical needs and fits their experiences. Additionally, we develop a rating system to evaluate prompt effectiveness by comparing model-generated outputs with human assessments, which serve as benchmarks to evaluate the performance of the automated model.

Results: Here we show that the model's ability to interpret diagnoses borders on humanlevel accuracy, demonstrating its potential to assist healthcare professionals in routine tasks such as generating discharge summaries.

Conclusions: This advancement underscores the potential of large language models in clinical settings and opens up possibilities for broader applications in healthcare documentation and decision-making support.

Abstract Image

查看原文本刊更多论文

使用大型语言模型的临床笔记摘要系统的开发与评估。

背景：临床记录是患者住院信息的重要和详细来源。然而，这些笔记的数量和复杂性使得评估和总结具有挑战性。尽管如此，总结临床记录对于患者护理中准确有效的临床决策至关重要。生成语言模型，特别是像GPT-4这样的大型语言模型，通过基于从大型数据集中学习的模式创建连贯的、与上下文相关的文本，提供了一个很有前途的解决方案。方法：本研究描述了一个使用大型语言模型的出院汇总系统的开发。通过进行在线调查和访谈，我们收集包括医生和患者在内的最终用户的反馈，以确保系统满足他们的实际需求并符合他们的体验。此外，我们开发了一个评级系统，通过将模型生成的输出与人工评估进行比较来评估提示有效性，这可以作为评估自动化模型性能的基准。结果：在这里，我们展示了该模型解释诊断的能力接近人类水平的准确性，证明了它在帮助医疗保健专业人员完成日常任务（如生成出院摘要）方面的潜力。结论：这一进展强调了大型语言模型在临床环境中的潜力，并为在医疗保健文档和决策支持方面的更广泛应用开辟了可能性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Communications medicine

自引率

0.00%

发文量