近期大型语言模型在肺癌患者出院摘要生成中的比较研究。

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics Pub Date : 2025-06-20 DOI:10.1016/j.jbi.2025.104867

Yiming Li , Fang Li , Na Hong , Manqi Li , Kirk Roberts , Licong Cui , Cui Tao , Hua Xu

{"title":"近期大型语言模型在肺癌患者出院摘要生成中的比较研究。","authors":"Yiming Li , Fang Li , Na Hong , Manqi Li , Kirk Roberts , Licong Cui , Cui Tao , Hua Xu","doi":"10.1016/j.jbi.2025.104867","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>Generating discharge summaries is a crucial yet time-consuming task in clinical practice, essential for conveying pertinent patient information and facilitating continuity of care. Recent advancements in large language models (LLMs) have significantly enhanced their capability in understanding and summarizing complex medical texts. This research aims to explore how LLMs can alleviate the burden of manual summarization, streamline workflow efficiencies, and support informed decision-making in healthcare settings.</div></div><div><h3>Materials and methods</h3><div>Clinical notes from a cohort of 1,099 lung cancer patients were utilized, with a subset of 50 patients for testing purposes, and 102 patients used for model fine-tuning. This study evaluates the performance of multiple LLMs, including GPT-3.5, GPT-4, GPT-4o, and LLaMA 3 8b, in generating discharge summaries. Evaluation metrics included token-level analysis (BLEU, ROUGE-1, ROUGE-2, ROUGE-L), semantic similarity scores, and manual evaluation of clinical relevance, factual faithfulness, and completeness. An iterative method was further tested on LLaMA 3 8b using clinical notes of varying lengths to examine the stability of its performance.</div></div><div><h3>Results</h3><div>The study found notable variations in summarization capabilities among LLMs. GPT-4o and fine-tuned LLaMA 3 demonstrated superior token-level evaluation metrics, while manual evaluation further revealed that GPT-4 achieved the highest scores in relevance (4.95 ± 0.22) and factual faithfulness (4.40 ± 0.50), whereas GPT-4o performed best in completeness (4.55 ± 0.69); both models showed comparable overall quality. Semantic similarity scores indicated GPT-4o and LLaMA 3 as leading models in capturing the underlying meaning and context of clinical narratives.</div></div><div><h3>Conclusion</h3><div>This study contributes insights into the efficacy of LLMs for generating discharge summaries, highlighting the potential of automated summarization tools to enhance documentation precision and efficiency, ultimately improving patient care and operational capability in healthcare settings.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"168 ","pages":"Article 104867"},"PeriodicalIF":4.0000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A comparative study of recent large language models on generating hospital discharge summaries for lung cancer patients\",\"authors\":\"Yiming Li , Fang Li , Na Hong , Manqi Li , Kirk Roberts , Licong Cui , Cui Tao , Hua Xu\",\"doi\":\"10.1016/j.jbi.2025.104867\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objective</h3><div>Generating discharge summaries is a crucial yet time-consuming task in clinical practice, essential for conveying pertinent patient information and facilitating continuity of care. Recent advancements in large language models (LLMs) have significantly enhanced their capability in understanding and summarizing complex medical texts. This research aims to explore how LLMs can alleviate the burden of manual summarization, streamline workflow efficiencies, and support informed decision-making in healthcare settings.</div></div><div><h3>Materials and methods</h3><div>Clinical notes from a cohort of 1,099 lung cancer patients were utilized, with a subset of 50 patients for testing purposes, and 102 patients used for model fine-tuning. This study evaluates the performance of multiple LLMs, including GPT-3.5, GPT-4, GPT-4o, and LLaMA 3 8b, in generating discharge summaries. Evaluation metrics included token-level analysis (BLEU, ROUGE-1, ROUGE-2, ROUGE-L), semantic similarity scores, and manual evaluation of clinical relevance, factual faithfulness, and completeness. An iterative method was further tested on LLaMA 3 8b using clinical notes of varying lengths to examine the stability of its performance.</div></div><div><h3>Results</h3><div>The study found notable variations in summarization capabilities among LLMs. GPT-4o and fine-tuned LLaMA 3 demonstrated superior token-level evaluation metrics, while manual evaluation further revealed that GPT-4 achieved the highest scores in relevance (4.95 ± 0.22) and factual faithfulness (4.40 ± 0.50), whereas GPT-4o performed best in completeness (4.55 ± 0.69); both models showed comparable overall quality. Semantic similarity scores indicated GPT-4o and LLaMA 3 as leading models in capturing the underlying meaning and context of clinical narratives.</div></div><div><h3>Conclusion</h3><div>This study contributes insights into the efficacy of LLMs for generating discharge summaries, highlighting the potential of automated summarization tools to enhance documentation precision and efficiency, ultimately improving patient care and operational capability in healthcare settings.</div></div>\",\"PeriodicalId\":15263,\"journal\":{\"name\":\"Journal of Biomedical Informatics\",\"volume\":\"168 \",\"pages\":\"Article 104867\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2025-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Biomedical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1532046425000966\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046425000966","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

目的：在临床实践中，生成出院摘要是一项重要而耗时的任务，对于传达相关的患者信息和促进护理的连续性至关重要。大型语言模型（llm）的最新进展显著提高了它们理解和总结复杂医学文本的能力。本研究旨在探讨法学硕士如何减轻人工总结的负担，简化工作流程效率，并支持医疗保健环境中的知情决策。材料和方法：利用1099例肺癌患者的临床记录，其中50例患者用于测试目的，102例患者用于模型微调。本研究评估了包括GPT-3.5、GPT-4、gpt - 40和LLaMA 38b在内的多种LLMs在生成放电摘要方面的性能。评估指标包括标记级分析（BLEU、ROUGE-1、ROUGE-2、ROUGE-L）、语义相似度评分，以及临床相关性、事实真实性和完整性的人工评估。利用不同长度的临床记录，进一步对LLaMA 38b进行迭代测试，以检验其性能的稳定性。结果：研究发现llm在总结能力方面存在显著差异。gpt - 40和微调的LLaMA 3表现出更好的令牌水平评价指标，而人工评价进一步表明，GPT-4在相关性（4.95 ± 0.22）和事实可信度（4.40 ± 0.50）方面得分最高，而gpt - 40在完整性（4.55 ± 0.69）方面得分最高；两款车型的整体质量相当。语义相似度评分表明gpt - 40和LLaMA 3是捕获临床叙述的潜在意义和上下文的主要模型。结论：本研究有助于深入了解llm生成出院摘要的功效，突出了自动摘要工具在提高文档准确性和效率方面的潜力，最终改善了医疗保健环境中的患者护理和操作能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A comparative study of recent large language models on generating hospital discharge summaries for lung cancer patients

查看原文本刊更多论文

A comparative study of recent large language models on generating hospital discharge summaries for lung cancer patients

Objective

Generating discharge summaries is a crucial yet time-consuming task in clinical practice, essential for conveying pertinent patient information and facilitating continuity of care. Recent advancements in large language models (LLMs) have significantly enhanced their capability in understanding and summarizing complex medical texts. This research aims to explore how LLMs can alleviate the burden of manual summarization, streamline workflow efficiencies, and support informed decision-making in healthcare settings.

Materials and methods

Clinical notes from a cohort of 1,099 lung cancer patients were utilized, with a subset of 50 patients for testing purposes, and 102 patients used for model fine-tuning. This study evaluates the performance of multiple LLMs, including GPT-3.5, GPT-4, GPT-4o, and LLaMA 3 8b, in generating discharge summaries. Evaluation metrics included token-level analysis (BLEU, ROUGE-1, ROUGE-2, ROUGE-L), semantic similarity scores, and manual evaluation of clinical relevance, factual faithfulness, and completeness. An iterative method was further tested on LLaMA 3 8b using clinical notes of varying lengths to examine the stability of its performance.

Results

The study found notable variations in summarization capabilities among LLMs. GPT-4o and fine-tuned LLaMA 3 demonstrated superior token-level evaluation metrics, while manual evaluation further revealed that GPT-4 achieved the highest scores in relevance (4.95 ± 0.22) and factual faithfulness (4.40 ± 0.50), whereas GPT-4o performed best in completeness (4.55 ± 0.69); both models showed comparable overall quality. Semantic similarity scores indicated GPT-4o and LLaMA 3 as leading models in capturing the underlying meaning and context of clinical narratives.

Conclusion

This study contributes insights into the efficacy of LLMs for generating discharge summaries, highlighting the potential of automated summarization tools to enhance documentation precision and efficiency, ultimately improving patient care and operational capability in healthcare settings.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Biomedical Informatics 医学-计算机：跨学科应用

CiteScore

8.90

自引率

6.70%

发文量

243

审稿时长

32 days

期刊介绍： The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.