评估超声心动图报告中的大语言模型：机遇与挑战。

IF 4.4 Q1 CARDIAC & CARDIOVASCULAR SYSTEMS

European heart journal. Digital health Pub Date : 2025-03-31 eCollection Date: 2025-05-01 DOI:10.1093/ehjdh/ztae086

Chieh-Ju Chao, Imon Banerjee, Reza Arsanjani, Chadi Ayoub, Andrew Tseng, Jean-Benoit Delbrouck, Garvan C Kane, Francisco Lopez-Jimenez, Zachi Attia, Jae K Oh, Bradley Erickson, Li Fei-Fei, Ehsan Adeli, Curtis Langlotz

{"title":"评估超声心动图报告中的大语言模型：机遇与挑战。","authors":"Chieh-Ju Chao, Imon Banerjee, Reza Arsanjani, Chadi Ayoub, Andrew Tseng, Jean-Benoit Delbrouck, Garvan C Kane, Francisco Lopez-Jimenez, Zachi Attia, Jae K Oh, Bradley Erickson, Li Fei-Fei, Ehsan Adeli, Curtis Langlotz","doi":"10.1093/ehjdh/ztae086","DOIUrl":null,"url":null,"abstract":"Aims: The increasing need for diagnostic echocardiography tests presents challenges in preserving the quality and promptness of reports. While Large Language Models (LLMs) have proven effective in summarizing clinical texts, their application in echo remains underexplored.Methods and results: Adult echocardiography studies, conducted at the Mayo Clinic from 1 January 2017 to 31 December 2017, were categorized into two groups: development (all Mayo locations except Arizona) and Arizona validation sets. We adapted open-source LLMs (Llama-2, MedAlpaca, Zephyr, and Flan-T5) using In-Context Learning and Quantized Low-Rank Adaptation fine-tuning (FT) for echo report summarization from 'Findings' to 'Impressions.' Against cardiologist-generated Impressions, the models' performance was assessed both quantitatively with automatic metrics and qualitatively by cardiologists. The development dataset included 97 506 reports from 71 717 unique patients, predominantly male (55.4%), with an average age of 64.3 ± 15.8 years. EchoGPT, a fine-tuned Llama-2 model, outperformed other models with win rates ranging from 87% to 99% in various automatic metrics, and produced reports comparable to cardiologists in qualitative review (significantly preferred in conciseness (P < 0.001), with no significant preference in completeness, correctness, and clinical utility). Correlations between automatic and human metrics were fair to modest, with the best being RadGraph F1 scores vs. clinical utility (r = 0.42) and automatic metrics showed insensitivity (0-5% drop) to changes in measurement numbers.Conclusion: EchoGPT can generate draft reports for human review and approval, helping to streamline the workflow. However, scalable evaluation approaches dedicated to echo reports remains necessary.","PeriodicalId":72965,"journal":{"name":"European heart journal. Digital health","volume":"6 3","pages":"326-339"},"PeriodicalIF":4.4000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12088711/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating large language models in echocardiography reporting: opportunities and challenges.\",\"authors\":\"Chieh-Ju Chao, Imon Banerjee, Reza Arsanjani, Chadi Ayoub, Andrew Tseng, Jean-Benoit Delbrouck, Garvan C Kane, Francisco Lopez-Jimenez, Zachi Attia, Jae K Oh, Bradley Erickson, Li Fei-Fei, Ehsan Adeli, Curtis Langlotz\",\"doi\":\"10.1093/ehjdh/ztae086\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Aims: The increasing need for diagnostic echocardiography tests presents challenges in preserving the quality and promptness of reports. While Large Language Models (LLMs) have proven effective in summarizing clinical texts, their application in echo remains underexplored.Methods and results: Adult echocardiography studies, conducted at the Mayo Clinic from 1 January 2017 to 31 December 2017, were categorized into two groups: development (all Mayo locations except Arizona) and Arizona validation sets. We adapted open-source LLMs (Llama-2, MedAlpaca, Zephyr, and Flan-T5) using In-Context Learning and Quantized Low-Rank Adaptation fine-tuning (FT) for echo report summarization from 'Findings' to 'Impressions.' Against cardiologist-generated Impressions, the models' performance was assessed both quantitatively with automatic metrics and qualitatively by cardiologists. The development dataset included 97 506 reports from 71 717 unique patients, predominantly male (55.4%), with an average age of 64.3 ± 15.8 years. EchoGPT, a fine-tuned Llama-2 model, outperformed other models with win rates ranging from 87% to 99% in various automatic metrics, and produced reports comparable to cardiologists in qualitative review (significantly preferred in conciseness (P < 0.001), with no significant preference in completeness, correctness, and clinical utility). Correlations between automatic and human metrics were fair to modest, with the best being RadGraph F1 scores vs. clinical utility (r = 0.42) and automatic metrics showed insensitivity (0-5% drop) to changes in measurement numbers.Conclusion: EchoGPT can generate draft reports for human review and approval, helping to streamline the workflow. However, scalable evaluation approaches dedicated to echo reports remains necessary.\",\"PeriodicalId\":72965,\"journal\":{\"name\":\"European heart journal. Digital health\",\"volume\":\"6 3\",\"pages\":\"326-339\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-03-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12088711/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"European heart journal. Digital health\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/ehjdh/ztae086\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/5/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"CARDIAC & CARDIOVASCULAR SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"European heart journal. Digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/ehjdh/ztae086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

目的：对超声心动图诊断测试的需求日益增加，在保持报告的质量和及时性方面提出了挑战。虽然大型语言模型（llm）在总结临床文本方面已被证明是有效的，但它们在回声中的应用仍未得到充分探索。方法和结果：2017年1月1日至2017年12月31日在梅奥诊所进行的成人超声心动图研究分为两组：发展组（除亚利桑那州外的所有梅奥诊所）和亚利桑那州验证组。我们改编了开源llm (Llama-2, MedAlpaca， Zephyr和Flan-T5)，使用上下文学习和量化低秩适应微调（FT）从“发现”到“印象”的回声报告总结。针对心脏病专家产生的印象，模型的性能通过自动指标定量评估，并由心脏病专家进行定性评估。发展数据集包括来自71 717例独特患者的97 506份报告，主要是男性（55.4%），平均年龄为64.3±15.8岁。EchoGPT是一种经过微调的lama-2模型，在各种自动指标上的胜率从87%到99%不等，优于其他模型，并在定性评价中产生与心脏病专家相当的报告（在简洁性方面明显优先（P < 0.001），在完整性、正确性和临床实用性方面没有明显优先）。自动指标和人工指标之间的相关性是公平到适度的，最好的是RadGraph F1分数与临床效用（r = 0.42），自动指标对测量数字的变化不敏感（下降0-5%）。结论：EchoGPT可以生成草稿报告供人工审核和批准，有助于简化工作流程。然而，专门用于回声报告的可扩展评估方法仍然是必要的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating large language models in echocardiography reporting: opportunities and challenges.

Aims: The increasing need for diagnostic echocardiography tests presents challenges in preserving the quality and promptness of reports. While Large Language Models (LLMs) have proven effective in summarizing clinical texts, their application in echo remains underexplored.

Methods and results: Adult echocardiography studies, conducted at the Mayo Clinic from 1 January 2017 to 31 December 2017, were categorized into two groups: development (all Mayo locations except Arizona) and Arizona validation sets. We adapted open-source LLMs (Llama-2, MedAlpaca, Zephyr, and Flan-T5) using In-Context Learning and Quantized Low-Rank Adaptation fine-tuning (FT) for echo report summarization from 'Findings' to 'Impressions.' Against cardiologist-generated Impressions, the models' performance was assessed both quantitatively with automatic metrics and qualitatively by cardiologists. The development dataset included 97 506 reports from 71 717 unique patients, predominantly male (55.4%), with an average age of 64.3 ± 15.8 years. EchoGPT, a fine-tuned Llama-2 model, outperformed other models with win rates ranging from 87% to 99% in various automatic metrics, and produced reports comparable to cardiologists in qualitative review (significantly preferred in conciseness (P < 0.001), with no significant preference in completeness, correctness, and clinical utility). Correlations between automatic and human metrics were fair to modest, with the best being RadGraph F1 scores vs. clinical utility (r = 0.42) and automatic metrics showed insensitivity (0-5% drop) to changes in measurement numbers.

Conclusion: EchoGPT can generate draft reports for human review and approval, helping to streamline the workflow. However, scalable evaluation approaches dedicated to echo reports remains necessary.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

European heart journal. Digital health

CiteScore

5.00

自引率

0.00%

发文量