A comparative analysis of privacy-preserving large language models for automated echocardiography report analysis.

IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Elham Mahmoudi, Sanaz Vahdati, Chieh-Ju Chao, Bardia Khosravi, Ajay Misra, Francisco Lopez-Jimenez, Bradley J Erickson
{"title":"A comparative analysis of privacy-preserving large language models for automated echocardiography report analysis.","authors":"Elham Mahmoudi, Sanaz Vahdati, Chieh-Ju Chao, Bardia Khosravi, Ajay Misra, Francisco Lopez-Jimenez, Bradley J Erickson","doi":"10.1093/jamia/ocaf056","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Automated data extraction from echocardiography reports could facilitate large-scale registry creation and clinical surveillance of valvular heart diseases (VHD). We evaluated the performance of open-source large language models (LLMs) guided by prompt instructions and chain of thought (CoT) for this task.</p><p><strong>Methods: </strong>From consecutive transthoracic echocardiographies performed in our center, we utilized 200 random reports from 2019 for prompt optimization and 1000 from 2023 for evaluation. Five instruction-tuned LLMs (Qwen2.0-72B, Llama3.0-70B, Mixtral8-46.7B, Llama3.0-8B, and Phi3.0-3.8B) were guided by prompt instructions with and without CoT to classify prosthetic valve presence and VHD severity. Performance was evaluated using classification metrics against expert-labeled ground truth. Mean squared error (MSE) was also calculated for predicted severity's deviation from actual severity.</p><p><strong>Results: </strong>With CoT prompting, Llama3.0-70B and Qwen2.0 achieved the highest performance (accuracy: 99.1% and 98.9% for VHD severity; 100% and 99.9% for prosthetic valve; MSE: 0.02 and 0.05, respectively). Smaller models showed lower accuracy for VHD severity (54.1%-85.9%) but maintained high accuracy for prosthetic valve detection (>96%). Chain of thought reasoning yielded higher accuracy for larger models while increasing processing time from 2-25 to 67-154 seconds per report. Based on CoT reasonings, the wrong predictions were mainly due to model outputs being influenced by irrelevant information in the text or failure to follow the prompt instructions.</p><p><strong>Conclusions: </strong>Our study demonstrates the near-perfect performance of open-source LLMs for automated echocardiography report interpretation with the purpose of registry formation and disease surveillance. While larger models achieved exceptional accuracy through prompt optimization, practical implementation requires balancing performance with computational efficiency.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"1120-1129"},"PeriodicalIF":4.6000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12257941/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf056","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Automated data extraction from echocardiography reports could facilitate large-scale registry creation and clinical surveillance of valvular heart diseases (VHD). We evaluated the performance of open-source large language models (LLMs) guided by prompt instructions and chain of thought (CoT) for this task.

Methods: From consecutive transthoracic echocardiographies performed in our center, we utilized 200 random reports from 2019 for prompt optimization and 1000 from 2023 for evaluation. Five instruction-tuned LLMs (Qwen2.0-72B, Llama3.0-70B, Mixtral8-46.7B, Llama3.0-8B, and Phi3.0-3.8B) were guided by prompt instructions with and without CoT to classify prosthetic valve presence and VHD severity. Performance was evaluated using classification metrics against expert-labeled ground truth. Mean squared error (MSE) was also calculated for predicted severity's deviation from actual severity.

Results: With CoT prompting, Llama3.0-70B and Qwen2.0 achieved the highest performance (accuracy: 99.1% and 98.9% for VHD severity; 100% and 99.9% for prosthetic valve; MSE: 0.02 and 0.05, respectively). Smaller models showed lower accuracy for VHD severity (54.1%-85.9%) but maintained high accuracy for prosthetic valve detection (>96%). Chain of thought reasoning yielded higher accuracy for larger models while increasing processing time from 2-25 to 67-154 seconds per report. Based on CoT reasonings, the wrong predictions were mainly due to model outputs being influenced by irrelevant information in the text or failure to follow the prompt instructions.

Conclusions: Our study demonstrates the near-perfect performance of open-source LLMs for automated echocardiography report interpretation with the purpose of registry formation and disease surveillance. While larger models achieved exceptional accuracy through prompt optimization, practical implementation requires balancing performance with computational efficiency.

用于自动超声心动图报告分析的隐私保护大语言模型的比较分析。
背景:从超声心动图报告中自动提取数据可以促进瓣膜性心脏病(VHD)的大规模登记创建和临床监测。我们评估了由提示指令和思维链(CoT)指导的开源大型语言模型(llm)的性能。方法:从本中心连续进行的经胸超声心动图中,我们使用2019年的200份随机报告进行提示优化,并使用2023年的1000份随机报告进行评估。5个指令调整llm (Qwen2.0-72B, Llama3.0-70B, Mixtral8-46.7B, Llama3.0-8B, Phi3.0-3.8B)在有或没有CoT的提示指令指导下对假体瓣膜存在和VHD严重程度进行分类。使用分类指标对专家标记的基础真理进行性能评估。预测严重程度与实际严重程度的偏差也计算均方误差(MSE)。结果:在CoT提示下,Llama3.0-70B和Qwen2.0的诊断准确率最高(对VHD严重程度的准确率分别为99.1%和98.9%;人工瓣膜100%、99.9%;MSE分别为0.02和0.05)。较小的模型对VHD严重程度的准确率较低(54.1%-85.9%),但对人工瓣膜检测的准确率较高(>96%)。思维链推理为大型模型提供了更高的准确性,同时将每个报告的处理时间从2-25秒增加到67-154秒。基于CoT推理,错误的预测主要是由于模型输出受到文本中不相关信息的影响或没有按照提示指示进行。结论:我们的研究证明了开源llm在自动超声心动图报告解释方面近乎完美的性能,目的是建立注册表和疾病监测。虽然较大的模型通过快速优化实现了卓越的准确性,但实际实施需要平衡性能和计算效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of the American Medical Informatics Association
Journal of the American Medical Informatics Association 医学-计算机:跨学科应用
CiteScore
14.50
自引率
7.80%
发文量
230
审稿时长
3-8 weeks
期刊介绍: JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信