心脏病学中的大型语言模型：系统综述。

IF 2.2 Q2 Medicine

JMIR Cardio Pub Date : 2026-04-16 DOI:10.2196/76734

Moran Gendler, Girish N Nadkarni, Karin Sudri, Michal Cohen-Shelly, Benjamin S Glicksberg, Orly Efros, Shelly Soffer, Eyal Klang

{"title":"心脏病学中的大型语言模型：系统综述。","authors":"Moran Gendler, Girish N Nadkarni, Karin Sudri, Michal Cohen-Shelly, Benjamin S Glicksberg, Orly Efros, Shelly Soffer, Eyal Klang","doi":"10.2196/76734","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) are increasingly used in health care, but their role in cardiology has not yet been systematically evaluated.Objective: This review aimed to assess the applications, performance, and limitations of LLMs across diverse cardiology tasks, including chronic and progressive conditions, acute events, education, and diagnostic testing.Methods: A systematic search was conducted in PubMed and Scopus for studies published up to April 14, 2024, using keywords related to LLMs and cardiology. Studies evaluating LLM outputs in cardiology-related tasks were included. Data were extracted across 5 predefined domains and the risk of bias was assessed using an adapted QUADAS-2 tool (developed by Whiting et al at the University of Bristol). The review protocol was registered in PROSPERO (CRD42024556397).Results: A total of 33 studies contributed quantitative outcome data to a descriptive synthesis. Across chronic conditions, ChatGPT-3.5 (OpenAI) answered 91% (43/47) heart failure questions accurately, although readability often required college-level comprehension. In acute scenarios, Bing Chat omitted key myocardial infarction first aid steps in 25% (5/20) to 45% (9/20) of cases, while cardiac arrest information was rated highly (mean 4.3/5, SD 0.7) but written above recommended reading levels. In physician education tasks, ChatGPT-4 (OpenAI) demonstrated higher accuracy than ChatGPT-3.5, improving from 38% (33/88) to 66% (58/88). In patient education studies, ChatGPT-4 provided scientifically adequate explanations (5.0-6.0/7) comparable to hospital materials but at higher reading levels (11th vs 7th grade). In diagnostic testing, ChatGPT-4 interpreted 91% (36/40) electrocardiogram vignettes correctly, significantly better than emergency physicians (31/40, 77%; P< .001), but showed lower performance in echocardiography.Conclusions: LLMs show meaningful potential in cardiology, especially for education and electrocardiogram interpretation, but performance varies across clinical tasks. Limitations in emergency guidance and readability, as well as small in silico study designs, highlight the need for multimodal models and prospective validation.","PeriodicalId":14706,"journal":{"name":"JMIR Cardio","volume":"10 ","pages":"e76734"},"PeriodicalIF":2.2000,"publicationDate":"2026-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13085985/pdf/","citationCount":"0","resultStr":"{\"title\":\"Large Language Models in Cardiology: Systematic Review.\",\"authors\":\"Moran Gendler, Girish N Nadkarni, Karin Sudri, Michal Cohen-Shelly, Benjamin S Glicksberg, Orly Efros, Shelly Soffer, Eyal Klang\",\"doi\":\"10.2196/76734\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Large language models (LLMs) are increasingly used in health care, but their role in cardiology has not yet been systematically evaluated.Objective: This review aimed to assess the applications, performance, and limitations of LLMs across diverse cardiology tasks, including chronic and progressive conditions, acute events, education, and diagnostic testing.Methods: A systematic search was conducted in PubMed and Scopus for studies published up to April 14, 2024, using keywords related to LLMs and cardiology. Studies evaluating LLM outputs in cardiology-related tasks were included. Data were extracted across 5 predefined domains and the risk of bias was assessed using an adapted QUADAS-2 tool (developed by Whiting et al at the University of Bristol). The review protocol was registered in PROSPERO (CRD42024556397).Results: A total of 33 studies contributed quantitative outcome data to a descriptive synthesis. Across chronic conditions, ChatGPT-3.5 (OpenAI) answered 91% (43/47) heart failure questions accurately, although readability often required college-level comprehension. In acute scenarios, Bing Chat omitted key myocardial infarction first aid steps in 25% (5/20) to 45% (9/20) of cases, while cardiac arrest information was rated highly (mean 4.3/5, SD 0.7) but written above recommended reading levels. In physician education tasks, ChatGPT-4 (OpenAI) demonstrated higher accuracy than ChatGPT-3.5, improving from 38% (33/88) to 66% (58/88). In patient education studies, ChatGPT-4 provided scientifically adequate explanations (5.0-6.0/7) comparable to hospital materials but at higher reading levels (11th vs 7th grade). In diagnostic testing, ChatGPT-4 interpreted 91% (36/40) electrocardiogram vignettes correctly, significantly better than emergency physicians (31/40, 77%; P< .001), but showed lower performance in echocardiography.Conclusions: LLMs show meaningful potential in cardiology, especially for education and electrocardiogram interpretation, but performance varies across clinical tasks. Limitations in emergency guidance and readability, as well as small in silico study designs, highlight the need for multimodal models and prospective validation.\",\"PeriodicalId\":14706,\"journal\":{\"name\":\"JMIR Cardio\",\"volume\":\"10 \",\"pages\":\"e76734\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2026-04-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13085985/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Cardio\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/76734\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Cardio","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/76734","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

摘要

背景：大型语言模型（llm）越来越多地用于医疗保健，但其在心脏病学中的作用尚未得到系统评估。目的：本综述旨在评估llm在不同心脏病学任务中的应用、性能和局限性，包括慢性和进展性疾病、急性事件、教育和诊断测试。方法：系统检索PubMed和Scopus中截止2024年4月14日发表的相关研究，检索关键词为法学硕士和心脏病学。包括评估法学硕士在心脏病学相关任务中的产出的研究。从5个预定义的领域中提取数据，并使用改进的QUADAS-2工具（由布里斯托尔大学的Whiting等人开发）评估偏倚风险。该审查方案已在PROSPERO注册（CRD42024556397）。结果：共有33项研究为描述性综合提供了定量结果数据。在慢性疾病中，ChatGPT-3.5 （OpenAI）准确回答了91%（43/47）的心力衰竭问题，尽管其可读性通常需要大学水平的理解能力。在急性情况下，Bing Chat在25%（5/20）至45%（9/20）的病例中省略了关键的心肌梗死急救步骤，而心脏骤停信息的评分很高（平均4.3/5,SD 0.7），但高于推荐阅读水平。在医生教育任务中，ChatGPT-4 （OpenAI）的准确率高于ChatGPT-3.5，从38%（33/88）提高到66%（58/88）。在患者教育研究中，ChatGPT-4提供了与医院材料相当的科学充分的解释（5.0-6.0/7），但阅读水平更高（11年级与7年级）。在诊断测试中，ChatGPT-4对91%（36/40）心电图图像的正确解释显著优于急诊医生（31/40,77%；P< 0.001），但在超声心动图中表现较差。结论：llm在心脏病学方面表现出有意义的潜力，特别是在教育和心电图解释方面，但在临床任务中的表现各不相同。紧急指导和可读性的局限性，以及小型的计算机研究设计，突出了对多模式模型和前瞻性验证的需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Large Language Models in Cardiology: Systematic Review.

Background: Large language models (LLMs) are increasingly used in health care, but their role in cardiology has not yet been systematically evaluated.

Objective: This review aimed to assess the applications, performance, and limitations of LLMs across diverse cardiology tasks, including chronic and progressive conditions, acute events, education, and diagnostic testing.

Methods: A systematic search was conducted in PubMed and Scopus for studies published up to April 14, 2024, using keywords related to LLMs and cardiology. Studies evaluating LLM outputs in cardiology-related tasks were included. Data were extracted across 5 predefined domains and the risk of bias was assessed using an adapted QUADAS-2 tool (developed by Whiting et al at the University of Bristol). The review protocol was registered in PROSPERO (CRD42024556397).

Results: A total of 33 studies contributed quantitative outcome data to a descriptive synthesis. Across chronic conditions, ChatGPT-3.5 (OpenAI) answered 91% (43/47) heart failure questions accurately, although readability often required college-level comprehension. In acute scenarios, Bing Chat omitted key myocardial infarction first aid steps in 25% (5/20) to 45% (9/20) of cases, while cardiac arrest information was rated highly (mean 4.3/5, SD 0.7) but written above recommended reading levels. In physician education tasks, ChatGPT-4 (OpenAI) demonstrated higher accuracy than ChatGPT-3.5, improving from 38% (33/88) to 66% (58/88). In patient education studies, ChatGPT-4 provided scientifically adequate explanations (5.0-6.0/7) comparable to hospital materials but at higher reading levels (11th vs 7th grade). In diagnostic testing, ChatGPT-4 interpreted 91% (36/40) electrocardiogram vignettes correctly, significantly better than emergency physicians (31/40, 77%; P< .001), but showed lower performance in echocardiography.

Conclusions: LLMs show meaningful potential in cardiology, especially for education and electrocardiogram interpretation, but performance varies across clinical tasks. Limitations in emergency guidance and readability, as well as small in silico study designs, highlight the need for multimodal models and prospective validation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊