评估心血管抗血栓治疗中的大型语言模型：性能、准确性和临床实践意义。

IF 5.8 2区医学 Q1 CARDIAC & CARDIOVASCULAR SYSTEMS

Canadian Journal of Cardiology Pub Date : 2025-04-15 DOI:10.1016/j.cjca.2025.04.008

Pavel Antiperovitch, Iris Liu, Ahmed T Mokhtar, Anthony Tang

{"title":"评估心血管抗血栓治疗中的大型语言模型：性能、准确性和临床实践意义。","authors":"Pavel Antiperovitch, Iris Liu, Ahmed T Mokhtar, Anthony Tang","doi":"10.1016/j.cjca.2025.04.008","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) are increasingly accessible for medical decision making and are often used by medical practitioners and patients. However, previous studies have raised concerns about the accuracy of individual LLMs in assisting with patient management.Methods: This study assessed the performance of 7 publicly available LLMs on validated cardiovascular antithrombotic care scenarios, evaluated by 3 independent clinicians for accuracy and reasoning. The results were compared with the performance of volunteer clinicians, based on a survey conducted at the Canadian Cardiovascular Congress in 2023. Statistical analyses ensured interobserver reliability and evaluated performance differences among models.Results: Claude 3 Opus correctly answered 85% of clinical scenarios, significantly outperforming both other LLMs (P = < 0.001) and all clinician groups. Among clinicians, cardiologists, and senior residents achieved the highest accuracy rates: 43% (95 confidence interval [CI], 32%-52%) and 47% (95 CI, 39%-56%) respectively, comparable with GPT-4o (55%) and Claude 3.5 Sonnet (44%). General practitioners performed similarly to Claude 3 Sonnet and Gemini 1.5: 22% (95 CI, 11%-33%) vs 26% vs 30%, whereas medical students achieved 8.3% (95 CI, 2%-15%), closely aligning with GPT-3.5 (10%).Conclusions: The performance of LLMs in cardiovascular clinical scenarios varied widely, with some models outperforming clinicians, and some free-tier models providing inappropriate medical advice to clinicians. However, all tested models demonstrated acceptable performance for delivering patient advice regarding lifestyle and dietary recommendations. Clinicians and patients should exercise caution when using LLMs, select the best LLM for the task, and crosscheck provided references to ensure safe use of LLMs in practice.Clinical trial registration: NCT05923658.","PeriodicalId":9555,"journal":{"name":"Canadian Journal of Cardiology","volume":" ","pages":""},"PeriodicalIF":5.8000,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Large Language Models in Cardiovascular Antithrombotic Care: Performance, Accuracy, and Implications for Clinical Practice.\",\"authors\":\"Pavel Antiperovitch, Iris Liu, Ahmed T Mokhtar, Anthony Tang\",\"doi\":\"10.1016/j.cjca.2025.04.008\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Large language models (LLMs) are increasingly accessible for medical decision making and are often used by medical practitioners and patients. However, previous studies have raised concerns about the accuracy of individual LLMs in assisting with patient management.Methods: This study assessed the performance of 7 publicly available LLMs on validated cardiovascular antithrombotic care scenarios, evaluated by 3 independent clinicians for accuracy and reasoning. The results were compared with the performance of volunteer clinicians, based on a survey conducted at the Canadian Cardiovascular Congress in 2023. Statistical analyses ensured interobserver reliability and evaluated performance differences among models.Results: Claude 3 Opus correctly answered 85% of clinical scenarios, significantly outperforming both other LLMs (P = < 0.001) and all clinician groups. Among clinicians, cardiologists, and senior residents achieved the highest accuracy rates: 43% (95 confidence interval [CI], 32%-52%) and 47% (95 CI, 39%-56%) respectively, comparable with GPT-4o (55%) and Claude 3.5 Sonnet (44%). General practitioners performed similarly to Claude 3 Sonnet and Gemini 1.5: 22% (95 CI, 11%-33%) vs 26% vs 30%, whereas medical students achieved 8.3% (95 CI, 2%-15%), closely aligning with GPT-3.5 (10%).Conclusions: The performance of LLMs in cardiovascular clinical scenarios varied widely, with some models outperforming clinicians, and some free-tier models providing inappropriate medical advice to clinicians. However, all tested models demonstrated acceptable performance for delivering patient advice regarding lifestyle and dietary recommendations. Clinicians and patients should exercise caution when using LLMs, select the best LLM for the task, and crosscheck provided references to ensure safe use of LLMs in practice.Clinical trial registration: NCT05923658.\",\"PeriodicalId\":9555,\"journal\":{\"name\":\"Canadian Journal of Cardiology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.8000,\"publicationDate\":\"2025-04-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Canadian Journal of Cardiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.cjca.2025.04.008\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CARDIAC & CARDIOVASCULAR SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Canadian Journal of Cardiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.cjca.2025.04.008","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

背景：大型语言模型（llm）越来越多地用于医疗决策，并且经常被医生和患者使用。然而，先前的研究对个别llm在协助患者管理方面的准确性提出了担忧。方法：本研究评估了7个公开可获得的llm在经过验证的心血管抗血栓治疗方案中的表现，由3名独立临床医生评估准确性和推理性。根据2023年加拿大心血管大会进行的一项调查，研究结果与志愿临床医生的表现进行了比较。统计分析确保了观察者之间的可靠性，并评估了模型之间的性能差异。结果：Claude 3 Opus正确回答了85%的临床问题，显著优于其他LLMs （P = < 0.001）和所有临床医生组。在临床医生、心脏病专家和老年住院医师中，准确率最高：分别为43%（95%置信区间[CI]， 32%-52%）和47%(95%置信区间[CI]， 39%-56%)，与gpt - 40（55%）和Claude 3.5 Sonnet（44%）相当。全科医生的表现与Claude Sonnet和Gemini相似，分别为22% （95 CI, 11%-33%）、26%和30%，而医科学生的表现为8.3% (95 CI, 2%-15%)，与GPT-3.5（10%）非常接近。结论：llm在心血管临床场景中的表现差异很大，一些模型优于临床医生，而一些免费模型为临床医生提供了不适当的医疗建议。然而，所有经过测试的模型在为患者提供有关生活方式和饮食建议的建议方面都表现出可接受的性能。临床医生和患者在使用法学硕士时应谨慎，选择最适合任务的法学硕士，并交叉核对提供的参考文献，以确保法学硕士在实践中的安全使用。临床试验注册：NCT05923658。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating Large Language Models in Cardiovascular Antithrombotic Care: Performance, Accuracy, and Implications for Clinical Practice.

Background: Large language models (LLMs) are increasingly accessible for medical decision making and are often used by medical practitioners and patients. However, previous studies have raised concerns about the accuracy of individual LLMs in assisting with patient management.

Methods: This study assessed the performance of 7 publicly available LLMs on validated cardiovascular antithrombotic care scenarios, evaluated by 3 independent clinicians for accuracy and reasoning. The results were compared with the performance of volunteer clinicians, based on a survey conducted at the Canadian Cardiovascular Congress in 2023. Statistical analyses ensured interobserver reliability and evaluated performance differences among models.

Results: Claude 3 Opus correctly answered 85% of clinical scenarios, significantly outperforming both other LLMs (P = < 0.001) and all clinician groups. Among clinicians, cardiologists, and senior residents achieved the highest accuracy rates: 43% (95 confidence interval [CI], 32%-52%) and 47% (95 CI, 39%-56%) respectively, comparable with GPT-4o (55%) and Claude 3.5 Sonnet (44%). General practitioners performed similarly to Claude 3 Sonnet and Gemini 1.5: 22% (95 CI, 11%-33%) vs 26% vs 30%, whereas medical students achieved 8.3% (95 CI, 2%-15%), closely aligning with GPT-3.5 (10%).

Conclusions: The performance of LLMs in cardiovascular clinical scenarios varied widely, with some models outperforming clinicians, and some free-tier models providing inappropriate medical advice to clinicians. However, all tested models demonstrated acceptable performance for delivering patient advice regarding lifestyle and dietary recommendations. Clinicians and patients should exercise caution when using LLMs, select the best LLM for the task, and crosscheck provided references to ensure safe use of LLMs in practice.

Clinical trial registration: NCT05923658.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Canadian Journal of Cardiology 医学-心血管系统

CiteScore

9.20

自引率

8.10%

发文量

546

审稿时长

32 days

期刊介绍： The Canadian Journal of Cardiology (CJC) is the official journal of the Canadian Cardiovascular Society (CCS). The CJC is a vehicle for the international dissemination of new knowledge in cardiology and cardiovascular science, particularly serving as the major venue for Canadian cardiovascular medicine.