Pavel Antiperovitch, Iris Liu, Ahmed T Mokhtar, Anthony Tang
{"title":"评估心血管抗血栓治疗中的大型语言模型:性能、准确性和临床实践意义。","authors":"Pavel Antiperovitch, Iris Liu, Ahmed T Mokhtar, Anthony Tang","doi":"10.1016/j.cjca.2025.04.008","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) are increasingly accessible for medical decision making and are often used by medical practitioners and patients. However, previous studies have raised concerns about the accuracy of individual LLMs in assisting with patient management.</p><p><strong>Methods: </strong>This study assessed the performance of 7 publicly available LLMs on validated cardiovascular antithrombotic care scenarios, evaluated by 3 independent clinicians for accuracy and reasoning. The results were compared with the performance of volunteer clinicians, based on a survey conducted at the Canadian Cardiovascular Congress in 2023. Statistical analyses ensured interobserver reliability and evaluated performance differences among models.</p><p><strong>Results: </strong>Claude 3 Opus correctly answered 85% of clinical scenarios, significantly outperforming both other LLMs (P = < 0.001) and all clinician groups. Among clinicians, cardiologists, and senior residents achieved the highest accuracy rates: 43% (95 confidence interval [CI], 32%-52%) and 47% (95 CI, 39%-56%) respectively, comparable with GPT-4o (55%) and Claude 3.5 Sonnet (44%). General practitioners performed similarly to Claude 3 Sonnet and Gemini 1.5: 22% (95 CI, 11%-33%) vs 26% vs 30%, whereas medical students achieved 8.3% (95 CI, 2%-15%), closely aligning with GPT-3.5 (10%).</p><p><strong>Conclusions: </strong>The performance of LLMs in cardiovascular clinical scenarios varied widely, with some models outperforming clinicians, and some free-tier models providing inappropriate medical advice to clinicians. However, all tested models demonstrated acceptable performance for delivering patient advice regarding lifestyle and dietary recommendations. Clinicians and patients should exercise caution when using LLMs, select the best LLM for the task, and crosscheck provided references to ensure safe use of LLMs in practice.</p><p><strong>Clinical trial registration: </strong>NCT05923658.</p>","PeriodicalId":9555,"journal":{"name":"Canadian Journal of Cardiology","volume":" ","pages":""},"PeriodicalIF":5.8000,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Large Language Models in Cardiovascular Antithrombotic Care: Performance, Accuracy, and Implications for Clinical Practice.\",\"authors\":\"Pavel Antiperovitch, Iris Liu, Ahmed T Mokhtar, Anthony Tang\",\"doi\":\"10.1016/j.cjca.2025.04.008\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Large language models (LLMs) are increasingly accessible for medical decision making and are often used by medical practitioners and patients. However, previous studies have raised concerns about the accuracy of individual LLMs in assisting with patient management.</p><p><strong>Methods: </strong>This study assessed the performance of 7 publicly available LLMs on validated cardiovascular antithrombotic care scenarios, evaluated by 3 independent clinicians for accuracy and reasoning. The results were compared with the performance of volunteer clinicians, based on a survey conducted at the Canadian Cardiovascular Congress in 2023. Statistical analyses ensured interobserver reliability and evaluated performance differences among models.</p><p><strong>Results: </strong>Claude 3 Opus correctly answered 85% of clinical scenarios, significantly outperforming both other LLMs (P = < 0.001) and all clinician groups. Among clinicians, cardiologists, and senior residents achieved the highest accuracy rates: 43% (95 confidence interval [CI], 32%-52%) and 47% (95 CI, 39%-56%) respectively, comparable with GPT-4o (55%) and Claude 3.5 Sonnet (44%). General practitioners performed similarly to Claude 3 Sonnet and Gemini 1.5: 22% (95 CI, 11%-33%) vs 26% vs 30%, whereas medical students achieved 8.3% (95 CI, 2%-15%), closely aligning with GPT-3.5 (10%).</p><p><strong>Conclusions: </strong>The performance of LLMs in cardiovascular clinical scenarios varied widely, with some models outperforming clinicians, and some free-tier models providing inappropriate medical advice to clinicians. However, all tested models demonstrated acceptable performance for delivering patient advice regarding lifestyle and dietary recommendations. Clinicians and patients should exercise caution when using LLMs, select the best LLM for the task, and crosscheck provided references to ensure safe use of LLMs in practice.</p><p><strong>Clinical trial registration: </strong>NCT05923658.</p>\",\"PeriodicalId\":9555,\"journal\":{\"name\":\"Canadian Journal of Cardiology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.8000,\"publicationDate\":\"2025-04-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Canadian Journal of Cardiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.cjca.2025.04.008\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CARDIAC & CARDIOVASCULAR SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Canadian Journal of Cardiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.cjca.2025.04.008","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
Evaluating Large Language Models in Cardiovascular Antithrombotic Care: Performance, Accuracy, and Implications for Clinical Practice.
Background: Large language models (LLMs) are increasingly accessible for medical decision making and are often used by medical practitioners and patients. However, previous studies have raised concerns about the accuracy of individual LLMs in assisting with patient management.
Methods: This study assessed the performance of 7 publicly available LLMs on validated cardiovascular antithrombotic care scenarios, evaluated by 3 independent clinicians for accuracy and reasoning. The results were compared with the performance of volunteer clinicians, based on a survey conducted at the Canadian Cardiovascular Congress in 2023. Statistical analyses ensured interobserver reliability and evaluated performance differences among models.
Results: Claude 3 Opus correctly answered 85% of clinical scenarios, significantly outperforming both other LLMs (P = < 0.001) and all clinician groups. Among clinicians, cardiologists, and senior residents achieved the highest accuracy rates: 43% (95 confidence interval [CI], 32%-52%) and 47% (95 CI, 39%-56%) respectively, comparable with GPT-4o (55%) and Claude 3.5 Sonnet (44%). General practitioners performed similarly to Claude 3 Sonnet and Gemini 1.5: 22% (95 CI, 11%-33%) vs 26% vs 30%, whereas medical students achieved 8.3% (95 CI, 2%-15%), closely aligning with GPT-3.5 (10%).
Conclusions: The performance of LLMs in cardiovascular clinical scenarios varied widely, with some models outperforming clinicians, and some free-tier models providing inappropriate medical advice to clinicians. However, all tested models demonstrated acceptable performance for delivering patient advice regarding lifestyle and dietary recommendations. Clinicians and patients should exercise caution when using LLMs, select the best LLM for the task, and crosscheck provided references to ensure safe use of LLMs in practice.
期刊介绍:
The Canadian Journal of Cardiology (CJC) is the official journal of the Canadian Cardiovascular Society (CCS). The CJC is a vehicle for the international dissemination of new knowledge in cardiology and cardiovascular science, particularly serving as the major venue for Canadian cardiovascular medicine.