Performance of Chat Generative Pre-trained Transformer-4o in the Adult Clinical Cardiology Self-Assessment Program.

IF 3.9 Q1 CARDIAC & CARDIOVASCULAR SYSTEMS

European heart journal. Digital health Pub Date : 2024-10-21 eCollection Date: 2025-01-01 DOI:10.1093/ehjdh/ztae077

Abdulaziz Malik, Christopher Madias, Benjamin S Wessler

{"title":"Performance of Chat Generative Pre-trained Transformer-4o in the Adult Clinical Cardiology Self-Assessment Program.","authors":"Abdulaziz Malik, Christopher Madias, Benjamin S Wessler","doi":"10.1093/ehjdh/ztae077","DOIUrl":null,"url":null,"abstract":"Aims: This study evaluates the performance of OpenAI's latest large language model (LLM), Chat Generative Pre-trained Transformer-4o, on the Adult Clinical Cardiology Self-Assessment Program (ACCSAP).Methods and results: Chat Generative Pre-trained Transformer-4o was tested on 639 ACCSAP questions, excluding 45 questions containing video clips, resulting in 594 questions for analysis. The questions included a mix of text-based and static image-based [electrocardiogram (ECG), angiogram, computed tomography (CT) scan, and echocardiogram] formats. The model was allowed one attempt per question. Further evaluation of image-only questions was performed on 25 questions from the database. Chat Generative Pre-trained Transformer-4o correctly answered 69.2% (411/594) of the questions. The performance was higher for text-only questions (73.9%) compared with those requiring image interpretation (55.3%, P < 0.001). The model performed worse on questions involving ECGs, with a correct rate of 56.5% compared with 73.3% for non-ECG questions (P < 0.001). Despite its capability to interpret medical images in the context of a text-based question, the model's accuracy varied, demonstrating strengths and notable gaps in diagnostic accuracy. It lacked accuracy in reading images (ECGs, echocardiography, and angiograms) with no context.Conclusion: Chat Generative Pre-trained Transformer-4o performed moderately well on ACCSAP questions. However, the model's performance remains inconsistent, especially in interpreting ECGs. These findings highlight the potential and current limitations of using LLMs in medical education and clinical decision-making.","PeriodicalId":72965,"journal":{"name":"European heart journal. Digital health","volume":"6 1","pages":"155-158"},"PeriodicalIF":3.9000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11750186/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European heart journal. Digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/ehjdh/ztae077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Aims: This study evaluates the performance of OpenAI's latest large language model (LLM), Chat Generative Pre-trained Transformer-4o, on the Adult Clinical Cardiology Self-Assessment Program (ACCSAP).

Methods and results: Chat Generative Pre-trained Transformer-4o was tested on 639 ACCSAP questions, excluding 45 questions containing video clips, resulting in 594 questions for analysis. The questions included a mix of text-based and static image-based [electrocardiogram (ECG), angiogram, computed tomography (CT) scan, and echocardiogram] formats. The model was allowed one attempt per question. Further evaluation of image-only questions was performed on 25 questions from the database. Chat Generative Pre-trained Transformer-4o correctly answered 69.2% (411/594) of the questions. The performance was higher for text-only questions (73.9%) compared with those requiring image interpretation (55.3%, P < 0.001). The model performed worse on questions involving ECGs, with a correct rate of 56.5% compared with 73.3% for non-ECG questions (P < 0.001). Despite its capability to interpret medical images in the context of a text-based question, the model's accuracy varied, demonstrating strengths and notable gaps in diagnostic accuracy. It lacked accuracy in reading images (ECGs, echocardiography, and angiograms) with no context.

Conclusion: Chat Generative Pre-trained Transformer-4o performed moderately well on ACCSAP questions. However, the model's performance remains inconsistent, especially in interpreting ECGs. These findings highlight the potential and current limitations of using LLMs in medical education and clinical decision-making.

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

European heart journal. Digital health

CiteScore

5.00

自引率

0.00%

发文量