Valerie Builoff BS , Aakash Shanbhag MSc , Robert JH. Miller MD , Damini Dey PhD , Joanna X. Liang MPH , Kathleen Flood BS , Jamieson M. Bourque MD , Panithaya Chareonthaitawee MD , Lawrence M. Phillips MD , Piotr J. Slomka PhD
{"title":"评估人工智能在核心脏病学中的熟练程度:大型语言模型参加董事会准备考试。","authors":"Valerie Builoff BS , Aakash Shanbhag MSc , Robert JH. Miller MD , Damini Dey PhD , Joanna X. Liang MPH , Kathleen Flood BS , Jamieson M. Bourque MD , Panithaya Chareonthaitawee MD , Lawrence M. Phillips MD , Piotr J. Slomka PhD","doi":"10.1016/j.nuclcard.2024.102089","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs—GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)—in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination.</div></div><div><h3>Methods</h3><div>We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions.</div></div><div><h3>Results</h3><div>GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.5% - 61.3%), and 63.1% (62.5%–64.3%) of questions, respectively. GPT-4o significantly outperformed other models (<em>P</em> = .007 vs GPT-4 Turbo, <em>P</em> < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (<em>P</em> < .001, <em>P</em> < .001, and <em>P</em> = .001), while Gemini performed worse on image-based questions (<em>P</em> < .001 for all).</div></div><div><h3>Conclusion</h3><div>GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions.</div></div>","PeriodicalId":16476,"journal":{"name":"Journal of Nuclear Cardiology","volume":"45 ","pages":"Article 102089"},"PeriodicalIF":3.0000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam\",\"authors\":\"Valerie Builoff BS , Aakash Shanbhag MSc , Robert JH. Miller MD , Damini Dey PhD , Joanna X. Liang MPH , Kathleen Flood BS , Jamieson M. Bourque MD , Panithaya Chareonthaitawee MD , Lawrence M. Phillips MD , Piotr J. Slomka PhD\",\"doi\":\"10.1016/j.nuclcard.2024.102089\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs—GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)—in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination.</div></div><div><h3>Methods</h3><div>We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions.</div></div><div><h3>Results</h3><div>GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.5% - 61.3%), and 63.1% (62.5%–64.3%) of questions, respectively. GPT-4o significantly outperformed other models (<em>P</em> = .007 vs GPT-4 Turbo, <em>P</em> < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (<em>P</em> < .001, <em>P</em> < .001, and <em>P</em> = .001), while Gemini performed worse on image-based questions (<em>P</em> < .001 for all).</div></div><div><h3>Conclusion</h3><div>GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions.</div></div>\",\"PeriodicalId\":16476,\"journal\":{\"name\":\"Journal of Nuclear Cardiology\",\"volume\":\"45 \",\"pages\":\"Article 102089\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2025-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Nuclear Cardiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1071358124007918\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CARDIAC & CARDIOVASCULAR SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Nuclear Cardiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1071358124007918","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam
Background
Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs—GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)—in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination.
Methods
We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions.
Results
GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.5% - 61.3%), and 63.1% (62.5%–64.3%) of questions, respectively. GPT-4o significantly outperformed other models (P = .007 vs GPT-4 Turbo, P < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (P < .001, P < .001, and P = .001), while Gemini performed worse on image-based questions (P < .001 for all).
Conclusion
GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions.
期刊介绍:
Journal of Nuclear Cardiology is the only journal in the world devoted to this dynamic and growing subspecialty. Physicians and technologists value the Journal not only for its peer-reviewed articles, but also for its timely discussions about the current and future role of nuclear cardiology. Original articles address all aspects of nuclear cardiology, including interpretation, diagnosis, imaging equipment, and use of radiopharmaceuticals. As the official publication of the American Society of Nuclear Cardiology, the Journal also brings readers the latest information emerging from the Society''s task forces and publishes guidelines and position papers as they are adopted.