Nino Nikolovski, Conall T Morgan, Michael N Gritti
{"title":"Comparing closed and open large language models on pediatric cardiology board exam performance.","authors":"Nino Nikolovski, Conall T Morgan, Michael N Gritti","doi":"10.4103/apc.apc_301_25","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) have gained traction in medicine, but there is limited research comparing closed- and open-source models in subspecialty contexts. This study evaluated ChatGPT-4.0o and DeepSeek-R1 on a pediatric cardiology board-style examination to quantify their accuracy and discuss educational and clinical utility. ChatGPT-4.0o and DeepSeek-R1 were used to answer 88 text-based multiple choice questions across 11 pediatric cardiology subtopics from a Pediatric Cardiology Board Review textbook. DeepSeek-R1's processing time per question was measured. ChatGPT-4.0o and DeepSeek-R1 achieved 70% (62/88) and 68% (60/88) accuracy, respectively (<i>p</i> = 0.53). Subtopic accuracy was equal in 5 of 11 chapters, with each model outperforming its counterpart in 3 of 11. DeepSeek-R1's processing time negatively correlated with accuracy (<i>r</i> = -0.68, <i>p</i> = 0.02). ChatGPT-4.0o and DeepSeek-R1 were comparable in accuracy and approached the passing threshold on a pediatric cardiology board examination. While further development of LLMs is required for clinical integration into pediatric cardiology, these findings suggest the potential utility of these models as educational aids.</p>","PeriodicalId":8026,"journal":{"name":"Annals of Pediatric Cardiology","volume":"18 6","pages":"590-593"},"PeriodicalIF":0.7000,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13048703/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Pediatric Cardiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4103/apc.apc_301_25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/3/16 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) have gained traction in medicine, but there is limited research comparing closed- and open-source models in subspecialty contexts. This study evaluated ChatGPT-4.0o and DeepSeek-R1 on a pediatric cardiology board-style examination to quantify their accuracy and discuss educational and clinical utility. ChatGPT-4.0o and DeepSeek-R1 were used to answer 88 text-based multiple choice questions across 11 pediatric cardiology subtopics from a Pediatric Cardiology Board Review textbook. DeepSeek-R1's processing time per question was measured. ChatGPT-4.0o and DeepSeek-R1 achieved 70% (62/88) and 68% (60/88) accuracy, respectively (p = 0.53). Subtopic accuracy was equal in 5 of 11 chapters, with each model outperforming its counterpart in 3 of 11. DeepSeek-R1's processing time negatively correlated with accuracy (r = -0.68, p = 0.02). ChatGPT-4.0o and DeepSeek-R1 were comparable in accuracy and approached the passing threshold on a pediatric cardiology board examination. While further development of LLMs is required for clinical integration into pediatric cardiology, these findings suggest the potential utility of these models as educational aids.