Can American Board of Surgery in Training Examinations be passed by Large Language Models? Comparative assessment of Gemini, Copilot, and ChatGPT.

IF 0.9 4区 医学 Q3 SURGERY
American Surgeon Pub Date : 2025-11-01 Epub Date: 2025-05-12 DOI:10.1177/00031348251341956
Ahmet Necati Sanli, Deniz Esin Tekcan Sanli, Ali Karabulut
{"title":"Can American Board of Surgery in Training Examinations be passed by Large Language Models? Comparative assessment of Gemini, Copilot, and ChatGPT.","authors":"Ahmet Necati Sanli, Deniz Esin Tekcan Sanli, Ali Karabulut","doi":"10.1177/00031348251341956","DOIUrl":null,"url":null,"abstract":"<p><p>ObjectiveThis study aimed to evaluate the performance of large language models (LLMs) in answering questions from the American Board of Surgery In-Training Examination (ABSITE).MethodsMultiple choice ABSITE Quiz was entered into the most popular LLMs as prompts. ChatGPT-4 (OpenAI), Copilot (Microsoft), and Gemini (Google) were used in the study. The research comprised 170 questions from 2017 to 2022, which were divided into four subgroups: Definitions, Biochemistry/Pharmaceutical, Case Scenario, and Treatment & Surgical Procedures. All questions were queried in LLMs, between October 1, 2024, and October 5, 2024. Correct answer rates of LLMs were evaluated.ResultsThe correct response rates for all questions were 79.4% for ChatGPT, 77.6% for Copilot, and 52.9% for Gemini, with Gemini significantly lower than both LLMs (<i>P</i> < 0.001). In the definition category, the correct response rates were 93.5% for ChatGPT, 90.3% for Copilot, and 64.5% for Gemini, with Gemini significantly lower (<i>P</i> = 0.005 and <i>P</i> = 0.015, respectively). In the Biochemistry/Pharmaceutical question category, the correct response rates were equal in all three groups (83.3%). In the Case Scenario category, the correct response rates were 76.3% in ChatGPT, 72.8% for Copilot, and 46.5% for Gemini, with Gemini significantly lower (<i>P</i> < 0.001). In the Treatment & Surgical Procedures category, the correct response rates were 69.2% for ChatGPT, 84.6% for Copilot, and 53.8% for Gemini. Although Gemini had the lowest accuracy, there was no statistically significant difference (<i>P</i> = 0.236).ConclusionIn the ABSITE Quiz, ChatGPT and Copilot had similar success, whereas Gemini was significantly behind.</p>","PeriodicalId":7782,"journal":{"name":"American Surgeon","volume":" ","pages":"1923-1929"},"PeriodicalIF":0.9000,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Surgeon","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/00031348251341956","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/12 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"SURGERY","Score":null,"Total":0}
引用次数: 0

Abstract

ObjectiveThis study aimed to evaluate the performance of large language models (LLMs) in answering questions from the American Board of Surgery In-Training Examination (ABSITE).MethodsMultiple choice ABSITE Quiz was entered into the most popular LLMs as prompts. ChatGPT-4 (OpenAI), Copilot (Microsoft), and Gemini (Google) were used in the study. The research comprised 170 questions from 2017 to 2022, which were divided into four subgroups: Definitions, Biochemistry/Pharmaceutical, Case Scenario, and Treatment & Surgical Procedures. All questions were queried in LLMs, between October 1, 2024, and October 5, 2024. Correct answer rates of LLMs were evaluated.ResultsThe correct response rates for all questions were 79.4% for ChatGPT, 77.6% for Copilot, and 52.9% for Gemini, with Gemini significantly lower than both LLMs (P < 0.001). In the definition category, the correct response rates were 93.5% for ChatGPT, 90.3% for Copilot, and 64.5% for Gemini, with Gemini significantly lower (P = 0.005 and P = 0.015, respectively). In the Biochemistry/Pharmaceutical question category, the correct response rates were equal in all three groups (83.3%). In the Case Scenario category, the correct response rates were 76.3% in ChatGPT, 72.8% for Copilot, and 46.5% for Gemini, with Gemini significantly lower (P < 0.001). In the Treatment & Surgical Procedures category, the correct response rates were 69.2% for ChatGPT, 84.6% for Copilot, and 53.8% for Gemini. Although Gemini had the lowest accuracy, there was no statistically significant difference (P = 0.236).ConclusionIn the ABSITE Quiz, ChatGPT and Copilot had similar success, whereas Gemini was significantly behind.

大型语言模型能否通过美国外科培训委员会考试?双子星、副驾驶和ChatGPT的比较评估。
目的本研究旨在评估大型语言模型(llm)在回答美国外科培训考试委员会(ABSITE)问题中的表现。方法在最受欢迎的法学硕士课程中按提示输入多项选择ABSITE测验。研究使用ChatGPT-4 (OpenAI)、Copilot (Microsoft)和Gemini(谷歌)。该研究包括从2017年到2022年的170个问题,分为四个亚组:定义、生物化学/制药、病例情景和治疗与外科手术。所有问题都是在2024年10月1日至2024年10月5日期间在llm中进行的。评估法学硕士的正确率。结果ChatGPT的正确率为79.4%,Copilot的正确率为77.6%,Gemini的正确率为52.9%,其中Gemini的正确率显著低于两种LLMs (P < 0.001)。在定义类别中,ChatGPT的正确反应率为93.5%,Copilot为90.3%,Gemini为64.5%,其中Gemini显著低于ChatGPT (P = 0.005和P = 0.015)。在生物化学/制药问题类别中,三组答对率相等(83.3%)。在Case Scenario类别中,ChatGPT的正确反应率为76.3%,Copilot为72.8%,Gemini为46.5%,其中Gemini显著低于前者(P < 0.001)。在治疗和外科手术类别中,ChatGPT的正确反应率为69.2%,Copilot为84.6%,Gemini为53.8%。虽然双子座的准确率最低,但没有统计学上的显著差异(P = 0.236)。在ABSITE测试中,ChatGPT和Copilot取得了类似的成功,而Gemini则明显落后。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
American Surgeon
American Surgeon 医学-外科
CiteScore
1.40
自引率
0.00%
发文量
623
期刊介绍: The American Surgeon is a monthly peer-reviewed publication published by the Southeastern Surgical Congress. Its area of concentration is clinical general surgery, as defined by the content areas of the American Board of Surgery: alimentary tract (including bariatric surgery), abdomen and its contents, breast, skin and soft tissue, endocrine system, solid organ transplantation, pediatric surgery, surgical critical care, surgical oncology (including head and neck surgery), trauma and emergency surgery, and vascular surgery.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信