Can American Board of Surgery in Training Examinations be passed by Large Language Models? Comparative assessment of Gemini, Copilot, and ChatGPT.

IF 0.9 4区医学 Q3 SURGERY

American Surgeon Pub Date : 2025-11-01 Epub Date: 2025-05-12 DOI:10.1177/00031348251341956

Ahmet Necati Sanli, Deniz Esin Tekcan Sanli, Ali Karabulut

{"title":"Can American Board of Surgery in Training Examinations be passed by Large Language Models? Comparative assessment of Gemini, Copilot, and ChatGPT.","authors":"Ahmet Necati Sanli, Deniz Esin Tekcan Sanli, Ali Karabulut","doi":"10.1177/00031348251341956","DOIUrl":null,"url":null,"abstract":"ObjectiveThis study aimed to evaluate the performance of large language models (LLMs) in answering questions from the American Board of Surgery In-Training Examination (ABSITE).MethodsMultiple choice ABSITE Quiz was entered into the most popular LLMs as prompts. ChatGPT-4 (OpenAI), Copilot (Microsoft), and Gemini (Google) were used in the study. The research comprised 170 questions from 2017 to 2022, which were divided into four subgroups: Definitions, Biochemistry/Pharmaceutical, Case Scenario, and Treatment & Surgical Procedures. All questions were queried in LLMs, between October 1, 2024, and October 5, 2024. Correct answer rates of LLMs were evaluated.ResultsThe correct response rates for all questions were 79.4% for ChatGPT, 77.6% for Copilot, and 52.9% for Gemini, with Gemini significantly lower than both LLMs (P < 0.001). In the definition category, the correct response rates were 93.5% for ChatGPT, 90.3% for Copilot, and 64.5% for Gemini, with Gemini significantly lower (P = 0.005 and P = 0.015, respectively). In the Biochemistry/Pharmaceutical question category, the correct response rates were equal in all three groups (83.3%). In the Case Scenario category, the correct response rates were 76.3% in ChatGPT, 72.8% for Copilot, and 46.5% for Gemini, with Gemini significantly lower (P < 0.001). In the Treatment & Surgical Procedures category, the correct response rates were 69.2% for ChatGPT, 84.6% for Copilot, and 53.8% for Gemini. Although Gemini had the lowest accuracy, there was no statistically significant difference (P = 0.236).ConclusionIn the ABSITE Quiz, ChatGPT and Copilot had similar success, whereas Gemini was significantly behind.","PeriodicalId":7782,"journal":{"name":"American Surgeon","volume":" ","pages":"1923-1929"},"PeriodicalIF":0.9000,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Surgeon","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/00031348251341956","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/12 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"SURGERY","Score":null,"Total":0}

引用次数: 0

Abstract

ObjectiveThis study aimed to evaluate the performance of large language models (LLMs) in answering questions from the American Board of Surgery In-Training Examination (ABSITE).MethodsMultiple choice ABSITE Quiz was entered into the most popular LLMs as prompts. ChatGPT-4 (OpenAI), Copilot (Microsoft), and Gemini (Google) were used in the study. The research comprised 170 questions from 2017 to 2022, which were divided into four subgroups: Definitions, Biochemistry/Pharmaceutical, Case Scenario, and Treatment & Surgical Procedures. All questions were queried in LLMs, between October 1, 2024, and October 5, 2024. Correct answer rates of LLMs were evaluated.ResultsThe correct response rates for all questions were 79.4% for ChatGPT, 77.6% for Copilot, and 52.9% for Gemini, with Gemini significantly lower than both LLMs (P < 0.001). In the definition category, the correct response rates were 93.5% for ChatGPT, 90.3% for Copilot, and 64.5% for Gemini, with Gemini significantly lower (P = 0.005 and P = 0.015, respectively). In the Biochemistry/Pharmaceutical question category, the correct response rates were equal in all three groups (83.3%). In the Case Scenario category, the correct response rates were 76.3% in ChatGPT, 72.8% for Copilot, and 46.5% for Gemini, with Gemini significantly lower (P < 0.001). In the Treatment & Surgical Procedures category, the correct response rates were 69.2% for ChatGPT, 84.6% for Copilot, and 53.8% for Gemini. Although Gemini had the lowest accuracy, there was no statistically significant difference (P = 0.236).ConclusionIn the ABSITE Quiz, ChatGPT and Copilot had similar success, whereas Gemini was significantly behind.

查看原文本刊更多论文

大型语言模型能否通过美国外科培训委员会考试？双子星、副驾驶和ChatGPT的比较评估。

目的本研究旨在评估大型语言模型（llm）在回答美国外科培训考试委员会（ABSITE）问题中的表现。方法在最受欢迎的法学硕士课程中按提示输入多项选择ABSITE测验。研究使用ChatGPT-4 （OpenAI）、Copilot （Microsoft）和Gemini（谷歌）。该研究包括从2017年到2022年的170个问题，分为四个亚组：定义、生物化学/制药、病例情景和治疗与外科手术。所有问题都是在2024年10月1日至2024年10月5日期间在llm中进行的。评估法学硕士的正确率。结果ChatGPT的正确率为79.4%，Copilot的正确率为77.6%，Gemini的正确率为52.9%，其中Gemini的正确率显著低于两种LLMs （P < 0.001）。在定义类别中，ChatGPT的正确反应率为93.5%，Copilot为90.3%，Gemini为64.5%，其中Gemini显著低于ChatGPT （P = 0.005和P = 0.015）。在生物化学/制药问题类别中，三组答对率相等（83.3%）。在Case Scenario类别中，ChatGPT的正确反应率为76.3%，Copilot为72.8%，Gemini为46.5%，其中Gemini显著低于前者（P < 0.001）。在治疗和外科手术类别中，ChatGPT的正确反应率为69.2%，Copilot为84.6%，Gemini为53.8%。虽然双子座的准确率最低，但没有统计学上的显著差异（P = 0.236）。在ABSITE测试中，ChatGPT和Copilot取得了类似的成功，而Gemini则明显落后。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

American Surgeon 医学-外科

CiteScore

1.40

自引率

0.00%

发文量

623

期刊介绍： The American Surgeon is a monthly peer-reviewed publication published by the Southeastern Surgical Congress. Its area of concentration is clinical general surgery, as defined by the content areas of the American Board of Surgery: alimentary tract (including bariatric surgery), abdomen and its contents, breast, skin and soft tissue, endocrine system, solid organ transplantation, pediatric surgery, surgical critical care, surgical oncology (including head and neck surgery), trauma and emergency surgery, and vascular surgery.