Comparative Performance Analysis of AI Engines in Answering American Board of Surgery In-Training Examination Questions: A Multi-Subspecialty Evaluation.

IF 1.6 4区 医学 Q3 SURGERY
Nawaf AlShahwan, Ibrahim Majed Fetyani, Mohammed Basem Beyari, Saleh Husam Aldeligan, Maram Basem Beyari, Rayan Saleh Alshehri, Ahmed Alburakan, Hassan Mashbari, Abdulaziz AlKanhal, Thamer Nouh
{"title":"Comparative Performance Analysis of AI Engines in Answering American Board of Surgery In-Training Examination Questions: A Multi-Subspecialty Evaluation.","authors":"Nawaf AlShahwan, Ibrahim Majed Fetyani, Mohammed Basem Beyari, Saleh Husam Aldeligan, Maram Basem Beyari, Rayan Saleh Alshehri, Ahmed Alburakan, Hassan Mashbari, Abdulaziz AlKanhal, Thamer Nouh","doi":"10.1177/15533506251361664","DOIUrl":null,"url":null,"abstract":"<p><p>BackgroundThe rapid advancement of artificial intelligence (AI) has led to its increasing application in the medical field, particularly in providing accurate and reliable information for complex medical queries. PurposeThis study evaluates the performance of four AI engines-Perplexity, Chat GPT, DeepSeek, and Gemini in answering 100 multiple-choice questions derived from the American Board of Surgery In-Training Examination (ABSITE). A set of questions focused on five surgical subspecialties including colorectal surgery, acute care and trauma surgery (ACS), upper GI Surgery, breast and endocrine surgery, and hepatopancreatobiliary surgery (HPB).Data collectionWe evaluated these AI engines' ability to provide accurate and focused medical knowledge as the main objective. The research study consisting of a two-month duration was conducted from January 1, 2025, to March 28, 2025. All AI engines received identical questions through then a comparison between correct and wrong responses was performed relative to the ABSITE key answers. Each question was entered manually into the chatbots, ensuring no memory retention bias.Statistical analysisThe researchers conducted their statistical analysis with JASP software for performance evaluation between different subspecialties and AI engines through univariate and multivariate investigations.ResultsAmong the available AI tools, DeepSeek produced the most accurate responses at 74% while Chat GPT delivered 70% accuracy Gemini achieved 69% and Perplexity attained 65%. Results showed that Chat GPT achieved 83.3% accuracy in colorectal surgery yet DeepSeek scored the best at 84.6% and 67.6% for HPB Surgery and ACS respectively. Perplexity achieved a 100% accuracy rate in breast and endocrine surgery which proved to be the highest score recorded throughout the study. The analysis showed that Chat GPT exhibited performance variability between different Surgical subspecialties since it registered significant variations (<i>P</i> < .05), especially in acute care and trauma Surgery. The results of logistic regression indicated that Gemini along with Perplexity scored the most consistent answers among AI systems with a significant odds ratio of 2.5 (<i>P</i> < .01). AI engines show different combinations of precision and reliability when solving medical questions about surgery yet DeepSeek stands out by remaining the most reliable overall.ConclusionsMedical application AI models need additional development because performance results show major differences between medical specialties.</p>","PeriodicalId":22095,"journal":{"name":"Surgical Innovation","volume":" ","pages":"15533506251361664"},"PeriodicalIF":1.6000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgical Innovation","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/15533506251361664","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"SURGERY","Score":null,"Total":0}
引用次数: 0

Abstract

BackgroundThe rapid advancement of artificial intelligence (AI) has led to its increasing application in the medical field, particularly in providing accurate and reliable information for complex medical queries. PurposeThis study evaluates the performance of four AI engines-Perplexity, Chat GPT, DeepSeek, and Gemini in answering 100 multiple-choice questions derived from the American Board of Surgery In-Training Examination (ABSITE). A set of questions focused on five surgical subspecialties including colorectal surgery, acute care and trauma surgery (ACS), upper GI Surgery, breast and endocrine surgery, and hepatopancreatobiliary surgery (HPB).Data collectionWe evaluated these AI engines' ability to provide accurate and focused medical knowledge as the main objective. The research study consisting of a two-month duration was conducted from January 1, 2025, to March 28, 2025. All AI engines received identical questions through then a comparison between correct and wrong responses was performed relative to the ABSITE key answers. Each question was entered manually into the chatbots, ensuring no memory retention bias.Statistical analysisThe researchers conducted their statistical analysis with JASP software for performance evaluation between different subspecialties and AI engines through univariate and multivariate investigations.ResultsAmong the available AI tools, DeepSeek produced the most accurate responses at 74% while Chat GPT delivered 70% accuracy Gemini achieved 69% and Perplexity attained 65%. Results showed that Chat GPT achieved 83.3% accuracy in colorectal surgery yet DeepSeek scored the best at 84.6% and 67.6% for HPB Surgery and ACS respectively. Perplexity achieved a 100% accuracy rate in breast and endocrine surgery which proved to be the highest score recorded throughout the study. The analysis showed that Chat GPT exhibited performance variability between different Surgical subspecialties since it registered significant variations (P < .05), especially in acute care and trauma Surgery. The results of logistic regression indicated that Gemini along with Perplexity scored the most consistent answers among AI systems with a significant odds ratio of 2.5 (P < .01). AI engines show different combinations of precision and reliability when solving medical questions about surgery yet DeepSeek stands out by remaining the most reliable overall.ConclusionsMedical application AI models need additional development because performance results show major differences between medical specialties.

人工智能引擎在回答美国外科培训考试问题中的比较性能分析:一个多亚专业评估。
人工智能(AI)的快速发展导致其在医疗领域的应用越来越多,特别是在为复杂的医疗查询提供准确可靠的信息方面。目的:本研究评估了四个人工智能引擎——perplexity、Chat GPT、DeepSeek和Gemini在回答来自美国外科培训考试委员会(ABSITE)的100道选择题中的表现。一组问题集中在五个外科亚专科,包括结肠直肠外科、急性护理和创伤外科(ACS)、上消化道外科、乳腺和内分泌外科以及肝胆胰外科(HPB)。我们评估了这些人工智能引擎提供准确和集中的医学知识作为主要目标的能力。研究时间为2025年1月1日至2025年3月28日,为期2个月。所有人工智能引擎都收到相同的问题,然后相对于ABSITE关键答案进行正确和错误回答的比较。每个问题都是人工输入到聊天机器人中,以确保没有记忆偏差。研究人员通过单变量和多变量调查,利用JASP软件对不同子专业和人工智能引擎之间的性能评估进行了统计分析。在现有的人工智能工具中,DeepSeek的准确率最高,为74%,而Chat GPT的准确率为70%,Gemini的准确率为69%,Perplexity的准确率为65%。结果显示,Chat GPT在结直肠手术中的准确率为83.3%,而DeepSeek在HPB手术和ACS中的准确率分别为84.6%和67.6%。在乳腺和内分泌手术中,Perplexity的准确率达到100%,这是整个研究中记录的最高分。分析表明,Chat GPT表现出不同外科专科之间的差异,因为它记录了显著差异(P < 0.05),特别是在急性护理和创伤外科。逻辑回归结果表明,Gemini和Perplexity在人工智能系统中获得了最一致的答案,优势比为2.5 (P < 0.01)。人工智能引擎在解决有关外科手术的医疗问题时,表现出了不同的精度和可靠性组合,但DeepSeek在总体上仍然是最可靠的。结论医疗应用人工智能模型在不同医学专业的性能差异较大,需要进一步开发。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Surgical Innovation
Surgical Innovation 医学-外科
CiteScore
2.90
自引率
0.00%
发文量
72
审稿时长
6-12 weeks
期刊介绍: Surgical Innovation (SRI) is a peer-reviewed bi-monthly journal focusing on minimally invasive surgical techniques, new instruments such as laparoscopes and endoscopes, and new technologies. SRI prepares surgeons to think and work in "the operating room of the future" through learning new techniques, understanding and adapting to new technologies, maintaining surgical competencies, and applying surgical outcomes data to their practices. This journal is a member of the Committee on Publication Ethics (COPE).
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信