Nawaf AlShahwan, Ibrahim Majed Fetyani, Mohammed Basem Beyari, Saleh Husam Aldeligan, Maram Basem Beyari, Rayan Saleh Alshehri, Ahmed Alburakan, Hassan Mashbari, Abdulaziz AlKanhal, Thamer Nouh
{"title":"人工智能引擎在回答美国外科培训考试问题中的比较性能分析:一个多亚专业评估。","authors":"Nawaf AlShahwan, Ibrahim Majed Fetyani, Mohammed Basem Beyari, Saleh Husam Aldeligan, Maram Basem Beyari, Rayan Saleh Alshehri, Ahmed Alburakan, Hassan Mashbari, Abdulaziz AlKanhal, Thamer Nouh","doi":"10.1177/15533506251361664","DOIUrl":null,"url":null,"abstract":"<p><p>BackgroundThe rapid advancement of artificial intelligence (AI) has led to its increasing application in the medical field, particularly in providing accurate and reliable information for complex medical queries. PurposeThis study evaluates the performance of four AI engines-Perplexity, Chat GPT, DeepSeek, and Gemini in answering 100 multiple-choice questions derived from the American Board of Surgery In-Training Examination (ABSITE). A set of questions focused on five surgical subspecialties including colorectal surgery, acute care and trauma surgery (ACS), upper GI Surgery, breast and endocrine surgery, and hepatopancreatobiliary surgery (HPB).Data collectionWe evaluated these AI engines' ability to provide accurate and focused medical knowledge as the main objective. The research study consisting of a two-month duration was conducted from January 1, 2025, to March 28, 2025. All AI engines received identical questions through then a comparison between correct and wrong responses was performed relative to the ABSITE key answers. Each question was entered manually into the chatbots, ensuring no memory retention bias.Statistical analysisThe researchers conducted their statistical analysis with JASP software for performance evaluation between different subspecialties and AI engines through univariate and multivariate investigations.ResultsAmong the available AI tools, DeepSeek produced the most accurate responses at 74% while Chat GPT delivered 70% accuracy Gemini achieved 69% and Perplexity attained 65%. Results showed that Chat GPT achieved 83.3% accuracy in colorectal surgery yet DeepSeek scored the best at 84.6% and 67.6% for HPB Surgery and ACS respectively. Perplexity achieved a 100% accuracy rate in breast and endocrine surgery which proved to be the highest score recorded throughout the study. The analysis showed that Chat GPT exhibited performance variability between different Surgical subspecialties since it registered significant variations (<i>P</i> < .05), especially in acute care and trauma Surgery. The results of logistic regression indicated that Gemini along with Perplexity scored the most consistent answers among AI systems with a significant odds ratio of 2.5 (<i>P</i> < .01). AI engines show different combinations of precision and reliability when solving medical questions about surgery yet DeepSeek stands out by remaining the most reliable overall.ConclusionsMedical application AI models need additional development because performance results show major differences between medical specialties.</p>","PeriodicalId":22095,"journal":{"name":"Surgical Innovation","volume":" ","pages":"15533506251361664"},"PeriodicalIF":1.6000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparative Performance Analysis of AI Engines in Answering American Board of Surgery In-Training Examination Questions: A Multi-Subspecialty Evaluation.\",\"authors\":\"Nawaf AlShahwan, Ibrahim Majed Fetyani, Mohammed Basem Beyari, Saleh Husam Aldeligan, Maram Basem Beyari, Rayan Saleh Alshehri, Ahmed Alburakan, Hassan Mashbari, Abdulaziz AlKanhal, Thamer Nouh\",\"doi\":\"10.1177/15533506251361664\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>BackgroundThe rapid advancement of artificial intelligence (AI) has led to its increasing application in the medical field, particularly in providing accurate and reliable information for complex medical queries. PurposeThis study evaluates the performance of four AI engines-Perplexity, Chat GPT, DeepSeek, and Gemini in answering 100 multiple-choice questions derived from the American Board of Surgery In-Training Examination (ABSITE). A set of questions focused on five surgical subspecialties including colorectal surgery, acute care and trauma surgery (ACS), upper GI Surgery, breast and endocrine surgery, and hepatopancreatobiliary surgery (HPB).Data collectionWe evaluated these AI engines' ability to provide accurate and focused medical knowledge as the main objective. The research study consisting of a two-month duration was conducted from January 1, 2025, to March 28, 2025. All AI engines received identical questions through then a comparison between correct and wrong responses was performed relative to the ABSITE key answers. Each question was entered manually into the chatbots, ensuring no memory retention bias.Statistical analysisThe researchers conducted their statistical analysis with JASP software for performance evaluation between different subspecialties and AI engines through univariate and multivariate investigations.ResultsAmong the available AI tools, DeepSeek produced the most accurate responses at 74% while Chat GPT delivered 70% accuracy Gemini achieved 69% and Perplexity attained 65%. Results showed that Chat GPT achieved 83.3% accuracy in colorectal surgery yet DeepSeek scored the best at 84.6% and 67.6% for HPB Surgery and ACS respectively. Perplexity achieved a 100% accuracy rate in breast and endocrine surgery which proved to be the highest score recorded throughout the study. The analysis showed that Chat GPT exhibited performance variability between different Surgical subspecialties since it registered significant variations (<i>P</i> < .05), especially in acute care and trauma Surgery. The results of logistic regression indicated that Gemini along with Perplexity scored the most consistent answers among AI systems with a significant odds ratio of 2.5 (<i>P</i> < .01). AI engines show different combinations of precision and reliability when solving medical questions about surgery yet DeepSeek stands out by remaining the most reliable overall.ConclusionsMedical application AI models need additional development because performance results show major differences between medical specialties.</p>\",\"PeriodicalId\":22095,\"journal\":{\"name\":\"Surgical Innovation\",\"volume\":\" \",\"pages\":\"15533506251361664\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2025-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Surgical Innovation\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/15533506251361664\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"SURGERY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgical Innovation","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/15533506251361664","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"SURGERY","Score":null,"Total":0}
Comparative Performance Analysis of AI Engines in Answering American Board of Surgery In-Training Examination Questions: A Multi-Subspecialty Evaluation.
BackgroundThe rapid advancement of artificial intelligence (AI) has led to its increasing application in the medical field, particularly in providing accurate and reliable information for complex medical queries. PurposeThis study evaluates the performance of four AI engines-Perplexity, Chat GPT, DeepSeek, and Gemini in answering 100 multiple-choice questions derived from the American Board of Surgery In-Training Examination (ABSITE). A set of questions focused on five surgical subspecialties including colorectal surgery, acute care and trauma surgery (ACS), upper GI Surgery, breast and endocrine surgery, and hepatopancreatobiliary surgery (HPB).Data collectionWe evaluated these AI engines' ability to provide accurate and focused medical knowledge as the main objective. The research study consisting of a two-month duration was conducted from January 1, 2025, to March 28, 2025. All AI engines received identical questions through then a comparison between correct and wrong responses was performed relative to the ABSITE key answers. Each question was entered manually into the chatbots, ensuring no memory retention bias.Statistical analysisThe researchers conducted their statistical analysis with JASP software for performance evaluation between different subspecialties and AI engines through univariate and multivariate investigations.ResultsAmong the available AI tools, DeepSeek produced the most accurate responses at 74% while Chat GPT delivered 70% accuracy Gemini achieved 69% and Perplexity attained 65%. Results showed that Chat GPT achieved 83.3% accuracy in colorectal surgery yet DeepSeek scored the best at 84.6% and 67.6% for HPB Surgery and ACS respectively. Perplexity achieved a 100% accuracy rate in breast and endocrine surgery which proved to be the highest score recorded throughout the study. The analysis showed that Chat GPT exhibited performance variability between different Surgical subspecialties since it registered significant variations (P < .05), especially in acute care and trauma Surgery. The results of logistic regression indicated that Gemini along with Perplexity scored the most consistent answers among AI systems with a significant odds ratio of 2.5 (P < .01). AI engines show different combinations of precision and reliability when solving medical questions about surgery yet DeepSeek stands out by remaining the most reliable overall.ConclusionsMedical application AI models need additional development because performance results show major differences between medical specialties.
期刊介绍:
Surgical Innovation (SRI) is a peer-reviewed bi-monthly journal focusing on minimally invasive surgical techniques, new instruments such as laparoscopes and endoscopes, and new technologies. SRI prepares surgeons to think and work in "the operating room of the future" through learning new techniques, understanding and adapting to new technologies, maintaining surgical competencies, and applying surgical outcomes data to their practices. This journal is a member of the Committee on Publication Ethics (COPE).