Comparative performance of neurosurgery-specific, peer-reviewed versus general AI chatbots in bilingual board examinations: evaluating accuracy, consistency, and error minimization strategies

IF 1.9 3区医学 Q3 CLINICAL NEUROLOGY

Acta Neurochirurgica Pub Date : 2025-09-09 DOI:10.1007/s00701-025-06628-y

Mahmut Çamlar, Umut Tan Sevgi, Gökberk Erol, Furkan Karakaş, Yücel Doğruel, Abuzer Güngör

{"title":"Comparative performance of neurosurgery-specific, peer-reviewed versus general AI chatbots in bilingual board examinations: evaluating accuracy, consistency, and error minimization strategies","authors":"Mahmut Çamlar, Umut Tan Sevgi, Gökberk Erol, Furkan Karakaş, Yücel Doğruel, Abuzer Güngör","doi":"10.1007/s00701-025-06628-y","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Recent studies suggest that large language models (LLMs) such as ChatGPT are useful tools for medical students or residents when preparing for examinations. These studies, especially those conducted with multiple-choice questions, emphasize that the level of knowledge and response consistency of the LLMs are generally acceptable; however, further optimization is needed in areas such as case discussion, interpretation, and language proficiency. Therefore, this study aimed to evaluate the performance of six distinct LLMs for Turkish and English neurosurgery multiple-choice questions and assess their accuracy and consistency in a specialized medical context.</p><h3>Methods</h3><p>A total of 599 multiple-choice questions drawn from Turkish Board examinations and an English neurosurgery question bank were presented to six LLMs (ChatGPT-o1pro, ChatGPT-4, AtlasGPT, Gemini, Copilot, and ChatGPT-3.5). Correctness rates were compared using the proportion z-test, and inter-model consistency was examined using Cohen’s kappa.</p><h3>Results</h3><p>ChatGPT-o1pro, ChatGPT-4, and AtlasGPT demonstrated relatively high accuracy for Single Best Answer–Recall of Knowledge (SBA-R), Single Best Answer–Interpretative Application of Knowledge (SBA-I), and True/False question types; however, performance notably decreased for questions with images, with some models leaving many unanswered items.</p><h3>Conclusion</h3><p>Our findings suggest that GPT-4-based models and AtlasGPT can handle specialized neurosurgery questions at a near-expert level for SBA-R, SBA-I, and True/False formats. Nevertheless, all models exhibit notable limitations in questions with images, indicating that these tools remain supplementary rather than definitive solutions for neurosurgical training and decision-making.</p></div>","PeriodicalId":7370,"journal":{"name":"Acta Neurochirurgica","volume":"167 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s00701-025-06628-y.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Neurochirurgica","FirstCategoryId":"3","ListUrlMain":"https://link.springer.com/article/10.1007/s00701-025-06628-y","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background

Recent studies suggest that large language models (LLMs) such as ChatGPT are useful tools for medical students or residents when preparing for examinations. These studies, especially those conducted with multiple-choice questions, emphasize that the level of knowledge and response consistency of the LLMs are generally acceptable; however, further optimization is needed in areas such as case discussion, interpretation, and language proficiency. Therefore, this study aimed to evaluate the performance of six distinct LLMs for Turkish and English neurosurgery multiple-choice questions and assess their accuracy and consistency in a specialized medical context.

Methods

A total of 599 multiple-choice questions drawn from Turkish Board examinations and an English neurosurgery question bank were presented to six LLMs (ChatGPT-o1pro, ChatGPT-4, AtlasGPT, Gemini, Copilot, and ChatGPT-3.5). Correctness rates were compared using the proportion z-test, and inter-model consistency was examined using Cohen’s kappa.

Results

ChatGPT-o1pro, ChatGPT-4, and AtlasGPT demonstrated relatively high accuracy for Single Best Answer–Recall of Knowledge (SBA-R), Single Best Answer–Interpretative Application of Knowledge (SBA-I), and True/False question types; however, performance notably decreased for questions with images, with some models leaving many unanswered items.

Conclusion

Our findings suggest that GPT-4-based models and AtlasGPT can handle specialized neurosurgery questions at a near-expert level for SBA-R, SBA-I, and True/False formats. Nevertheless, all models exhibit notable limitations in questions with images, indicating that these tools remain supplementary rather than definitive solutions for neurosurgical training and decision-making.

查看原文本刊更多论文

神经外科专用、同行评议与通用人工智能聊天机器人在双语委员会考试中的比较表现：评估准确性、一致性和错误最小化策略

最近的研究表明，大型语言模型（llm），如ChatGPT，是医学生或住院医生准备考试时的有用工具。这些研究，特别是那些用选择题进行的研究，强调法学硕士的知识水平和反应一致性总体上是可以接受的；然而，在案例讨论、口译和语言熟练程度等方面需要进一步优化。因此，本研究旨在评估六种不同llm在土耳其语和英语神经外科多项选择题中的表现，并评估其在专业医学背景下的准确性和一致性。方法向6位法学硕士（ChatGPT-o1pro、ChatGPT-4、AtlasGPT、Gemini、Copilot和ChatGPT-3.5）提交来自土耳其考试委员会考试和英语神经外科题库的599道选择题。使用比例z检验比较正确率，使用Cohen’s kappa检验模型间一致性。结果schatgpt -o1pro、ChatGPT-4和AtlasGPT在单个最佳答案-知识回忆（SBA-R）、单个最佳答案-知识解释应用（SBA-I）和真假问题类型上具有较高的准确率；然而，对于带有图像的问题，性能明显下降，一些模型留下了许多未回答的问题。结论基于gpt -4的模型和AtlasGPT可以近乎专家水平地处理SBA-R、SBA-I和真假格式的专业神经外科问题。然而，所有模型在图像问题中都表现出明显的局限性，这表明这些工具仍然是神经外科训练和决策的补充而不是最终解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Acta Neurochirurgica 医学-临床神经学

CiteScore

4.40

自引率

4.20%

发文量

342

审稿时长

1 months

期刊介绍： The journal "Acta Neurochirurgica" publishes only original papers useful both to research and clinical work. Papers should deal with clinical neurosurgery - diagnosis and diagnostic techniques, operative surgery and results, postoperative treatment - or with research work in neuroscience if the underlying questions or the results are of neurosurgical interest. Reports on congresses are given in brief accounts. As official organ of the European Association of Neurosurgical Societies the journal publishes all announcements of the E.A.N.S. and reports on the activities of its member societies. Only contributions written in English will be accepted.