Comparative performance of neurosurgery-specific, peer-reviewed versus general AI chatbots in bilingual board examinations: evaluating accuracy, consistency, and error minimization strategies
Mahmut Çamlar, Umut Tan Sevgi, Gökberk Erol, Furkan Karakaş, Yücel Doğruel, Abuzer Güngör
{"title":"Comparative performance of neurosurgery-specific, peer-reviewed versus general AI chatbots in bilingual board examinations: evaluating accuracy, consistency, and error minimization strategies","authors":"Mahmut Çamlar, Umut Tan Sevgi, Gökberk Erol, Furkan Karakaş, Yücel Doğruel, Abuzer Güngör","doi":"10.1007/s00701-025-06628-y","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Recent studies suggest that large language models (LLMs) such as ChatGPT are useful tools for medical students or residents when preparing for examinations. These studies, especially those conducted with multiple-choice questions, emphasize that the level of knowledge and response consistency of the LLMs are generally acceptable; however, further optimization is needed in areas such as case discussion, interpretation, and language proficiency. Therefore, this study aimed to evaluate the performance of six distinct LLMs for Turkish and English neurosurgery multiple-choice questions and assess their accuracy and consistency in a specialized medical context.</p><h3>Methods</h3><p>A total of 599 multiple-choice questions drawn from Turkish Board examinations and an English neurosurgery question bank were presented to six LLMs (ChatGPT-o1pro, ChatGPT-4, AtlasGPT, Gemini, Copilot, and ChatGPT-3.5). Correctness rates were compared using the proportion z-test, and inter-model consistency was examined using Cohen’s kappa.</p><h3>Results</h3><p>ChatGPT-o1pro, ChatGPT-4, and AtlasGPT demonstrated relatively high accuracy for Single Best Answer–Recall of Knowledge (SBA-R), Single Best Answer–Interpretative Application of Knowledge (SBA-I), and True/False question types; however, performance notably decreased for questions with images, with some models leaving many unanswered items.</p><h3>Conclusion</h3><p>Our findings suggest that GPT-4-based models and AtlasGPT can handle specialized neurosurgery questions at a near-expert level for SBA-R, SBA-I, and True/False formats. Nevertheless, all models exhibit notable limitations in questions with images, indicating that these tools remain supplementary rather than definitive solutions for neurosurgical training and decision-making.</p></div>","PeriodicalId":7370,"journal":{"name":"Acta Neurochirurgica","volume":"167 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s00701-025-06628-y.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Neurochirurgica","FirstCategoryId":"3","ListUrlMain":"https://link.springer.com/article/10.1007/s00701-025-06628-y","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Recent studies suggest that large language models (LLMs) such as ChatGPT are useful tools for medical students or residents when preparing for examinations. These studies, especially those conducted with multiple-choice questions, emphasize that the level of knowledge and response consistency of the LLMs are generally acceptable; however, further optimization is needed in areas such as case discussion, interpretation, and language proficiency. Therefore, this study aimed to evaluate the performance of six distinct LLMs for Turkish and English neurosurgery multiple-choice questions and assess their accuracy and consistency in a specialized medical context.
Methods
A total of 599 multiple-choice questions drawn from Turkish Board examinations and an English neurosurgery question bank were presented to six LLMs (ChatGPT-o1pro, ChatGPT-4, AtlasGPT, Gemini, Copilot, and ChatGPT-3.5). Correctness rates were compared using the proportion z-test, and inter-model consistency was examined using Cohen’s kappa.
Results
ChatGPT-o1pro, ChatGPT-4, and AtlasGPT demonstrated relatively high accuracy for Single Best Answer–Recall of Knowledge (SBA-R), Single Best Answer–Interpretative Application of Knowledge (SBA-I), and True/False question types; however, performance notably decreased for questions with images, with some models leaving many unanswered items.
Conclusion
Our findings suggest that GPT-4-based models and AtlasGPT can handle specialized neurosurgery questions at a near-expert level for SBA-R, SBA-I, and True/False formats. Nevertheless, all models exhibit notable limitations in questions with images, indicating that these tools remain supplementary rather than definitive solutions for neurosurgical training and decision-making.
期刊介绍:
The journal "Acta Neurochirurgica" publishes only original papers useful both to research and clinical work. Papers should deal with clinical neurosurgery - diagnosis and diagnostic techniques, operative surgery and results, postoperative treatment - or with research work in neuroscience if the underlying questions or the results are of neurosurgical interest. Reports on congresses are given in brief accounts. As official organ of the European Association of Neurosurgical Societies the journal publishes all announcements of the E.A.N.S. and reports on the activities of its member societies. Only contributions written in English will be accepted.