Rohaid Ali, Hael F Abdulrazeq, Advait Patil, Morgan Cheatham, Ian D Connolly, Oliver Y Tang, Cody A Doberstein, Tori Riccelli, Kevin T Huang, Ganesh M Shankar, Theresa Williamson, John H Shin, Bob Carter, Radmehr Torabi, Christine K Lee, Deus Cielo, Albert E Telfeian, Ziya L Gokaslan, Aaron A Cohen-Gadol, James Zou, Wael F Asaad
{"title":"AtlasGPT: a language model grounded in neurosurgery with domain-specific data and document retrieval.","authors":"Rohaid Ali, Hael F Abdulrazeq, Advait Patil, Morgan Cheatham, Ian D Connolly, Oliver Y Tang, Cody A Doberstein, Tori Riccelli, Kevin T Huang, Ganesh M Shankar, Theresa Williamson, John H Shin, Bob Carter, Radmehr Torabi, Christine K Lee, Deus Cielo, Albert E Telfeian, Ziya L Gokaslan, Aaron A Cohen-Gadol, James Zou, Wael F Asaad","doi":"10.3171/2024.12.JNS241607","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Large language models (LLMs) have shown promising performance on medical licensing examinations, but their ability to excel in subspecialty domains and their robustness under adversarial conditions remain unclear. Herein, the authors present AtlasGPT, a subspecialty-focused LLM for neurosurgery, and evaluate its performance on a benchmark multiple-choice question bank and under adversarial testing, as well as its ability to generate high-quality explanations.</p><p><strong>Methods: </strong>AtlasGPT was built by fine-tuning GPT-4 architecture and retrieval-augmented generation from neurosurgical knowledge sources. Its performance was compared with that of GPT-4 and Gemini Advanced on a 149-question neurosurgery examination. Adversarial testing assessed robustness to misinformation. Answer explanations were rated by 15 independent neurosurgeons and compared with the question bank.</p><p><strong>Results: </strong>Across all 149 questions and on text-only questions, AtlasGPT (96%) outperformed Gemini Advanced (93%) and GPT-4 (88%) in accuracy. In adversarial testing, under which AtlasGPT was tasked with identifying medical misinformation, it was fooled 14% of the time, compared with 44% for GPT-4 and 68% for Gemini Advanced. Neurosurgeons rated AtlasGPT's answer explanations as significantly more comprehensive, relevant, and better referenced than the question bank's explanations of the responses (p < 0.001). AtlasGPT did not demonstrate any evidence of hallucination or other content that would be harmful for patient care or the surgeon's clinical decision.</p><p><strong>Conclusions: </strong>AtlasGPT demonstrates the potential of subspecialty-focused LLMs to outperform general models, exhibit robustness to misinformation, and generate high-quality explanations. Domain-specific LLMs may improve medical knowledge, decision-making, and educational materials in complex fields like neurosurgery.</p>","PeriodicalId":16505,"journal":{"name":"Journal of neurosurgery","volume":" ","pages":"1-8"},"PeriodicalIF":3.5000,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of neurosurgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3171/2024.12.JNS241607","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: Large language models (LLMs) have shown promising performance on medical licensing examinations, but their ability to excel in subspecialty domains and their robustness under adversarial conditions remain unclear. Herein, the authors present AtlasGPT, a subspecialty-focused LLM for neurosurgery, and evaluate its performance on a benchmark multiple-choice question bank and under adversarial testing, as well as its ability to generate high-quality explanations.
Methods: AtlasGPT was built by fine-tuning GPT-4 architecture and retrieval-augmented generation from neurosurgical knowledge sources. Its performance was compared with that of GPT-4 and Gemini Advanced on a 149-question neurosurgery examination. Adversarial testing assessed robustness to misinformation. Answer explanations were rated by 15 independent neurosurgeons and compared with the question bank.
Results: Across all 149 questions and on text-only questions, AtlasGPT (96%) outperformed Gemini Advanced (93%) and GPT-4 (88%) in accuracy. In adversarial testing, under which AtlasGPT was tasked with identifying medical misinformation, it was fooled 14% of the time, compared with 44% for GPT-4 and 68% for Gemini Advanced. Neurosurgeons rated AtlasGPT's answer explanations as significantly more comprehensive, relevant, and better referenced than the question bank's explanations of the responses (p < 0.001). AtlasGPT did not demonstrate any evidence of hallucination or other content that would be harmful for patient care or the surgeon's clinical decision.
Conclusions: AtlasGPT demonstrates the potential of subspecialty-focused LLMs to outperform general models, exhibit robustness to misinformation, and generate high-quality explanations. Domain-specific LLMs may improve medical knowledge, decision-making, and educational materials in complex fields like neurosurgery.
期刊介绍:
The Journal of Neurosurgery, Journal of Neurosurgery: Spine, Journal of Neurosurgery: Pediatrics, and Neurosurgical Focus are devoted to the publication of original works relating primarily to neurosurgery, including studies in clinical neurophysiology, organic neurology, ophthalmology, radiology, pathology, and molecular biology. The Editors and Editorial Boards encourage submission of clinical and laboratory studies. Other manuscripts accepted for review include technical notes on instruments or equipment that are innovative or useful to clinicians and researchers in the field of neuroscience; papers describing unusual cases; manuscripts on historical persons or events related to neurosurgery; and in Neurosurgical Focus, occasional reviews. Letters to the Editor commenting on articles recently published in the Journal of Neurosurgery, Journal of Neurosurgery: Spine, and Journal of Neurosurgery: Pediatrics are welcome.