Large language models in neurosurgery: a systematic review and meta-analysis

IF 1.9 3区 医学 Q3 CLINICAL NEUROLOGY
Advait Patil, Paul Serrato, Nathan Chisvo, Omar Arnaout, Pokmeng Alfred See, Kevin T. Huang
{"title":"Large language models in neurosurgery: a systematic review and meta-analysis","authors":"Advait Patil,&nbsp;Paul Serrato,&nbsp;Nathan Chisvo,&nbsp;Omar Arnaout,&nbsp;Pokmeng Alfred See,&nbsp;Kevin T. Huang","doi":"10.1007/s00701-024-06372-9","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature.</p><h3>Methods</h3><p>We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery (“large language model” OR “LLM” OR “ChatGPT” OR “GPT-3” OR “GPT3” OR “GPT-3.5” OR “GPT3.5” OR “GPT-4” OR “GPT4” OR “LLAMA” OR “MISTRAL” OR “BARD”) AND “neurosurgery”. The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures.</p><h3>Results</h3><p>Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (<i>n</i> = 14, 27.5%), Answering Standardized Exam Questions (<i>n</i> = 12, 23.5%), and Clinical Judgement and Decision-Making Support (<i>n</i> = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (<i>n</i> = 30, 58.8%), GPT-4 (<i>n</i> = 20, 39.2%), Bard (<i>n</i> = 9, 17.6%), and Bing (<i>n</i> = 6, 11.8%). Most studies (<i>n</i> = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning.</p><h3>Conclusions</h3><p>Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.</p></div>","PeriodicalId":7370,"journal":{"name":"Acta Neurochirurgica","volume":"166 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Neurochirurgica","FirstCategoryId":"3","ListUrlMain":"https://link.springer.com/article/10.1007/s00701-024-06372-9","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background

Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature.

Methods

We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery (“large language model” OR “LLM” OR “ChatGPT” OR “GPT-3” OR “GPT3” OR “GPT-3.5” OR “GPT3.5” OR “GPT-4” OR “GPT4” OR “LLAMA” OR “MISTRAL” OR “BARD”) AND “neurosurgery”. The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures.

Results

Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (n = 14, 27.5%), Answering Standardized Exam Questions (n = 12, 23.5%), and Clinical Judgement and Decision-Making Support (n = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (n = 30, 58.8%), GPT-4 (n = 20, 39.2%), Bard (n = 9, 17.6%), and Bing (n = 6, 11.8%). Most studies (n = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning.

Conclusions

Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.

神经外科中的大语言模型:系统回顾和荟萃分析
背景大语言模型(LLMs)在神经外科领域受到越来越多的关注,并具有改善该领域的巨大潜力。然而,LLMs 在不同神经外科任务中的应用范围和性能尚未得到系统的研究,而且 LLMs 也有其自身的挑战和独特的术语。我们试图确定关键模型,建立可复制性报告指南,并在神经外科文献中强调LLM在关键应用领域的应用进展。方法我们使用与 LLM 和神经外科相关的术语("大型语言模型 "或 "LLM "或 "ChatGPT "或 "GPT-3 "或 "GPT3 "或 "GPT-3.5 "或 "GPT3.5 "或 "GPT-4 "或 "GPT4 "或 "LLAMA "或 "MISTRAL "或 "BARD")和 "神经外科 "对 PubMed 和 Google Scholar 进行了检索。对最后一组文章的出版年份、应用领域、使用的特定 LLM、用于评估 LLM 性能的对照组/比较组、文章是否报告了特定的 LLM 提示、使用的提示策略类型、LLM 查询是否可以完整再现(包括使用的提示和任何相关数据)、幻觉测量和报告的性能测量。结果51篇文章符合纳入标准,并被分为六个应用领域,其中最常见的是生成直接用于临床的文本(14篇,占27.5%)、回答标准化考试问题(12篇,占23.5%)以及临床判断和决策支持(11篇,占21.6%)。最常用的 LLM 是 GPT-3.5(30 人,占 58.8%)、GPT-4(20 人,占 39.2%)、Bard(9 人,占 17.6%)和 Bing(6 人,占 11.8%)。大多数研究(n = 43,84.3%)直接使用开箱即用的 LLM,而 8 项研究(15.7%)进行了高级预训练或微调。然而,研究通常针对基本应用,忽略了提高 LLM 性能,面临着可重复性问题。标准化详细报告、考虑 LLM 的随机性以及使用基本验证之外的先进方法对取得进展至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Acta Neurochirurgica
Acta Neurochirurgica 医学-临床神经学
CiteScore
4.40
自引率
4.20%
发文量
342
审稿时长
1 months
期刊介绍: The journal "Acta Neurochirurgica" publishes only original papers useful both to research and clinical work. Papers should deal with clinical neurosurgery - diagnosis and diagnostic techniques, operative surgery and results, postoperative treatment - or with research work in neuroscience if the underlying questions or the results are of neurosurgical interest. Reports on congresses are given in brief accounts. As official organ of the European Association of Neurosurgical Societies the journal publishes all announcements of the E.A.N.S. and reports on the activities of its member societies. Only contributions written in English will be accepted.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信