神经外科中的大语言模型：系统回顾和荟萃分析

IF 1.9 3区医学 Q3 CLINICAL NEUROLOGY

Acta Neurochirurgica Pub Date : 2024-11-23 DOI:10.1007/s00701-024-06372-9

Advait Patil, Paul Serrato, Nathan Chisvo, Omar Arnaout, Pokmeng Alfred See, Kevin T. Huang

{"title":"神经外科中的大语言模型：系统回顾和荟萃分析","authors":"Advait Patil, Paul Serrato, Nathan Chisvo, Omar Arnaout, Pokmeng Alfred See, Kevin T. Huang","doi":"10.1007/s00701-024-06372-9","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature.</p><h3>Methods</h3><p>We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery (“large language model” OR “LLM” OR “ChatGPT” OR “GPT-3” OR “GPT3” OR “GPT-3.5” OR “GPT3.5” OR “GPT-4” OR “GPT4” OR “LLAMA” OR “MISTRAL” OR “BARD”) AND “neurosurgery”. The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures.</p><h3>Results</h3><p>Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (<i>n</i> = 14, 27.5%), Answering Standardized Exam Questions (<i>n</i> = 12, 23.5%), and Clinical Judgement and Decision-Making Support (<i>n</i> = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (<i>n</i> = 30, 58.8%), GPT-4 (<i>n</i> = 20, 39.2%), Bard (<i>n</i> = 9, 17.6%), and Bing (<i>n</i> = 6, 11.8%). Most studies (<i>n</i> = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning.</p><h3>Conclusions</h3><p>Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.</p></div>","PeriodicalId":7370,"journal":{"name":"Acta Neurochirurgica","volume":"166 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large language models in neurosurgery: a systematic review and meta-analysis\",\"authors\":\"Advait Patil, Paul Serrato, Nathan Chisvo, Omar Arnaout, Pokmeng Alfred See, Kevin T. Huang\",\"doi\":\"10.1007/s00701-024-06372-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><p>Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature.</p><h3>Methods</h3><p>We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery (“large language model” OR “LLM” OR “ChatGPT” OR “GPT-3” OR “GPT3” OR “GPT-3.5” OR “GPT3.5” OR “GPT-4” OR “GPT4” OR “LLAMA” OR “MISTRAL” OR “BARD”) AND “neurosurgery”. The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures.</p><h3>Results</h3><p>Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (<i>n</i> = 14, 27.5%), Answering Standardized Exam Questions (<i>n</i> = 12, 23.5%), and Clinical Judgement and Decision-Making Support (<i>n</i> = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (<i>n</i> = 30, 58.8%), GPT-4 (<i>n</i> = 20, 39.2%), Bard (<i>n</i> = 9, 17.6%), and Bing (<i>n</i> = 6, 11.8%). Most studies (<i>n</i> = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning.</p><h3>Conclusions</h3><p>Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.</p></div>\",\"PeriodicalId\":7370,\"journal\":{\"name\":\"Acta Neurochirurgica\",\"volume\":\"166 1\",\"pages\":\"\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2024-11-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Acta Neurochirurgica\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s00701-024-06372-9\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Neurochirurgica","FirstCategoryId":"3","ListUrlMain":"https://link.springer.com/article/10.1007/s00701-024-06372-9","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景大语言模型（LLMs）在神经外科领域受到越来越多的关注，并具有改善该领域的巨大潜力。然而，LLMs 在不同神经外科任务中的应用范围和性能尚未得到系统的研究，而且 LLMs 也有其自身的挑战和独特的术语。我们试图确定关键模型，建立可复制性报告指南，并在神经外科文献中强调LLM在关键应用领域的应用进展。方法我们使用与 LLM 和神经外科相关的术语（"大型语言模型 "或 "LLM "或 "ChatGPT "或 "GPT-3 "或 "GPT3 "或 "GPT-3.5 "或 "GPT3.5 "或 "GPT-4 "或 "GPT4 "或 "LLAMA "或 "MISTRAL "或 "BARD"）和 "神经外科 "对 PubMed 和 Google Scholar 进行了检索。对最后一组文章的出版年份、应用领域、使用的特定 LLM、用于评估 LLM 性能的对照组/比较组、文章是否报告了特定的 LLM 提示、使用的提示策略类型、LLM 查询是否可以完整再现（包括使用的提示和任何相关数据）、幻觉测量和报告的性能测量。结果51篇文章符合纳入标准，并被分为六个应用领域，其中最常见的是生成直接用于临床的文本（14篇，占27.5%）、回答标准化考试问题（12篇，占23.5%）以及临床判断和决策支持（11篇，占21.6%）。最常用的 LLM 是 GPT-3.5（30 人，占 58.8%）、GPT-4（20 人，占 39.2%）、Bard（9 人，占 17.6%）和 Bing（6 人，占 11.8%）。大多数研究（n = 43，84.3%）直接使用开箱即用的 LLM，而 8 项研究（15.7%）进行了高级预训练或微调。然而，研究通常针对基本应用，忽略了提高 LLM 性能，面临着可重复性问题。标准化详细报告、考虑 LLM 的随机性以及使用基本验证之外的先进方法对取得进展至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Large language models in neurosurgery: a systematic review and meta-analysis

Background

Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature.

Methods

We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery (“large language model” OR “LLM” OR “ChatGPT” OR “GPT-3” OR “GPT3” OR “GPT-3.5” OR “GPT3.5” OR “GPT-4” OR “GPT4” OR “LLAMA” OR “MISTRAL” OR “BARD”) AND “neurosurgery”. The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures.

Results

Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (n = 14, 27.5%), Answering Standardized Exam Questions (n = 12, 23.5%), and Clinical Judgement and Decision-Making Support (n = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (n = 30, 58.8%), GPT-4 (n = 20, 39.2%), Bard (n = 9, 17.6%), and Bing (n = 6, 11.8%). Most studies (n = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning.

Conclusions

Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Acta Neurochirurgica 医学-临床神经学

CiteScore

4.40

自引率

4.20%

发文量

342

审稿时长

1 months

期刊介绍： The journal "Acta Neurochirurgica" publishes only original papers useful both to research and clinical work. Papers should deal with clinical neurosurgery - diagnosis and diagnostic techniques, operative surgery and results, postoperative treatment - or with research work in neuroscience if the underlying questions or the results are of neurosurgical interest. Reports on congresses are given in brief accounts. As official organ of the European Association of Neurosurgical Societies the journal publishes all announcements of the E.A.N.S. and reports on the activities of its member societies. Only contributions written in English will be accepted.