The role of generative artificial intelligence in deciding fusion treatment of lumbar degeneration: a comparative analysis and narrative review.

IF 2.6 3区医学 Q2 CLINICAL NEUROLOGY

European Spine Journal Pub Date : 2025-06-25 DOI:10.1007/s00586-025-09052-z

Taha M Taka, Christopher E Collins, Andrew Miner, Isaac Overfield, David Shin, Lauren Seo, Olumide Danisa

{"title":"The role of generative artificial intelligence in deciding fusion treatment of lumbar degeneration: a comparative analysis and narrative review.","authors":"Taha M Taka, Christopher E Collins, Andrew Miner, Isaac Overfield, David Shin, Lauren Seo, Olumide Danisa","doi":"10.1007/s00586-025-09052-z","DOIUrl":null,"url":null,"abstract":"Purpose: This study analyzed responses and readability of generative artificial intelligence (AI) models to questions and recommendations from the 2014 Journal of Neurosurgery: Spine (JNS) guidelines for fusion procedures in the treatment of degenerative lumbar spine disease.Methods: Twenty-four questions were generated from JNS guidelines and asked to ChatGPT 4o, Perplexity, Microsoft Copilot, and Gemini. Answers were \"concordant\" if the response highlighted all points from the JNS guidelines; otherwise, answers were considered \"non-concordant\" and further sub-categorized as either \"insufficient\" or \"overconclusive.\" Responses were evaluated for readability via the Flesch-Kincaid Grade Level, Gunning Fog Index, Simple Measure of Gobbledygook (SMOG) Index, and Flesch Reading Ease test.Results: ChatGPT 4o had the highest concordance rate at 66.67%, with non-concordant responses distributed at 16.67% for both insufficient and over-conclusive classifications. Perplexity displayed a 58.33% concordance rate, with 25% insufficient and 16.67% over-conclusive responses. Copilot showed 50% concordance, with 37.5% over-conclusive and 16.67% insufficient responses. Gemini demonstrated 54.17% concordance, with 20.83% insufficient and 25% over-conclusive responses. The Flesch-Kincaid Grade Level scores ranged from 14.03 (Copilot) to 15.66 (Perplexity). The Gunning Fog Index scores varied between 15.15 (Copilot) and 18.13 (Perplexity). The SMOG Index scores ranged from 14.69 (Copilot) to 16.49 (Perplexity). The Flesch Reading Ease scores were low across all models, with Copilot showing the highest score of 20.71.Conclusions: ChatGPT 4.0 emerged as the best-performing model in terms of concordance, while Perplexity displayed the highest complexity in text readability. AI can be a valuable adjunct in clinical decision-making but cannot replace clinician judgment.","PeriodicalId":12323,"journal":{"name":"European Spine Journal","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Spine Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00586-025-09052-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: This study analyzed responses and readability of generative artificial intelligence (AI) models to questions and recommendations from the 2014 Journal of Neurosurgery: Spine (JNS) guidelines for fusion procedures in the treatment of degenerative lumbar spine disease.

Methods: Twenty-four questions were generated from JNS guidelines and asked to ChatGPT 4o, Perplexity, Microsoft Copilot, and Gemini. Answers were "concordant" if the response highlighted all points from the JNS guidelines; otherwise, answers were considered "non-concordant" and further sub-categorized as either "insufficient" or "overconclusive." Responses were evaluated for readability via the Flesch-Kincaid Grade Level, Gunning Fog Index, Simple Measure of Gobbledygook (SMOG) Index, and Flesch Reading Ease test.

Results: ChatGPT 4o had the highest concordance rate at 66.67%, with non-concordant responses distributed at 16.67% for both insufficient and over-conclusive classifications. Perplexity displayed a 58.33% concordance rate, with 25% insufficient and 16.67% over-conclusive responses. Copilot showed 50% concordance, with 37.5% over-conclusive and 16.67% insufficient responses. Gemini demonstrated 54.17% concordance, with 20.83% insufficient and 25% over-conclusive responses. The Flesch-Kincaid Grade Level scores ranged from 14.03 (Copilot) to 15.66 (Perplexity). The Gunning Fog Index scores varied between 15.15 (Copilot) and 18.13 (Perplexity). The SMOG Index scores ranged from 14.69 (Copilot) to 16.49 (Perplexity). The Flesch Reading Ease scores were low across all models, with Copilot showing the highest score of 20.71.

Conclusions: ChatGPT 4.0 emerged as the best-performing model in terms of concordance, while Perplexity displayed the highest complexity in text readability. AI can be a valuable adjunct in clinical decision-making but cannot replace clinician judgment.

查看原文本刊更多论文

生成式人工智能在腰椎退变融合治疗中的作用：比较分析和叙述回顾。

目的：本研究分析了生成式人工智能（AI）模型对2014年《神经外科：脊柱杂志》（JNS）指南中关于融合手术治疗退行性腰椎疾病的问题和建议的反应和可读性。方法：根据JNS指南生成24个问题，并向ChatGPT 40、Perplexity、Microsoft Copilot和Gemini进行问卷调查。如果回答强调了JNS指南中的所有要点，则回答为“一致”；否则，答案被认为是“不一致的”，并进一步被分类为“不充分的”或“过度的”。通过Flesch- kinkaid等级水平、枪雾指数、简单测量的官样书（SMOG）指数和Flesch阅读难度测试来评估回复的可读性。结果：ChatGPT 40的一致性率最高，为66.67%，不一致性分布在不充分和过度结论性分类的16.67%。困惑显示58.33%的一致性，25%的不充分和16.67%的过度结论性反应。副驾驶有50%的人回答一致，37.5%的人回答过于确定，16.67%的人回答不充分。双子座有54.17%的一致性，20.83%的不充分和25%的过度结论性反应。Flesch-Kincaid Grade Level分数从14.03（副驾驶）到15.66（困惑）不等。射击雾指数得分在15.15（副驾驶）和18.13（困惑）之间变化。烟雾指数得分从14.69（副驾驶）到16.49（困惑）不等。所有型号的Flesch Reading Ease得分都很低，其中Copilot得分最高，为20.71分。结论：ChatGPT 4.0在一致性方面表现最好，而Perplexity在文本可读性方面表现出最高的复杂性。人工智能在临床决策中可以成为有价值的辅助工具，但不能取代临床医生的判断。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

European Spine Journal 医学-临床神经学

CiteScore

4.80

自引率

10.70%

发文量

373

审稿时长

2-4 weeks

期刊介绍： "European Spine Journal" is a publication founded in response to the increasing trend toward specialization in spinal surgery and spinal pathology in general. The Journal is devoted to all spine related disciplines, including functional and surgical anatomy of the spine, biomechanics and pathophysiology, diagnostic procedures, and neurology, surgery and outcomes. The aim of "European Spine Journal" is to support the further development of highly innovative spine treatments including but not restricted to surgery and to provide an integrated and balanced view of diagnostic, research and treatment procedures as well as outcomes that will enhance effective collaboration among specialists worldwide. The “European Spine Journal” also participates in education by means of videos, interactive meetings and the endorsement of educative efforts. Official publication of EUROSPINE, The Spine Society of Europe