The role of generative artificial intelligence in deciding fusion treatment of lumbar degeneration: a comparative analysis and narrative review.

IF 2.6 3区 医学 Q2 CLINICAL NEUROLOGY
Taha M Taka, Christopher E Collins, Andrew Miner, Isaac Overfield, David Shin, Lauren Seo, Olumide Danisa
{"title":"The role of generative artificial intelligence in deciding fusion treatment of lumbar degeneration: a comparative analysis and narrative review.","authors":"Taha M Taka, Christopher E Collins, Andrew Miner, Isaac Overfield, David Shin, Lauren Seo, Olumide Danisa","doi":"10.1007/s00586-025-09052-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study analyzed responses and readability of generative artificial intelligence (AI) models to questions and recommendations from the 2014 Journal of Neurosurgery: Spine (JNS) guidelines for fusion procedures in the treatment of degenerative lumbar spine disease.</p><p><strong>Methods: </strong>Twenty-four questions were generated from JNS guidelines and asked to ChatGPT 4o, Perplexity, Microsoft Copilot, and Gemini. Answers were \"concordant\" if the response highlighted all points from the JNS guidelines; otherwise, answers were considered \"non-concordant\" and further sub-categorized as either \"insufficient\" or \"overconclusive.\" Responses were evaluated for readability via the Flesch-Kincaid Grade Level, Gunning Fog Index, Simple Measure of Gobbledygook (SMOG) Index, and Flesch Reading Ease test.</p><p><strong>Results: </strong>ChatGPT 4o had the highest concordance rate at 66.67%, with non-concordant responses distributed at 16.67% for both insufficient and over-conclusive classifications. Perplexity displayed a 58.33% concordance rate, with 25% insufficient and 16.67% over-conclusive responses. Copilot showed 50% concordance, with 37.5% over-conclusive and 16.67% insufficient responses. Gemini demonstrated 54.17% concordance, with 20.83% insufficient and 25% over-conclusive responses. The Flesch-Kincaid Grade Level scores ranged from 14.03 (Copilot) to 15.66 (Perplexity). The Gunning Fog Index scores varied between 15.15 (Copilot) and 18.13 (Perplexity). The SMOG Index scores ranged from 14.69 (Copilot) to 16.49 (Perplexity). The Flesch Reading Ease scores were low across all models, with Copilot showing the highest score of 20.71.</p><p><strong>Conclusions: </strong>ChatGPT 4.0 emerged as the best-performing model in terms of concordance, while Perplexity displayed the highest complexity in text readability. AI can be a valuable adjunct in clinical decision-making but cannot replace clinician judgment.</p>","PeriodicalId":12323,"journal":{"name":"European Spine Journal","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Spine Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00586-025-09052-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: This study analyzed responses and readability of generative artificial intelligence (AI) models to questions and recommendations from the 2014 Journal of Neurosurgery: Spine (JNS) guidelines for fusion procedures in the treatment of degenerative lumbar spine disease.

Methods: Twenty-four questions were generated from JNS guidelines and asked to ChatGPT 4o, Perplexity, Microsoft Copilot, and Gemini. Answers were "concordant" if the response highlighted all points from the JNS guidelines; otherwise, answers were considered "non-concordant" and further sub-categorized as either "insufficient" or "overconclusive." Responses were evaluated for readability via the Flesch-Kincaid Grade Level, Gunning Fog Index, Simple Measure of Gobbledygook (SMOG) Index, and Flesch Reading Ease test.

Results: ChatGPT 4o had the highest concordance rate at 66.67%, with non-concordant responses distributed at 16.67% for both insufficient and over-conclusive classifications. Perplexity displayed a 58.33% concordance rate, with 25% insufficient and 16.67% over-conclusive responses. Copilot showed 50% concordance, with 37.5% over-conclusive and 16.67% insufficient responses. Gemini demonstrated 54.17% concordance, with 20.83% insufficient and 25% over-conclusive responses. The Flesch-Kincaid Grade Level scores ranged from 14.03 (Copilot) to 15.66 (Perplexity). The Gunning Fog Index scores varied between 15.15 (Copilot) and 18.13 (Perplexity). The SMOG Index scores ranged from 14.69 (Copilot) to 16.49 (Perplexity). The Flesch Reading Ease scores were low across all models, with Copilot showing the highest score of 20.71.

Conclusions: ChatGPT 4.0 emerged as the best-performing model in terms of concordance, while Perplexity displayed the highest complexity in text readability. AI can be a valuable adjunct in clinical decision-making but cannot replace clinician judgment.

生成式人工智能在腰椎退变融合治疗中的作用:比较分析和叙述回顾。
目的:本研究分析了生成式人工智能(AI)模型对2014年《神经外科:脊柱杂志》(JNS)指南中关于融合手术治疗退行性腰椎疾病的问题和建议的反应和可读性。方法:根据JNS指南生成24个问题,并向ChatGPT 40、Perplexity、Microsoft Copilot和Gemini进行问卷调查。如果回答强调了JNS指南中的所有要点,则回答为“一致”;否则,答案被认为是“不一致的”,并进一步被分类为“不充分的”或“过度的”。通过Flesch- kinkaid等级水平、枪雾指数、简单测量的官样书(SMOG)指数和Flesch阅读难度测试来评估回复的可读性。结果:ChatGPT 40的一致性率最高,为66.67%,不一致性分布在不充分和过度结论性分类的16.67%。困惑显示58.33%的一致性,25%的不充分和16.67%的过度结论性反应。副驾驶有50%的人回答一致,37.5%的人回答过于确定,16.67%的人回答不充分。双子座有54.17%的一致性,20.83%的不充分和25%的过度结论性反应。Flesch-Kincaid Grade Level分数从14.03(副驾驶)到15.66(困惑)不等。射击雾指数得分在15.15(副驾驶)和18.13(困惑)之间变化。烟雾指数得分从14.69(副驾驶)到16.49(困惑)不等。所有型号的Flesch Reading Ease得分都很低,其中Copilot得分最高,为20.71分。结论:ChatGPT 4.0在一致性方面表现最好,而Perplexity在文本可读性方面表现出最高的复杂性。人工智能在临床决策中可以成为有价值的辅助工具,但不能取代临床医生的判断。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
European Spine Journal
European Spine Journal 医学-临床神经学
CiteScore
4.80
自引率
10.70%
发文量
373
审稿时长
2-4 weeks
期刊介绍: "European Spine Journal" is a publication founded in response to the increasing trend toward specialization in spinal surgery and spinal pathology in general. The Journal is devoted to all spine related disciplines, including functional and surgical anatomy of the spine, biomechanics and pathophysiology, diagnostic procedures, and neurology, surgery and outcomes. The aim of "European Spine Journal" is to support the further development of highly innovative spine treatments including but not restricted to surgery and to provide an integrated and balanced view of diagnostic, research and treatment procedures as well as outcomes that will enhance effective collaboration among specialists worldwide. The “European Spine Journal” also participates in education by means of videos, interactive meetings and the endorsement of educative efforts. Official publication of EUROSPINE, The Spine Society of Europe
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信