Taha M Taka, Christopher E Collins, Andrew Miner, Isaac Overfield, David Shin, Lauren Seo, Olumide Danisa
{"title":"The role of generative artificial intelligence in deciding fusion treatment of lumbar degeneration: a comparative analysis and narrative review.","authors":"Taha M Taka, Christopher E Collins, Andrew Miner, Isaac Overfield, David Shin, Lauren Seo, Olumide Danisa","doi":"10.1007/s00586-025-09052-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study analyzed responses and readability of generative artificial intelligence (AI) models to questions and recommendations from the 2014 Journal of Neurosurgery: Spine (JNS) guidelines for fusion procedures in the treatment of degenerative lumbar spine disease.</p><p><strong>Methods: </strong>Twenty-four questions were generated from JNS guidelines and asked to ChatGPT 4o, Perplexity, Microsoft Copilot, and Gemini. Answers were \"concordant\" if the response highlighted all points from the JNS guidelines; otherwise, answers were considered \"non-concordant\" and further sub-categorized as either \"insufficient\" or \"overconclusive.\" Responses were evaluated for readability via the Flesch-Kincaid Grade Level, Gunning Fog Index, Simple Measure of Gobbledygook (SMOG) Index, and Flesch Reading Ease test.</p><p><strong>Results: </strong>ChatGPT 4o had the highest concordance rate at 66.67%, with non-concordant responses distributed at 16.67% for both insufficient and over-conclusive classifications. Perplexity displayed a 58.33% concordance rate, with 25% insufficient and 16.67% over-conclusive responses. Copilot showed 50% concordance, with 37.5% over-conclusive and 16.67% insufficient responses. Gemini demonstrated 54.17% concordance, with 20.83% insufficient and 25% over-conclusive responses. The Flesch-Kincaid Grade Level scores ranged from 14.03 (Copilot) to 15.66 (Perplexity). The Gunning Fog Index scores varied between 15.15 (Copilot) and 18.13 (Perplexity). The SMOG Index scores ranged from 14.69 (Copilot) to 16.49 (Perplexity). The Flesch Reading Ease scores were low across all models, with Copilot showing the highest score of 20.71.</p><p><strong>Conclusions: </strong>ChatGPT 4.0 emerged as the best-performing model in terms of concordance, while Perplexity displayed the highest complexity in text readability. AI can be a valuable adjunct in clinical decision-making but cannot replace clinician judgment.</p>","PeriodicalId":12323,"journal":{"name":"European Spine Journal","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Spine Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00586-025-09052-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: This study analyzed responses and readability of generative artificial intelligence (AI) models to questions and recommendations from the 2014 Journal of Neurosurgery: Spine (JNS) guidelines for fusion procedures in the treatment of degenerative lumbar spine disease.
Methods: Twenty-four questions were generated from JNS guidelines and asked to ChatGPT 4o, Perplexity, Microsoft Copilot, and Gemini. Answers were "concordant" if the response highlighted all points from the JNS guidelines; otherwise, answers were considered "non-concordant" and further sub-categorized as either "insufficient" or "overconclusive." Responses were evaluated for readability via the Flesch-Kincaid Grade Level, Gunning Fog Index, Simple Measure of Gobbledygook (SMOG) Index, and Flesch Reading Ease test.
Results: ChatGPT 4o had the highest concordance rate at 66.67%, with non-concordant responses distributed at 16.67% for both insufficient and over-conclusive classifications. Perplexity displayed a 58.33% concordance rate, with 25% insufficient and 16.67% over-conclusive responses. Copilot showed 50% concordance, with 37.5% over-conclusive and 16.67% insufficient responses. Gemini demonstrated 54.17% concordance, with 20.83% insufficient and 25% over-conclusive responses. The Flesch-Kincaid Grade Level scores ranged from 14.03 (Copilot) to 15.66 (Perplexity). The Gunning Fog Index scores varied between 15.15 (Copilot) and 18.13 (Perplexity). The SMOG Index scores ranged from 14.69 (Copilot) to 16.49 (Perplexity). The Flesch Reading Ease scores were low across all models, with Copilot showing the highest score of 20.71.
Conclusions: ChatGPT 4.0 emerged as the best-performing model in terms of concordance, while Perplexity displayed the highest complexity in text readability. AI can be a valuable adjunct in clinical decision-making but cannot replace clinician judgment.
期刊介绍:
"European Spine Journal" is a publication founded in response to the increasing trend toward specialization in spinal surgery and spinal pathology in general. The Journal is devoted to all spine related disciplines, including functional and surgical anatomy of the spine, biomechanics and pathophysiology, diagnostic procedures, and neurology, surgery and outcomes. The aim of "European Spine Journal" is to support the further development of highly innovative spine treatments including but not restricted to surgery and to provide an integrated and balanced view of diagnostic, research and treatment procedures as well as outcomes that will enhance effective collaboration among specialists worldwide. The “European Spine Journal” also participates in education by means of videos, interactive meetings and the endorsement of educative efforts.
Official publication of EUROSPINE, The Spine Society of Europe