{"title":"Comparative diagnostic accuracy of ChatGPT-4 and machine learning in differentiating spinal tuberculosis and spinal tumors.","authors":"Xiaojiang Hu, Dongcheng Xu, Hongqi Zhang, Mingxing Tang, Qile Gao","doi":"10.1016/j.spinee.2024.12.035","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>In clinical practice, distinguishing between spinal tuberculosis (STB) and spinal tumors (ST) poses a significant diagnostic challenge. The application of AI-driven large language models (LLMs) shows great potential for improving the accuracy of this differential diagnosis.</p><p><strong>Purpose: </strong>To evaluate the performance of various machine learning models and ChatGPT-4 in distinguishing between STB and ST.</p><p><strong>Study design: </strong>A retrospective cohort study.</p><p><strong>Patient sample: </strong>143 STB cases and 153 ST cases admitted to Xiangya Hospital Central South University, from January 2016 to June 2023 were collected.</p><p><strong>Outcome measures: </strong>This study incorporates basic patient information, standard laboratory results, serum tumor markers, and comprehensive imaging records, including Magnetic Resonance Imaging (MRI) and Computed Tomography (CT), for individuals diagnosed with STB and ST. Machine learning techniques and ChatGPT-4 were utilized to distinguish between STB and ST separately.</p><p><strong>Method: </strong>Six distinct machine learning models, along with ChatGPT-4, were employed to evaluate their differential diagnostic effectiveness.</p><p><strong>Result: </strong>Among the 6 machine learning models, the Gradient Boosting Machine (GBM) algorithm model demonstrated the highest differential diagnostic efficiency. In the training cohort, the GBM model achieved a sensitivity of 98.84% and a specificity of 100.00% in distinguishing STB from ST. In the testing cohort, its sensitivity was 98.25%, and specificity was 91.80%. ChatGPT-4 exhibited a sensitivity of 70.37% and a specificity of 90.65% for differential diagnosis. In single-question cases, ChatGPT-4's sensitivity and specificity were 71.67% and 92.55%, respectively, while in re-questioning cases, they were 44.44% and 76.92%.</p><p><strong>Conclusion: </strong>The GBM model demonstrates significant value in the differential diagnosis of STB and ST, whereas the diagnostic performance of ChatGPT-4 remains suboptimal.</p>","PeriodicalId":49484,"journal":{"name":"Spine Journal","volume":" ","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Spine Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.spinee.2024.12.035","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: In clinical practice, distinguishing between spinal tuberculosis (STB) and spinal tumors (ST) poses a significant diagnostic challenge. The application of AI-driven large language models (LLMs) shows great potential for improving the accuracy of this differential diagnosis.
Purpose: To evaluate the performance of various machine learning models and ChatGPT-4 in distinguishing between STB and ST.
Study design: A retrospective cohort study.
Patient sample: 143 STB cases and 153 ST cases admitted to Xiangya Hospital Central South University, from January 2016 to June 2023 were collected.
Outcome measures: This study incorporates basic patient information, standard laboratory results, serum tumor markers, and comprehensive imaging records, including Magnetic Resonance Imaging (MRI) and Computed Tomography (CT), for individuals diagnosed with STB and ST. Machine learning techniques and ChatGPT-4 were utilized to distinguish between STB and ST separately.
Method: Six distinct machine learning models, along with ChatGPT-4, were employed to evaluate their differential diagnostic effectiveness.
Result: Among the 6 machine learning models, the Gradient Boosting Machine (GBM) algorithm model demonstrated the highest differential diagnostic efficiency. In the training cohort, the GBM model achieved a sensitivity of 98.84% and a specificity of 100.00% in distinguishing STB from ST. In the testing cohort, its sensitivity was 98.25%, and specificity was 91.80%. ChatGPT-4 exhibited a sensitivity of 70.37% and a specificity of 90.65% for differential diagnosis. In single-question cases, ChatGPT-4's sensitivity and specificity were 71.67% and 92.55%, respectively, while in re-questioning cases, they were 44.44% and 76.92%.
Conclusion: The GBM model demonstrates significant value in the differential diagnosis of STB and ST, whereas the diagnostic performance of ChatGPT-4 remains suboptimal.
期刊介绍:
The Spine Journal, the official journal of the North American Spine Society, is an international and multidisciplinary journal that publishes original, peer-reviewed articles on research and treatment related to the spine and spine care, including basic science and clinical investigations. It is a condition of publication that manuscripts submitted to The Spine Journal have not been published, and will not be simultaneously submitted or published elsewhere. The Spine Journal also publishes major reviews of specific topics by acknowledged authorities, technical notes, teaching editorials, and other special features, Letters to the Editor-in-Chief are encouraged.