Assessing the clinical support capabilities of ChatGPT 4o and ChatGPT 4o mini in managing lumbar disc herniation.

IF 2.8 3区医学 Q2 MEDICINE, RESEARCH & EXPERIMENTAL

European Journal of Medical Research Pub Date : 2025-01-22 DOI:10.1186/s40001-025-02296-x

Suning Wang, Ying Wang, Linlin Jiang, Yong Chang, Shiji Zhang, Kun Zhao, Lu Chen, Chunzheng Gao

{"title":"Assessing the clinical support capabilities of ChatGPT 4o and ChatGPT 4o mini in managing lumbar disc herniation.","authors":"Suning Wang, Ying Wang, Linlin Jiang, Yong Chang, Shiji Zhang, Kun Zhao, Lu Chen, Chunzheng Gao","doi":"10.1186/s40001-025-02296-x","DOIUrl":null,"url":null,"abstract":"Purpose: This study evaluated and compared the clinical support capabilities of ChatGPT 4o and ChatGPT 4o mini in diagnosing and treating lumbar disc herniation (LDH) with radiculopathy.Methods: Twenty-one questions (across 5 categories) from NASS Clinical Guidelines were input into ChatGPT 4o and ChatGPT 4o mini. Five orthopedic surgeons assessed their responses using a 5-point Likert scale for accuracy and completeness, and a 7-point scale for reliability. Flesch Reading Ease scores were calculated to assess readability. Additionally, ChatGPT 4o analyzed lumbar images from 53 patients, comparing its recognizable agreement with orthopedic surgeons using Kappa values.Results: Both models demonstrated strong clinical support capabilities with no significant differences in accuracy or reliability. However, ChatGPT 4o provided more comprehensive and consistent responses. The Flesch Reading Ease scores for both models indicated that their generated content was \"very difficult to read,\" potentially limiting patient accessibility. In evaluating lumbar disc herniation images, ChatGPT 4o achieved an overall accuracy of 0.81, with LDH recognition precision, recall, and F1 scores exceeding 0.80. The AUC was 0.80, and the Kappa value was 0.61, indicating moderate agreement between the model's predictions and actual diagnoses, though with room for improvement.Conclusion: While both models are effective, ChatGPT 4o offers more comprehensive clinical responses, making it more suitable for high-integrity medical tasks. However, the difficulty in reading AI-generated content and occasional use of misleading terms, such as \"tumor,\" indicate a need for further improvements to reduce patient anxiety.","PeriodicalId":11949,"journal":{"name":"European Journal of Medical Research","volume":"30 1","pages":"45"},"PeriodicalIF":2.8000,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11753088/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Medical Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s40001-025-02296-x","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: This study evaluated and compared the clinical support capabilities of ChatGPT 4o and ChatGPT 4o mini in diagnosing and treating lumbar disc herniation (LDH) with radiculopathy.

Methods: Twenty-one questions (across 5 categories) from NASS Clinical Guidelines were input into ChatGPT 4o and ChatGPT 4o mini. Five orthopedic surgeons assessed their responses using a 5-point Likert scale for accuracy and completeness, and a 7-point scale for reliability. Flesch Reading Ease scores were calculated to assess readability. Additionally, ChatGPT 4o analyzed lumbar images from 53 patients, comparing its recognizable agreement with orthopedic surgeons using Kappa values.

Results: Both models demonstrated strong clinical support capabilities with no significant differences in accuracy or reliability. However, ChatGPT 4o provided more comprehensive and consistent responses. The Flesch Reading Ease scores for both models indicated that their generated content was "very difficult to read," potentially limiting patient accessibility. In evaluating lumbar disc herniation images, ChatGPT 4o achieved an overall accuracy of 0.81, with LDH recognition precision, recall, and F1 scores exceeding 0.80. The AUC was 0.80, and the Kappa value was 0.61, indicating moderate agreement between the model's predictions and actual diagnoses, though with room for improvement.

Conclusion: While both models are effective, ChatGPT 4o offers more comprehensive clinical responses, making it more suitable for high-integrity medical tasks. However, the difficulty in reading AI-generated content and occasional use of misleading terms, such as "tumor," indicate a need for further improvements to reduce patient anxiety.

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

European Journal of Medical Research 医学-医学：研究与实验

CiteScore

3.20

自引率

0.00%

发文量

247

审稿时长

>12 weeks

期刊介绍： European Journal of Medical Research publishes translational and clinical research of international interest across all medical disciplines, enabling clinicians and other researchers to learn about developments and innovations within these disciplines and across the boundaries between disciplines. The journal publishes high quality research and reviews and aims to ensure that the results of all well-conducted research are published, regardless of their outcome.