Hayden P. Baker, Sarthak Aggarwal, Senthooran Kalidoss, Matthew Hess, Rex Haydon, Jason A. Strelzow
{"title":"Diagnostic accuracy of ChatGPT-4 in orthopedic oncology: a comparative study with residents","authors":"Hayden P. Baker, Sarthak Aggarwal, Senthooran Kalidoss, Matthew Hess, Rex Haydon, Jason A. Strelzow","doi":"10.1016/j.knee.2025.04.004","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Artificial intelligence (AI) is increasingly being explored for its potential role in medical diagnostics. ChatGPT-4, a large language model (LLM) with image analysis capabilities, may assist in histopathological interpretation, but its accuracy in musculoskeletal oncology remains untested. This study evaluates ChatGPT-4′s diagnostic accuracy in identifying musculoskeletal tumors from histology slides compared to orthopedic surgery residents.</div></div><div><h3>Methods</h3><div>A comparative study was conducted using 24 histology slides randomly selected from an orthopedic oncology registry. Five teams of orthopedic surgery residents (PGY-1 to PGY-5) participated in a diagnostic competition, providing their best diagnosis for each slide. ChatGPT-4 was tested separately using identical histology images and clinical vignettes, with two independent attempts. Statistical analyses, including one-way ANOVA and independent t-tests were performed to compare diagnostic accuracy.</div></div><div><h3>Results</h3><div>Orthopedic residents significantly outperformed ChatGPT-4 in diagnosing musculoskeletal tumors. The mean diagnostic accuracy among resident teams was 55%, while ChatGPT-4 achieved 25% on its first attempt and 33% on its second attempt. One-way ANOVA revealed a significant difference in accuracy across groups (<em>F</em> = 8.51, <em>p</em> = 0.033). Independent t-tests confirmed that residents performed significantly better than ChatGPT-4 (<em>t</em> = 5.80, <em>p</em> = 0.0004 for first attempt; <em>t</em> = 4.25, <em>p</em> = 0.0028 for second attempt). Both residents and ChatGPT-4 struggled with specific cases, particularly soft tissue sarcomas.</div></div><div><h3>Conclusions</h3><div>ChatGPT-4 demonstrated limited accuracy in interpreting histopathological slides compared to orthopedic residents. While AI holds promise for medical diagnostics, its current capabilities in musculoskeletal oncology remain insufficient for independent clinical use. These findings should be viewed as exploratory rather than confirmatory, and further research with larger, more diverse datasets is needed to assess AI’s role in histopathology. Future studies should investigate AI-assisted workflows, refine prompt engineering, and explore AI models specifically trained for histopathological diagnosis.</div></div>","PeriodicalId":56110,"journal":{"name":"Knee","volume":"55 ","pages":"Pages 153-160"},"PeriodicalIF":1.6000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knee","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0968016025000766","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Artificial intelligence (AI) is increasingly being explored for its potential role in medical diagnostics. ChatGPT-4, a large language model (LLM) with image analysis capabilities, may assist in histopathological interpretation, but its accuracy in musculoskeletal oncology remains untested. This study evaluates ChatGPT-4′s diagnostic accuracy in identifying musculoskeletal tumors from histology slides compared to orthopedic surgery residents.
Methods
A comparative study was conducted using 24 histology slides randomly selected from an orthopedic oncology registry. Five teams of orthopedic surgery residents (PGY-1 to PGY-5) participated in a diagnostic competition, providing their best diagnosis for each slide. ChatGPT-4 was tested separately using identical histology images and clinical vignettes, with two independent attempts. Statistical analyses, including one-way ANOVA and independent t-tests were performed to compare diagnostic accuracy.
Results
Orthopedic residents significantly outperformed ChatGPT-4 in diagnosing musculoskeletal tumors. The mean diagnostic accuracy among resident teams was 55%, while ChatGPT-4 achieved 25% on its first attempt and 33% on its second attempt. One-way ANOVA revealed a significant difference in accuracy across groups (F = 8.51, p = 0.033). Independent t-tests confirmed that residents performed significantly better than ChatGPT-4 (t = 5.80, p = 0.0004 for first attempt; t = 4.25, p = 0.0028 for second attempt). Both residents and ChatGPT-4 struggled with specific cases, particularly soft tissue sarcomas.
Conclusions
ChatGPT-4 demonstrated limited accuracy in interpreting histopathological slides compared to orthopedic residents. While AI holds promise for medical diagnostics, its current capabilities in musculoskeletal oncology remain insufficient for independent clinical use. These findings should be viewed as exploratory rather than confirmatory, and further research with larger, more diverse datasets is needed to assess AI’s role in histopathology. Future studies should investigate AI-assisted workflows, refine prompt engineering, and explore AI models specifically trained for histopathological diagnosis.
期刊介绍:
The Knee is an international journal publishing studies on the clinical treatment and fundamental biomechanical characteristics of this joint. The aim of the journal is to provide a vehicle relevant to surgeons, biomedical engineers, imaging specialists, materials scientists, rehabilitation personnel and all those with an interest in the knee.
The topics covered include, but are not limited to:
• Anatomy, physiology, morphology and biochemistry;
• Biomechanical studies;
• Advances in the development of prosthetic, orthotic and augmentation devices;
• Imaging and diagnostic techniques;
• Pathology;
• Trauma;
• Surgery;
• Rehabilitation.