Diagnostic accuracy of ChatGPT-4 in orthopedic oncology: a comparative study with residents

IF 1.6 4区 医学 Q3 ORTHOPEDICS
Knee Pub Date : 2025-05-01 DOI:10.1016/j.knee.2025.04.004
Hayden P. Baker, Sarthak Aggarwal, Senthooran Kalidoss, Matthew Hess, Rex Haydon, Jason A. Strelzow
{"title":"Diagnostic accuracy of ChatGPT-4 in orthopedic oncology: a comparative study with residents","authors":"Hayden P. Baker,&nbsp;Sarthak Aggarwal,&nbsp;Senthooran Kalidoss,&nbsp;Matthew Hess,&nbsp;Rex Haydon,&nbsp;Jason A. Strelzow","doi":"10.1016/j.knee.2025.04.004","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Artificial intelligence (AI) is increasingly being explored for its potential role in medical diagnostics. ChatGPT-4, a large language model (LLM) with image analysis capabilities, may assist in histopathological interpretation, but its accuracy in musculoskeletal oncology remains untested. This study evaluates ChatGPT-4′s diagnostic accuracy in identifying musculoskeletal tumors from histology slides compared to orthopedic surgery residents.</div></div><div><h3>Methods</h3><div>A comparative study was conducted using 24 histology slides randomly selected from an orthopedic oncology registry. Five teams of orthopedic surgery residents (PGY-1 to PGY-5) participated in a diagnostic competition, providing their best diagnosis for each slide. ChatGPT-4 was tested separately using identical histology images and clinical vignettes, with two independent attempts. Statistical analyses, including one-way ANOVA and independent t-tests were performed to compare diagnostic accuracy.</div></div><div><h3>Results</h3><div>Orthopedic residents significantly outperformed ChatGPT-4 in diagnosing musculoskeletal tumors. The mean diagnostic accuracy among resident teams was 55%, while ChatGPT-4 achieved 25% on its first attempt and 33% on its second attempt. One-way ANOVA revealed a significant difference in accuracy across groups (<em>F</em> = 8.51, <em>p</em> = 0.033). Independent t-tests confirmed that residents performed significantly better than ChatGPT-4 (<em>t</em> = 5.80, <em>p</em> = 0.0004 for first attempt; <em>t</em> = 4.25, <em>p</em> = 0.0028 for second attempt). Both residents and ChatGPT-4 struggled with specific cases, particularly soft tissue sarcomas.</div></div><div><h3>Conclusions</h3><div>ChatGPT-4 demonstrated limited accuracy in interpreting histopathological slides compared to orthopedic residents. While AI holds promise for medical diagnostics, its current capabilities in musculoskeletal oncology remain insufficient for independent clinical use. These findings should be viewed as exploratory rather than confirmatory, and further research with larger, more diverse datasets is needed to assess AI’s role in histopathology. Future studies should investigate AI-assisted workflows, refine prompt engineering, and explore AI models specifically trained for histopathological diagnosis.</div></div>","PeriodicalId":56110,"journal":{"name":"Knee","volume":"55 ","pages":"Pages 153-160"},"PeriodicalIF":1.6000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knee","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0968016025000766","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0

Abstract

Background

Artificial intelligence (AI) is increasingly being explored for its potential role in medical diagnostics. ChatGPT-4, a large language model (LLM) with image analysis capabilities, may assist in histopathological interpretation, but its accuracy in musculoskeletal oncology remains untested. This study evaluates ChatGPT-4′s diagnostic accuracy in identifying musculoskeletal tumors from histology slides compared to orthopedic surgery residents.

Methods

A comparative study was conducted using 24 histology slides randomly selected from an orthopedic oncology registry. Five teams of orthopedic surgery residents (PGY-1 to PGY-5) participated in a diagnostic competition, providing their best diagnosis for each slide. ChatGPT-4 was tested separately using identical histology images and clinical vignettes, with two independent attempts. Statistical analyses, including one-way ANOVA and independent t-tests were performed to compare diagnostic accuracy.

Results

Orthopedic residents significantly outperformed ChatGPT-4 in diagnosing musculoskeletal tumors. The mean diagnostic accuracy among resident teams was 55%, while ChatGPT-4 achieved 25% on its first attempt and 33% on its second attempt. One-way ANOVA revealed a significant difference in accuracy across groups (F = 8.51, p = 0.033). Independent t-tests confirmed that residents performed significantly better than ChatGPT-4 (t = 5.80, p = 0.0004 for first attempt; t = 4.25, p = 0.0028 for second attempt). Both residents and ChatGPT-4 struggled with specific cases, particularly soft tissue sarcomas.

Conclusions

ChatGPT-4 demonstrated limited accuracy in interpreting histopathological slides compared to orthopedic residents. While AI holds promise for medical diagnostics, its current capabilities in musculoskeletal oncology remain insufficient for independent clinical use. These findings should be viewed as exploratory rather than confirmatory, and further research with larger, more diverse datasets is needed to assess AI’s role in histopathology. Future studies should investigate AI-assisted workflows, refine prompt engineering, and explore AI models specifically trained for histopathological diagnosis.
ChatGPT-4在骨科肿瘤中的诊断准确性:与住院医师的比较研究
人工智能(AI)因其在医疗诊断中的潜在作用而受到越来越多的探索。ChatGPT-4是一种具有图像分析能力的大型语言模型(LLM),可能有助于组织病理学解释,但其在肌肉骨骼肿瘤学中的准确性尚未得到验证。与骨科住院医师相比,本研究评估了ChatGPT-4在从组织学切片中识别肌肉骨骼肿瘤的诊断准确性。方法从骨科肿瘤学登记处随机选择24张组织学切片进行比较研究。五个骨科住院医师小组(PGY-1至PGY-5)参加了诊断比赛,为每张幻灯片提供他们最好的诊断。ChatGPT-4分别使用相同的组织学图像和临床小片段进行测试,并进行两次独立尝试。统计分析,包括单因素方差分析和独立t检验来比较诊断的准确性。结果骨科住院医师对肌肉骨骼肿瘤的诊断明显优于ChatGPT-4。常驻团队的平均诊断准确率为55%,而ChatGPT-4在第一次尝试时达到25%,第二次尝试时达到33%。单因素方差分析显示,各组之间的准确性有显著差异(F = 8.51, p = 0.033)。独立t检验证实,第一次尝试时,居民的表现明显优于ChatGPT-4 (t = 5.80, p = 0.0004;T = 4.25, p = 0.0028)。居民和ChatGPT-4都在某些情况下挣扎,尤其是软组织肉瘤。结论:与骨科住院医师相比,atgpt -4在解释组织病理切片方面的准确性有限。虽然人工智能有望用于医学诊断,但它目前在肌肉骨骼肿瘤学方面的能力仍不足以用于独立的临床应用。这些发现应该被视为探索性的,而不是证实性的,需要用更大、更多样化的数据集进行进一步的研究,以评估人工智能在组织病理学中的作用。未来的研究应该调查人工智能辅助的工作流程,完善即时工程,并探索专门用于组织病理学诊断的人工智能模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Knee
Knee 医学-外科
CiteScore
3.80
自引率
5.30%
发文量
171
审稿时长
6 months
期刊介绍: The Knee is an international journal publishing studies on the clinical treatment and fundamental biomechanical characteristics of this joint. The aim of the journal is to provide a vehicle relevant to surgeons, biomedical engineers, imaging specialists, materials scientists, rehabilitation personnel and all those with an interest in the knee. The topics covered include, but are not limited to: • Anatomy, physiology, morphology and biochemistry; • Biomechanical studies; • Advances in the development of prosthetic, orthotic and augmentation devices; • Imaging and diagnostic techniques; • Pathology; • Trauma; • Surgery; • Rehabilitation.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信