ChatGPT-4 Responses on Ankle Cartilage Surgery Often Diverge from Expert Consensus: A Comparative Analysis.

Foot & Ankle Orthopaedics Pub Date : 2025-08-13 eCollection Date: 2025-07-01 DOI:10.1177/24730114251352494
Takuji Yokoe, Giulia Roversi, Nuno Sevivas, Naosuke Kamei, Pedro Diniz, Hélder Pereira
{"title":"ChatGPT-4 Responses on Ankle Cartilage Surgery Often Diverge from Expert Consensus: A Comparative Analysis.","authors":"Takuji Yokoe, Giulia Roversi, Nuno Sevivas, Naosuke Kamei, Pedro Diniz, Hélder Pereira","doi":"10.1177/24730114251352494","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>There are few studies that have evaluated whether large language models, such as ChatGPT, can provide accurate guidance to clinicians in the field of foot and ankle surgery. This study aimed to assess the accuracy of ChatGPT's responses regarding ankle cartilage repair by comparing them with the consensus statements from foot and ankle experts as a standard reference.</p><p><strong>Methods: </strong>The open artificial intelligence (AI) model ChatGPT-4 was asked to answer a total of 14 questions on debridement, curettage, and bone marrow stimulation for ankle cartilage lesions that were selected at the 2017 International Consensus Meeting on Cartilage Repair of the Ankle. The ChatGPT responses were compared with the consensus statements developed in this international meeting. A Likert scale (scores, 1-5) was used to evaluate the similarity of the answers by ChatGPT to the consensus statements. The 4 scoring categories (Accuracy, Overconclusiveness, Supplementary, and Incompleteness) were also used to evaluate the quality of ChatGPT answers, according to previous studies.</p><p><strong>Results: </strong>The mean Likert scale score regarding the similarity of ChatGPT's answers to the consensus statements was 3.1 ± 0.8. Regarding the results of 4 scoring categories of the ChatGPT answers, the percentages of answers that were considered \"yes\" in the Accuracy, Overconclusiveness, Supplementary, and Incompleteness were 71.4% (10/14), 35.7% (5/14), 78.6% (11/14), and 14.3% (2/14), respectively.</p><p><strong>Conclusion: </strong>This study showed that ChatGPT-4 often provides responses that diverge from expert consensus regarding surgical treatment of ankle cartilage lesions.</p><p><strong>Level of evidence: </strong>Level V, expert opinion.</p>","PeriodicalId":12429,"journal":{"name":"Foot & Ankle Orthopaedics","volume":"10 3","pages":"24730114251352494"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12351097/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Foot & Ankle Orthopaedics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/24730114251352494","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: There are few studies that have evaluated whether large language models, such as ChatGPT, can provide accurate guidance to clinicians in the field of foot and ankle surgery. This study aimed to assess the accuracy of ChatGPT's responses regarding ankle cartilage repair by comparing them with the consensus statements from foot and ankle experts as a standard reference.

Methods: The open artificial intelligence (AI) model ChatGPT-4 was asked to answer a total of 14 questions on debridement, curettage, and bone marrow stimulation for ankle cartilage lesions that were selected at the 2017 International Consensus Meeting on Cartilage Repair of the Ankle. The ChatGPT responses were compared with the consensus statements developed in this international meeting. A Likert scale (scores, 1-5) was used to evaluate the similarity of the answers by ChatGPT to the consensus statements. The 4 scoring categories (Accuracy, Overconclusiveness, Supplementary, and Incompleteness) were also used to evaluate the quality of ChatGPT answers, according to previous studies.

Results: The mean Likert scale score regarding the similarity of ChatGPT's answers to the consensus statements was 3.1 ± 0.8. Regarding the results of 4 scoring categories of the ChatGPT answers, the percentages of answers that were considered "yes" in the Accuracy, Overconclusiveness, Supplementary, and Incompleteness were 71.4% (10/14), 35.7% (5/14), 78.6% (11/14), and 14.3% (2/14), respectively.

Conclusion: This study showed that ChatGPT-4 often provides responses that diverge from expert consensus regarding surgical treatment of ankle cartilage lesions.

Level of evidence: Level V, expert opinion.

踝软骨手术的ChatGPT-4反应经常与专家共识分歧:比较分析。
背景:目前很少有研究评估像ChatGPT这样的大型语言模型能否为足踝外科领域的临床医生提供准确的指导。本研究旨在评估ChatGPT对踝关节软骨修复反应的准确性,将其与足部和踝关节专家的共识声明作为标准参考进行比较。方法:采用开放式人工智能(AI)模型ChatGPT-4,回答2017年踝关节软骨修复国际共识会议选定的关于踝关节软骨病变清创、刮除、骨髓刺激等14个问题。ChatGPT的答复与本次国际会议上形成的共识声明进行了比较。使用李克特量表(得分,1-5)来评估ChatGPT的答案与共识陈述的相似性。根据以往的研究,4个评分类别(准确性、过度结论性、补充性和不完整性)也被用于评估ChatGPT答案的质量。结果:ChatGPT的答案与共识陈述相似度的平均李克特量表得分为3.1±0.8。在ChatGPT答案的4个评分类别的结果中,准确度(Accuracy)、过度结论(overconusiveness)、补充性(Supplementary)和不完整性(Incompleteness)的回答为“是”的比例分别为71.4%(10/14)、35.7%(5/14)、78.6%(11/14)和14.3%(2/14)。结论:本研究表明,ChatGPT-4在踝关节软骨病变的手术治疗中经常提供与专家共识不同的反应。证据等级:V级,专家意见。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Foot & Ankle Orthopaedics
Foot & Ankle Orthopaedics Medicine-Orthopedics and Sports Medicine
CiteScore
1.20
自引率
0.00%
发文量
1152
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信