{"title":"ChatGPT-4 Responses on Ankle Cartilage Surgery Often Diverge from Expert Consensus: A Comparative Analysis.","authors":"Takuji Yokoe, Giulia Roversi, Nuno Sevivas, Naosuke Kamei, Pedro Diniz, Hélder Pereira","doi":"10.1177/24730114251352494","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>There are few studies that have evaluated whether large language models, such as ChatGPT, can provide accurate guidance to clinicians in the field of foot and ankle surgery. This study aimed to assess the accuracy of ChatGPT's responses regarding ankle cartilage repair by comparing them with the consensus statements from foot and ankle experts as a standard reference.</p><p><strong>Methods: </strong>The open artificial intelligence (AI) model ChatGPT-4 was asked to answer a total of 14 questions on debridement, curettage, and bone marrow stimulation for ankle cartilage lesions that were selected at the 2017 International Consensus Meeting on Cartilage Repair of the Ankle. The ChatGPT responses were compared with the consensus statements developed in this international meeting. A Likert scale (scores, 1-5) was used to evaluate the similarity of the answers by ChatGPT to the consensus statements. The 4 scoring categories (Accuracy, Overconclusiveness, Supplementary, and Incompleteness) were also used to evaluate the quality of ChatGPT answers, according to previous studies.</p><p><strong>Results: </strong>The mean Likert scale score regarding the similarity of ChatGPT's answers to the consensus statements was 3.1 ± 0.8. Regarding the results of 4 scoring categories of the ChatGPT answers, the percentages of answers that were considered \"yes\" in the Accuracy, Overconclusiveness, Supplementary, and Incompleteness were 71.4% (10/14), 35.7% (5/14), 78.6% (11/14), and 14.3% (2/14), respectively.</p><p><strong>Conclusion: </strong>This study showed that ChatGPT-4 often provides responses that diverge from expert consensus regarding surgical treatment of ankle cartilage lesions.</p><p><strong>Level of evidence: </strong>Level V, expert opinion.</p>","PeriodicalId":12429,"journal":{"name":"Foot & Ankle Orthopaedics","volume":"10 3","pages":"24730114251352494"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12351097/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Foot & Ankle Orthopaedics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/24730114251352494","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background: There are few studies that have evaluated whether large language models, such as ChatGPT, can provide accurate guidance to clinicians in the field of foot and ankle surgery. This study aimed to assess the accuracy of ChatGPT's responses regarding ankle cartilage repair by comparing them with the consensus statements from foot and ankle experts as a standard reference.
Methods: The open artificial intelligence (AI) model ChatGPT-4 was asked to answer a total of 14 questions on debridement, curettage, and bone marrow stimulation for ankle cartilage lesions that were selected at the 2017 International Consensus Meeting on Cartilage Repair of the Ankle. The ChatGPT responses were compared with the consensus statements developed in this international meeting. A Likert scale (scores, 1-5) was used to evaluate the similarity of the answers by ChatGPT to the consensus statements. The 4 scoring categories (Accuracy, Overconclusiveness, Supplementary, and Incompleteness) were also used to evaluate the quality of ChatGPT answers, according to previous studies.
Results: The mean Likert scale score regarding the similarity of ChatGPT's answers to the consensus statements was 3.1 ± 0.8. Regarding the results of 4 scoring categories of the ChatGPT answers, the percentages of answers that were considered "yes" in the Accuracy, Overconclusiveness, Supplementary, and Incompleteness were 71.4% (10/14), 35.7% (5/14), 78.6% (11/14), and 14.3% (2/14), respectively.
Conclusion: This study showed that ChatGPT-4 often provides responses that diverge from expert consensus regarding surgical treatment of ankle cartilage lesions.