Haodong Wu , Shuxin Yao , Huanli Bao , Yishun Guo , Chao Xu , Jianbing Ma
{"title":"ChatGPT-4.0和DeepSeek-R1尚未为膝关节骨关节炎提供临床支持的答案","authors":"Haodong Wu , Shuxin Yao , Huanli Bao , Yishun Guo , Chao Xu , Jianbing Ma","doi":"10.1016/j.knee.2025.06.007","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Large Language Models (LLMs) such as ChatGPT-4.0 and DeepSeek-R1 provide advanced natural language capabilities, but they also raise concerns regarding accuracy in medical applications. There is a lack of systematic evaluation of their performance against orthopedic guidelines, particularly for knee osteoarthritis (KOA). This study assessed the accuracy and consistency of these LLMs in relation to the most recent Chinese clinical practice guidelines for KOA.</div></div><div><h3>Methods</h3><div>Queries regarding 17 guideline-recommended KOA therapeutic strategies were posed to ChatGPT-4.0 and DeepSeek-R1. Two independent reviewers evaluated response concordance (Concordance, Discordance, or No Concordance) with guidelines. Inter-rater reliability was assessed using Cohen’s kappa coefficient. A chi-square test was employed to compare the response patterns between the two models.</div></div><div><h3>Results</h3><div>ChatGPT-4.0 showed 59 % concordance; DeepSeek-R1 achieved 71 %. Both models gave inconsistent recommendations for ozone therapy and arthroscopy. ChatGPT-4.0 had five inconsistent responses; DeepSeek-R1 had three. Inter-rater agreement was high (κ = 0.90 and 0.86). No significant difference was found in concordance rates (P = 0.7; P = 1). Only DeepSeek-R1 provided references (38 in total), but just 8 were fully verifiable.</div></div><div><h3>Conclusion</h3><div>Neither ChatGPT-4.0 nor DeepSeek-R1 consistently produced responses aligned with evidence-based clinical guidelines. These findings highlight the need for cautious interpretation of medical advice generated by current AI platforms, both by clinicians and patients.</div></div>","PeriodicalId":56110,"journal":{"name":"Knee","volume":"56 ","pages":"Pages 386-396"},"PeriodicalIF":2.0000,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ChatGPT-4.0 and DeepSeek-R1 does not yet provide clinically supported answers for knee osteoarthritis\",\"authors\":\"Haodong Wu , Shuxin Yao , Huanli Bao , Yishun Guo , Chao Xu , Jianbing Ma\",\"doi\":\"10.1016/j.knee.2025.06.007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Large Language Models (LLMs) such as ChatGPT-4.0 and DeepSeek-R1 provide advanced natural language capabilities, but they also raise concerns regarding accuracy in medical applications. There is a lack of systematic evaluation of their performance against orthopedic guidelines, particularly for knee osteoarthritis (KOA). This study assessed the accuracy and consistency of these LLMs in relation to the most recent Chinese clinical practice guidelines for KOA.</div></div><div><h3>Methods</h3><div>Queries regarding 17 guideline-recommended KOA therapeutic strategies were posed to ChatGPT-4.0 and DeepSeek-R1. Two independent reviewers evaluated response concordance (Concordance, Discordance, or No Concordance) with guidelines. Inter-rater reliability was assessed using Cohen’s kappa coefficient. A chi-square test was employed to compare the response patterns between the two models.</div></div><div><h3>Results</h3><div>ChatGPT-4.0 showed 59 % concordance; DeepSeek-R1 achieved 71 %. Both models gave inconsistent recommendations for ozone therapy and arthroscopy. ChatGPT-4.0 had five inconsistent responses; DeepSeek-R1 had three. Inter-rater agreement was high (κ = 0.90 and 0.86). No significant difference was found in concordance rates (P = 0.7; P = 1). Only DeepSeek-R1 provided references (38 in total), but just 8 were fully verifiable.</div></div><div><h3>Conclusion</h3><div>Neither ChatGPT-4.0 nor DeepSeek-R1 consistently produced responses aligned with evidence-based clinical guidelines. These findings highlight the need for cautious interpretation of medical advice generated by current AI platforms, both by clinicians and patients.</div></div>\",\"PeriodicalId\":56110,\"journal\":{\"name\":\"Knee\",\"volume\":\"56 \",\"pages\":\"Pages 386-396\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knee\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0968016025001620\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knee","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0968016025001620","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
ChatGPT-4.0 and DeepSeek-R1 does not yet provide clinically supported answers for knee osteoarthritis
Background
Large Language Models (LLMs) such as ChatGPT-4.0 and DeepSeek-R1 provide advanced natural language capabilities, but they also raise concerns regarding accuracy in medical applications. There is a lack of systematic evaluation of their performance against orthopedic guidelines, particularly for knee osteoarthritis (KOA). This study assessed the accuracy and consistency of these LLMs in relation to the most recent Chinese clinical practice guidelines for KOA.
Methods
Queries regarding 17 guideline-recommended KOA therapeutic strategies were posed to ChatGPT-4.0 and DeepSeek-R1. Two independent reviewers evaluated response concordance (Concordance, Discordance, or No Concordance) with guidelines. Inter-rater reliability was assessed using Cohen’s kappa coefficient. A chi-square test was employed to compare the response patterns between the two models.
Results
ChatGPT-4.0 showed 59 % concordance; DeepSeek-R1 achieved 71 %. Both models gave inconsistent recommendations for ozone therapy and arthroscopy. ChatGPT-4.0 had five inconsistent responses; DeepSeek-R1 had three. Inter-rater agreement was high (κ = 0.90 and 0.86). No significant difference was found in concordance rates (P = 0.7; P = 1). Only DeepSeek-R1 provided references (38 in total), but just 8 were fully verifiable.
Conclusion
Neither ChatGPT-4.0 nor DeepSeek-R1 consistently produced responses aligned with evidence-based clinical guidelines. These findings highlight the need for cautious interpretation of medical advice generated by current AI platforms, both by clinicians and patients.
期刊介绍:
The Knee is an international journal publishing studies on the clinical treatment and fundamental biomechanical characteristics of this joint. The aim of the journal is to provide a vehicle relevant to surgeons, biomedical engineers, imaging specialists, materials scientists, rehabilitation personnel and all those with an interest in the knee.
The topics covered include, but are not limited to:
• Anatomy, physiology, morphology and biochemistry;
• Biomechanical studies;
• Advances in the development of prosthetic, orthotic and augmentation devices;
• Imaging and diagnostic techniques;
• Pathology;
• Trauma;
• Surgery;
• Rehabilitation.