大型语言模型在非洲风湿病学中的表现:ChatGPT-4、Gemini、Copilot和Claude人工智能的诊断测试准确性研究

IF 2.1 Q3 RHEUMATOLOGY
Yannick Laurent Tchenadoyo Bayala, Wendlassida Joelle Stéphanie Zabsonré/Tiendrebeogo, Dieu-Donné Ouedraogo, Fulgence Kaboré, Charles Sougué, Aristide Relwendé Yameogo, Wendlassida Martin Nacanabo, Ismael Ayouba Tinni, Aboubakar Ouedraogo, Yamyellé Enselme Zongo
{"title":"大型语言模型在非洲风湿病学中的表现:ChatGPT-4、Gemini、Copilot和Claude人工智能的诊断测试准确性研究","authors":"Yannick Laurent Tchenadoyo Bayala, Wendlassida Joelle Stéphanie Zabsonré/Tiendrebeogo, Dieu-Donné Ouedraogo, Fulgence Kaboré, Charles Sougué, Aristide Relwendé Yameogo, Wendlassida Martin Nacanabo, Ismael Ayouba Tinni, Aboubakar Ouedraogo, Yamyellé Enselme Zongo","doi":"10.1186/s41927-025-00512-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) tools, particularly Large Language Models (LLMs), are revolutionizing medical practice, including rheumatology. However, their diagnostic capabilities remain underexplored in the African context. To assess the diagnostic accuracy of ChatGPT-4, Gemini, Copilot, and Claude AI in rheumatology within an African population.</p><p><strong>Methods: </strong>This was a cross-sectional analytical study with retrospective data collection, conducted at the Rheumatology Department of Bogodogo University Hospital Center (Burkina Faso) from January 1 to June 30, 2024. Standardized clinical and paraclinical data from 103 patients were submitted to the four AI models. The diagnoses proposed by the AIs were compared to expert-confirmed diagnoses established by a panel of senior rheumatologists. Diagnostic accuracy, sensitivity, specificity, and predictive values were calculated for each AI model.</p><p><strong>Results: </strong>Among the patients enrolled in the study period, infectious diseases constituted the most common diagnostic category, representing 47.57% (n = 49). ChatGPT-4 achieved the highest diagnostic accuracy (86.41%), followed by Claude AI (85.44%), Copilot (75.73%), and Gemini (71.84%). The inter-model agreement was moderate, with Cohen's kappa coefficients ranging from 0.43 to 0.59. ChatGPT-4 and Claude AI demonstrated high sensitivity (> 90%) for most conditions but had lower performance for neoplastic diseases (sensitivity < 67%). Patients under 50 years old had a significantly higher probability of receiving a correct diagnosis with Copilot (OR = 3.36; 95% CI [1.16-9.71]; p = 0.025).</p><p><strong>Conclusion: </strong>LLMs, particularly ChatGPT-4 and Claude AI, show high diagnostic capabilities in rheumatology, despite some limitations in specific disease categories.</p><p><strong>Clinical trial number: </strong>Not applicable.</p>","PeriodicalId":9150,"journal":{"name":"BMC Rheumatology","volume":"9 1","pages":"54"},"PeriodicalIF":2.1000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12083132/pdf/","citationCount":"0","resultStr":"{\"title\":\"Performance of the Large Language Models in African rheumatology: a diagnostic test accuracy study of ChatGPT-4, Gemini, Copilot, and Claude artificial intelligence.\",\"authors\":\"Yannick Laurent Tchenadoyo Bayala, Wendlassida Joelle Stéphanie Zabsonré/Tiendrebeogo, Dieu-Donné Ouedraogo, Fulgence Kaboré, Charles Sougué, Aristide Relwendé Yameogo, Wendlassida Martin Nacanabo, Ismael Ayouba Tinni, Aboubakar Ouedraogo, Yamyellé Enselme Zongo\",\"doi\":\"10.1186/s41927-025-00512-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Artificial intelligence (AI) tools, particularly Large Language Models (LLMs), are revolutionizing medical practice, including rheumatology. However, their diagnostic capabilities remain underexplored in the African context. To assess the diagnostic accuracy of ChatGPT-4, Gemini, Copilot, and Claude AI in rheumatology within an African population.</p><p><strong>Methods: </strong>This was a cross-sectional analytical study with retrospective data collection, conducted at the Rheumatology Department of Bogodogo University Hospital Center (Burkina Faso) from January 1 to June 30, 2024. Standardized clinical and paraclinical data from 103 patients were submitted to the four AI models. The diagnoses proposed by the AIs were compared to expert-confirmed diagnoses established by a panel of senior rheumatologists. Diagnostic accuracy, sensitivity, specificity, and predictive values were calculated for each AI model.</p><p><strong>Results: </strong>Among the patients enrolled in the study period, infectious diseases constituted the most common diagnostic category, representing 47.57% (n = 49). ChatGPT-4 achieved the highest diagnostic accuracy (86.41%), followed by Claude AI (85.44%), Copilot (75.73%), and Gemini (71.84%). The inter-model agreement was moderate, with Cohen's kappa coefficients ranging from 0.43 to 0.59. ChatGPT-4 and Claude AI demonstrated high sensitivity (> 90%) for most conditions but had lower performance for neoplastic diseases (sensitivity < 67%). Patients under 50 years old had a significantly higher probability of receiving a correct diagnosis with Copilot (OR = 3.36; 95% CI [1.16-9.71]; p = 0.025).</p><p><strong>Conclusion: </strong>LLMs, particularly ChatGPT-4 and Claude AI, show high diagnostic capabilities in rheumatology, despite some limitations in specific disease categories.</p><p><strong>Clinical trial number: </strong>Not applicable.</p>\",\"PeriodicalId\":9150,\"journal\":{\"name\":\"BMC Rheumatology\",\"volume\":\"9 1\",\"pages\":\"54\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12083132/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Rheumatology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s41927-025-00512-z\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"RHEUMATOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Rheumatology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41927-025-00512-z","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RHEUMATOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

背景:人工智能(AI)工具,特别是大型语言模型(llm),正在彻底改变包括风湿病学在内的医疗实践。然而,它们的诊断能力在非洲仍未得到充分探索。评估ChatGPT-4、Gemini、Copilot和Claude AI在非洲人群风湿病诊断中的准确性。方法:这是一项回顾性数据收集的横断面分析研究,于2024年1月1日至6月30日在波哥多戈大学医院中心(布基纳法索)风湿病学系进行。来自103名患者的标准化临床和临床旁数据被提交给四个AI模型。由ai提出的诊断与由资深风湿病学家小组确定的专家确诊诊断进行比较。计算每个人工智能模型的诊断准确性、敏感性、特异性和预测值。结果:在纳入研究期的患者中,传染病是最常见的诊断类别,占47.57% (n = 49)。ChatGPT-4的诊断准确率最高(86.41%),其次是Claude AI(85.44%)、Copilot(75.73%)和Gemini(71.84%)。模型间一致性中等,Cohen’s kappa系数在0.43 ~ 0.59之间。ChatGPT-4和Claude AI对大多数疾病表现出较高的敏感性(> 90%),但对肿瘤疾病的敏感性较低。结论:LLMs,特别是ChatGPT-4和Claude AI,在风湿病中表现出很高的诊断能力,尽管在特定疾病类别中存在一些局限性。临床试验号:不适用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Performance of the Large Language Models in African rheumatology: a diagnostic test accuracy study of ChatGPT-4, Gemini, Copilot, and Claude artificial intelligence.

Background: Artificial intelligence (AI) tools, particularly Large Language Models (LLMs), are revolutionizing medical practice, including rheumatology. However, their diagnostic capabilities remain underexplored in the African context. To assess the diagnostic accuracy of ChatGPT-4, Gemini, Copilot, and Claude AI in rheumatology within an African population.

Methods: This was a cross-sectional analytical study with retrospective data collection, conducted at the Rheumatology Department of Bogodogo University Hospital Center (Burkina Faso) from January 1 to June 30, 2024. Standardized clinical and paraclinical data from 103 patients were submitted to the four AI models. The diagnoses proposed by the AIs were compared to expert-confirmed diagnoses established by a panel of senior rheumatologists. Diagnostic accuracy, sensitivity, specificity, and predictive values were calculated for each AI model.

Results: Among the patients enrolled in the study period, infectious diseases constituted the most common diagnostic category, representing 47.57% (n = 49). ChatGPT-4 achieved the highest diagnostic accuracy (86.41%), followed by Claude AI (85.44%), Copilot (75.73%), and Gemini (71.84%). The inter-model agreement was moderate, with Cohen's kappa coefficients ranging from 0.43 to 0.59. ChatGPT-4 and Claude AI demonstrated high sensitivity (> 90%) for most conditions but had lower performance for neoplastic diseases (sensitivity < 67%). Patients under 50 years old had a significantly higher probability of receiving a correct diagnosis with Copilot (OR = 3.36; 95% CI [1.16-9.71]; p = 0.025).

Conclusion: LLMs, particularly ChatGPT-4 and Claude AI, show high diagnostic capabilities in rheumatology, despite some limitations in specific disease categories.

Clinical trial number: Not applicable.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
BMC Rheumatology
BMC Rheumatology Medicine-Rheumatology
CiteScore
3.80
自引率
0.00%
发文量
73
审稿时长
15 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信