{"title":"大型语言模型在非洲风湿病学中的表现:ChatGPT-4、Gemini、Copilot和Claude人工智能的诊断测试准确性研究","authors":"Yannick Laurent Tchenadoyo Bayala, Wendlassida Joelle Stéphanie Zabsonré/Tiendrebeogo, Dieu-Donné Ouedraogo, Fulgence Kaboré, Charles Sougué, Aristide Relwendé Yameogo, Wendlassida Martin Nacanabo, Ismael Ayouba Tinni, Aboubakar Ouedraogo, Yamyellé Enselme Zongo","doi":"10.1186/s41927-025-00512-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) tools, particularly Large Language Models (LLMs), are revolutionizing medical practice, including rheumatology. However, their diagnostic capabilities remain underexplored in the African context. To assess the diagnostic accuracy of ChatGPT-4, Gemini, Copilot, and Claude AI in rheumatology within an African population.</p><p><strong>Methods: </strong>This was a cross-sectional analytical study with retrospective data collection, conducted at the Rheumatology Department of Bogodogo University Hospital Center (Burkina Faso) from January 1 to June 30, 2024. Standardized clinical and paraclinical data from 103 patients were submitted to the four AI models. The diagnoses proposed by the AIs were compared to expert-confirmed diagnoses established by a panel of senior rheumatologists. Diagnostic accuracy, sensitivity, specificity, and predictive values were calculated for each AI model.</p><p><strong>Results: </strong>Among the patients enrolled in the study period, infectious diseases constituted the most common diagnostic category, representing 47.57% (n = 49). ChatGPT-4 achieved the highest diagnostic accuracy (86.41%), followed by Claude AI (85.44%), Copilot (75.73%), and Gemini (71.84%). The inter-model agreement was moderate, with Cohen's kappa coefficients ranging from 0.43 to 0.59. ChatGPT-4 and Claude AI demonstrated high sensitivity (> 90%) for most conditions but had lower performance for neoplastic diseases (sensitivity < 67%). Patients under 50 years old had a significantly higher probability of receiving a correct diagnosis with Copilot (OR = 3.36; 95% CI [1.16-9.71]; p = 0.025).</p><p><strong>Conclusion: </strong>LLMs, particularly ChatGPT-4 and Claude AI, show high diagnostic capabilities in rheumatology, despite some limitations in specific disease categories.</p><p><strong>Clinical trial number: </strong>Not applicable.</p>","PeriodicalId":9150,"journal":{"name":"BMC Rheumatology","volume":"9 1","pages":"54"},"PeriodicalIF":2.1000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12083132/pdf/","citationCount":"0","resultStr":"{\"title\":\"Performance of the Large Language Models in African rheumatology: a diagnostic test accuracy study of ChatGPT-4, Gemini, Copilot, and Claude artificial intelligence.\",\"authors\":\"Yannick Laurent Tchenadoyo Bayala, Wendlassida Joelle Stéphanie Zabsonré/Tiendrebeogo, Dieu-Donné Ouedraogo, Fulgence Kaboré, Charles Sougué, Aristide Relwendé Yameogo, Wendlassida Martin Nacanabo, Ismael Ayouba Tinni, Aboubakar Ouedraogo, Yamyellé Enselme Zongo\",\"doi\":\"10.1186/s41927-025-00512-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Artificial intelligence (AI) tools, particularly Large Language Models (LLMs), are revolutionizing medical practice, including rheumatology. However, their diagnostic capabilities remain underexplored in the African context. To assess the diagnostic accuracy of ChatGPT-4, Gemini, Copilot, and Claude AI in rheumatology within an African population.</p><p><strong>Methods: </strong>This was a cross-sectional analytical study with retrospective data collection, conducted at the Rheumatology Department of Bogodogo University Hospital Center (Burkina Faso) from January 1 to June 30, 2024. Standardized clinical and paraclinical data from 103 patients were submitted to the four AI models. The diagnoses proposed by the AIs were compared to expert-confirmed diagnoses established by a panel of senior rheumatologists. Diagnostic accuracy, sensitivity, specificity, and predictive values were calculated for each AI model.</p><p><strong>Results: </strong>Among the patients enrolled in the study period, infectious diseases constituted the most common diagnostic category, representing 47.57% (n = 49). ChatGPT-4 achieved the highest diagnostic accuracy (86.41%), followed by Claude AI (85.44%), Copilot (75.73%), and Gemini (71.84%). The inter-model agreement was moderate, with Cohen's kappa coefficients ranging from 0.43 to 0.59. ChatGPT-4 and Claude AI demonstrated high sensitivity (> 90%) for most conditions but had lower performance for neoplastic diseases (sensitivity < 67%). Patients under 50 years old had a significantly higher probability of receiving a correct diagnosis with Copilot (OR = 3.36; 95% CI [1.16-9.71]; p = 0.025).</p><p><strong>Conclusion: </strong>LLMs, particularly ChatGPT-4 and Claude AI, show high diagnostic capabilities in rheumatology, despite some limitations in specific disease categories.</p><p><strong>Clinical trial number: </strong>Not applicable.</p>\",\"PeriodicalId\":9150,\"journal\":{\"name\":\"BMC Rheumatology\",\"volume\":\"9 1\",\"pages\":\"54\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12083132/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Rheumatology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s41927-025-00512-z\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"RHEUMATOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Rheumatology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41927-025-00512-z","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RHEUMATOLOGY","Score":null,"Total":0}
Performance of the Large Language Models in African rheumatology: a diagnostic test accuracy study of ChatGPT-4, Gemini, Copilot, and Claude artificial intelligence.
Background: Artificial intelligence (AI) tools, particularly Large Language Models (LLMs), are revolutionizing medical practice, including rheumatology. However, their diagnostic capabilities remain underexplored in the African context. To assess the diagnostic accuracy of ChatGPT-4, Gemini, Copilot, and Claude AI in rheumatology within an African population.
Methods: This was a cross-sectional analytical study with retrospective data collection, conducted at the Rheumatology Department of Bogodogo University Hospital Center (Burkina Faso) from January 1 to June 30, 2024. Standardized clinical and paraclinical data from 103 patients were submitted to the four AI models. The diagnoses proposed by the AIs were compared to expert-confirmed diagnoses established by a panel of senior rheumatologists. Diagnostic accuracy, sensitivity, specificity, and predictive values were calculated for each AI model.
Results: Among the patients enrolled in the study period, infectious diseases constituted the most common diagnostic category, representing 47.57% (n = 49). ChatGPT-4 achieved the highest diagnostic accuracy (86.41%), followed by Claude AI (85.44%), Copilot (75.73%), and Gemini (71.84%). The inter-model agreement was moderate, with Cohen's kappa coefficients ranging from 0.43 to 0.59. ChatGPT-4 and Claude AI demonstrated high sensitivity (> 90%) for most conditions but had lower performance for neoplastic diseases (sensitivity < 67%). Patients under 50 years old had a significantly higher probability of receiving a correct diagnosis with Copilot (OR = 3.36; 95% CI [1.16-9.71]; p = 0.025).
Conclusion: LLMs, particularly ChatGPT-4 and Claude AI, show high diagnostic capabilities in rheumatology, despite some limitations in specific disease categories.