Fatma E A Hassanein, Ahmed El Barbary, Radwa R Hussein, Yousra Ahmed, Jylan El-Guindy, Susan Sarhan, Asmaa Abou-Bakr
{"title":"Diagnostic Performance of ChatGPT-4o and DeepSeek-3 Differential Diagnosis of Complex Oral Lesions: A Multimodal Imaging and Case Difficulty Analysis.","authors":"Fatma E A Hassanein, Ahmed El Barbary, Radwa R Hussein, Yousra Ahmed, Jylan El-Guindy, Susan Sarhan, Asmaa Abou-Bakr","doi":"10.1111/odi.70007","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>AI models like ChatGPT-4o and DeepSeek-3 show diagnostic promise, but their reliability in complex, image-based oral lesions remains unclear. This study aimed to evaluate and compare the diagnostic accuracy of ChatGPT-4o and DeepSeek-3 despite their differing modalities against oral medicine (OM) experts across varied lesion types and case difficulty levels.</p><p><strong>Methods: </strong>Eighty standardized clinical vignettes derived from real-world oral disease cases, including clinical images/radiographs, were evaluated. Differential diagnoses were generated by ChatGPT-4o, DeepSeek-3, and four board-certified OM specialists, with accuracy assessed at Top-1, Top-3, and Top-5 levels.</p><p><strong>Results: </strong>OM specialists consistently achieved the highest diagnostic accuracy. However, DeepSeek-3 significantly outperformed ChatGPT-4o at the Top-3 level (p = 0.0153) and showed greater robustness in high-difficulty and inflammatory cases despite its text-only modality. Multimodal imaging enhanced diagnostic accuracy. Regression analysis indicated lesion type and imaging modality as positive predictors, while diagnostic difficulty negatively impacted Top-1 performance.</p><p><strong>Conclusions: </strong>Remarkably, the text-only DeepSeek-3 model exceeded the diagnostic performance of the multimodal ChatGPT-4o model for complex oral lesions, highlighting its structured reasoning capabilities and reduced hallucination rate. These findings underscore the potential of non-vision LLMs in diagnostic support, emphasizing the critical need for expert oversight in complex scenarios.</p>","PeriodicalId":19615,"journal":{"name":"Oral diseases","volume":" ","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Oral diseases","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/odi.70007","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Background: AI models like ChatGPT-4o and DeepSeek-3 show diagnostic promise, but their reliability in complex, image-based oral lesions remains unclear. This study aimed to evaluate and compare the diagnostic accuracy of ChatGPT-4o and DeepSeek-3 despite their differing modalities against oral medicine (OM) experts across varied lesion types and case difficulty levels.
Methods: Eighty standardized clinical vignettes derived from real-world oral disease cases, including clinical images/radiographs, were evaluated. Differential diagnoses were generated by ChatGPT-4o, DeepSeek-3, and four board-certified OM specialists, with accuracy assessed at Top-1, Top-3, and Top-5 levels.
Results: OM specialists consistently achieved the highest diagnostic accuracy. However, DeepSeek-3 significantly outperformed ChatGPT-4o at the Top-3 level (p = 0.0153) and showed greater robustness in high-difficulty and inflammatory cases despite its text-only modality. Multimodal imaging enhanced diagnostic accuracy. Regression analysis indicated lesion type and imaging modality as positive predictors, while diagnostic difficulty negatively impacted Top-1 performance.
Conclusions: Remarkably, the text-only DeepSeek-3 model exceeded the diagnostic performance of the multimodal ChatGPT-4o model for complex oral lesions, highlighting its structured reasoning capabilities and reduced hallucination rate. These findings underscore the potential of non-vision LLMs in diagnostic support, emphasizing the critical need for expert oversight in complex scenarios.
期刊介绍:
Oral Diseases is a multidisciplinary and international journal with a focus on head and neck disorders, edited by leaders in the field, Professor Giovanni Lodi (Editor-in-Chief, Milan, Italy), Professor Stefano Petti (Deputy Editor, Rome, Italy) and Associate Professor Gulshan Sunavala-Dossabhoy (Deputy Editor, Shreveport, LA, USA). The journal is pre-eminent in oral medicine. Oral Diseases specifically strives to link often-isolated areas of dentistry and medicine through broad-based scholarship that includes well-designed and controlled clinical research, analytical epidemiology, and the translation of basic science in pre-clinical studies. The journal typically publishes articles relevant to many related medical specialties including especially dermatology, gastroenterology, hematology, immunology, infectious diseases, neuropsychiatry, oncology and otolaryngology. The essential requirement is that all submitted research is hypothesis-driven, with significant positive and negative results both welcomed. Equal publication emphasis is placed on etiology, pathogenesis, diagnosis, prevention and treatment.