Image-Based Diagnostic Performance of LLMs vs CNNs for Oral Lichen Planus: Example-Guided and Differential Diagnosis

IF 3.2 3区医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE

International dental journal Pub Date : 2025-06-06 DOI:10.1016/j.identj.2025.100848

Paak Rewthamrongsris , Jirayu Burapacheep , Ekarat Phattarataratip , Promphakkon Kulthanaamondhita , Antonin Tichy , Falk Schwendicke , Thanaphum Osathanon , Kraisorn Sappayatosok

{"title":"Image-Based Diagnostic Performance of LLMs vs CNNs for Oral Lichen Planus: Example-Guided and Differential Diagnosis","authors":"Paak Rewthamrongsris , Jirayu Burapacheep , Ekarat Phattarataratip , Promphakkon Kulthanaamondhita , Antonin Tichy , Falk Schwendicke , Thanaphum Osathanon , Kraisorn Sappayatosok","doi":"10.1016/j.identj.2025.100848","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction and aims</h3><div>The overlapping characteristics of oral lichen planus (OLP), a chronic oral mucosal inflammatory condition, with those of other oral lesions, present diagnostic challenges. Large language models (LLMs) with integrated computer-vision capabilities and convolutional neural networks (CNNs) constitute an alternative diagnostic modality. We evaluated the ability of seven LLMs, including both proprietary and open-source models, to detect OLP from intraoral images and generate differential diagnoses.</div></div><div><h3>Methods</h3><div>Using a dataset with 1,142 clinical photographs of histopathologically confirmed OLP, non-OLP lesions, and normal mucosa. The LLMs were tested using three experimental designs: zero-shot recognition, example-guided recognition, and differential diagnosis. Performance was measured using accuracy, precision, recall, F1-score, and discounted cumulative gain (DCG). Furthermore, the performance of LLMs was compared with three previously published CNN-based models for OLP detection on a subset of 110 photographs, which were previously used to test the CNN models.</div></div><div><h3>Results</h3><div>Gemini 1.5 Pro and Flash demonstrated the highest accuracy (69.69%) in zero-shot recognition, whereas GPT-4o ranked first in the F1 score (76.10%). With example-guided prompts, which improved consistency and reduced refusal rates, Gemini 1.5 Flash achieved the highest accuracy (80.53%) and F1-score (84.54%); however, Claude 3.5 Sonnet achieved the highest DCG score of 0.63. Although the proprietary models generally excelled, the open-source Llama model demonstrated notable strengths in ranking relevant diagnoses despite moderate performance in detection tasks. All LLMs were outperformed by the CNN models.</div></div><div><h3>Conclusion</h3><div>The seven evaluated LLMs lack sufficient performance for clinical use. CNNs trained to detect OLP outperformed the LLMs tested in this study.</div></div>","PeriodicalId":13785,"journal":{"name":"International dental journal","volume":"75 4","pages":"Article 100848"},"PeriodicalIF":3.2000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International dental journal","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0020653925001376","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction and aims

The overlapping characteristics of oral lichen planus (OLP), a chronic oral mucosal inflammatory condition, with those of other oral lesions, present diagnostic challenges. Large language models (LLMs) with integrated computer-vision capabilities and convolutional neural networks (CNNs) constitute an alternative diagnostic modality. We evaluated the ability of seven LLMs, including both proprietary and open-source models, to detect OLP from intraoral images and generate differential diagnoses.

Methods

Using a dataset with 1,142 clinical photographs of histopathologically confirmed OLP, non-OLP lesions, and normal mucosa. The LLMs were tested using three experimental designs: zero-shot recognition, example-guided recognition, and differential diagnosis. Performance was measured using accuracy, precision, recall, F1-score, and discounted cumulative gain (DCG). Furthermore, the performance of LLMs was compared with three previously published CNN-based models for OLP detection on a subset of 110 photographs, which were previously used to test the CNN models.

Results

Gemini 1.5 Pro and Flash demonstrated the highest accuracy (69.69%) in zero-shot recognition, whereas GPT-4o ranked first in the F1 score (76.10%). With example-guided prompts, which improved consistency and reduced refusal rates, Gemini 1.5 Flash achieved the highest accuracy (80.53%) and F1-score (84.54%); however, Claude 3.5 Sonnet achieved the highest DCG score of 0.63. Although the proprietary models generally excelled, the open-source Llama model demonstrated notable strengths in ranking relevant diagnoses despite moderate performance in detection tasks. All LLMs were outperformed by the CNN models.

Conclusion

The seven evaluated LLMs lack sufficient performance for clinical use. CNNs trained to detect OLP outperformed the LLMs tested in this study.

查看原文本刊更多论文

基于图像的LLMs与cnn对口腔扁平苔藓的诊断性能：实例指导和鉴别诊断

口腔扁平苔藓（OLP）是一种慢性口腔黏膜炎症，与其他口腔病变的重叠特征给诊断带来了挑战。集成了计算机视觉能力和卷积神经网络（cnn）的大型语言模型（llm）构成了另一种诊断模式。我们评估了七种llm（包括专有和开源模型）从口腔内图像检测OLP并生成鉴别诊断的能力。方法采用1142张组织病理学证实的OLP、非OLP病变和正常粘膜的临床照片。采用三种实验设计对llm进行测试：零射击识别、示例引导识别和鉴别诊断。使用准确性、精密度、召回率、f1分数和贴现累积增益（DCG）来衡量性能。此外，llm的性能与先前发表的三个基于CNN的模型进行了比较，这些模型用于在110张照片子集上进行OLP检测，这些模型先前用于测试CNN模型。结果gemini 1.5 Pro和Flash在零射击识别中准确率最高（69.69%），gpt - 40在F1得分中排名第一（76.10%）。使用实例引导提示，提高了一致性并降低了拒绝率，Gemini 1.5 Flash达到了最高的准确率（80.53%）和f1得分（84.54%）；而克劳德3.5十四行诗的DCG得分最高，为0.63。尽管专有模型普遍表现优异，但开源的Llama模型在相关诊断排序方面表现出显著优势，尽管在检测任务中表现一般。所有llm的表现都优于CNN模型。结论7种经评估的llm均缺乏临床应用的足够性能。经过训练的cnn检测OLP的表现优于本研究中测试的llm。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International dental journal 医学-牙科与口腔外科

CiteScore

4.80

自引率

6.10%

发文量

159

审稿时长

63 days

期刊介绍： The International Dental Journal features peer-reviewed, scientific articles relevant to international oral health issues, as well as practical, informative articles aimed at clinicians.