Can Multimodal Large Language Models Diagnose Diabetic Retinopathy from Fundus Photos? A Quantitative Evaluation

IF 4.6 Q1 OPHTHALMOLOGY

Ophthalmology science Pub Date : 2025-08-12 DOI:10.1016/j.xops.2025.100911

Jesse A. Most BA , Evan H. Walker MS , Nehal N. Mehta MD , Ines D. Nagel MD , Jimmy S. Chen MD , Jonathan F. Russell MD, PhD , Nathan L. Scott MD, MPP , Shyamanga Borooah MBBS, PhD

{"title":"Can Multimodal Large Language Models Diagnose Diabetic Retinopathy from Fundus Photos? A Quantitative Evaluation","authors":"Jesse A. Most BA , Evan H. Walker MS , Nehal N. Mehta MD , Ines D. Nagel MD , Jimmy S. Chen MD , Jonathan F. Russell MD, PhD , Nathan L. Scott MD, MPP , Shyamanga Borooah MBBS, PhD","doi":"10.1016/j.xops.2025.100911","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>To evaluate the diagnostic accuracy of 4 multimodal large language models (MLLMs) in detecting and grading diabetic retinopathy (DR) using their new image analysis features.</div></div><div><h3>Design</h3><div>A single-center retrospective study.</div></div><div><h3>Subjects</h3><div>Patients diagnosed with prediabetes and diabetes.</div></div><div><h3>Methods</h3><div>Ultra-widefield fundus images from patients seen at the University of California, San Diego, were graded for DR severity by 3 retina specialists using the ETDRS classification system to establish ground truth. Four MLLMs (ChatGPT-4o, Claude 3.5 Sonnet, Google Gemini 1.5 Pro, and Perplexity Llama 3.1 Sonar/Default) were tested using 4 distinct prompts. These assessed multiple-choice disease diagnosis, binary disease classification, and disease severity. Multimodal large language models were assessed for accuracy, sensitivity, and specificity in identifying the presence or absence of DR and relative disease severity.</div></div><div><h3>Main Outcome Measures</h3><div>Accuracy, sensitivity, and specificity of diagnosis.</div></div><div><h3>Results</h3><div>A total of 309 eyes from 188 patients were included in the study. The average patient age was 58.7 (56.7–60.7) years, with 55.3% being female. After specialist grading, 70.2% of eyes had DR of varying severity, and 29.8% had no DR. For disease identification with multiple choices provided, Claude and ChatGPT scored significantly higher (<em>P</em> < 0.0006, per Bonferroni correction) than other MLLMs for accuracy (0.608–0.566) and sensitivity (0.618–0.641). In binary DR versus no DR classification, accuracy was the highest for ChatGPT (0.644) and Perplexity (0.602). Sensitivity varied (ChatGPT [0.539], Perplexity [0.488], Claude [0.179], and Gemini [0.042]), whereas specificity for all models was relatively high (range: 0.870–0.989). For the DR severity prompt with the best overall results (Prompt 3.1), no significant differences between models were found in accuracy (Perplexity [0.411], ChatGPT [0.395], Gemini [0.392], and Claude [0.314]). All models demonstrated low sensitivity (Perplexity [0.247], ChatGPT [0.229], Gemini [0.224], and Claude [0.184]). Specificity ranged from 0.840 to 0.866.</div></div><div><h3>Conclusions</h3><div>Multimodal large language models are powerful tools that may eventually assist retinal image analysis. Currently, however, there is variability in the accuracy of image analysis, and diagnostic performance falls short of clinical standards for safe implementation in DR diagnosis and grading. Further training and optimization of common errors may enhance their clinical utility.</div></div><div><h3>Financial Disclosure(s)</h3><div>Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.</div></div>","PeriodicalId":74363,"journal":{"name":"Ophthalmology science","volume":"6 1","pages":"Article 100911"},"PeriodicalIF":4.6000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmology science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S266691452500209X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

To evaluate the diagnostic accuracy of 4 multimodal large language models (MLLMs) in detecting and grading diabetic retinopathy (DR) using their new image analysis features.

Design

A single-center retrospective study.

Subjects

Patients diagnosed with prediabetes and diabetes.

Methods

Ultra-widefield fundus images from patients seen at the University of California, San Diego, were graded for DR severity by 3 retina specialists using the ETDRS classification system to establish ground truth. Four MLLMs (ChatGPT-4o, Claude 3.5 Sonnet, Google Gemini 1.5 Pro, and Perplexity Llama 3.1 Sonar/Default) were tested using 4 distinct prompts. These assessed multiple-choice disease diagnosis, binary disease classification, and disease severity. Multimodal large language models were assessed for accuracy, sensitivity, and specificity in identifying the presence or absence of DR and relative disease severity.

Main Outcome Measures

Accuracy, sensitivity, and specificity of diagnosis.

Results

A total of 309 eyes from 188 patients were included in the study. The average patient age was 58.7 (56.7–60.7) years, with 55.3% being female. After specialist grading, 70.2% of eyes had DR of varying severity, and 29.8% had no DR. For disease identification with multiple choices provided, Claude and ChatGPT scored significantly higher (P < 0.0006, per Bonferroni correction) than other MLLMs for accuracy (0.608–0.566) and sensitivity (0.618–0.641). In binary DR versus no DR classification, accuracy was the highest for ChatGPT (0.644) and Perplexity (0.602). Sensitivity varied (ChatGPT [0.539], Perplexity [0.488], Claude [0.179], and Gemini [0.042]), whereas specificity for all models was relatively high (range: 0.870–0.989). For the DR severity prompt with the best overall results (Prompt 3.1), no significant differences between models were found in accuracy (Perplexity [0.411], ChatGPT [0.395], Gemini [0.392], and Claude [0.314]). All models demonstrated low sensitivity (Perplexity [0.247], ChatGPT [0.229], Gemini [0.224], and Claude [0.184]). Specificity ranged from 0.840 to 0.866.

Conclusions

Multimodal large language models are powerful tools that may eventually assist retinal image analysis. Currently, however, there is variability in the accuracy of image analysis, and diagnostic performance falls short of clinical standards for safe implementation in DR diagnosis and grading. Further training and optimization of common errors may enhance their clinical utility.

Financial Disclosure(s)

Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

查看原文本刊更多论文

多模态大语言模型能从眼底照片诊断糖尿病视网膜病变吗？A定量评价

目的评价4种多模态大语言模型（MLLMs）利用其新的图像分析特征对糖尿病视网膜病变（DR）的诊断准确性。设计：单中心回顾性研究。被诊断为糖尿病前期和糖尿病的患者。方法由3位视网膜专家使用ETDRS分类系统对加州大学圣地亚哥分校患者的超广角眼底图像进行DR严重程度分级，以确定基本事实。四个mlms （chatgpt - 40, Claude 3.5 Sonnet，谷歌Gemini 1.5 Pro和Perplexity Llama 3.1 Sonar/Default）使用4个不同的提示进行测试。这些评估了多项选择疾病诊断、二元疾病分类和疾病严重程度。评估多模态大语言模型在识别DR存在与否和相对疾病严重程度方面的准确性、敏感性和特异性。主要观察指标诊断的准确性、敏感性和特异性。结果188例患者共309只眼纳入研究。患者平均年龄58.7岁（56.7-60.7）岁，女性占55.3%。专家评分后，70.2%的眼睛有不同程度的DR， 29.8%的眼睛没有DR。在提供多种选择的疾病识别中，Claude和ChatGPT在准确率（0.608-0.566）和灵敏度（0.618-0.641）方面显著高于其他mllm （P < 0.0006，每Bonferroni校正）。在二元DR与无DR分类中，ChatGPT（0.644）和Perplexity（0.602）的准确率最高。灵敏度各不相同（ChatGPT [0.539], Perplexity [0.488], Claude [0.179], Gemini[0.042]），而所有模型的特异性相对较高（范围：0.870-0.989）。对于总体结果最好的DR严重程度提示（提示3.1），模型之间的准确率无显著差异（Perplexity [0.411], ChatGPT [0.395], Gemini [0.392], Claude[0.314]）。所有模型的灵敏度均较低（Perplexity [0.247], ChatGPT [0.229], Gemini [0.224], Claude[0.184]）。特异性范围为0.840 ~ 0.866。结论多模态大语言模型是辅助视网膜图像分析的有力工具。然而，目前，图像分析的准确性存在差异，并且诊断性能低于DR诊断和分级安全实施的临床标准。进一步培训和优化常见错误可能会提高他们的临床应用。财务披露专有或商业披露可在本文末尾的脚注和披露中找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊