利用彩色眼底照片评估ChatGPT-4 Omni在糖尿病视网膜病变眼底镜分级中的诊断能力。

Clinical ophthalmology (Auckland, N.Z.) Pub Date : 2025-08-31 eCollection Date: 2025-01-01 DOI:10.2147/OPTH.S517238

Nitin Chetla, Sai S Samayamanthula, Joseph He Chang, Arnold Y Leigh, Sinan Akosman, Mihir Tandon, Tamer R Hage, Michael Cusick

{"title":"利用彩色眼底照片评估ChatGPT-4 Omni在糖尿病视网膜病变眼底镜分级中的诊断能力。","authors":"Nitin Chetla, Sai S Samayamanthula, Joseph He Chang, Arnold Y Leigh, Sinan Akosman, Mihir Tandon, Tamer R Hage, Michael Cusick","doi":"10.2147/OPTH.S517238","DOIUrl":null,"url":null,"abstract":"Purpose: Diabetic retinopathy (DR) is a leading cause of vision loss in working-age adults. Despite the importance of early DR detection, only 60% of patients with diabetes receive recommended annual screenings due to limited eye care provider capacity. FDA-approved AI systems were developed to meet the growing demand for DR screening; however, high costs and specialized equipment limit accessibility. More accessible and equally as accurate AI systems need to be evaluated to combat this disparity. This study evaluated the diagnostic accuracy of ChatGPT-4 Omni (GPT-4o) in classifying DR from color fundus photographs (CFPs) to assess its potential as a low-cost alternative screening tool.Methods: We utilized the publicly available EyePACS DR detection competition dataset from Kaggle, which includes 2,500 CFPs representing no DR, mild DR, moderate DR, severe DR, and proliferative DR. Each image was presented to GPT-4o with 1 of 8 prompts designed to enhance the model's accuracy. The results were analyzed through confusion matrices, and metrics such as accuracy, precision, sensitivity, specificity, and F1 scores were calculated to evaluate performance.Results: In prompts 1-3, GPT-4o showed a strong bias towards classifying images as no DR, with an average accuracy of 51.0%, while accuracy for other stages ranged from 70% to 80%. GPT-4o struggled with misclassifications, particularly between adjacent DR levels. It performed best in detecting proliferative DR (Level 4), achieving an F1 score above 0.3 and accuracy exceeding 80%. In binary classification tasks (Prompts 4.1-4.4), GPT-4o's performance improved, though it still had difficulty distinguishing mild DR (49.8% accuracy). When compared to FDA-approved AI systems, GPT-4o's sensitivity (47.7%) and specificity (73.8%) were significantly lower.Conclusion: While GPT-4o shows promise identifying severe DR, limitations in distinguishing early stages exist and highlight the need for further refinement before clinical usage in DR screening. Unlike traditional CNN-based tools like IDx-DR, GPT-4o is a multimodal foundation model with a fundamentally different architecture and training process, which may contribute to its diagnostic limitations. GPT-4o and other LLMs are not designed to learn about important DR features like microaneurysms or hemorrhages using pixel data which is why they may struggle to detect DR compared to CNN models.","PeriodicalId":93945,"journal":{"name":"Clinical ophthalmology (Auckland, N.Z.)","volume":"19 ","pages":"3103-3112"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12411675/pdf/","citationCount":"0","resultStr":"{\"title\":\"Assessing the Diagnostic Capabilities of ChatGPT-4 Omni in Grading Diabetic Retinopathy Fundoscopy Using Color Fundus Photographs.\",\"authors\":\"Nitin Chetla, Sai S Samayamanthula, Joseph He Chang, Arnold Y Leigh, Sinan Akosman, Mihir Tandon, Tamer R Hage, Michael Cusick\",\"doi\":\"10.2147/OPTH.S517238\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: Diabetic retinopathy (DR) is a leading cause of vision loss in working-age adults. Despite the importance of early DR detection, only 60% of patients with diabetes receive recommended annual screenings due to limited eye care provider capacity. FDA-approved AI systems were developed to meet the growing demand for DR screening; however, high costs and specialized equipment limit accessibility. More accessible and equally as accurate AI systems need to be evaluated to combat this disparity. This study evaluated the diagnostic accuracy of ChatGPT-4 Omni (GPT-4o) in classifying DR from color fundus photographs (CFPs) to assess its potential as a low-cost alternative screening tool.Methods: We utilized the publicly available EyePACS DR detection competition dataset from Kaggle, which includes 2,500 CFPs representing no DR, mild DR, moderate DR, severe DR, and proliferative DR. Each image was presented to GPT-4o with 1 of 8 prompts designed to enhance the model's accuracy. The results were analyzed through confusion matrices, and metrics such as accuracy, precision, sensitivity, specificity, and F1 scores were calculated to evaluate performance.Results: In prompts 1-3, GPT-4o showed a strong bias towards classifying images as no DR, with an average accuracy of 51.0%, while accuracy for other stages ranged from 70% to 80%. GPT-4o struggled with misclassifications, particularly between adjacent DR levels. It performed best in detecting proliferative DR (Level 4), achieving an F1 score above 0.3 and accuracy exceeding 80%. In binary classification tasks (Prompts 4.1-4.4), GPT-4o's performance improved, though it still had difficulty distinguishing mild DR (49.8% accuracy). When compared to FDA-approved AI systems, GPT-4o's sensitivity (47.7%) and specificity (73.8%) were significantly lower.Conclusion: While GPT-4o shows promise identifying severe DR, limitations in distinguishing early stages exist and highlight the need for further refinement before clinical usage in DR screening. Unlike traditional CNN-based tools like IDx-DR, GPT-4o is a multimodal foundation model with a fundamentally different architecture and training process, which may contribute to its diagnostic limitations. GPT-4o and other LLMs are not designed to learn about important DR features like microaneurysms or hemorrhages using pixel data which is why they may struggle to detect DR compared to CNN models.\",\"PeriodicalId\":93945,\"journal\":{\"name\":\"Clinical ophthalmology (Auckland, N.Z.)\",\"volume\":\"19 \",\"pages\":\"3103-3112\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-08-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12411675/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Clinical ophthalmology (Auckland, N.Z.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2147/OPTH.S517238\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical ophthalmology (Auckland, N.Z.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2147/OPTH.S517238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

目的：糖尿病视网膜病变（DR）是导致工作年龄成人视力丧失的主要原因。尽管早期DR检测很重要，但由于眼科保健提供者能力有限，只有60%的糖尿病患者接受建议的年度筛查。fda批准的人工智能系统是为了满足日益增长的DR筛查需求而开发的；然而，高昂的成本和专门的设备限制了可及性。我们需要评估更易于使用且同样准确的人工智能系统，以消除这种差异。本研究评估了ChatGPT-4 Omni （gpt - 40）从彩色眼底照片（CFPs）中对DR进行分类的诊断准确性，以评估其作为低成本替代筛查工具的潜力。方法：我们利用来自Kaggle的公开可用的EyePACS DR检测竞争数据集，其中包括2500个CFPs，代表无DR、轻度DR、中度DR、严重DR和增发性DR。每张图像都被提交给gpt - 40，并设置8个提示中的1个，以提高模型的准确性。通过混淆矩阵分析结果，并计算准确度、精密度、灵敏度、特异性和F1评分等指标来评估性能。结果：在提示1-3中，gpt - 40对无DR图像分类表现出强烈的偏见，平均准确率为51.0%，而其他阶段的准确率在70%到80%之间。gpt - 40与错误分类斗争，特别是在相邻的DR级别之间。在检测增殖性DR（4级）方面表现最好，F1评分在0.3以上，准确率超过80%。在二元分类任务（提示4.1-4.4）中，gpt - 40的性能有所提高，但仍然难以区分轻度DR（准确率为49.8%）。与fda批准的人工智能系统相比，gpt - 40的敏感性（47.7%）和特异性（73.8%）显著降低。结论：虽然gpt - 40在识别严重DR方面有希望，但在区分早期阶段方面存在局限性，在临床应用于DR筛查之前需要进一步改进。与传统的基于cnn的工具（如IDx-DR）不同，gpt - 40是一个多模态基础模型，具有根本不同的架构和训练过程，这可能导致其诊断局限性。gpt - 40和其他llm的设计并不是为了学习重要的DR特征，如使用像素数据的微动脉瘤或出血，这就是为什么与CNN模型相比，它们可能难以检测DR。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Assessing the Diagnostic Capabilities of ChatGPT-4 Omni in Grading Diabetic Retinopathy Fundoscopy Using Color Fundus Photographs.

查看原文本刊更多论文

Assessing the Diagnostic Capabilities of ChatGPT-4 Omni in Grading Diabetic Retinopathy Fundoscopy Using Color Fundus Photographs.

Purpose: Diabetic retinopathy (DR) is a leading cause of vision loss in working-age adults. Despite the importance of early DR detection, only 60% of patients with diabetes receive recommended annual screenings due to limited eye care provider capacity. FDA-approved AI systems were developed to meet the growing demand for DR screening; however, high costs and specialized equipment limit accessibility. More accessible and equally as accurate AI systems need to be evaluated to combat this disparity. This study evaluated the diagnostic accuracy of ChatGPT-4 Omni (GPT-4o) in classifying DR from color fundus photographs (CFPs) to assess its potential as a low-cost alternative screening tool.

Methods: We utilized the publicly available EyePACS DR detection competition dataset from Kaggle, which includes 2,500 CFPs representing no DR, mild DR, moderate DR, severe DR, and proliferative DR. Each image was presented to GPT-4o with 1 of 8 prompts designed to enhance the model's accuracy. The results were analyzed through confusion matrices, and metrics such as accuracy, precision, sensitivity, specificity, and F1 scores were calculated to evaluate performance.

Results: In prompts 1-3, GPT-4o showed a strong bias towards classifying images as no DR, with an average accuracy of 51.0%, while accuracy for other stages ranged from 70% to 80%. GPT-4o struggled with misclassifications, particularly between adjacent DR levels. It performed best in detecting proliferative DR (Level 4), achieving an F1 score above 0.3 and accuracy exceeding 80%. In binary classification tasks (Prompts 4.1-4.4), GPT-4o's performance improved, though it still had difficulty distinguishing mild DR (49.8% accuracy). When compared to FDA-approved AI systems, GPT-4o's sensitivity (47.7%) and specificity (73.8%) were significantly lower.

Conclusion: While GPT-4o shows promise identifying severe DR, limitations in distinguishing early stages exist and highlight the need for further refinement before clinical usage in DR screening. Unlike traditional CNN-based tools like IDx-DR, GPT-4o is a multimodal foundation model with a fundamentally different architecture and training process, which may contribute to its diagnostic limitations. GPT-4o and other LLMs are not designed to learn about important DR features like microaneurysms or hemorrhages using pixel data which is why they may struggle to detect DR compared to CNN models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Clinical ophthalmology (Auckland, N.Z.)

CiteScore

4.10

自引率

0.00%

发文量