{"title":"视觉变换器与传统卷积神经网络在检测可转诊糖尿病视网膜病变中的对比分析","authors":"","doi":"10.1016/j.xops.2024.100552","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>Vision transformers (ViTs) have shown promising performance in various classification tasks previously dominated by convolutional neural networks (CNNs). However, the performance of ViTs in referable diabetic retinopathy (DR) detection is relatively underexplored. In this study, using retinal photographs, we evaluated the comparative performances of ViTs and CNNs on detection of referable DR.</p></div><div><h3>Design</h3><p>Retrospective study.</p></div><div><h3>Participants</h3><p>A total of 48 269 retinal images from the open-source Kaggle DR detection dataset, the Messidor-1 dataset and the Singapore Epidemiology of Eye Diseases (SEED) study were included.</p></div><div><h3>Methods</h3><p>Using 41 614 retinal photographs from the Kaggle dataset, we developed 5 CNN (Visual Geometry Group 19, ResNet50, InceptionV3, DenseNet201, and EfficientNetV2S) and 4 ViTs models (VAN_small, CrossViT_small, ViT_small, and Hierarchical Vision transformer using Shifted Windows [SWIN]_tiny) for the detection of referable DR. We defined the presence of referable DR as eyes with moderate or worse DR. The comparative performance of all 9 models was evaluated in the Kaggle internal test dataset (with 1045 study eyes), and in 2 external test sets, the SEED study (5455 study eyes) and the Messidor-1 (1200 study eyes).</p></div><div><h3>Main Outcome Measures</h3><p>Area under operating characteristics curve (AUC), specificity, and sensitivity.</p></div><div><h3>Results</h3><p>Among all models, the SWIN transformer displayed the highest AUC of 95.7% on the internal test set, significantly outperforming the CNN models (all <em>P</em> < 0.001). The same observation was confirmed in the external test sets, with the SWIN transformer achieving AUC of 97.3% in SEED and 96.3% in Messidor-1. When specificity level was fixed at 80% for the internal test, the SWIN transformer achieved the highest sensitivity of 94.4%, significantly better than all the CNN models (sensitivity levels ranging between 76.3% and 83.8%; all <em>P</em> < 0.001). This trend was also consistently observed in both external test sets.</p></div><div><h3>Conclusions</h3><p>Our findings demonstrate that ViTs provide superior performance over CNNs in detecting referable DR from retinal photographs. These results point to the potential of utilizing ViT models to improve and optimize retinal photo-based deep learning for referable DR detection.</p></div><div><h3>Financial Disclosure(s)</h3><p>Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.</p></div>","PeriodicalId":74363,"journal":{"name":"Ophthalmology science","volume":"4 6","pages":"Article 100552"},"PeriodicalIF":3.2000,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666914524000885/pdfft?md5=e825a283c219c78d58998007371e0532&pid=1-s2.0-S2666914524000885-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Comparative Analysis of Vision Transformers and Conventional Convolutional Neural Networks in Detecting Referable Diabetic Retinopathy\",\"authors\":\"\",\"doi\":\"10.1016/j.xops.2024.100552\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objective</h3><p>Vision transformers (ViTs) have shown promising performance in various classification tasks previously dominated by convolutional neural networks (CNNs). However, the performance of ViTs in referable diabetic retinopathy (DR) detection is relatively underexplored. In this study, using retinal photographs, we evaluated the comparative performances of ViTs and CNNs on detection of referable DR.</p></div><div><h3>Design</h3><p>Retrospective study.</p></div><div><h3>Participants</h3><p>A total of 48 269 retinal images from the open-source Kaggle DR detection dataset, the Messidor-1 dataset and the Singapore Epidemiology of Eye Diseases (SEED) study were included.</p></div><div><h3>Methods</h3><p>Using 41 614 retinal photographs from the Kaggle dataset, we developed 5 CNN (Visual Geometry Group 19, ResNet50, InceptionV3, DenseNet201, and EfficientNetV2S) and 4 ViTs models (VAN_small, CrossViT_small, ViT_small, and Hierarchical Vision transformer using Shifted Windows [SWIN]_tiny) for the detection of referable DR. We defined the presence of referable DR as eyes with moderate or worse DR. The comparative performance of all 9 models was evaluated in the Kaggle internal test dataset (with 1045 study eyes), and in 2 external test sets, the SEED study (5455 study eyes) and the Messidor-1 (1200 study eyes).</p></div><div><h3>Main Outcome Measures</h3><p>Area under operating characteristics curve (AUC), specificity, and sensitivity.</p></div><div><h3>Results</h3><p>Among all models, the SWIN transformer displayed the highest AUC of 95.7% on the internal test set, significantly outperforming the CNN models (all <em>P</em> < 0.001). The same observation was confirmed in the external test sets, with the SWIN transformer achieving AUC of 97.3% in SEED and 96.3% in Messidor-1. When specificity level was fixed at 80% for the internal test, the SWIN transformer achieved the highest sensitivity of 94.4%, significantly better than all the CNN models (sensitivity levels ranging between 76.3% and 83.8%; all <em>P</em> < 0.001). This trend was also consistently observed in both external test sets.</p></div><div><h3>Conclusions</h3><p>Our findings demonstrate that ViTs provide superior performance over CNNs in detecting referable DR from retinal photographs. These results point to the potential of utilizing ViT models to improve and optimize retinal photo-based deep learning for referable DR detection.</p></div><div><h3>Financial Disclosure(s)</h3><p>Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.</p></div>\",\"PeriodicalId\":74363,\"journal\":{\"name\":\"Ophthalmology science\",\"volume\":\"4 6\",\"pages\":\"Article 100552\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-05-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666914524000885/pdfft?md5=e825a283c219c78d58998007371e0532&pid=1-s2.0-S2666914524000885-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ophthalmology science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666914524000885\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"OPHTHALMOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmology science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666914524000885","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
客观视觉转换器(ViT)在以前由卷积神经网络(CNN)主导的各种分类任务中表现出了良好的性能。然而,ViTs 在可参考的糖尿病视网膜病变(DR)检测中的表现却相对欠缺探索。在这项研究中,我们使用视网膜照片评估了 ViT 和 CNN 在检测可转诊的糖尿病视网膜病变方面的性能对比。设计回顾性研究。参与者从开源的 Kaggle 糖尿病视网膜病变检测数据集、Messidor-1 数据集和新加坡眼科疾病流行病学(SEED)研究中纳入了共计 48 269 张视网膜图像。方法利用 Kaggle 数据集中的 41 614 张视网膜照片,我们开发了 5 个 CNN(Visual Geometry Group 19、ResNet50、InceptionV3、DenseNet201 和 EfficientNetV2S)和 4 个 ViTs 模型(VAN_small、CrossViT_small、ViT_small 和 Hierarchical Vision transformer using Shifted Windows [SWIN]_tiny),用于检测可转诊的 DR。我们将存在可转诊的 DR 定义为具有中度或更严重 DR 的眼睛。在 Kaggle 内部测试数据集(1045 只研究用眼)和 2 个外部测试集(SEED 研究(5455 只研究用眼)和 Messidor-1(1200 只研究用眼))中对所有 9 个模型的比较性能进行了评估。同样的观察结果在外部测试集中也得到了证实,SWIN变换器在SEED中的AUC达到了97.3%,在Messidor-1中达到了96.3%。当内部测试的特异性水平固定为 80% 时,SWIN 变换器的灵敏度最高,达到 94.4%,明显优于所有 CNN 模型(灵敏度水平在 76.3% 和 83.8% 之间;所有 P < 0.001)。我们的研究结果表明,在从视网膜照片检测可转诊的 DR 方面,ViT 的性能优于 CNN。这些结果表明,利用 ViT 模型改进和优化基于视网膜照片的深度学习来检测可转诊的 DR 是很有潜力的。
Comparative Analysis of Vision Transformers and Conventional Convolutional Neural Networks in Detecting Referable Diabetic Retinopathy
Objective
Vision transformers (ViTs) have shown promising performance in various classification tasks previously dominated by convolutional neural networks (CNNs). However, the performance of ViTs in referable diabetic retinopathy (DR) detection is relatively underexplored. In this study, using retinal photographs, we evaluated the comparative performances of ViTs and CNNs on detection of referable DR.
Design
Retrospective study.
Participants
A total of 48 269 retinal images from the open-source Kaggle DR detection dataset, the Messidor-1 dataset and the Singapore Epidemiology of Eye Diseases (SEED) study were included.
Methods
Using 41 614 retinal photographs from the Kaggle dataset, we developed 5 CNN (Visual Geometry Group 19, ResNet50, InceptionV3, DenseNet201, and EfficientNetV2S) and 4 ViTs models (VAN_small, CrossViT_small, ViT_small, and Hierarchical Vision transformer using Shifted Windows [SWIN]_tiny) for the detection of referable DR. We defined the presence of referable DR as eyes with moderate or worse DR. The comparative performance of all 9 models was evaluated in the Kaggle internal test dataset (with 1045 study eyes), and in 2 external test sets, the SEED study (5455 study eyes) and the Messidor-1 (1200 study eyes).
Main Outcome Measures
Area under operating characteristics curve (AUC), specificity, and sensitivity.
Results
Among all models, the SWIN transformer displayed the highest AUC of 95.7% on the internal test set, significantly outperforming the CNN models (all P < 0.001). The same observation was confirmed in the external test sets, with the SWIN transformer achieving AUC of 97.3% in SEED and 96.3% in Messidor-1. When specificity level was fixed at 80% for the internal test, the SWIN transformer achieved the highest sensitivity of 94.4%, significantly better than all the CNN models (sensitivity levels ranging between 76.3% and 83.8%; all P < 0.001). This trend was also consistently observed in both external test sets.
Conclusions
Our findings demonstrate that ViTs provide superior performance over CNNs in detecting referable DR from retinal photographs. These results point to the potential of utilizing ViT models to improve and optimize retinal photo-based deep learning for referable DR detection.
Financial Disclosure(s)
Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.