{"title":"Diabetic Retinopathy Detection using CNN, Transformer and MLP based Architectures","authors":"N. S. Kumar, Badri Karthikeyan","doi":"10.1109/ISPACS51563.2021.9651024","DOIUrl":null,"url":null,"abstract":"Diabetic retinopathy is a chronic disease caused due to a long term accumulation of insulin in the retinal blood vessels. 2.6% of global blindness is a result of diabetic retinopathy (DR) with more than 150 million people affected. Early detection of DR plays an important role in preventing blindness. Use of deep learning is a long term solution to screen, diagnose and monitor patients within primary health centers. Attention based networks (Transformers), Convolutional neural networks (CNN) and multi-layered perceptrons (MPLs) are the current state-of-the-art architectures for addressing computer vision based problem statements. In this paper, we evaluate these three different architectures for the detection of DR. Model convegence time (training time), accuracy, model size are few of the metrics that have been used for this evaluation. State-of-the-art pre-trained models belonging to each of these architectures have been chosen for these experiments. The models include EfficientNet, ResNet, Swin-Transformer, Vision-Transformer (ViT) and MLP-Mixer. These models have been trained using Kaggle dataset, which contains more than 3600 annotated images with a resolution of 2416*1736. For fair comparisons, no augmentation techniques have been used to improve the performance. Results of the experiments indicate that the models based on Transformer based architecture are the most accurate and also have comparative model-convergence times compared to CNN and MLP architectures. Among all the state-of-the-art pre-trained models Swin-Transformer yields the best accuracy of 86.4% on test dataset and it takes around 12 minutes for training the model on a Tesla K80 GPU.","PeriodicalId":359822,"journal":{"name":"2021 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS)","volume":"168 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPACS51563.2021.9651024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Diabetic retinopathy is a chronic disease caused due to a long term accumulation of insulin in the retinal blood vessels. 2.6% of global blindness is a result of diabetic retinopathy (DR) with more than 150 million people affected. Early detection of DR plays an important role in preventing blindness. Use of deep learning is a long term solution to screen, diagnose and monitor patients within primary health centers. Attention based networks (Transformers), Convolutional neural networks (CNN) and multi-layered perceptrons (MPLs) are the current state-of-the-art architectures for addressing computer vision based problem statements. In this paper, we evaluate these three different architectures for the detection of DR. Model convegence time (training time), accuracy, model size are few of the metrics that have been used for this evaluation. State-of-the-art pre-trained models belonging to each of these architectures have been chosen for these experiments. The models include EfficientNet, ResNet, Swin-Transformer, Vision-Transformer (ViT) and MLP-Mixer. These models have been trained using Kaggle dataset, which contains more than 3600 annotated images with a resolution of 2416*1736. For fair comparisons, no augmentation techniques have been used to improve the performance. Results of the experiments indicate that the models based on Transformer based architecture are the most accurate and also have comparative model-convergence times compared to CNN and MLP architectures. Among all the state-of-the-art pre-trained models Swin-Transformer yields the best accuracy of 86.4% on test dataset and it takes around 12 minutes for training the model on a Tesla K80 GPU.