{"title":"Visual explainability of 250 skin diseases viewed through the eyes of an AI-based, self-supervised vision transformer—A clinical perspective","authors":"Ramy Abdel Mawgoud, Christian Posch","doi":"10.1002/jvc2.580","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Conventional supervised deep-learning approaches mostly focus on a small range of skin disease images. Recently, self-supervised (SS) Vision Transformers have emerged, capturing complex visual patterns in hundreds of classes without any need for tedious image annotation.</p>\n </section>\n \n <section>\n \n <h3> Objectives</h3>\n \n <p>This study aimed to form the basis for an inexpensive and explainable AI system, targeted at the vastness of clinical skin diagnoses by comparing so-called ‘self-attention maps’ of an SS and a supervised ViT on 250 skin diseases—visualizations showing areas of interest for each skin disease.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Using a public data set containing images of 250 different skin diseases, one small ViT was pretrained S) for 300 epochs (=ViT-SS), and two were fine-tuned supervised from ImageNet-weights for 300 epochs (=ViT-300) and for 78 epochs due to heavier regularization (=ViT-78), respectively. The models generated 250 self-attention maps each. These maps were analyzed in a blinded manner using a ‘DermAttention’ score, and the models were primarily compared based on their ability to focus on disease-relevant features.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Visual analysis revealed that ViT-SS delivered superior self-attention-maps. It scored a significantly higher accuracy of focusing on disease-defining lesions (88%; confidence interval [CI] 95%: 0.840–0.920) compared to ViT-300 (78.4%; CI 95%: 0.733–0.835; <i>p</i> < 0.05) and ViT-78 (51.2%; CI 95%: 0.450–0.574; <i>p</i> < 0.05). It also exceeded in other subcategories of ‘DermAttention’.</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>SS pretraining did not translate to better diagnostic performance when compared to conventional supervision. However, it led to more accurate visual representations of varying skin disease images. These findings may pave the way for large-scale, explainable computer-aided skin diagnostic in an unfiltered clinical setting. Further research is needed to improve clinical outcomes using these visual tools.</p>\n </section>\n </div>","PeriodicalId":94325,"journal":{"name":"JEADV clinical practice","volume":"4 1","pages":"145-155"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/jvc2.580","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JEADV clinical practice","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/jvc2.580","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Visual explainability of 250 skin diseases viewed through the eyes of an AI-based, self-supervised vision transformer—A clinical perspective
Background
Conventional supervised deep-learning approaches mostly focus on a small range of skin disease images. Recently, self-supervised (SS) Vision Transformers have emerged, capturing complex visual patterns in hundreds of classes without any need for tedious image annotation.
Objectives
This study aimed to form the basis for an inexpensive and explainable AI system, targeted at the vastness of clinical skin diagnoses by comparing so-called ‘self-attention maps’ of an SS and a supervised ViT on 250 skin diseases—visualizations showing areas of interest for each skin disease.
Methods
Using a public data set containing images of 250 different skin diseases, one small ViT was pretrained S) for 300 epochs (=ViT-SS), and two were fine-tuned supervised from ImageNet-weights for 300 epochs (=ViT-300) and for 78 epochs due to heavier regularization (=ViT-78), respectively. The models generated 250 self-attention maps each. These maps were analyzed in a blinded manner using a ‘DermAttention’ score, and the models were primarily compared based on their ability to focus on disease-relevant features.
Results
Visual analysis revealed that ViT-SS delivered superior self-attention-maps. It scored a significantly higher accuracy of focusing on disease-defining lesions (88%; confidence interval [CI] 95%: 0.840–0.920) compared to ViT-300 (78.4%; CI 95%: 0.733–0.835; p < 0.05) and ViT-78 (51.2%; CI 95%: 0.450–0.574; p < 0.05). It also exceeded in other subcategories of ‘DermAttention’.
Conclusions
SS pretraining did not translate to better diagnostic performance when compared to conventional supervision. However, it led to more accurate visual representations of varying skin disease images. These findings may pave the way for large-scale, explainable computer-aided skin diagnostic in an unfiltered clinical setting. Further research is needed to improve clinical outcomes using these visual tools.