On the differences between CNNs and vision transformers for COVID-19 diagnosis using CT and chest x-ray mono- and multimodality

IF 1.7 4区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Sara El-Ateif, Ali Idri, José Luis Fernández-Alemán
{"title":"On the differences between CNNs and vision transformers for COVID-19 diagnosis using CT and chest x-ray mono- and multimodality","authors":"Sara El-Ateif, Ali Idri, José Luis Fernández-Alemán","doi":"10.1108/dta-01-2023-0005","DOIUrl":null,"url":null,"abstract":"<h3>Purpose</h3>\n<p>COVID-19 continues to spread, and cause increasing deaths. Physicians diagnose COVID-19 using not only real-time polymerase chain reaction but also the computed tomography (CT) and chest x-ray (CXR) modalities, depending on the stage of infection. However, with so many patients and so few doctors, it has become difficult to keep abreast of the disease. Deep learning models have been developed in order to assist in this respect, and vision transformers are currently state-of-the-art methods, but most techniques currently focus only on one modality (CXR).</p><!--/ Abstract__block -->\n<h3>Design/methodology/approach</h3>\n<p>This work aims to leverage the benefits of both CT and CXR to improve COVID-19 diagnosis. This paper studies the differences between using convolutional MobileNetV2, ViT DeiT and Swin Transformer models when training from scratch and pretraining on the MedNIST medical dataset rather than the ImageNet dataset of natural images. The comparison is made by reporting six performance metrics, the Scott–Knott Effect Size Difference, Wilcoxon statistical test and the Borda Count method. We also use the Grad-CAM algorithm to study the model's interpretability. Finally, the model's robustness is tested by evaluating it on Gaussian noised images.</p><!--/ Abstract__block -->\n<h3>Findings</h3>\n<p>Although pretrained MobileNetV2 was the best model in terms of performance, the best model in terms of performance, interpretability, and robustness to noise is the trained from scratch Swin Transformer using the CXR (accuracy = 93.21 per cent) and CT (accuracy = 94.14 per cent) modalities.</p><!--/ Abstract__block -->\n<h3>Originality/value</h3>\n<p>Models compared are pretrained on MedNIST and leverage both the CT and CXR modalities.</p><!--/ Abstract__block -->","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"54 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Technologies and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1108/dta-01-2023-0005","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose

COVID-19 continues to spread, and cause increasing deaths. Physicians diagnose COVID-19 using not only real-time polymerase chain reaction but also the computed tomography (CT) and chest x-ray (CXR) modalities, depending on the stage of infection. However, with so many patients and so few doctors, it has become difficult to keep abreast of the disease. Deep learning models have been developed in order to assist in this respect, and vision transformers are currently state-of-the-art methods, but most techniques currently focus only on one modality (CXR).

Design/methodology/approach

This work aims to leverage the benefits of both CT and CXR to improve COVID-19 diagnosis. This paper studies the differences between using convolutional MobileNetV2, ViT DeiT and Swin Transformer models when training from scratch and pretraining on the MedNIST medical dataset rather than the ImageNet dataset of natural images. The comparison is made by reporting six performance metrics, the Scott–Knott Effect Size Difference, Wilcoxon statistical test and the Borda Count method. We also use the Grad-CAM algorithm to study the model's interpretability. Finally, the model's robustness is tested by evaluating it on Gaussian noised images.

Findings

Although pretrained MobileNetV2 was the best model in terms of performance, the best model in terms of performance, interpretability, and robustness to noise is the trained from scratch Swin Transformer using the CXR (accuracy = 93.21 per cent) and CT (accuracy = 94.14 per cent) modalities.

Originality/value

Models compared are pretrained on MedNIST and leverage both the CT and CXR modalities.

利用 CT 和胸部 X 射线单模态和多模态诊断 COVID-19 的 CNN 与视觉变换器之间的差异
目的 COVID-19 仍在继续传播,并导致越来越多的人死亡。医生在诊断 COVID-19 时,不仅要使用实时聚合酶链反应,还要根据感染阶段使用计算机断层扫描(CT)和胸部 X 光检查(CXR)。然而,由于患者多、医生少,要及时了解病情变得十分困难。为了在这方面提供帮助,人们开发了深度学习模型,视觉转换器也是目前最先进的方法,但大多数技术目前只关注一种模式(CXR)。本文研究了使用卷积 MobileNetV2、ViT DeiT 和 Swin Transformer 模型从头开始训练与在 MedNIST 医学数据集而非自然图像的 ImageNet 数据集上进行预训练之间的差异。通过报告六项性能指标、斯科特-克诺特效应大小差、Wilcoxon 统计检验和 Borda 计数法进行比较。我们还使用 Grad-CAM 算法来研究模型的可解释性。研究结果虽然经过预训练的 MobileNetV2 是性能最佳的模型,但在性能、可解释性和对噪声的鲁棒性方面,使用 CXR(准确率 = 93.21%)和 CT(准确率 = 94.14%)模式从头开始训练的 Swin Transformer 才是最佳模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Data Technologies and Applications
Data Technologies and Applications Social Sciences-Library and Information Sciences
CiteScore
3.80
自引率
6.20%
发文量
29
期刊介绍: Previously published as: Program Online from: 2018 Subject Area: Information & Knowledge Management, Library Studies
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信