HyCoViT: Hybrid Convolution Vision Transformer With Dynamic Dropout for Enhanced Medical Chest X-Ray Classification

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Access Pub Date : 2025-06-30 DOI:10.1109/ACCESS.2025.3584065

Omid Almasi Naghash;Nam Ling;Xiang Li

{"title":"HyCoViT: Hybrid Convolution Vision Transformer With Dynamic Dropout for Enhanced Medical Chest X-Ray Classification","authors":"Omid Almasi Naghash;Nam Ling;Xiang Li","doi":"10.1109/ACCESS.2025.3584065","DOIUrl":null,"url":null,"abstract":"Medical chest X-ray (CXR) classification necessitates balancing detailed local feature extraction with capturing broader, long-range dependencies, especially when working with limited and heterogeneous datasets. In this paper, we propose HyCoViT, a hybrid model that integrates a custom Convolutional Neural Network (CNN) block with Vision Transformers (ViTs). This approach combines the locality of CNN-based latent space representations with the global attention mechanisms of ViTs. To address overfitting in data-scarce scenarios, we introduce a Dynamic Dropout (DD) algorithm that adaptively adjusts the dropout rate during training. Additionally, we enhance model generalization using a combination of traditional data augmentation and MixUp techniques. We evaluate HyCoViT on a multi-class classification task involving COVID-19, pneumonia, lung opacity, and normal CXR images. While COVID-19 serves as a case study, the model’s design is generalizable to various medical imaging applications. Experimental results show that HyCoViT achieves state-of-the-art (SOTA) performance, with 98.81% accuracy for three-class surpassing the existing CNN-based model by average +4.90%., and SOTA transformer-based average by 2.05%. In four-class classification, HyCoViT achieves the highest accuracy at 96.56%, which is 8.32% higher than the average accuracy of SOTA CNN-based models and 4.96% higher than the average accuracy of other SOTA transformer-based models. These results surpass many existing CNN-based and transformer-based models, demonstrating the robust generalization capabilities of our method. Furthermore, we provide interpretable, attention-based visualizations that highlight crucial lung regions to support context-aware decisions and ultimately improve patient outcomes.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"112623-112641"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11059244","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11059244/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Medical chest X-ray (CXR) classification necessitates balancing detailed local feature extraction with capturing broader, long-range dependencies, especially when working with limited and heterogeneous datasets. In this paper, we propose HyCoViT, a hybrid model that integrates a custom Convolutional Neural Network (CNN) block with Vision Transformers (ViTs). This approach combines the locality of CNN-based latent space representations with the global attention mechanisms of ViTs. To address overfitting in data-scarce scenarios, we introduce a Dynamic Dropout (DD) algorithm that adaptively adjusts the dropout rate during training. Additionally, we enhance model generalization using a combination of traditional data augmentation and MixUp techniques. We evaluate HyCoViT on a multi-class classification task involving COVID-19, pneumonia, lung opacity, and normal CXR images. While COVID-19 serves as a case study, the model’s design is generalizable to various medical imaging applications. Experimental results show that HyCoViT achieves state-of-the-art (SOTA) performance, with 98.81% accuracy for three-class surpassing the existing CNN-based model by average +4.90%., and SOTA transformer-based average by 2.05%. In four-class classification, HyCoViT achieves the highest accuracy at 96.56%, which is 8.32% higher than the average accuracy of SOTA CNN-based models and 4.96% higher than the average accuracy of other SOTA transformer-based models. These results surpass many existing CNN-based and transformer-based models, demonstrating the robust generalization capabilities of our method. Furthermore, we provide interpretable, attention-based visualizations that highlight crucial lung regions to support context-aware decisions and ultimately improve patient outcomes.

查看原文本刊更多论文

HyCoViT：用于增强医学胸部x射线分类的动态Dropout混合卷积视觉变压器

医用胸部x射线（CXR）分类需要在详细的局部特征提取与捕获更广泛的远程依赖关系之间取得平衡，特别是在处理有限且异构的数据集时。在本文中，我们提出了HyCoViT，这是一种将自定义卷积神经网络（CNN）块与视觉变压器（ViTs）集成在一起的混合模型。该方法将基于cnn的潜在空间表示的局部性与vit的全局注意机制相结合。为了解决数据稀缺情况下的过拟合问题，我们引入了一种动态Dropout （DD）算法，该算法在训练过程中自适应调整辍学率。此外，我们使用传统的数据增强和混合技术的组合来增强模型泛化。我们在涉及COVID-19、肺炎、肺不透明和正常CXR图像的多类别分类任务上评估HyCoViT。虽然COVID-19是一个案例研究，但该模型的设计可推广到各种医学成像应用。实验结果表明，HyCoViT达到了最先进（SOTA）的性能，对三个类别的准确率达到98.81%，比现有的基于cnn的模型平均高出4.90%。， SOTA变压器平均下降2.05%。在四类分类中，HyCoViT准确率最高，达到96.56%，比基于SOTA cnn的模型平均准确率高8.32%，比其他基于SOTA变压器的模型平均准确率高4.96%。这些结果超越了许多现有的基于cnn和基于变压器的模型，证明了我们的方法具有强大的泛化能力。此外，我们提供可解释的，基于注意力的可视化，突出关键的肺区域，以支持上下文感知决策并最终改善患者预后。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.