{"title":"HyCoViT: Hybrid Convolution Vision Transformer With Dynamic Dropout for Enhanced Medical Chest X-Ray Classification","authors":"Omid Almasi Naghash;Nam Ling;Xiang Li","doi":"10.1109/ACCESS.2025.3584065","DOIUrl":null,"url":null,"abstract":"Medical chest X-ray (CXR) classification necessitates balancing detailed local feature extraction with capturing broader, long-range dependencies, especially when working with limited and heterogeneous datasets. In this paper, we propose HyCoViT, a hybrid model that integrates a custom Convolutional Neural Network (CNN) block with Vision Transformers (ViTs). This approach combines the locality of CNN-based latent space representations with the global attention mechanisms of ViTs. To address overfitting in data-scarce scenarios, we introduce a Dynamic Dropout (DD) algorithm that adaptively adjusts the dropout rate during training. Additionally, we enhance model generalization using a combination of traditional data augmentation and MixUp techniques. We evaluate HyCoViT on a multi-class classification task involving COVID-19, pneumonia, lung opacity, and normal CXR images. While COVID-19 serves as a case study, the model’s design is generalizable to various medical imaging applications. Experimental results show that HyCoViT achieves state-of-the-art (SOTA) performance, with 98.81% accuracy for three-class surpassing the existing CNN-based model by average +4.90%., and SOTA transformer-based average by 2.05%. In four-class classification, HyCoViT achieves the highest accuracy at 96.56%, which is 8.32% higher than the average accuracy of SOTA CNN-based models and 4.96% higher than the average accuracy of other SOTA transformer-based models. These results surpass many existing CNN-based and transformer-based models, demonstrating the robust generalization capabilities of our method. Furthermore, we provide interpretable, attention-based visualizations that highlight crucial lung regions to support context-aware decisions and ultimately improve patient outcomes.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"112623-112641"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11059244","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11059244/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Medical chest X-ray (CXR) classification necessitates balancing detailed local feature extraction with capturing broader, long-range dependencies, especially when working with limited and heterogeneous datasets. In this paper, we propose HyCoViT, a hybrid model that integrates a custom Convolutional Neural Network (CNN) block with Vision Transformers (ViTs). This approach combines the locality of CNN-based latent space representations with the global attention mechanisms of ViTs. To address overfitting in data-scarce scenarios, we introduce a Dynamic Dropout (DD) algorithm that adaptively adjusts the dropout rate during training. Additionally, we enhance model generalization using a combination of traditional data augmentation and MixUp techniques. We evaluate HyCoViT on a multi-class classification task involving COVID-19, pneumonia, lung opacity, and normal CXR images. While COVID-19 serves as a case study, the model’s design is generalizable to various medical imaging applications. Experimental results show that HyCoViT achieves state-of-the-art (SOTA) performance, with 98.81% accuracy for three-class surpassing the existing CNN-based model by average +4.90%., and SOTA transformer-based average by 2.05%. In four-class classification, HyCoViT achieves the highest accuracy at 96.56%, which is 8.32% higher than the average accuracy of SOTA CNN-based models and 4.96% higher than the average accuracy of other SOTA transformer-based models. These results surpass many existing CNN-based and transformer-based models, demonstrating the robust generalization capabilities of our method. Furthermore, we provide interpretable, attention-based visualizations that highlight crucial lung regions to support context-aware decisions and ultimately improve patient outcomes.
IEEE AccessCOMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
9.80
自引率
7.70%
发文量
6673
审稿时长
6 weeks
期刊介绍:
IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest.
IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on:
Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals.
Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering.
Development of new or improved fabrication or manufacturing techniques.
Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.