Jesika Debnath , Al Shahriar Uddin Khondakar Pranta , Amira Hossain , Anamul Sakib , Hamdadur Rahman , Rezaul Haque , Md.Redwan Ahmed , Ahmed Wasif Reza , S M Masfequier.Rahman Swapno , Abhishek Appaji
{"title":"LMVT: A hybrid vision transformer with attention mechanisms for efficient and explainable lung cancer diagnosis","authors":"Jesika Debnath , Al Shahriar Uddin Khondakar Pranta , Amira Hossain , Anamul Sakib , Hamdadur Rahman , Rezaul Haque , Md.Redwan Ahmed , Ahmed Wasif Reza , S M Masfequier.Rahman Swapno , Abhishek Appaji","doi":"10.1016/j.imu.2025.101669","DOIUrl":null,"url":null,"abstract":"<div><div>Lung cancer continues to be a leading cause of cancer-related deaths worldwide due to its high mortality rate and the complexities involved in diagnosis. Traditional diagnostic approaches often face issues such as subjectivity, class imbalance, and limited applicability across different imaging modalities. To tackle these problems, we introduce Lung MobileVIT (LMVT), a lightweight hybrid model that combines a Convolutional Neural Network (CNN) and a Transformer for multiclass lung cancer classification. LMVT utilizes depthwise separable convolutions for local texture extraction while employing multi-head self-attention (MHSA) to capture long-range global dependencies. Furthermore, we integrate attention mechanisms based on the Convolutional Block Attention Module (CBAM) and feature selection techniques derived from the Simple Gray Level Difference Method (SGLDM) to improve discriminative focus and minimize redundancy. LMVT utilizes attention recalibration to enhance the saliency of the minority class, while also incorporating curriculum augmentation strategies that balance representation across underrepresented classes. The model has been trained and validated using two public datasets (IQ-OTH/NCCD and LC25000) and evaluated for both 3-class and 5-class classification tasks. LMVT achieved an impressive 99.61 % accuracy and 99.22 % F1-score for the 3-class classification, along with 99.75 % accuracy and 99.44 % specificity for the 5-class classification. This performance surpasses that of several recent Vision Transformer (ViT) architectures. Statistical significance tests and confidence intervals confirm the reliability of these performance metrics, while an analysis of model complexity supports its capability for potential deployment. To enhance clinical interpretability, the model is integrated with explainable AI (XAI) and is implemented within a web-based diagnostic application for analyzing CT and histopathology images. This study highlights the potential of hybrid ViT architectures in creating scalable and interpretable data-driven tools for practical use in lung cancer diagnostics.</div></div>","PeriodicalId":13953,"journal":{"name":"Informatics in Medicine Unlocked","volume":"57 ","pages":"Article 101669"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics in Medicine Unlocked","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352914825000577","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
Abstract
Lung cancer continues to be a leading cause of cancer-related deaths worldwide due to its high mortality rate and the complexities involved in diagnosis. Traditional diagnostic approaches often face issues such as subjectivity, class imbalance, and limited applicability across different imaging modalities. To tackle these problems, we introduce Lung MobileVIT (LMVT), a lightweight hybrid model that combines a Convolutional Neural Network (CNN) and a Transformer for multiclass lung cancer classification. LMVT utilizes depthwise separable convolutions for local texture extraction while employing multi-head self-attention (MHSA) to capture long-range global dependencies. Furthermore, we integrate attention mechanisms based on the Convolutional Block Attention Module (CBAM) and feature selection techniques derived from the Simple Gray Level Difference Method (SGLDM) to improve discriminative focus and minimize redundancy. LMVT utilizes attention recalibration to enhance the saliency of the minority class, while also incorporating curriculum augmentation strategies that balance representation across underrepresented classes. The model has been trained and validated using two public datasets (IQ-OTH/NCCD and LC25000) and evaluated for both 3-class and 5-class classification tasks. LMVT achieved an impressive 99.61 % accuracy and 99.22 % F1-score for the 3-class classification, along with 99.75 % accuracy and 99.44 % specificity for the 5-class classification. This performance surpasses that of several recent Vision Transformer (ViT) architectures. Statistical significance tests and confidence intervals confirm the reliability of these performance metrics, while an analysis of model complexity supports its capability for potential deployment. To enhance clinical interpretability, the model is integrated with explainable AI (XAI) and is implemented within a web-based diagnostic application for analyzing CT and histopathology images. This study highlights the potential of hybrid ViT architectures in creating scalable and interpretable data-driven tools for practical use in lung cancer diagnostics.
期刊介绍:
Informatics in Medicine Unlocked (IMU) is an international gold open access journal covering a broad spectrum of topics within medical informatics, including (but not limited to) papers focusing on imaging, pathology, teledermatology, public health, ophthalmological, nursing and translational medicine informatics. The full papers that are published in the journal are accessible to all who visit the website.