InceptionWTMNet: A hybrid network for Alzheimer’s Disease detection using wavelet transform convolution and Mixed Local Channel Attention on finely fused multimodal images
IF 4.2 3区 计算机科学Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Zenan Xu, Zhengyao Bai, Han Ma, Mingqiang Xu, Qiqin Huang, Tao Lin
{"title":"InceptionWTMNet: A hybrid network for Alzheimer’s Disease detection using wavelet transform convolution and Mixed Local Channel Attention on finely fused multimodal images","authors":"Zenan Xu, Zhengyao Bai, Han Ma, Mingqiang Xu, Qiqin Huang, Tao Lin","doi":"10.1016/j.imavis.2025.105693","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal fusion has emerged as a critical technique for the diagnosis of Alzheimer’s Disease (AD), with the aim of effectively extracting and utilising complementary information from diverse modalities. Current fusion methods frequently cause the precise alignment of source images and do not adequately address parallax issues. This oversight can result in artifacts during the fusion process when images are misaligned. In response to this challenge, we propose a refined registration fusion technique, termed MURF, which integrates multimodal image registration and fusion within a cohesive framework. The Vision Transformer (ViT) has inspired the application of large-kernel convolutions in the diagnosis of Alzheimer’s disease (AD) because of its ability to model long-range dependencies. This approach aims to expand the receptive field and enhance the performance of diagnostic models. Despite requiring a minimal number of floating-point operations (FLOPs), these deep operators encounter challenges associated with over-parameterisation because of high memory access costs, which ultimately compromises computational efficiency. By utilising wavelet transform convolutions (WTConv), we decompose large-kernel depth-wise convolutions into four parallel branches. One branch employs a wavelet-transform convolution with square kernels, while the other two branches incorporate orthogonal wavelet-transform kernels with an identity mapping. This innovative method, with a Mixed Local Channel Attention mechanism, has facilitated the development of the InceptionWTConvolutions network. This network maintains a receptive field comparable to that of large-kernel convolutions, while concurrently minimising over-parameterisation and enhancing computational efficiency. InceptionWTMNet classified AD, MCI, and NC using MRI and PET data from ADNI dataset with 98.69% accuracy, 98.65% recall, 98.70% F1-score, and 98.98% AUC. and provide Graphical abstract in correct format.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105693"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625002811","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal fusion has emerged as a critical technique for the diagnosis of Alzheimer’s Disease (AD), with the aim of effectively extracting and utilising complementary information from diverse modalities. Current fusion methods frequently cause the precise alignment of source images and do not adequately address parallax issues. This oversight can result in artifacts during the fusion process when images are misaligned. In response to this challenge, we propose a refined registration fusion technique, termed MURF, which integrates multimodal image registration and fusion within a cohesive framework. The Vision Transformer (ViT) has inspired the application of large-kernel convolutions in the diagnosis of Alzheimer’s disease (AD) because of its ability to model long-range dependencies. This approach aims to expand the receptive field and enhance the performance of diagnostic models. Despite requiring a minimal number of floating-point operations (FLOPs), these deep operators encounter challenges associated with over-parameterisation because of high memory access costs, which ultimately compromises computational efficiency. By utilising wavelet transform convolutions (WTConv), we decompose large-kernel depth-wise convolutions into four parallel branches. One branch employs a wavelet-transform convolution with square kernels, while the other two branches incorporate orthogonal wavelet-transform kernels with an identity mapping. This innovative method, with a Mixed Local Channel Attention mechanism, has facilitated the development of the InceptionWTConvolutions network. This network maintains a receptive field comparable to that of large-kernel convolutions, while concurrently minimising over-parameterisation and enhancing computational efficiency. InceptionWTMNet classified AD, MCI, and NC using MRI and PET data from ADNI dataset with 98.69% accuracy, 98.65% recall, 98.70% F1-score, and 98.98% AUC. and provide Graphical abstract in correct format.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.