{"title":"Systematic Review of Hybrid Vision Transformer Architectures for Radiological Image Analysis","authors":"Ji Woong Kim, Aisha Urooj Khan, Imon Banerjee","doi":"10.1101/2024.06.21.24309265","DOIUrl":null,"url":null,"abstract":"Background: Vision Transformer (ViT) and Convolutional Neural Networks (CNNs) each possess distinct strengths in medical imaging: ViT excels in capturing long-range dependencies through self-attention, while CNNs are adept at extracting local features via spatial convolution filters. While ViTs might struggle with capturing detailed local spatial information critical for tasks like anomaly detection in medical imaging, shallow CNNs often fail to effectively abstract global context.\nObjective: This study aims to explore and evaluate hybrid architectures that integrate ViT and CNN to lever- age their complementary strengths for enhanced performance in medical vision tasks, such as segmentation, classification, and prediction.\nMethods: Following PRISMA guideline, a systematic review was conducted on 28 articles published between 2020 and 2023. These articles proposed hybrid ViT-CNN architectures specifically for medical imaging tasks in radiology. The review focused on analyzing architectural variations, merging strategies between ViT and CNN, innovative applications of ViT, and efficiency metrics including parameters, inference time (GFlops), and performance benchmarks.\nResults: The review identified that integrating ViT and CNN can help mitigate the limitations of each architecture, offering comprehensive solutions that combine global context understanding with precise local feature extraction. We benchmarked the articles based on architectural variations, merging strategies, innovative uses of ViT, and efficiency metrics (number of parameters, inference time (GFlops), performance).\nConclusion: By synthesizing current literature, this review defines fundamental concepts of hybrid vision transformers and highlights emerging trends in the field. It provides a clear direction for future research aimed at optimizing the integration of ViT and CNN for effective utilization in medical imaging, contributing to advancements in diagnostic accuracy and image analysis.","PeriodicalId":501358,"journal":{"name":"medRxiv - Radiology and Imaging","volume":"81 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Radiology and Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.06.21.24309265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Vision Transformer (ViT) and Convolutional Neural Networks (CNNs) each possess distinct strengths in medical imaging: ViT excels in capturing long-range dependencies through self-attention, while CNNs are adept at extracting local features via spatial convolution filters. While ViTs might struggle with capturing detailed local spatial information critical for tasks like anomaly detection in medical imaging, shallow CNNs often fail to effectively abstract global context.
Objective: This study aims to explore and evaluate hybrid architectures that integrate ViT and CNN to lever- age their complementary strengths for enhanced performance in medical vision tasks, such as segmentation, classification, and prediction.
Methods: Following PRISMA guideline, a systematic review was conducted on 28 articles published between 2020 and 2023. These articles proposed hybrid ViT-CNN architectures specifically for medical imaging tasks in radiology. The review focused on analyzing architectural variations, merging strategies between ViT and CNN, innovative applications of ViT, and efficiency metrics including parameters, inference time (GFlops), and performance benchmarks.
Results: The review identified that integrating ViT and CNN can help mitigate the limitations of each architecture, offering comprehensive solutions that combine global context understanding with precise local feature extraction. We benchmarked the articles based on architectural variations, merging strategies, innovative uses of ViT, and efficiency metrics (number of parameters, inference time (GFlops), performance).
Conclusion: By synthesizing current literature, this review defines fundamental concepts of hybrid vision transformers and highlights emerging trends in the field. It provides a clear direction for future research aimed at optimizing the integration of ViT and CNN for effective utilization in medical imaging, contributing to advancements in diagnostic accuracy and image analysis.