Systematic Review of Hybrid Vision Transformer Architectures for Radiological Image Analysis

Ji Woong Kim, Aisha Urooj Khan, Imon Banerjee
{"title":"Systematic Review of Hybrid Vision Transformer Architectures for Radiological Image Analysis","authors":"Ji Woong Kim, Aisha Urooj Khan, Imon Banerjee","doi":"10.1101/2024.06.21.24309265","DOIUrl":null,"url":null,"abstract":"Background: Vision Transformer (ViT) and Convolutional Neural Networks (CNNs) each possess distinct strengths in medical imaging: ViT excels in capturing long-range dependencies through self-attention, while CNNs are adept at extracting local features via spatial convolution filters. While ViTs might struggle with capturing detailed local spatial information critical for tasks like anomaly detection in medical imaging, shallow CNNs often fail to effectively abstract global context.\nObjective: This study aims to explore and evaluate hybrid architectures that integrate ViT and CNN to lever- age their complementary strengths for enhanced performance in medical vision tasks, such as segmentation, classification, and prediction.\nMethods: Following PRISMA guideline, a systematic review was conducted on 28 articles published between 2020 and 2023. These articles proposed hybrid ViT-CNN architectures specifically for medical imaging tasks in radiology. The review focused on analyzing architectural variations, merging strategies between ViT and CNN, innovative applications of ViT, and efficiency metrics including parameters, inference time (GFlops), and performance benchmarks.\nResults: The review identified that integrating ViT and CNN can help mitigate the limitations of each architecture, offering comprehensive solutions that combine global context understanding with precise local feature extraction. We benchmarked the articles based on architectural variations, merging strategies, innovative uses of ViT, and efficiency metrics (number of parameters, inference time (GFlops), performance).\nConclusion: By synthesizing current literature, this review defines fundamental concepts of hybrid vision transformers and highlights emerging trends in the field. It provides a clear direction for future research aimed at optimizing the integration of ViT and CNN for effective utilization in medical imaging, contributing to advancements in diagnostic accuracy and image analysis.","PeriodicalId":501358,"journal":{"name":"medRxiv - Radiology and Imaging","volume":"81 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Radiology and Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.06.21.24309265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Vision Transformer (ViT) and Convolutional Neural Networks (CNNs) each possess distinct strengths in medical imaging: ViT excels in capturing long-range dependencies through self-attention, while CNNs are adept at extracting local features via spatial convolution filters. While ViTs might struggle with capturing detailed local spatial information critical for tasks like anomaly detection in medical imaging, shallow CNNs often fail to effectively abstract global context. Objective: This study aims to explore and evaluate hybrid architectures that integrate ViT and CNN to lever- age their complementary strengths for enhanced performance in medical vision tasks, such as segmentation, classification, and prediction. Methods: Following PRISMA guideline, a systematic review was conducted on 28 articles published between 2020 and 2023. These articles proposed hybrid ViT-CNN architectures specifically for medical imaging tasks in radiology. The review focused on analyzing architectural variations, merging strategies between ViT and CNN, innovative applications of ViT, and efficiency metrics including parameters, inference time (GFlops), and performance benchmarks. Results: The review identified that integrating ViT and CNN can help mitigate the limitations of each architecture, offering comprehensive solutions that combine global context understanding with precise local feature extraction. We benchmarked the articles based on architectural variations, merging strategies, innovative uses of ViT, and efficiency metrics (number of parameters, inference time (GFlops), performance). Conclusion: By synthesizing current literature, this review defines fundamental concepts of hybrid vision transformers and highlights emerging trends in the field. It provides a clear direction for future research aimed at optimizing the integration of ViT and CNN for effective utilization in medical imaging, contributing to advancements in diagnostic accuracy and image analysis.
用于放射图像分析的混合视觉变换器架构系统综述
背景:视觉转换器(ViT)和卷积神经网络(CNN)在医学成像方面各有所长:ViT 擅长通过自我关注捕捉长距离依赖关系,而 CNN 则擅长通过空间卷积滤波器提取局部特征。ViT 可能难以捕捉对医学成像中异常检测等任务至关重要的详细局部空间信息,而浅层 CNN 则往往无法有效抽象出全局上下文:本研究旨在探索和评估整合了 ViT 和 CNN 的混合架构,利用它们的互补优势来提高医疗视觉任务(如分割、分类和预测)的性能:按照 PRISMA 准则,对 2020 年至 2023 年间发表的 28 篇文章进行了系统性综述。这些文章专门针对放射学中的医学成像任务提出了混合 ViT-CNN 架构。综述重点分析了架构的变化、ViT 与 CNN 的合并策略、ViT 的创新应用以及效率指标,包括参数、推理时间(GFlops)和性能基准:综述发现,整合 ViT 和 CNN 有助于缓解每种架构的局限性,提供结合全局上下文理解和精确局部特征提取的全面解决方案。我们根据架构变化、合并策略、ViT 的创新应用以及效率指标(参数数量、推理时间(GFlops)、性能)对文章进行了基准测试:本综述综合了当前的文献,定义了混合视觉转换器的基本概念,并强调了该领域的新兴趋势。它为未来的研究提供了明确的方向,旨在优化 ViT 和 CNN 的集成,以便在医学成像中有效利用,从而促进诊断准确性和图像分析的进步。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信