Systematic Review of Hybrid Vision Transformer Architectures for Radiological Image Analysis

medRxiv - Radiology and Imaging Pub Date : 2024-06-22 DOI:10.1101/2024.06.21.24309265

Ji Woong Kim, Aisha Urooj Khan, Imon Banerjee

{"title":"Systematic Review of Hybrid Vision Transformer Architectures for Radiological Image Analysis","authors":"Ji Woong Kim, Aisha Urooj Khan, Imon Banerjee","doi":"10.1101/2024.06.21.24309265","DOIUrl":null,"url":null,"abstract":"Background: Vision Transformer (ViT) and Convolutional Neural Networks (CNNs) each possess distinct strengths in medical imaging: ViT excels in capturing long-range dependencies through self-attention, while CNNs are adept at extracting local features via spatial convolution filters. While ViTs might struggle with capturing detailed local spatial information critical for tasks like anomaly detection in medical imaging, shallow CNNs often fail to effectively abstract global context.\nObjective: This study aims to explore and evaluate hybrid architectures that integrate ViT and CNN to lever- age their complementary strengths for enhanced performance in medical vision tasks, such as segmentation, classification, and prediction.\nMethods: Following PRISMA guideline, a systematic review was conducted on 28 articles published between 2020 and 2023. These articles proposed hybrid ViT-CNN architectures specifically for medical imaging tasks in radiology. The review focused on analyzing architectural variations, merging strategies between ViT and CNN, innovative applications of ViT, and efficiency metrics including parameters, inference time (GFlops), and performance benchmarks.\nResults: The review identified that integrating ViT and CNN can help mitigate the limitations of each architecture, offering comprehensive solutions that combine global context understanding with precise local feature extraction. We benchmarked the articles based on architectural variations, merging strategies, innovative uses of ViT, and efficiency metrics (number of parameters, inference time (GFlops), performance).\nConclusion: By synthesizing current literature, this review defines fundamental concepts of hybrid vision transformers and highlights emerging trends in the field. It provides a clear direction for future research aimed at optimizing the integration of ViT and CNN for effective utilization in medical imaging, contributing to advancements in diagnostic accuracy and image analysis.","PeriodicalId":501358,"journal":{"name":"medRxiv - Radiology and Imaging","volume":"81 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Radiology and Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.06.21.24309265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Vision Transformer (ViT) and Convolutional Neural Networks (CNNs) each possess distinct strengths in medical imaging: ViT excels in capturing long-range dependencies through self-attention, while CNNs are adept at extracting local features via spatial convolution filters. While ViTs might struggle with capturing detailed local spatial information critical for tasks like anomaly detection in medical imaging, shallow CNNs often fail to effectively abstract global context. Objective: This study aims to explore and evaluate hybrid architectures that integrate ViT and CNN to lever- age their complementary strengths for enhanced performance in medical vision tasks, such as segmentation, classification, and prediction. Methods: Following PRISMA guideline, a systematic review was conducted on 28 articles published between 2020 and 2023. These articles proposed hybrid ViT-CNN architectures specifically for medical imaging tasks in radiology. The review focused on analyzing architectural variations, merging strategies between ViT and CNN, innovative applications of ViT, and efficiency metrics including parameters, inference time (GFlops), and performance benchmarks. Results: The review identified that integrating ViT and CNN can help mitigate the limitations of each architecture, offering comprehensive solutions that combine global context understanding with precise local feature extraction. We benchmarked the articles based on architectural variations, merging strategies, innovative uses of ViT, and efficiency metrics (number of parameters, inference time (GFlops), performance). Conclusion: By synthesizing current literature, this review defines fundamental concepts of hybrid vision transformers and highlights emerging trends in the field. It provides a clear direction for future research aimed at optimizing the integration of ViT and CNN for effective utilization in medical imaging, contributing to advancements in diagnostic accuracy and image analysis.

查看原文本刊更多论文

用于放射图像分析的混合视觉变换器架构系统综述

背景：视觉转换器（ViT）和卷积神经网络（CNN）在医学成像方面各有所长：ViT 擅长通过自我关注捕捉长距离依赖关系，而 CNN 则擅长通过空间卷积滤波器提取局部特征。ViT 可能难以捕捉对医学成像中异常检测等任务至关重要的详细局部空间信息，而浅层 CNN 则往往无法有效抽象出全局上下文：本研究旨在探索和评估整合了 ViT 和 CNN 的混合架构，利用它们的互补优势来提高医疗视觉任务（如分割、分类和预测）的性能：按照 PRISMA 准则，对 2020 年至 2023 年间发表的 28 篇文章进行了系统性综述。这些文章专门针对放射学中的医学成像任务提出了混合 ViT-CNN 架构。综述重点分析了架构的变化、ViT 与 CNN 的合并策略、ViT 的创新应用以及效率指标，包括参数、推理时间（GFlops）和性能基准：综述发现，整合 ViT 和 CNN 有助于缓解每种架构的局限性，提供结合全局上下文理解和精确局部特征提取的全面解决方案。我们根据架构变化、合并策略、ViT 的创新应用以及效率指标（参数数量、推理时间（GFlops）、性能）对文章进行了基准测试：本综述综合了当前的文献，定义了混合视觉转换器的基本概念，并强调了该领域的新兴趋势。它为未来的研究提供了明确的方向，旨在优化 ViT 和 CNN 的集成，以便在医学成像中有效利用，从而促进诊断准确性和图像分析的进步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

medRxiv - Radiology and Imaging

自引率

0.00%

发文量