MoViT: Memorizing Vision Transformers for Medical Image Analysis.

Machine learning in medical imaging. MLMI (Workshop) Pub Date : 2024-01-01 Epub Date: 2023-10-15 DOI:10.1007/978-3-031-45676-3_21

Yiqing Shen, Pengfei Guo, Jingpu Wu, Qianqi Huang, Nhat Le, Jinyuan Zhou, Shanshan Jiang, Mathias Unberath

{"title":"MoViT: Memorizing Vision Transformers for Medical Image Analysis.","authors":"Yiqing Shen, Pengfei Guo, Jingpu Wu, Qianqi Huang, Nhat Le, Jinyuan Zhou, Shanshan Jiang, Mathias Unberath","doi":"10.1007/978-3-031-45676-3_21","DOIUrl":null,"url":null,"abstract":"<p><p>The synergy of long-range dependencies from transformers and local representations of image content from convolutional neural networks (CNNs) has led to advanced architectures and increased performance for various medical image analysis tasks due to their complementary benefits. However, compared with CNNs, transformers require considerably more training data, due to a larger number of parameters and an absence of inductive bias. The need for increasingly large datasets continues to be problematic, particularly in the context of medical imaging, where both annotation efforts and data protection result in limited data availability. In this work, inspired by the human decision-making process of correlating new \"evidence\" with previously memorized \"experience\", we propose a Memorizing Vision Transformer (MoViT) to alleviate the need for large-scale datasets to successfully train and deploy transformer-based architectures. MoViT leverages an external memory structure to cache history attention snapshots during the training stage. To prevent overfitting, we incorporate an innovative memory update scheme, attention temporal moving average, to update the stored external memories with the historical moving average. For inference speedup, we design a prototypical attention learning method to distill the external memory into smaller representative subsets. We evaluate our method on a public histology image dataset and an in-house MRI dataset, demonstrating that MoViT applied to varied medical image analysis tasks, can outperform vanilla transformer models across varied data regimes, especially in cases where only a small amount of annotated data is available. More importantly, MoViT can reach a competitive performance of ViT with only 3.0% of the training data. In conclusion, MoViT provides a simple plug-in for transformer architectures which may contribute to reducing the training data needed to achieve acceptable models for a broad range of medical image analysis tasks.</p>","PeriodicalId":74092,"journal":{"name":"Machine learning in medical imaging. MLMI (Workshop)","volume":"14349 ","pages":"205-213"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11008051/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning in medical imaging. MLMI (Workshop)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/978-3-031-45676-3_21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/10/15 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The synergy of long-range dependencies from transformers and local representations of image content from convolutional neural networks (CNNs) has led to advanced architectures and increased performance for various medical image analysis tasks due to their complementary benefits. However, compared with CNNs, transformers require considerably more training data, due to a larger number of parameters and an absence of inductive bias. The need for increasingly large datasets continues to be problematic, particularly in the context of medical imaging, where both annotation efforts and data protection result in limited data availability. In this work, inspired by the human decision-making process of correlating new "evidence" with previously memorized "experience", we propose a Memorizing Vision Transformer (MoViT) to alleviate the need for large-scale datasets to successfully train and deploy transformer-based architectures. MoViT leverages an external memory structure to cache history attention snapshots during the training stage. To prevent overfitting, we incorporate an innovative memory update scheme, attention temporal moving average, to update the stored external memories with the historical moving average. For inference speedup, we design a prototypical attention learning method to distill the external memory into smaller representative subsets. We evaluate our method on a public histology image dataset and an in-house MRI dataset, demonstrating that MoViT applied to varied medical image analysis tasks, can outperform vanilla transformer models across varied data regimes, especially in cases where only a small amount of annotated data is available. More importantly, MoViT can reach a competitive performance of ViT with only 3.0% of the training data. In conclusion, MoViT provides a simple plug-in for transformer architectures which may contribute to reducing the training data needed to achieve acceptable models for a broad range of medical image analysis tasks.

查看原文本刊更多论文

MoViT：为医学图像分析记忆视觉变换器。

变压器的长程依赖性和卷积神经网络（CNN）对图像内容的局部表征的协同作用，为各种医学图像分析任务提供了先进的架构和更高的性能，因为它们具有互补优势。然而，与卷积神经网络相比，变换器需要更多的训练数据，这是因为变换器需要更多的参数，而且不存在归纳偏差。对越来越大的数据集的需求仍然是个问题，特别是在医学成像领域，注释工作和数据保护都导致数据可用性有限。在这项工作中，受将新 "证据 "与先前记忆的 "经验 "关联起来的人类决策过程的启发，我们提出了记忆视觉转换器（MoViT），以缓解对大规模数据集的需求，从而成功地训练和部署基于转换器的架构。MoViT 利用外部内存结构，在训练阶段缓存历史注意力快照。为防止过度拟合，我们采用了一种创新的内存更新方案--注意力时空移动平均法，用历史移动平均值更新存储的外部内存。为了加快推理速度，我们设计了一种原型注意力学习方法，将外部记忆提炼为更小的代表性子集。我们在一个公共组织学图像数据集和一个内部核磁共振成像数据集上对我们的方法进行了评估，结果表明，将 MoViT 应用于各种医学图像分析任务时，它在各种数据环境下的表现都优于香草变换器模型，尤其是在只有少量注释数据的情况下。更重要的是，MoViT 只需 3.0% 的训练数据就能达到 ViT 的竞争性能。总之，MoViT 为变换器架构提供了一个简单的插件，它可以帮助减少训练数据，从而为广泛的医学图像分析任务建立可接受的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Machine learning in medical imaging. MLMI (Workshop)

自引率

0.00%

发文量