基于视觉变换的自适应特征混合临床图像分析

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Soft Computing Pub Date : 2025-06-10 DOI:10.1016/j.asoc.2025.113259

Susmita Ghosh, Swagatam Das

{"title":"基于视觉变换的自适应特征混合临床图像分析","authors":"Susmita Ghosh, Swagatam Das","doi":"10.1016/j.asoc.2025.113259","DOIUrl":null,"url":null,"abstract":"<div><div>The Vision Transformer (ViT) is an adaptation of the Transformer architecture that shows promise in image classification. However, limited training samples and the complex attributes of such images hinder its performance in identifying medical conditions from clinical images. To address this challenge, we propose a modified ViT architecture called ReMixViT by incorporating an efficient MLP-Mixer layer and reordering the residual blocks within the encoder block. This modification improves feature mixing and enhances the model’s generalization ability. We enhanced ReMixViT by incorporating an efficient MLP-Mixer layer. Additionally, we design two hybrid architectures, Res-ReMixViT and Res-ReMixViT+, by integrating a Convolutional Neural Network (ResNet50) and ReMixViT encoder blocks, considering feature maps of single and multiple scales, respectively. We evaluated the proposed architectures using six diverse medical imaging datasets with varying modalities and medical conditions. Our comparative study reveals that the ReMixViT and hybrid models outperform the vanilla ViT models and hybrid models with ViT encoder blocks, respectively, based on widely accepted performance measures. Specifically, we observe improvements of 4.62% and 3.08% in the F1-score performance metric. Moreover, when combined with data augmentation algorithms, the proposed hybrid architectures surpass other state-of-the-art hybrid networks. In addition to performance evaluation, we provide visual explanations through attention maps and the gradient flow of our model. These visual explanations contribute to the interpretability of the Artificial Intelligence (AI) system, assisting medical practitioners in drawing inferences from an explainable AI perspective. Moreover, an extended study demonstrates that the proposed modifications can be successfully adapted to other vision transformer architectures, resulting in enhanced performance.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"181 ","pages":"Article 113259"},"PeriodicalIF":7.2000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Adaptive feature mixing with Vision Transformers for clinical image analysis\",\"authors\":\"Susmita Ghosh, Swagatam Das\",\"doi\":\"10.1016/j.asoc.2025.113259\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The Vision Transformer (ViT) is an adaptation of the Transformer architecture that shows promise in image classification. However, limited training samples and the complex attributes of such images hinder its performance in identifying medical conditions from clinical images. To address this challenge, we propose a modified ViT architecture called ReMixViT by incorporating an efficient MLP-Mixer layer and reordering the residual blocks within the encoder block. This modification improves feature mixing and enhances the model’s generalization ability. We enhanced ReMixViT by incorporating an efficient MLP-Mixer layer. Additionally, we design two hybrid architectures, Res-ReMixViT and Res-ReMixViT+, by integrating a Convolutional Neural Network (ResNet50) and ReMixViT encoder blocks, considering feature maps of single and multiple scales, respectively. We evaluated the proposed architectures using six diverse medical imaging datasets with varying modalities and medical conditions. Our comparative study reveals that the ReMixViT and hybrid models outperform the vanilla ViT models and hybrid models with ViT encoder blocks, respectively, based on widely accepted performance measures. Specifically, we observe improvements of 4.62% and 3.08% in the F1-score performance metric. Moreover, when combined with data augmentation algorithms, the proposed hybrid architectures surpass other state-of-the-art hybrid networks. In addition to performance evaluation, we provide visual explanations through attention maps and the gradient flow of our model. These visual explanations contribute to the interpretability of the Artificial Intelligence (AI) system, assisting medical practitioners in drawing inferences from an explainable AI perspective. Moreover, an extended study demonstrates that the proposed modifications can be successfully adapted to other vision transformer architectures, resulting in enhanced performance.</div></div>\",\"PeriodicalId\":50737,\"journal\":{\"name\":\"Applied Soft Computing\",\"volume\":\"181 \",\"pages\":\"Article 113259\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2025-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Soft Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1568494625005708\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625005708","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视觉变压器（Vision Transformer, ViT）是对Transformer架构的一种改进，在图像分类方面显示出前景。然而，有限的训练样本和这些图像的复杂属性阻碍了它从临床图像中识别医疗条件的性能。为了解决这一挑战，我们提出了一种改进的ViT架构，称为ReMixViT，通过合并有效的MLP-Mixer层并重新排序编码器块内的剩余块。这种改进改善了特征混合，增强了模型的泛化能力。我们通过整合一个高效的MLP-Mixer层来增强ReMixViT。此外，我们通过集成卷积神经网络（ResNet50）和ReMixViT编码器块，分别考虑单尺度和多尺度的特征映射，设计了Res-ReMixViT和Res-ReMixViT+两种混合架构。我们使用具有不同模式和医疗条件的六种不同的医学成像数据集评估了所提出的架构。我们的比较研究表明，基于广泛接受的性能指标，ReMixViT和混合模型分别优于普通ViT模型和带有ViT编码器块的混合模型。具体来说，我们观察到f1得分性能指标的改进分别为4.62%和3.08%。此外，当与数据增强算法相结合时，所提出的混合架构优于其他最先进的混合网络。除了性能评估，我们还通过注意图和我们模型的梯度流提供视觉解释。这些可视化的解释有助于人工智能（AI）系统的可解释性，帮助医生从可解释的人工智能角度进行推断。此外，一项扩展研究表明，所提出的修改可以成功地适用于其他视觉转换器架构，从而提高了性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Adaptive feature mixing with Vision Transformers for clinical image analysis

The Vision Transformer (ViT) is an adaptation of the Transformer architecture that shows promise in image classification. However, limited training samples and the complex attributes of such images hinder its performance in identifying medical conditions from clinical images. To address this challenge, we propose a modified ViT architecture called ReMixViT by incorporating an efficient MLP-Mixer layer and reordering the residual blocks within the encoder block. This modification improves feature mixing and enhances the model’s generalization ability. We enhanced ReMixViT by incorporating an efficient MLP-Mixer layer. Additionally, we design two hybrid architectures, Res-ReMixViT and Res-ReMixViT+, by integrating a Convolutional Neural Network (ResNet50) and ReMixViT encoder blocks, considering feature maps of single and multiple scales, respectively. We evaluated the proposed architectures using six diverse medical imaging datasets with varying modalities and medical conditions. Our comparative study reveals that the ReMixViT and hybrid models outperform the vanilla ViT models and hybrid models with ViT encoder blocks, respectively, based on widely accepted performance measures. Specifically, we observe improvements of 4.62% and 3.08% in the F1-score performance metric. Moreover, when combined with data augmentation algorithms, the proposed hybrid architectures surpass other state-of-the-art hybrid networks. In addition to performance evaluation, we provide visual explanations through attention maps and the gradient flow of our model. These visual explanations contribute to the interpretability of the Artificial Intelligence (AI) system, assisting medical practitioners in drawing inferences from an explainable AI perspective. Moreover, an extended study demonstrates that the proposed modifications can be successfully adapted to other vision transformer architectures, resulting in enhanced performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Soft Computing 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

6.90%

发文量

874

审稿时长

10.9 months

期刊介绍： Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.