A novel vision transformer with selective residual in multihead self-attention for pattern recognition

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-09-26 DOI:10.1016/j.patcog.2025.112497

Arun Kumar Sharma, Nishchal K. Verma

{"title":"A novel vision transformer with selective residual in multihead self-attention for pattern recognition","authors":"Arun Kumar Sharma, Nishchal K. Verma","doi":"10.1016/j.patcog.2025.112497","DOIUrl":null,"url":null,"abstract":"<div><div>Intelligent fault diagnosis requires robust capturing of specific features, representing the fault patterns, from time-series vibration signals. Most of the existing solutions require complex preprocessing steps to make the signal suitable for training a deep learning model. This article presents a novel vision transformer with a selective residual in the multihead self-attention network, called Selective Residual Vision Transformer (SeReViT), for improved robustness in capturing the fault signature from the vibration signal. The novel attention mechanism incorporates cumulative attention by utilizing the best attention through residual connections in each block of multihead attention. The best attention term is defined using the highest value of L1-norms of attention value (the scaled-dot product of key and query) of multiheads. It enables the model to focus on selected best attention to learn the long-range dependencies among sequential input image patches, resulting in better classification performance. The proposed framework is validated for fault diagnosis on the Case Western Reserve University bearing fault diagnosis dataset and the Paderborn University dataset. Since these datasets are already cleaned data, noisy vibration data are created by adding white noise for the demonstration of the robustness of the proposed framework. The vibration signals are first converted to images using the short-time Fourier transform with a fixed window size. The generated images are used to train and validate the proposed SeReViT. The results outperformed the state-of-the-art convolution-based models for fault diagnosis for both cleaned datasets and noisy datasets. The short-time Fourier transform is utilized to convert the noisy (raw) vibration signals from rotating machines to spectrum images.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112497"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325011604","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Intelligent fault diagnosis requires robust capturing of specific features, representing the fault patterns, from time-series vibration signals. Most of the existing solutions require complex preprocessing steps to make the signal suitable for training a deep learning model. This article presents a novel vision transformer with a selective residual in the multihead self-attention network, called Selective Residual Vision Transformer (SeReViT), for improved robustness in capturing the fault signature from the vibration signal. The novel attention mechanism incorporates cumulative attention by utilizing the best attention through residual connections in each block of multihead attention. The best attention term is defined using the highest value of L1-norms of attention value (the scaled-dot product of key and query) of multiheads. It enables the model to focus on selected best attention to learn the long-range dependencies among sequential input image patches, resulting in better classification performance. The proposed framework is validated for fault diagnosis on the Case Western Reserve University bearing fault diagnosis dataset and the Paderborn University dataset. Since these datasets are already cleaned data, noisy vibration data are created by adding white noise for the demonstration of the robustness of the proposed framework. The vibration signals are first converted to images using the short-time Fourier transform with a fixed window size. The generated images are used to train and validate the proposed SeReViT. The results outperformed the state-of-the-art convolution-based models for fault diagnosis for both cleaned datasets and noisy datasets. The short-time Fourier transform is utilized to convert the noisy (raw) vibration signals from rotating machines to spectrum images.

查看原文本刊更多论文

一种基于多头自注意残差的模式识别视觉变压器

智能故障诊断需要从时间序列振动信号中鲁棒地捕获代表故障模式的特定特征。大多数现有的解决方案需要复杂的预处理步骤，以使信号适合训练深度学习模型。为了提高从振动信号中捕获故障特征的鲁棒性，本文提出了一种新的多头自关注网络中带有选择性残差的视觉变压器，称为选择性残差视觉变压器（SeReViT）。该注意机制通过多头注意各块的剩余连接，利用最佳注意，实现了注意的累积。最佳关注项是使用多头的关注值（键与查询的标度点积）的l1规范的最大值来定义的。它使模型能够专注于选择的最佳注意力来学习序列输入图像patch之间的长期依赖关系，从而获得更好的分类性能。在凯斯西储大学轴承故障诊断数据集和帕德博恩大学数据集上验证了该框架的故障诊断效果。由于这些数据集已经是经过清理的数据，因此通过添加白噪声来创建噪声振动数据，以证明所提出框架的鲁棒性。首先用固定窗口大小的短时傅里叶变换将振动信号转换成图像。生成的图像用于训练和验证所提出的SeReViT。结果优于最先进的基于卷积的模型，用于清洁数据集和噪声数据集的故障诊断。利用短时傅里叶变换将旋转机械的噪声（原始）振动信号转换为频谱图像。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.