{"title":"A novel vision transformer with selective residual in multihead self-attention for pattern recognition","authors":"Arun Kumar Sharma, Nishchal K. Verma","doi":"10.1016/j.patcog.2025.112497","DOIUrl":null,"url":null,"abstract":"<div><div>Intelligent fault diagnosis requires robust capturing of specific features, representing the fault patterns, from time-series vibration signals. Most of the existing solutions require complex preprocessing steps to make the signal suitable for training a deep learning model. This article presents a novel vision transformer with a selective residual in the multihead self-attention network, called Selective Residual Vision Transformer (SeReViT), for improved robustness in capturing the fault signature from the vibration signal. The novel attention mechanism incorporates cumulative attention by utilizing the best attention through residual connections in each block of multihead attention. The best attention term is defined using the highest value of L1-norms of attention value (the scaled-dot product of key and query) of multiheads. It enables the model to focus on selected best attention to learn the long-range dependencies among sequential input image patches, resulting in better classification performance. The proposed framework is validated for fault diagnosis on the Case Western Reserve University bearing fault diagnosis dataset and the Paderborn University dataset. Since these datasets are already cleaned data, noisy vibration data are created by adding white noise for the demonstration of the robustness of the proposed framework. The vibration signals are first converted to images using the short-time Fourier transform with a fixed window size. The generated images are used to train and validate the proposed SeReViT. The results outperformed the state-of-the-art convolution-based models for fault diagnosis for both cleaned datasets and noisy datasets. The short-time Fourier transform is utilized to convert the noisy (raw) vibration signals from rotating machines to spectrum images.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112497"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325011604","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Intelligent fault diagnosis requires robust capturing of specific features, representing the fault patterns, from time-series vibration signals. Most of the existing solutions require complex preprocessing steps to make the signal suitable for training a deep learning model. This article presents a novel vision transformer with a selective residual in the multihead self-attention network, called Selective Residual Vision Transformer (SeReViT), for improved robustness in capturing the fault signature from the vibration signal. The novel attention mechanism incorporates cumulative attention by utilizing the best attention through residual connections in each block of multihead attention. The best attention term is defined using the highest value of L1-norms of attention value (the scaled-dot product of key and query) of multiheads. It enables the model to focus on selected best attention to learn the long-range dependencies among sequential input image patches, resulting in better classification performance. The proposed framework is validated for fault diagnosis on the Case Western Reserve University bearing fault diagnosis dataset and the Paderborn University dataset. Since these datasets are already cleaned data, noisy vibration data are created by adding white noise for the demonstration of the robustness of the proposed framework. The vibration signals are first converted to images using the short-time Fourier transform with a fixed window size. The generated images are used to train and validate the proposed SeReViT. The results outperformed the state-of-the-art convolution-based models for fault diagnosis for both cleaned datasets and noisy datasets. The short-time Fourier transform is utilized to convert the noisy (raw) vibration signals from rotating machines to spectrum images.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.