S$^\text{3}$Attention: Improving Long Sequence Attention With Smoothed Skeleton Sketching

IF 8.7 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal of Selected Topics in Signal Processing Pub Date : 2024-08-22 DOI:10.1109/JSTSP.2024.3446173

Xue Wang;Tian Zhou;Jianqing Zhu;Jialin Liu;Kun Yuan;Tao Yao;Wotao Yin;Rong Jin;HanQin Cai

{"title":"S$^\\text{3}$Attention: Improving Long Sequence Attention With Smoothed Skeleton Sketching","authors":"Xue Wang;Tian Zhou;Jianqing Zhu;Jialin Liu;Kun Yuan;Tao Yao;Wotao Yin;Rong Jin;HanQin Cai","doi":"10.1109/JSTSP.2024.3446173","DOIUrl":null,"url":null,"abstract":"Attention based models have achieved many remarkable breakthroughs in numerous applications. However, the quadratic complexity of Attention makes the vanilla Attention based models hard to apply to long sequence tasks. Various improved Attention structures are proposed to reduce the computation cost by inducing low rankness and approximating the whole sequence by sub-sequences. The most challenging part of those approaches is maintaining the proper balance between information preservation and computation reduction: the longer sub-sequences used, the better information is preserved, but at the price of introducing more noise and computational costs. In this paper, we propose a smoothed skeleton sketching based Attention structure, coined S<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>Attention, which significantly improves upon the previous attempts to negotiate this trade-off. S<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>Attention has two mechanisms to effectively minimize the impact of noise while keeping the linear complexity to the sequence length: a smoothing block to mix information over long sequences and a matrix sketching method that simultaneously selects columns and rows from the input matrix. We verify the effectiveness of S<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>Attention both theoretically and empirically. Extensive studies over Long Range Arena (LRA) datasets and six time-series forecasting show that S<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>Attention significantly outperforms both vanilla Attention and other state-of-the-art variants of Attention structures.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 6","pages":"985-996"},"PeriodicalIF":8.7000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10643652/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Attention based models have achieved many remarkable breakthroughs in numerous applications. However, the quadratic complexity of Attention makes the vanilla Attention based models hard to apply to long sequence tasks. Various improved Attention structures are proposed to reduce the computation cost by inducing low rankness and approximating the whole sequence by sub-sequences. The most challenging part of those approaches is maintaining the proper balance between information preservation and computation reduction: the longer sub-sequences used, the better information is preserved, but at the price of introducing more noise and computational costs. In this paper, we propose a smoothed skeleton sketching based Attention structure, coined S

$^{3}$

Attention, which significantly improves upon the previous attempts to negotiate this trade-off. S

$^{3}$

Attention has two mechanisms to effectively minimize the impact of noise while keeping the linear complexity to the sequence length: a smoothing block to mix information over long sequences and a matrix sketching method that simultaneously selects columns and rows from the input matrix. We verify the effectiveness of S

$^{3}$

Attention both theoretically and empirically. Extensive studies over Long Range Arena (LRA) datasets and six time-series forecasting show that S

$^{3}$

Attention significantly outperforms both vanilla Attention and other state-of-the-art variants of Attention structures.

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Journal of Selected Topics in Signal Processing 工程技术-工程：电子与电气

CiteScore

19.00

自引率

1.30%

发文量

135

审稿时长

3 months

期刊介绍： The IEEE Journal of Selected Topics in Signal Processing (JSTSP) focuses on the Field of Interest of the IEEE Signal Processing Society, which encompasses the theory and application of various signal processing techniques. These techniques include filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals using digital or analog devices. The term "signal" covers a wide range of data types, including audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and others. The journal format allows for in-depth exploration of signal processing topics, enabling the Society to cover both established and emerging areas. This includes interdisciplinary fields such as biomedical engineering and language processing, as well as areas not traditionally associated with engineering.