{"title":"Gated Slot Attention for Efficient Linear-Time Sequence Modeling","authors":"Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu","doi":"arxiv-2409.07146","DOIUrl":null,"url":null,"abstract":"Linear attention Transformers and their gated variants, celebrated for\nenabling parallel training and efficient recurrent inference, still fall short\nin recall-intensive tasks compared to traditional Transformers and demand\nsignificant resources for training from scratch. This paper introduces Gated\nSlot Attention (GSA), which enhances Attention with Bounded-memory-Control\n(ABC) by incorporating a gating mechanism inspired by Gated Linear Attention\n(GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing\ncontext-aware memory reading and adaptive forgetting to improve memory capacity\nwhile maintaining compact recurrent state size. This design greatly enhances\nboth training and inference efficiency through GLA's hardware-efficient\ntraining algorithm and reduced state size. Additionally, retaining the softmax\noperation is particularly beneficial in \"finetuning pretrained Transformers to\nRNNs\" (T2R) settings, reducing the need for extensive training from scratch.\nExtensive experiments confirm GSA's superior performance in scenarios requiring\nin-context recall and in T2R settings.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"34 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Linear attention Transformers and their gated variants, celebrated for
enabling parallel training and efficient recurrent inference, still fall short
in recall-intensive tasks compared to traditional Transformers and demand
significant resources for training from scratch. This paper introduces Gated
Slot Attention (GSA), which enhances Attention with Bounded-memory-Control
(ABC) by incorporating a gating mechanism inspired by Gated Linear Attention
(GLA). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing
context-aware memory reading and adaptive forgetting to improve memory capacity
while maintaining compact recurrent state size. This design greatly enhances
both training and inference efficiency through GLA's hardware-efficient
training algorithm and reduced state size. Additionally, retaining the softmax
operation is particularly beneficial in "finetuning pretrained Transformers to
RNNs" (T2R) settings, reducing the need for extensive training from scratch.
Extensive experiments confirm GSA's superior performance in scenarios requiring
in-context recall and in T2R settings.