CAST: Clustering self-Attention using Surrogate Tokens for efficient transformers

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters Pub Date : 2024-09-06 DOI:10.1016/j.patrec.2024.08.024

Adjorn van Engelenhoven, Nicola Strisciuglio, Estefanía Talavera

{"title":"CAST: Clustering self-Attention using Surrogate Tokens for efficient transformers","authors":"Adjorn van Engelenhoven, Nicola Strisciuglio, Estefanía Talavera","doi":"10.1016/j.patrec.2024.08.024","DOIUrl":null,"url":null,"abstract":"<div>The Transformer architecture has shown to be a powerful tool for a wide range of tasks. It is based on the self-attention mechanism, which is an inherently computationally expensive operation with quadratic computational complexity: memory usage and compute time increase quadratically with the length of the input sequences, thus limiting the application of Transformers. In this work, we propose a novel Clustering self-Attention mechanism using Surrogate Tokens (CAST), to optimize the attention computation and achieve efficient transformers. CAST utilizes learnable surrogate tokens to construct a cluster affinity matrix, used to cluster the input sequence and generate novel cluster summaries. The self-attention from within each cluster is then combined with the cluster summaries of other clusters, enabling information flow across the entire input sequence. CAST improves efficiency by reducing the complexity from <math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math> to <math><mrow><mi>O</mi><mrow><mo>(</mo><mi>α</mi><mi>N</mi><mo>)</mo></mrow></mrow></math> where <math><mi>N</mi></math> is the sequence length, and <math><mi>α</mi></math> is constant according to the number of clusters and samples per cluster. We show that CAST performs better than or comparable to the baseline Transformers on long-range sequence modeling tasks, while also achieving higher results on time and memory efficiency than other efficient transformers.</div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 30-36"},"PeriodicalIF":3.9000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167865524002563/pdfft?md5=41d75a76c8436c27473bdc1f0c0144be&pid=1-s2.0-S0167865524002563-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865524002563","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The Transformer architecture has shown to be a powerful tool for a wide range of tasks. It is based on the self-attention mechanism, which is an inherently computationally expensive operation with quadratic computational complexity: memory usage and compute time increase quadratically with the length of the input sequences, thus limiting the application of Transformers. In this work, we propose a novel Clustering self-Attention mechanism using Surrogate Tokens (CAST), to optimize the attention computation and achieve efficient transformers. CAST utilizes learnable surrogate tokens to construct a cluster affinity matrix, used to cluster the input sequence and generate novel cluster summaries. The self-attention from within each cluster is then combined with the cluster summaries of other clusters, enabling information flow across the entire input sequence. CAST improves efficiency by reducing the complexity from $O (N^{2})$ to $O (α N)$ where $N$ is the sequence length, and $α$ is constant according to the number of clusters and samples per cluster. We show that CAST performs better than or comparable to the baseline Transformers on long-range sequence modeling tasks, while also achieving higher results on time and memory efficiency than other efficient transformers.

查看原文本刊更多论文

CAST：使用代用标记进行自关注聚类，实现高效变压器

Transformer 架构已证明是执行各种任务的强大工具。它基于自我注意机制，而自我注意机制本身是一种计算昂贵的操作，其计算复杂度为二次方：内存使用量和计算时间随输入序列的长度呈二次方增长，从而限制了变形金刚的应用。在这项工作中，我们提出了一种使用代理标记的新型聚类自我注意力机制（CAST），以优化注意力计算并实现高效的变换器。CAST 利用可学习的代理标记来构建聚类亲和矩阵，用于对输入序列进行聚类并生成新的聚类摘要。然后，每个集群内的自我注意力与其他集群的集群摘要相结合，从而实现整个输入序列的信息流。CAST 提高了效率，将复杂度从 O(N2) 降低到 O(αN)，其中 N 是序列长度，α 是根据簇数和每个簇的样本数确定的常数。我们的研究表明，CAST 在长程序列建模任务中的表现优于或媲美基线变换器，同时在时间和内存效率上也高于其他高效变换器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition Letters 工程技术-计算机：人工智能

CiteScore

12.40

自引率

5.90%

发文量

287

审稿时长

9.1 months

期刊介绍： Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.