Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers Pub Date : 2024-04-16 DOI:10.1109/TC.2024.3389507

Hulin Wang;Donglin Yang;Yaqi Xia;Zheng Zhang;Qigang Wang;Jianping Fan;Xiaobo Zhou;Dazhao Cheng

{"title":"Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences","authors":"Hulin Wang;Donglin Yang;Yaqi Xia;Zheng Zhang;Qigang Wang;Jianping Fan;Xiaobo Zhou;Dazhao Cheng","doi":"10.1109/TC.2024.3389507","DOIUrl":null,"url":null,"abstract":"Transformer-based models have made significant advancements across various domains, largely due to the self-attention mechanism's ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the \n<inline-formula><tex-math>$O(n^{2})$</tex-math></inline-formula>\n complexity associated with self-attention. To address this, sparse attention has been proposed to reduce the quadratic dependency to linear. Nevertheless, deploying the sparse transformer efficiently encounters two major obstacles: 1) Existing system optimizations are less effective for the sparse transformer due to the algorithm's approximation properties leading to fragmented attention, and 2) the variability of input sequences results in computation and memory access inefficiencies. We present Raptor-T, a cutting-edge transformer framework designed for handling long and variable-length sequences. Raptor-T harnesses the power of the sparse transformer to reduce resource requirements for processing long sequences while also implementing system-level optimizations to accelerate inference performance. To address the fragmented attention issue, Raptor-T employs fused and memory-efficient Multi-Head Attention. Additionally, we introduce an asynchronous data processing method to mitigate GPU-blocking operations caused by sparse attention. Furthermore, Raptor-T minimizes padding for variable-length inputs, effectively reducing the overhead associated with padding and achieving balanced computation on GPUs. In evaluation, we compare Raptor-T's performance against state-of-the-art frameworks on an NVIDIA A100 GPU. The experimental results demonstrate that Raptor-T outperforms FlashAttention-2 and FasterTransformer, achieving an impressive average end-to-end performance improvement of 3.41X and 3.71X, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1852-1865"},"PeriodicalIF":3.6000,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10500743/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer-based models have made significant advancements across various domains, largely due to the self-attention mechanism's ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the

$O(n^{2})$

complexity associated with self-attention. To address this, sparse attention has been proposed to reduce the quadratic dependency to linear. Nevertheless, deploying the sparse transformer efficiently encounters two major obstacles: 1) Existing system optimizations are less effective for the sparse transformer due to the algorithm's approximation properties leading to fragmented attention, and 2) the variability of input sequences results in computation and memory access inefficiencies. We present Raptor-T, a cutting-edge transformer framework designed for handling long and variable-length sequences. Raptor-T harnesses the power of the sparse transformer to reduce resource requirements for processing long sequences while also implementing system-level optimizations to accelerate inference performance. To address the fragmented attention issue, Raptor-T employs fused and memory-efficient Multi-Head Attention. Additionally, we introduce an asynchronous data processing method to mitigate GPU-blocking operations caused by sparse attention. Furthermore, Raptor-T minimizes padding for variable-length inputs, effectively reducing the overhead associated with padding and achieving balanced computation on GPUs. In evaluation, we compare Raptor-T's performance against state-of-the-art frameworks on an NVIDIA A100 GPU. The experimental results demonstrate that Raptor-T outperforms FlashAttention-2 and FasterTransformer, achieving an impressive average end-to-end performance improvement of 3.41X and 3.71X, respectively.

查看原文本刊更多论文

Raptor-T：用于长序列和变长序列的融合且内存效率高的稀疏变换器

基于变换器的模型在各个领域都取得了重大进展，这主要归功于自我注意机制捕捉输入序列中上下文关系的能力。然而，对于变换器模型来说，处理长序列的计算成本仍然很高，这主要是由于与自我注意相关的 $O(n^{2})$ 复杂性。为了解决这个问题，有人提出了稀疏注意力，将二次依赖关系降低为线性关系。然而，有效部署稀疏变换器遇到了两大障碍：1）由于算法的近似特性导致注意力分散，现有的系统优化对稀疏变换器的效果较差；2）输入序列的可变性导致计算和内存访问效率低下。我们提出了 Raptor-T，一个专为处理长序列和变长序列而设计的尖端变换器框架。Raptor-T 利用稀疏变换器的强大功能，降低了处理长序列的资源需求，同时还实现了系统级优化，加快了推理性能。为了解决注意力分散的问题，Raptor-T 采用了融合且内存效率高的多头注意力（Multi-Head Attention）。此外，我们还引入了一种异步数据处理方法，以减轻稀疏注意力造成的 GPU 阻塞操作。此外，Raptor-T 还最大限度地减少了变长输入的填充，有效降低了与填充相关的开销，实现了 GPU 上的均衡计算。在评估中，我们在英伟达 A100 GPU 上比较了 Raptor-T 与最先进框架的性能。实验结果表明，Raptor-T 的性能优于 FlashAttention-2 和 FasterTransformer，端到端平均性能分别提高了 3.41 倍和 3.71 倍，令人印象深刻。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.