DESA: Dataflow Efficient Systolic Array for Acceleration of Transformers

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers Pub Date : 2025-03-10 DOI:10.1109/TC.2025.3549621

Zhican Wang;Hongxiang Fan;Guanghui He

{"title":"DESA: Dataflow Efficient Systolic Array for Acceleration of Transformers","authors":"Zhican Wang;Hongxiang Fan;Guanghui He","doi":"10.1109/TC.2025.3549621","DOIUrl":null,"url":null,"abstract":"Transformers have become prevalent in various Artificial Intelligence (AI) applications, spanning natural language processing to computer vision. Owing to their suboptimal performance on general-purpose platforms, various domain-specific accelerators that explore and utilize the model sparsity have been developed. Instead, we conduct a quantitative analysis of Transformers. (Transformers can be categorized into three types: Encoder-Only, Decoder-Only, and Encoder-Decoder. This paper focuses on Encoder-Only Transformers.) to identify key inefficiencies and adopt dataflow optimization to address them. These inefficiencies arise from 1) diverse matrix multiplication, 2) multi-phase non-linear operations and their dependencies, and 3) heavy memory requirements. We introduce a novel dataflow design to support decoupling with latency hiding, effectively reducing the dependencies and addressing the performance bottlenecks of nonlinear operations. To enable fully fused attention computation, we propose practical tiling and mapping strategies to sustain high throughput and notably decrease memory requirements from <inline-formula><tex-math>$O(N^{2}H)$</tex-math></inline-formula> to <inline-formula><tex-math>$O(N)$</tex-math></inline-formula>. A hybrid buffer-level reuse strategy is also introduced to enhance utilization and diminish off-chip access. Based on these optimizations, we propose a novel systolic array design, named DESA, with three innovations: 1) A reconfigurable vector processing unit (VPU) and immediate processing units (IPUs) that can be seamlessly fused within the systolic array to support various normalization, post-processing, and transposition operations with efficient latency hiding. 2) A hybrid stationary systolic array that improves the compute and memory efficiency for matrix multiplications with diverse operational intensity and characteristics. 3) A novel tile fusion processing that efficiently addresses the low utilization issue in the conventional systolic array during the data setup and offloading. Across various benchmarks, extensive experiments demonstrate that DESA archives <inline-formula><tex-math>$5.0\\boldsymbol{\\times\\thicksim}8.3\\boldsymbol{\\times}$</tex-math></inline-formula> energy saving over 3090 GPU and <inline-formula><tex-math>$25.6\\boldsymbol{\\times\\thicksim}88.4\\boldsymbol{\\times}$</tex-math></inline-formula> than Intel 6226R CPU. Compared to the SOTA designs, DESA achieves <inline-formula><tex-math>$11.6\\boldsymbol{\\times\\thicksim}15.0\\boldsymbol{\\times}$</tex-math></inline-formula> speedup and up to <inline-formula><tex-math>$2.3\\times$</tex-math></inline-formula> energy saving over the SOTA accelerators.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2058-2072"},"PeriodicalIF":3.6000,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10918723/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Transformers have become prevalent in various Artificial Intelligence (AI) applications, spanning natural language processing to computer vision. Owing to their suboptimal performance on general-purpose platforms, various domain-specific accelerators that explore and utilize the model sparsity have been developed. Instead, we conduct a quantitative analysis of Transformers. (Transformers can be categorized into three types: Encoder-Only, Decoder-Only, and Encoder-Decoder. This paper focuses on Encoder-Only Transformers.) to identify key inefficiencies and adopt dataflow optimization to address them. These inefficiencies arise from 1) diverse matrix multiplication, 2) multi-phase non-linear operations and their dependencies, and 3) heavy memory requirements. We introduce a novel dataflow design to support decoupling with latency hiding, effectively reducing the dependencies and addressing the performance bottlenecks of nonlinear operations. To enable fully fused attention computation, we propose practical tiling and mapping strategies to sustain high throughput and notably decrease memory requirements from

$O(N^{2}H)$

$O(N)$

. A hybrid buffer-level reuse strategy is also introduced to enhance utilization and diminish off-chip access. Based on these optimizations, we propose a novel systolic array design, named DESA, with three innovations: 1) A reconfigurable vector processing unit (VPU) and immediate processing units (IPUs) that can be seamlessly fused within the systolic array to support various normalization, post-processing, and transposition operations with efficient latency hiding. 2) A hybrid stationary systolic array that improves the compute and memory efficiency for matrix multiplications with diverse operational intensity and characteristics. 3) A novel tile fusion processing that efficiently addresses the low utilization issue in the conventional systolic array during the data setup and offloading. Across various benchmarks, extensive experiments demonstrate that DESA archives

$5.0\boldsymbol{\times\thicksim}8.3\boldsymbol{\times}$

energy saving over 3090 GPU and

$25.6\boldsymbol{\times\thicksim}88.4\boldsymbol{\times}$

than Intel 6226R CPU. Compared to the SOTA designs, DESA achieves

$11.6\boldsymbol{\times\thicksim}15.0\boldsymbol{\times}$

speedup and up to

$2.3\times$

energy saving over the SOTA accelerators.

查看原文本刊更多论文

用于变压器加速的数据流高效收缩阵列

变形金刚已经在各种人工智能（AI）应用中变得普遍，从自然语言处理到计算机视觉。由于它们在通用平台上的性能欠佳，各种探索和利用模型稀疏性的特定领域加速器已经被开发出来。相反，我们对《变形金刚》进行定量分析。(变压器可以分为三种类型：编码器- only，解码器- only和编码器-解码器。本文着重于仅编码转换器（Encoder-Only transformer），以识别关键的低效率，并采用数据流优化来解决它们。这些低效率源于1)不同的矩阵乘法，2)多阶段非线性操作及其依赖关系，以及3)繁重的内存需求。我们引入了一种新的数据流设计来支持与延迟隐藏的解耦，有效地减少了依赖关系并解决了非线性操作的性能瓶颈。为了实现完全融合的注意力计算，我们提出了实用的平铺和映射策略，以维持高吞吐量，并显着将内存需求从$O(N^{2}H)$降低到$O(N)$。为了提高利用率和减少片外访问，还引入了混合缓冲级重用策略。基于这些优化，我们提出了一种新的收缩阵列设计，命名为DESA，具有三个创新：1)可重构的矢量处理单元（VPU）和即时处理单元（ipu）可以在收缩阵列中无缝融合，以支持各种规范化，后处理和换位操作，并具有有效的延迟隐藏。2)一种混合平稳收缩阵列，提高了不同运算强度和特性的矩阵乘法的计算和存储效率。3)一种新的瓦片融合处理方法，有效地解决了传统收缩压阵列在数据设置和卸载过程中利用率低的问题。在各种基准测试中，广泛的实验表明，DESA比3090 GPU节省了$5.0\boldsymbol{\times\thicksim}8.3\boldsymbol{\times}$能源，比英特尔6226R CPU节省了$25.6\boldsymbol{\times\thicksim}88.4\boldsymbol{\times}$能源。与SOTA设计相比，DESA实现了$11.6\boldsymbol{\times\thicksim}15.0\boldsymbol{\times}$加速和高达$2.3\times$节能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.