{"title":"DESA: Dataflow Efficient Systolic Array for Acceleration of Transformers","authors":"Zhican Wang;Hongxiang Fan;Guanghui He","doi":"10.1109/TC.2025.3549621","DOIUrl":null,"url":null,"abstract":"Transformers have become prevalent in various Artificial Intelligence (AI) applications, spanning natural language processing to computer vision. Owing to their suboptimal performance on general-purpose platforms, various domain-specific accelerators that explore and utilize the model sparsity have been developed. Instead, we conduct a quantitative analysis of Transformers. (Transformers can be categorized into three types: Encoder-Only, Decoder-Only, and Encoder-Decoder. This paper focuses on Encoder-Only Transformers.) to identify key inefficiencies and adopt dataflow optimization to address them. These inefficiencies arise from <i>1)</i> diverse matrix multiplication, <i>2)</i> multi-phase non-linear operations and their dependencies, and <i>3)</i> heavy memory requirements. We introduce a novel dataflow design to support decoupling with latency hiding, effectively reducing the dependencies and addressing the performance bottlenecks of nonlinear operations. To enable fully fused attention computation, we propose practical tiling and mapping strategies to sustain high throughput and notably decrease memory requirements from <inline-formula><tex-math>$O(N^{2}H)$</tex-math></inline-formula> to <inline-formula><tex-math>$O(N)$</tex-math></inline-formula>. A hybrid buffer-level reuse strategy is also introduced to enhance utilization and diminish off-chip access. Based on these optimizations, we propose a novel systolic array design, named DESA, with three innovations: <i>1)</i> A reconfigurable vector processing unit (VPU) and immediate processing units (IPUs) that can be seamlessly fused within the systolic array to support various normalization, post-processing, and transposition operations with efficient latency hiding. <i>2)</i> A hybrid stationary systolic array that improves the compute and memory efficiency for matrix multiplications with diverse operational intensity and characteristics. <i>3)</i> A novel tile fusion processing that efficiently addresses the low utilization issue in the conventional systolic array during the data setup and offloading. Across various benchmarks, extensive experiments demonstrate that DESA archives <inline-formula><tex-math>$5.0\\boldsymbol{\\times\\thicksim}8.3\\boldsymbol{\\times}$</tex-math></inline-formula> energy saving over 3090 GPU and <inline-formula><tex-math>$25.6\\boldsymbol{\\times\\thicksim}88.4\\boldsymbol{\\times}$</tex-math></inline-formula> than Intel 6226R CPU. Compared to the SOTA designs, DESA achieves <inline-formula><tex-math>$11.6\\boldsymbol{\\times\\thicksim}15.0\\boldsymbol{\\times}$</tex-math></inline-formula> speedup and up to <inline-formula><tex-math>$2.3\\times$</tex-math></inline-formula> energy saving over the SOTA accelerators.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2058-2072"},"PeriodicalIF":3.6000,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10918723/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Transformers have become prevalent in various Artificial Intelligence (AI) applications, spanning natural language processing to computer vision. Owing to their suboptimal performance on general-purpose platforms, various domain-specific accelerators that explore and utilize the model sparsity have been developed. Instead, we conduct a quantitative analysis of Transformers. (Transformers can be categorized into three types: Encoder-Only, Decoder-Only, and Encoder-Decoder. This paper focuses on Encoder-Only Transformers.) to identify key inefficiencies and adopt dataflow optimization to address them. These inefficiencies arise from 1) diverse matrix multiplication, 2) multi-phase non-linear operations and their dependencies, and 3) heavy memory requirements. We introduce a novel dataflow design to support decoupling with latency hiding, effectively reducing the dependencies and addressing the performance bottlenecks of nonlinear operations. To enable fully fused attention computation, we propose practical tiling and mapping strategies to sustain high throughput and notably decrease memory requirements from $O(N^{2}H)$ to $O(N)$. A hybrid buffer-level reuse strategy is also introduced to enhance utilization and diminish off-chip access. Based on these optimizations, we propose a novel systolic array design, named DESA, with three innovations: 1) A reconfigurable vector processing unit (VPU) and immediate processing units (IPUs) that can be seamlessly fused within the systolic array to support various normalization, post-processing, and transposition operations with efficient latency hiding. 2) A hybrid stationary systolic array that improves the compute and memory efficiency for matrix multiplications with diverse operational intensity and characteristics. 3) A novel tile fusion processing that efficiently addresses the low utilization issue in the conventional systolic array during the data setup and offloading. Across various benchmarks, extensive experiments demonstrate that DESA archives $5.0\boldsymbol{\times\thicksim}8.3\boldsymbol{\times}$ energy saving over 3090 GPU and $25.6\boldsymbol{\times\thicksim}88.4\boldsymbol{\times}$ than Intel 6226R CPU. Compared to the SOTA designs, DESA achieves $11.6\boldsymbol{\times\thicksim}15.0\boldsymbol{\times}$ speedup and up to $2.3\times$ energy saving over the SOTA accelerators.
期刊介绍:
The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.