Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow

IF 5.6 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Yubin Qin;Yang Wang;Dazheng Deng;Xiaolong Yang;Zhiren Zhao;Yang Zhou;Yuanqi Fan;Jingchuan Wei;Tianbao Chen;Leibo Liu;Shaojun Wei;Yang Hu;Shouyi Yin
{"title":"Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow","authors":"Yubin Qin;Yang Wang;Dazheng Deng;Xiaolong Yang;Zhiren Zhao;Yang Zhou;Yuanqi Fan;Jingchuan Wei;Tianbao Chen;Leibo Liu;Shaojun Wei;Yang Hu;Shouyi Yin","doi":"10.1109/JSSC.2024.3397189","DOIUrl":null,"url":null,"abstract":"Transformer model has demonstrated outstanding performance in the field of artificial intelligence. However, its remarkable performance comes at the cost of substantial computational complexity, posing limitations on deploying transformers from cloud to edge due to power and throughput constraints. There are two main challenges in designing a transformer accelerator for practical tasks. First, a transformer has inconsistent bottlenecks due to input length changes: for short inputs, such as using vision transformer (ViT) for ImageNet or bidirectional encoder representations from transformers (BERT) for general language understanding evaluation (GLUE), the linear layer of the model becomes the computational bottleneck. In contrast, for long inputs, such as high-resolution images or long-text tasks, attention computation becomes the bottleneck. Second, even for a given input length, different layers in the model exhibit various computational characteristics and workloads, such as matrix sizes and data reuse strategies. This article introduces Ayaka, a versatile transformer accelerator designed to address these issues. Ayaka uses a cross-layer sparse prediction approach based on random projection (RP), enabling simultaneous sparsification of attention computation and linear layers, thereby enhancing throughput for various bottlenecks for different input lengths. Furthermore, Ayaka optimizes the sparse attention computation by leveraging the input translation invariance of softmax. In addition, Ayaka features a heterogeneous dataflow processing element (HDPE) design, dynamically adjusting stationary matrix operands based on the current computation to maximize on-chip data reuse and reduce memory footprint. With these features, Ayaka is so far the first accelerator that accelerates the whole attention layer. Evaluation of 12 typical models and tasks shows that it achieves a peak energy efficiency of 49.7 TOPS/W, which is 1.20–\n<inline-formula> <tex-math>$258.9{\\times }$ </tex-math></inline-formula>\n higher than the state-of-the-art works.","PeriodicalId":13129,"journal":{"name":"IEEE Journal of Solid-state Circuits","volume":"59 10","pages":"3342-3356"},"PeriodicalIF":5.6000,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Solid-state Circuits","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10530252/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Transformer model has demonstrated outstanding performance in the field of artificial intelligence. However, its remarkable performance comes at the cost of substantial computational complexity, posing limitations on deploying transformers from cloud to edge due to power and throughput constraints. There are two main challenges in designing a transformer accelerator for practical tasks. First, a transformer has inconsistent bottlenecks due to input length changes: for short inputs, such as using vision transformer (ViT) for ImageNet or bidirectional encoder representations from transformers (BERT) for general language understanding evaluation (GLUE), the linear layer of the model becomes the computational bottleneck. In contrast, for long inputs, such as high-resolution images or long-text tasks, attention computation becomes the bottleneck. Second, even for a given input length, different layers in the model exhibit various computational characteristics and workloads, such as matrix sizes and data reuse strategies. This article introduces Ayaka, a versatile transformer accelerator designed to address these issues. Ayaka uses a cross-layer sparse prediction approach based on random projection (RP), enabling simultaneous sparsification of attention computation and linear layers, thereby enhancing throughput for various bottlenecks for different input lengths. Furthermore, Ayaka optimizes the sparse attention computation by leveraging the input translation invariance of softmax. In addition, Ayaka features a heterogeneous dataflow processing element (HDPE) design, dynamically adjusting stationary matrix operands based on the current computation to maximize on-chip data reuse and reduce memory footprint. With these features, Ayaka is so far the first accelerator that accelerates the whole attention layer. Evaluation of 12 typical models and tasks shows that it achieves a peak energy efficiency of 49.7 TOPS/W, which is 1.20– $258.9{\times }$ higher than the state-of-the-art works.
Ayaka:具有低方根估计和异构数据流功能的多功能变压器加速器
变压器模型在人工智能领域表现出色。然而,其出色的性能是以巨大的计算复杂性为代价的,由于功率和吞吐量的限制,从云到边缘部署变压器受到了限制。为实际任务设计变压器加速器面临两大挑战。首先,变换器会因输入长度的变化而产生不一致的瓶颈:对于短输入,例如使用视觉变换器(ViT)处理 ImageNet 或使用变换器的双向编码器表示法(BERT)处理通用语言理解评估(GLUE),模型的线性层会成为计算瓶颈。相反,对于长输入,如高分辨率图像或长文本任务,注意力计算则成为瓶颈。其次,即使输入长度给定,模型中的不同层也会表现出不同的计算特性和工作量,如矩阵大小和数据重用策略。本文介绍的 Ayaka 是一种多功能变压器加速器,旨在解决这些问题。Ayaka 采用基于随机投影 (RP) 的跨层稀疏预测方法,实现了注意力计算和线性层的同步稀疏化,从而提高了不同输入长度下各种瓶颈的吞吐量。此外,Ayaka 还利用 softmax 的输入翻译不变性优化了稀疏注意力计算。此外,Ayaka 还采用了异构数据流处理元件 (HDPE) 设计,可根据当前计算动态调整固定矩阵操作数,从而最大限度地提高片上数据重用率并减少内存占用。凭借这些特性,Ayaka 是迄今为止第一款能够加速整个注意力层的加速器。对12个典型模型和任务的评估表明,它实现了49.7 TOPS/W的峰值能效,比最先进的作品高出1.20- $258.9{\times }$。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Journal of Solid-state Circuits
IEEE Journal of Solid-state Circuits 工程技术-工程:电子与电气
CiteScore
11.00
自引率
20.40%
发文量
351
审稿时长
3-6 weeks
期刊介绍: The IEEE Journal of Solid-State Circuits publishes papers each month in the broad area of solid-state circuits with particular emphasis on transistor-level design of integrated circuits. It also provides coverage of topics such as circuits modeling, technology, systems design, layout, and testing that relate directly to IC design. Integrated circuits and VLSI are of principal interest; material related to discrete circuit design is seldom published. Experimental verification is strongly encouraged.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信