变压器中多头注意和位置前馈的硬件加速器

Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, Zhongfeng Wang
{"title":"变压器中多头注意和位置前馈的硬件加速器","authors":"Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, Zhongfeng Wang","doi":"10.1109/socc49529.2020.9524802","DOIUrl":null,"url":null,"abstract":"Designing hardware accelerators for deep neural networks (DNNs) has been much desired. Nonetheless, most of these existing accelerators are built for either convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Recently, the Transformer model is replacing the RNN in the natural language processing (NLP) area. However, because of intensive matrix computations and complicated data flow being involved, the hardware design for the Transformer model has never been reported. In this paper, we propose the first hardware accelerator for two key components, i.e., the multi-head attention (MHA) ResBlock and the position-wise feed-forward network (FFN) ResBlock, which are the two most complex layers in the Transformer. Firstly, an efficient method is introduced to partition the huge matrices in the Transformer, allowing the two ResBlocks to share most of the hardware resources. Secondly, the computation flow is well designed to ensure the high hardware utilization of the systolic array, which is the biggest module in our design. Thirdly, complicated nonlinear functions are highly optimized to further reduce the hardware complexity and also the latency of the entire system. Our design is coded using hardware description language (HDL) and evaluated on a Xilinx FPGA. Compared with the implementation on GPU with the same setting, the proposed design demonstrates a speed-up of 14.6 x in the MHA ResBlock, and 3.4 x in the FFN ResBlock, respectively. Therefore, this work lays a good foundation for building efficient hardware accelerators for multiple Transformer networks.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":"{\"title\":\"Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer\",\"authors\":\"Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, Zhongfeng Wang\",\"doi\":\"10.1109/socc49529.2020.9524802\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Designing hardware accelerators for deep neural networks (DNNs) has been much desired. Nonetheless, most of these existing accelerators are built for either convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Recently, the Transformer model is replacing the RNN in the natural language processing (NLP) area. However, because of intensive matrix computations and complicated data flow being involved, the hardware design for the Transformer model has never been reported. In this paper, we propose the first hardware accelerator for two key components, i.e., the multi-head attention (MHA) ResBlock and the position-wise feed-forward network (FFN) ResBlock, which are the two most complex layers in the Transformer. Firstly, an efficient method is introduced to partition the huge matrices in the Transformer, allowing the two ResBlocks to share most of the hardware resources. Secondly, the computation flow is well designed to ensure the high hardware utilization of the systolic array, which is the biggest module in our design. Thirdly, complicated nonlinear functions are highly optimized to further reduce the hardware complexity and also the latency of the entire system. Our design is coded using hardware description language (HDL) and evaluated on a Xilinx FPGA. Compared with the implementation on GPU with the same setting, the proposed design demonstrates a speed-up of 14.6 x in the MHA ResBlock, and 3.4 x in the FFN ResBlock, respectively. Therefore, this work lays a good foundation for building efficient hardware accelerators for multiple Transformer networks.\",\"PeriodicalId\":114740,\"journal\":{\"name\":\"2020 IEEE 33rd International System-on-Chip Conference (SOCC)\",\"volume\":\"62 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"41\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 33rd International System-on-Chip Conference (SOCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/socc49529.2020.9524802\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/socc49529.2020.9524802","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 41

摘要

设计用于深度神经网络(dnn)的硬件加速器一直是人们迫切需要的。尽管如此,大多数现有的加速器都是为卷积神经网络(cnn)或循环神经网络(rnn)构建的。近年来,Transformer模型在自然语言处理(NLP)领域正在取代RNN。然而,由于涉及到密集的矩阵计算和复杂的数据流,Transformer模型的硬件设计从未被报道过。在本文中,我们提出了两个关键组件的第一个硬件加速器,即多头注意(MHA) ResBlock和位置前馈网络(FFN) ResBlock,这是Transformer中最复杂的两个层。首先,引入了一种有效的方法对Transformer中巨大的矩阵进行分区,使两个resblock共享大部分硬件资源。其次,设计了计算流程,保证了收缩压阵列的高硬件利用率,这是我们设计中最大的模块。第三,对复杂的非线性函数进行了高度优化,进一步降低了硬件复杂度和整个系统的时延。我们的设计使用硬件描述语言(HDL)进行编码,并在Xilinx FPGA上进行评估。与相同设置的GPU上的实现相比,提出的设计在MHA ResBlock和FFN ResBlock上的速度分别提高了14.6倍和3.4倍。因此,本工作为构建高效的多变压器网络硬件加速器奠定了良好的基础。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer
Designing hardware accelerators for deep neural networks (DNNs) has been much desired. Nonetheless, most of these existing accelerators are built for either convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Recently, the Transformer model is replacing the RNN in the natural language processing (NLP) area. However, because of intensive matrix computations and complicated data flow being involved, the hardware design for the Transformer model has never been reported. In this paper, we propose the first hardware accelerator for two key components, i.e., the multi-head attention (MHA) ResBlock and the position-wise feed-forward network (FFN) ResBlock, which are the two most complex layers in the Transformer. Firstly, an efficient method is introduced to partition the huge matrices in the Transformer, allowing the two ResBlocks to share most of the hardware resources. Secondly, the computation flow is well designed to ensure the high hardware utilization of the systolic array, which is the biggest module in our design. Thirdly, complicated nonlinear functions are highly optimized to further reduce the hardware complexity and also the latency of the entire system. Our design is coded using hardware description language (HDL) and evaluated on a Xilinx FPGA. Compared with the implementation on GPU with the same setting, the proposed design demonstrates a speed-up of 14.6 x in the MHA ResBlock, and 3.4 x in the FFN ResBlock, respectively. Therefore, this work lays a good foundation for building efficient hardware accelerators for multiple Transformer networks.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信