变压器中多头注意和位置前馈的硬件加速器

2020 IEEE 33rd International System-on-Chip Conference (SOCC) Pub Date : 2020-09-08 DOI:10.1109/socc49529.2020.9524802

Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, Zhongfeng Wang

{"title":"变压器中多头注意和位置前馈的硬件加速器","authors":"Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, Zhongfeng Wang","doi":"10.1109/socc49529.2020.9524802","DOIUrl":null,"url":null,"abstract":"Designing hardware accelerators for deep neural networks (DNNs) has been much desired. Nonetheless, most of these existing accelerators are built for either convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Recently, the Transformer model is replacing the RNN in the natural language processing (NLP) area. However, because of intensive matrix computations and complicated data flow being involved, the hardware design for the Transformer model has never been reported. In this paper, we propose the first hardware accelerator for two key components, i.e., the multi-head attention (MHA) ResBlock and the position-wise feed-forward network (FFN) ResBlock, which are the two most complex layers in the Transformer. Firstly, an efficient method is introduced to partition the huge matrices in the Transformer, allowing the two ResBlocks to share most of the hardware resources. Secondly, the computation flow is well designed to ensure the high hardware utilization of the systolic array, which is the biggest module in our design. Thirdly, complicated nonlinear functions are highly optimized to further reduce the hardware complexity and also the latency of the entire system. Our design is coded using hardware description language (HDL) and evaluated on a Xilinx FPGA. Compared with the implementation on GPU with the same setting, the proposed design demonstrates a speed-up of 14.6 x in the MHA ResBlock, and 3.4 x in the FFN ResBlock, respectively. Therefore, this work lays a good foundation for building efficient hardware accelerators for multiple Transformer networks.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":"{\"title\":\"Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer\",\"authors\":\"Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, Zhongfeng Wang\",\"doi\":\"10.1109/socc49529.2020.9524802\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Designing hardware accelerators for deep neural networks (DNNs) has been much desired. Nonetheless, most of these existing accelerators are built for either convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Recently, the Transformer model is replacing the RNN in the natural language processing (NLP) area. However, because of intensive matrix computations and complicated data flow being involved, the hardware design for the Transformer model has never been reported. In this paper, we propose the first hardware accelerator for two key components, i.e., the multi-head attention (MHA) ResBlock and the position-wise feed-forward network (FFN) ResBlock, which are the two most complex layers in the Transformer. Firstly, an efficient method is introduced to partition the huge matrices in the Transformer, allowing the two ResBlocks to share most of the hardware resources. Secondly, the computation flow is well designed to ensure the high hardware utilization of the systolic array, which is the biggest module in our design. Thirdly, complicated nonlinear functions are highly optimized to further reduce the hardware complexity and also the latency of the entire system. Our design is coded using hardware description language (HDL) and evaluated on a Xilinx FPGA. Compared with the implementation on GPU with the same setting, the proposed design demonstrates a speed-up of 14.6 x in the MHA ResBlock, and 3.4 x in the FFN ResBlock, respectively. Therefore, this work lays a good foundation for building efficient hardware accelerators for multiple Transformer networks.\",\"PeriodicalId\":114740,\"journal\":{\"name\":\"2020 IEEE 33rd International System-on-Chip Conference (SOCC)\",\"volume\":\"62 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"41\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 33rd International System-on-Chip Conference (SOCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/socc49529.2020.9524802\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/socc49529.2020.9524802","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 41

摘要

设计用于深度神经网络(dnn)的硬件加速器一直是人们迫切需要的。尽管如此，大多数现有的加速器都是为卷积神经网络(cnn)或循环神经网络(rnn)构建的。近年来，Transformer模型在自然语言处理(NLP)领域正在取代RNN。然而，由于涉及到密集的矩阵计算和复杂的数据流，Transformer模型的硬件设计从未被报道过。在本文中，我们提出了两个关键组件的第一个硬件加速器，即多头注意(MHA) ResBlock和位置前馈网络(FFN) ResBlock，这是Transformer中最复杂的两个层。首先，引入了一种有效的方法对Transformer中巨大的矩阵进行分区，使两个resblock共享大部分硬件资源。其次，设计了计算流程，保证了收缩压阵列的高硬件利用率，这是我们设计中最大的模块。第三，对复杂的非线性函数进行了高度优化，进一步降低了硬件复杂度和整个系统的时延。我们的设计使用硬件描述语言(HDL)进行编码，并在Xilinx FPGA上进行评估。与相同设置的GPU上的实现相比，提出的设计在MHA ResBlock和FFN ResBlock上的速度分别提高了14.6倍和3.4倍。因此，本工作为构建高效的多变压器网络硬件加速器奠定了良好的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer

Designing hardware accelerators for deep neural networks (DNNs) has been much desired. Nonetheless, most of these existing accelerators are built for either convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Recently, the Transformer model is replacing the RNN in the natural language processing (NLP) area. However, because of intensive matrix computations and complicated data flow being involved, the hardware design for the Transformer model has never been reported. In this paper, we propose the first hardware accelerator for two key components, i.e., the multi-head attention (MHA) ResBlock and the position-wise feed-forward network (FFN) ResBlock, which are the two most complex layers in the Transformer. Firstly, an efficient method is introduced to partition the huge matrices in the Transformer, allowing the two ResBlocks to share most of the hardware resources. Secondly, the computation flow is well designed to ensure the high hardware utilization of the systolic array, which is the biggest module in our design. Thirdly, complicated nonlinear functions are highly optimized to further reduce the hardware complexity and also the latency of the entire system. Our design is coded using hardware description language (HDL) and evaluated on a Xilinx FPGA. Compared with the implementation on GPU with the same setting, the proposed design demonstrates a speed-up of 14.6 x in the MHA ResBlock, and 3.4 x in the FFN ResBlock, respectively. Therefore, this work lays a good foundation for building efficient hardware accelerators for multiple Transformer networks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 33rd International System-on-Chip Conference (SOCC)

自引率

0.00%

发文量