PIR-DSP: An FPGA DSP Block Architecture for Multi-precision Deep Neural Networks

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI:10.1109/FCCM.2019.00015

Seyedramin Rasoulinezhad, Hao Zhou, Lingli Wang, Philip H. W. Leong

{"title":"PIR-DSP: An FPGA DSP Block Architecture for Multi-precision Deep Neural Networks","authors":"Seyedramin Rasoulinezhad, Hao Zhou, Lingli Wang, Philip H. W. Leong","doi":"10.1109/FCCM.2019.00015","DOIUrl":null,"url":null,"abstract":"Quantisation is a key optimisation strategy to improve the performance of floating-point deep neural network (DNN) accelerators. Digital signal processing (DSP) blocks on field-programmable gate arrays are not efficiently utilised when the accelerator precision is much lower than the DSP precision. Through three modifications to Xilinx DSP48E2 DSP blocks, we address this issue for important computations in embedded DNN accelerators, namely the standard, depth-wise, and pointwise convolutional layers. First, we propose a flexible precision, run-time decomposable multiplier architecture for CNN implementations. Second, we propose a significant upgrade to DSPDSP interconnect, providing a semi-2D low precision chaining capability which supports our low-precision multiplier. Finally, we improve data reuse via a register file which can also be configured as FIFO. Compared with the 27 × 18-bit mode in the Xilinx DSP48E2, our Precision, Interconnect, and Reuseoptimised DSP (PIR-DSP) offers a 6× improvement in multiplyaccumulate operations per DSP in the 9 × 9-bit case, 12× for 4 × 4 bits, and 24× for 2 × 2 bits. We estimate that PIR-DSP decreases the run time energy to 31/19/13% of the original value in a 9/4/2-bit MobileNet-v2 DNN implementation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2019.00015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

Quantisation is a key optimisation strategy to improve the performance of floating-point deep neural network (DNN) accelerators. Digital signal processing (DSP) blocks on field-programmable gate arrays are not efficiently utilised when the accelerator precision is much lower than the DSP precision. Through three modifications to Xilinx DSP48E2 DSP blocks, we address this issue for important computations in embedded DNN accelerators, namely the standard, depth-wise, and pointwise convolutional layers. First, we propose a flexible precision, run-time decomposable multiplier architecture for CNN implementations. Second, we propose a significant upgrade to DSPDSP interconnect, providing a semi-2D low precision chaining capability which supports our low-precision multiplier. Finally, we improve data reuse via a register file which can also be configured as FIFO. Compared with the 27 × 18-bit mode in the Xilinx DSP48E2, our Precision, Interconnect, and Reuseoptimised DSP (PIR-DSP) offers a 6× improvement in multiplyaccumulate operations per DSP in the 9 × 9-bit case, 12× for 4 × 4 bits, and 24× for 2 × 2 bits. We estimate that PIR-DSP decreases the run time energy to 31/19/13% of the original value in a 9/4/2-bit MobileNet-v2 DNN implementation.

查看原文本刊更多论文

PIR-DSP:一种用于多精度深度神经网络的FPGA DSP块结构

量化是提高浮点深度神经网络(DNN)加速器性能的关键优化策略。当加速器精度远低于DSP精度时，现场可编程门阵列上的数字信号处理(DSP)模块得不到有效利用。通过对Xilinx DSP48E2 DSP模块的三次修改，我们解决了嵌入式DNN加速器中重要计算的这个问题，即标准层、深度层和点卷积层。首先，我们为CNN的实现提出了一个灵活的精度、运行时可分解的乘法器架构。其次，我们提出对DSPDSP互连进行重大升级，提供支持我们的低精度乘法器的半2d低精度链功能。最后，我们通过一个也可以配置为FIFO的寄存器文件来提高数据重用。与Xilinx DSP48E2中的27 × 18位模式相比，我们的Precision, Interconnect和reuse - optimization DSP (PIR-DSP)在9 × 9位情况下每个DSP的乘法累加操作提高了6倍，4× 4位提高了12倍，2× 2位提高了24倍。我们估计在9/4/2位MobileNet-v2 DNN实现中，PIR-DSP将运行时能量降低到原始值的31/19/13%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

自引率

0.00%

发文量