PIR-DSP: An FPGA DSP Block Architecture for Multi-precision Deep Neural Networks

Seyedramin Rasoulinezhad, Hao Zhou, Lingli Wang, Philip H. W. Leong
{"title":"PIR-DSP: An FPGA DSP Block Architecture for Multi-precision Deep Neural Networks","authors":"Seyedramin Rasoulinezhad, Hao Zhou, Lingli Wang, Philip H. W. Leong","doi":"10.1109/FCCM.2019.00015","DOIUrl":null,"url":null,"abstract":"Quantisation is a key optimisation strategy to improve the performance of floating-point deep neural network (DNN) accelerators. Digital signal processing (DSP) blocks on field-programmable gate arrays are not efficiently utilised when the accelerator precision is much lower than the DSP precision. Through three modifications to Xilinx DSP48E2 DSP blocks, we address this issue for important computations in embedded DNN accelerators, namely the standard, depth-wise, and pointwise convolutional layers. First, we propose a flexible precision, run-time decomposable multiplier architecture for CNN implementations. Second, we propose a significant upgrade to DSPDSP interconnect, providing a semi-2D low precision chaining capability which supports our low-precision multiplier. Finally, we improve data reuse via a register file which can also be configured as FIFO. Compared with the 27 × 18-bit mode in the Xilinx DSP48E2, our Precision, Interconnect, and Reuseoptimised DSP (PIR-DSP) offers a 6× improvement in multiplyaccumulate operations per DSP in the 9 × 9-bit case, 12× for 4 × 4 bits, and 24× for 2 × 2 bits. We estimate that PIR-DSP decreases the run time energy to 31/19/13% of the original value in a 9/4/2-bit MobileNet-v2 DNN implementation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2019.00015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30

Abstract

Quantisation is a key optimisation strategy to improve the performance of floating-point deep neural network (DNN) accelerators. Digital signal processing (DSP) blocks on field-programmable gate arrays are not efficiently utilised when the accelerator precision is much lower than the DSP precision. Through three modifications to Xilinx DSP48E2 DSP blocks, we address this issue for important computations in embedded DNN accelerators, namely the standard, depth-wise, and pointwise convolutional layers. First, we propose a flexible precision, run-time decomposable multiplier architecture for CNN implementations. Second, we propose a significant upgrade to DSPDSP interconnect, providing a semi-2D low precision chaining capability which supports our low-precision multiplier. Finally, we improve data reuse via a register file which can also be configured as FIFO. Compared with the 27 × 18-bit mode in the Xilinx DSP48E2, our Precision, Interconnect, and Reuseoptimised DSP (PIR-DSP) offers a 6× improvement in multiplyaccumulate operations per DSP in the 9 × 9-bit case, 12× for 4 × 4 bits, and 24× for 2 × 2 bits. We estimate that PIR-DSP decreases the run time energy to 31/19/13% of the original value in a 9/4/2-bit MobileNet-v2 DNN implementation.
PIR-DSP:一种用于多精度深度神经网络的FPGA DSP块结构
量化是提高浮点深度神经网络(DNN)加速器性能的关键优化策略。当加速器精度远低于DSP精度时,现场可编程门阵列上的数字信号处理(DSP)模块得不到有效利用。通过对Xilinx DSP48E2 DSP模块的三次修改,我们解决了嵌入式DNN加速器中重要计算的这个问题,即标准层、深度层和点卷积层。首先,我们为CNN的实现提出了一个灵活的精度、运行时可分解的乘法器架构。其次,我们提出对DSPDSP互连进行重大升级,提供支持我们的低精度乘法器的半2d低精度链功能。最后,我们通过一个也可以配置为FIFO的寄存器文件来提高数据重用。与Xilinx DSP48E2中的27 × 18位模式相比,我们的Precision, Interconnect和reuse - optimization DSP (PIR-DSP)在9 × 9位情况下每个DSP的乘法累加操作提高了6倍,4× 4位提高了12倍,2× 2位提高了24倍。我们估计在9/4/2位MobileNet-v2 DNN实现中,PIR-DSP将运行时能量降低到原始值的31/19/13%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信