深度神经网络的对数块浮点算法

Chao Ni, Jinming Lu, Jun Lin, Zhongfeng Wang
{"title":"深度神经网络的对数块浮点算法","authors":"Chao Ni, Jinming Lu, Jun Lin, Zhongfeng Wang","doi":"10.1109/APCCAS50809.2020.9301687","DOIUrl":null,"url":null,"abstract":"Fixed-point quantization techniques have attracted considerable attention in deep neural network (DNN) inference acceleration. Nevertheless, they often require time-consuming fine-tuning or retraining to keep the accuracy of a quantized model. Besides, DNNs involve massive multiplication operations, which are of much higher computational complexities compared with addition operations. To deal with the two problems, we propose an improved numerical format named logarithmic block floating point (LBFP) for post-training quantization. Firstly, logarithmic arithmetic is employed to convert multiplication operations to addition and shift operations. Then, Kullback-Leibler divergence is used to determine the shared exponent before inference. Thus, LBFP can significantly reduce the hard-ware complexity with negligible performance loss. Moreover, an efficient hardware architecture is designed to support the computation of LBFP. Hardware synthesis results show that our 8-bit LBFP multiplier can reduce power and area by 53% and 45%, respectively, compared with the 8-bit traditional fixed-point multiplier. Finally, a software library is developed with the CUDA-C language to evaluate the inference accuracy of LBFP. Without retraining, the accuracy of the selected DNN models with the 8-bit LBFP representation is comparable to that of the corresponding 32-bit floating-point baselines, showing the great potential in efficient DNN inference acceleration.","PeriodicalId":127075,"journal":{"name":"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"LBFP: Logarithmic Block Floating Point Arithmetic for Deep Neural Networks\",\"authors\":\"Chao Ni, Jinming Lu, Jun Lin, Zhongfeng Wang\",\"doi\":\"10.1109/APCCAS50809.2020.9301687\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fixed-point quantization techniques have attracted considerable attention in deep neural network (DNN) inference acceleration. Nevertheless, they often require time-consuming fine-tuning or retraining to keep the accuracy of a quantized model. Besides, DNNs involve massive multiplication operations, which are of much higher computational complexities compared with addition operations. To deal with the two problems, we propose an improved numerical format named logarithmic block floating point (LBFP) for post-training quantization. Firstly, logarithmic arithmetic is employed to convert multiplication operations to addition and shift operations. Then, Kullback-Leibler divergence is used to determine the shared exponent before inference. Thus, LBFP can significantly reduce the hard-ware complexity with negligible performance loss. Moreover, an efficient hardware architecture is designed to support the computation of LBFP. Hardware synthesis results show that our 8-bit LBFP multiplier can reduce power and area by 53% and 45%, respectively, compared with the 8-bit traditional fixed-point multiplier. Finally, a software library is developed with the CUDA-C language to evaluate the inference accuracy of LBFP. Without retraining, the accuracy of the selected DNN models with the 8-bit LBFP representation is comparable to that of the corresponding 32-bit floating-point baselines, showing the great potential in efficient DNN inference acceleration.\",\"PeriodicalId\":127075,\"journal\":{\"name\":\"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/APCCAS50809.2020.9301687\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APCCAS50809.2020.9301687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

不动点量化技术是深度神经网络推理加速研究的热点之一。然而,它们通常需要耗时的微调或再训练来保持量子化模型的准确性。此外,深度神经网络涉及大量乘法运算,与加法运算相比,其计算复杂度要高得多。为了解决这两个问题,我们提出了一种改进的数字格式,即对数块浮点(LBFP),用于训练后量化。首先,采用对数算法将乘法运算转化为加法和移位运算。然后,在进行推理之前,利用Kullback-Leibler散度确定共享指数。因此,LBFP可以显著降低硬件复杂性,而性能损失可以忽略不计。此外,设计了一种高效的硬件架构来支持LBFP的计算。硬件综合结果表明,与传统的8位定点乘法器相比,我们的8位LBFP乘法器的功耗和面积分别降低了53%和45%。最后,利用CUDA-C语言开发了一个软件库来评估LBFP的推理精度。在不进行再训练的情况下,采用8位LBFP表示的DNN模型的精度与对应的32位浮点基线相当,显示了在高效DNN推理加速方面的巨大潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
LBFP: Logarithmic Block Floating Point Arithmetic for Deep Neural Networks
Fixed-point quantization techniques have attracted considerable attention in deep neural network (DNN) inference acceleration. Nevertheless, they often require time-consuming fine-tuning or retraining to keep the accuracy of a quantized model. Besides, DNNs involve massive multiplication operations, which are of much higher computational complexities compared with addition operations. To deal with the two problems, we propose an improved numerical format named logarithmic block floating point (LBFP) for post-training quantization. Firstly, logarithmic arithmetic is employed to convert multiplication operations to addition and shift operations. Then, Kullback-Leibler divergence is used to determine the shared exponent before inference. Thus, LBFP can significantly reduce the hard-ware complexity with negligible performance loss. Moreover, an efficient hardware architecture is designed to support the computation of LBFP. Hardware synthesis results show that our 8-bit LBFP multiplier can reduce power and area by 53% and 45%, respectively, compared with the 8-bit traditional fixed-point multiplier. Finally, a software library is developed with the CUDA-C language to evaluate the inference accuracy of LBFP. Without retraining, the accuracy of the selected DNN models with the 8-bit LBFP representation is comparable to that of the corresponding 32-bit floating-point baselines, showing the great potential in efficient DNN inference acceleration.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信