{"title":"深度神经网络的对数块浮点算法","authors":"Chao Ni, Jinming Lu, Jun Lin, Zhongfeng Wang","doi":"10.1109/APCCAS50809.2020.9301687","DOIUrl":null,"url":null,"abstract":"Fixed-point quantization techniques have attracted considerable attention in deep neural network (DNN) inference acceleration. Nevertheless, they often require time-consuming fine-tuning or retraining to keep the accuracy of a quantized model. Besides, DNNs involve massive multiplication operations, which are of much higher computational complexities compared with addition operations. To deal with the two problems, we propose an improved numerical format named logarithmic block floating point (LBFP) for post-training quantization. Firstly, logarithmic arithmetic is employed to convert multiplication operations to addition and shift operations. Then, Kullback-Leibler divergence is used to determine the shared exponent before inference. Thus, LBFP can significantly reduce the hard-ware complexity with negligible performance loss. Moreover, an efficient hardware architecture is designed to support the computation of LBFP. Hardware synthesis results show that our 8-bit LBFP multiplier can reduce power and area by 53% and 45%, respectively, compared with the 8-bit traditional fixed-point multiplier. Finally, a software library is developed with the CUDA-C language to evaluate the inference accuracy of LBFP. Without retraining, the accuracy of the selected DNN models with the 8-bit LBFP representation is comparable to that of the corresponding 32-bit floating-point baselines, showing the great potential in efficient DNN inference acceleration.","PeriodicalId":127075,"journal":{"name":"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"LBFP: Logarithmic Block Floating Point Arithmetic for Deep Neural Networks\",\"authors\":\"Chao Ni, Jinming Lu, Jun Lin, Zhongfeng Wang\",\"doi\":\"10.1109/APCCAS50809.2020.9301687\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fixed-point quantization techniques have attracted considerable attention in deep neural network (DNN) inference acceleration. Nevertheless, they often require time-consuming fine-tuning or retraining to keep the accuracy of a quantized model. Besides, DNNs involve massive multiplication operations, which are of much higher computational complexities compared with addition operations. To deal with the two problems, we propose an improved numerical format named logarithmic block floating point (LBFP) for post-training quantization. Firstly, logarithmic arithmetic is employed to convert multiplication operations to addition and shift operations. Then, Kullback-Leibler divergence is used to determine the shared exponent before inference. Thus, LBFP can significantly reduce the hard-ware complexity with negligible performance loss. Moreover, an efficient hardware architecture is designed to support the computation of LBFP. Hardware synthesis results show that our 8-bit LBFP multiplier can reduce power and area by 53% and 45%, respectively, compared with the 8-bit traditional fixed-point multiplier. Finally, a software library is developed with the CUDA-C language to evaluate the inference accuracy of LBFP. Without retraining, the accuracy of the selected DNN models with the 8-bit LBFP representation is comparable to that of the corresponding 32-bit floating-point baselines, showing the great potential in efficient DNN inference acceleration.\",\"PeriodicalId\":127075,\"journal\":{\"name\":\"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/APCCAS50809.2020.9301687\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APCCAS50809.2020.9301687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
LBFP: Logarithmic Block Floating Point Arithmetic for Deep Neural Networks
Fixed-point quantization techniques have attracted considerable attention in deep neural network (DNN) inference acceleration. Nevertheless, they often require time-consuming fine-tuning or retraining to keep the accuracy of a quantized model. Besides, DNNs involve massive multiplication operations, which are of much higher computational complexities compared with addition operations. To deal with the two problems, we propose an improved numerical format named logarithmic block floating point (LBFP) for post-training quantization. Firstly, logarithmic arithmetic is employed to convert multiplication operations to addition and shift operations. Then, Kullback-Leibler divergence is used to determine the shared exponent before inference. Thus, LBFP can significantly reduce the hard-ware complexity with negligible performance loss. Moreover, an efficient hardware architecture is designed to support the computation of LBFP. Hardware synthesis results show that our 8-bit LBFP multiplier can reduce power and area by 53% and 45%, respectively, compared with the 8-bit traditional fixed-point multiplier. Finally, a software library is developed with the CUDA-C language to evaluate the inference accuracy of LBFP. Without retraining, the accuracy of the selected DNN models with the 8-bit LBFP representation is comparable to that of the corresponding 32-bit floating-point baselines, showing the great potential in efficient DNN inference acceleration.