Sebastian Vogel, R. Raghunath, A. Guntoro, Kristof Van Laerhoven, G. Ascheid
{"title":"Bit-Shift-Based Accelerator for CNNs with Selectable Accuracy and Throughput","authors":"Sebastian Vogel, R. Raghunath, A. Guntoro, Kristof Van Laerhoven, G. Ascheid","doi":"10.1109/DSD.2019.00106","DOIUrl":null,"url":null,"abstract":"Hardware accelerators for compute intensive algorithms such as convolutional neural networks benefit from number representations with reduced precision. In this paper, we evaluate and extend a number representation based on power-of-two quantization enabling bit-shift-based processing of multiplications. We found that weights of a neural network can either be represented by a single 4 bit power-of-two value or with two 4 bit values depending on accuracy requirements. We evaluate the classification accuracy of VGG-16 and ResNet50 on the ImageNet dataset with weights represented in our novel number format. To include a more complex task, we additionally evaluate the format on two networks for semantic segmentation. In addition, we design a novel processing element based on bit-shifts which is configurable in terms of throughput (4 bit mode) and accuracy (8 bit mode). We evaluate this processing element in an FPGA implementation of a dedicated accelerator for neural networks incorporating a 32-by-64 processing array running at 250 MHz with 1 TOp/s peak throughput in 8 bit mode. The accelerator is capable of processing regular convolutional layers and dilated convolutions in combination with pooling and upsampling. For a semantic segmentation network with 108.5 GOp/frame, our FPGA implementation achieves a throughput of 7.0 FPS in the 8 bit accurate mode and upto 11.2 FPS in the 4 bit mode corresponding to 760.1 GOp/s and 1,218 GOp/s effective throughput, respectively. Finally, we compare the novel design to classical multiplier-based approaches in terms of FPGA utilization and power consumption. Our novel multiply-accumulate engines designed for the optimized number representation uses 9 % less logical elements while allowing double throughput compared to a classical implementation. Moreover, a measurement shows 25 % reduction of power consumption at same throughput. Therefore, our flexible design offers a solution to the trade-off between energy efficiency, accuracy, and high throughput.","PeriodicalId":217233,"journal":{"name":"2019 22nd Euromicro Conference on Digital System Design (DSD)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 22nd Euromicro Conference on Digital System Design (DSD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSD.2019.00106","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Hardware accelerators for compute intensive algorithms such as convolutional neural networks benefit from number representations with reduced precision. In this paper, we evaluate and extend a number representation based on power-of-two quantization enabling bit-shift-based processing of multiplications. We found that weights of a neural network can either be represented by a single 4 bit power-of-two value or with two 4 bit values depending on accuracy requirements. We evaluate the classification accuracy of VGG-16 and ResNet50 on the ImageNet dataset with weights represented in our novel number format. To include a more complex task, we additionally evaluate the format on two networks for semantic segmentation. In addition, we design a novel processing element based on bit-shifts which is configurable in terms of throughput (4 bit mode) and accuracy (8 bit mode). We evaluate this processing element in an FPGA implementation of a dedicated accelerator for neural networks incorporating a 32-by-64 processing array running at 250 MHz with 1 TOp/s peak throughput in 8 bit mode. The accelerator is capable of processing regular convolutional layers and dilated convolutions in combination with pooling and upsampling. For a semantic segmentation network with 108.5 GOp/frame, our FPGA implementation achieves a throughput of 7.0 FPS in the 8 bit accurate mode and upto 11.2 FPS in the 4 bit mode corresponding to 760.1 GOp/s and 1,218 GOp/s effective throughput, respectively. Finally, we compare the novel design to classical multiplier-based approaches in terms of FPGA utilization and power consumption. Our novel multiply-accumulate engines designed for the optimized number representation uses 9 % less logical elements while allowing double throughput compared to a classical implementation. Moreover, a measurement shows 25 % reduction of power consumption at same throughput. Therefore, our flexible design offers a solution to the trade-off between energy efficiency, accuracy, and high throughput.