Sebastian Vogel, R. Raghunath, A. Guntoro, Kristof Van Laerhoven, G. Ascheid
{"title":"基于位偏移的CNNs加速器,具有可选择的精度和吞吐量","authors":"Sebastian Vogel, R. Raghunath, A. Guntoro, Kristof Van Laerhoven, G. Ascheid","doi":"10.1109/DSD.2019.00106","DOIUrl":null,"url":null,"abstract":"Hardware accelerators for compute intensive algorithms such as convolutional neural networks benefit from number representations with reduced precision. In this paper, we evaluate and extend a number representation based on power-of-two quantization enabling bit-shift-based processing of multiplications. We found that weights of a neural network can either be represented by a single 4 bit power-of-two value or with two 4 bit values depending on accuracy requirements. We evaluate the classification accuracy of VGG-16 and ResNet50 on the ImageNet dataset with weights represented in our novel number format. To include a more complex task, we additionally evaluate the format on two networks for semantic segmentation. In addition, we design a novel processing element based on bit-shifts which is configurable in terms of throughput (4 bit mode) and accuracy (8 bit mode). We evaluate this processing element in an FPGA implementation of a dedicated accelerator for neural networks incorporating a 32-by-64 processing array running at 250 MHz with 1 TOp/s peak throughput in 8 bit mode. The accelerator is capable of processing regular convolutional layers and dilated convolutions in combination with pooling and upsampling. For a semantic segmentation network with 108.5 GOp/frame, our FPGA implementation achieves a throughput of 7.0 FPS in the 8 bit accurate mode and upto 11.2 FPS in the 4 bit mode corresponding to 760.1 GOp/s and 1,218 GOp/s effective throughput, respectively. Finally, we compare the novel design to classical multiplier-based approaches in terms of FPGA utilization and power consumption. Our novel multiply-accumulate engines designed for the optimized number representation uses 9 % less logical elements while allowing double throughput compared to a classical implementation. Moreover, a measurement shows 25 % reduction of power consumption at same throughput. Therefore, our flexible design offers a solution to the trade-off between energy efficiency, accuracy, and high throughput.","PeriodicalId":217233,"journal":{"name":"2019 22nd Euromicro Conference on Digital System Design (DSD)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Bit-Shift-Based Accelerator for CNNs with Selectable Accuracy and Throughput\",\"authors\":\"Sebastian Vogel, R. Raghunath, A. Guntoro, Kristof Van Laerhoven, G. Ascheid\",\"doi\":\"10.1109/DSD.2019.00106\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hardware accelerators for compute intensive algorithms such as convolutional neural networks benefit from number representations with reduced precision. In this paper, we evaluate and extend a number representation based on power-of-two quantization enabling bit-shift-based processing of multiplications. We found that weights of a neural network can either be represented by a single 4 bit power-of-two value or with two 4 bit values depending on accuracy requirements. We evaluate the classification accuracy of VGG-16 and ResNet50 on the ImageNet dataset with weights represented in our novel number format. To include a more complex task, we additionally evaluate the format on two networks for semantic segmentation. In addition, we design a novel processing element based on bit-shifts which is configurable in terms of throughput (4 bit mode) and accuracy (8 bit mode). We evaluate this processing element in an FPGA implementation of a dedicated accelerator for neural networks incorporating a 32-by-64 processing array running at 250 MHz with 1 TOp/s peak throughput in 8 bit mode. The accelerator is capable of processing regular convolutional layers and dilated convolutions in combination with pooling and upsampling. For a semantic segmentation network with 108.5 GOp/frame, our FPGA implementation achieves a throughput of 7.0 FPS in the 8 bit accurate mode and upto 11.2 FPS in the 4 bit mode corresponding to 760.1 GOp/s and 1,218 GOp/s effective throughput, respectively. Finally, we compare the novel design to classical multiplier-based approaches in terms of FPGA utilization and power consumption. Our novel multiply-accumulate engines designed for the optimized number representation uses 9 % less logical elements while allowing double throughput compared to a classical implementation. Moreover, a measurement shows 25 % reduction of power consumption at same throughput. Therefore, our flexible design offers a solution to the trade-off between energy efficiency, accuracy, and high throughput.\",\"PeriodicalId\":217233,\"journal\":{\"name\":\"2019 22nd Euromicro Conference on Digital System Design (DSD)\",\"volume\":\"38 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 22nd Euromicro Conference on Digital System Design (DSD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSD.2019.00106\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 22nd Euromicro Conference on Digital System Design (DSD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSD.2019.00106","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Bit-Shift-Based Accelerator for CNNs with Selectable Accuracy and Throughput
Hardware accelerators for compute intensive algorithms such as convolutional neural networks benefit from number representations with reduced precision. In this paper, we evaluate and extend a number representation based on power-of-two quantization enabling bit-shift-based processing of multiplications. We found that weights of a neural network can either be represented by a single 4 bit power-of-two value or with two 4 bit values depending on accuracy requirements. We evaluate the classification accuracy of VGG-16 and ResNet50 on the ImageNet dataset with weights represented in our novel number format. To include a more complex task, we additionally evaluate the format on two networks for semantic segmentation. In addition, we design a novel processing element based on bit-shifts which is configurable in terms of throughput (4 bit mode) and accuracy (8 bit mode). We evaluate this processing element in an FPGA implementation of a dedicated accelerator for neural networks incorporating a 32-by-64 processing array running at 250 MHz with 1 TOp/s peak throughput in 8 bit mode. The accelerator is capable of processing regular convolutional layers and dilated convolutions in combination with pooling and upsampling. For a semantic segmentation network with 108.5 GOp/frame, our FPGA implementation achieves a throughput of 7.0 FPS in the 8 bit accurate mode and upto 11.2 FPS in the 4 bit mode corresponding to 760.1 GOp/s and 1,218 GOp/s effective throughput, respectively. Finally, we compare the novel design to classical multiplier-based approaches in terms of FPGA utilization and power consumption. Our novel multiply-accumulate engines designed for the optimized number representation uses 9 % less logical elements while allowing double throughput compared to a classical implementation. Moreover, a measurement shows 25 % reduction of power consumption at same throughput. Therefore, our flexible design offers a solution to the trade-off between energy efficiency, accuracy, and high throughput.