{"title":"FFConv:基于fpga的卷积神经网络快速卷积层加速器","authors":"Afzal Ahmad, Muhammad Adeel Pasha","doi":"10.1145/3380548","DOIUrl":null,"url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 1539-9087/2020/03-ART15 $15.00 https://doi.org/10.1145/3380548 ACM Transactions on Embedded Computing Systems, Vol. 19, No. 2, Article 15. Publication date: March 2020. 15:2 A. Ahmad and M. A. Pasha state-of-the-art algorithms for computer vision tasks such as image classification, object detection, and semantic segmentation [14, 16, 21]. While the accuracy achieved by CNNs on these tasks is unparalleled, their extreme computational budgets limit wider implementation. While Graphical Processing Units (GPUs) are being used to deploy CNN architectures exploiting the algorithmic parallelism over the many-cores that they provide [6], their power consumption is high and their architectures are more generic. Owing to their reconfigurability, Field Programmable Gate Array (FPGA)-based implementations [22, 29, 41] are being explored to design parallel and pipelined network architectures that give improved performance and power efficiency compared to general-purpose processors (CPUs) and GPUs. While more custom solutions in the form of Application-Specific Integrated Circuits (ASICs) can be implemented that further improve the performance and power efficiency compared to their FPGA-based counterparts [3], ASIC-based designs are rigid, hence may only be justified at the stage of final implementation when thorough testing and prototyping has been done on a more reconfigurable FPGA-based platform. Significant research effort is also being put into optimizing different layers of CNNs to gain improvements in the performance metrics for a wider range of CNN architectures and hardware platforms. Fast Fourier Transforms (FFT)-based convolutions have shown significant gains in reducing the computational complexity of convolutional layers that use large kernel sizes (≥7 × 7) implemented on GPU platforms [31]. Although this reduction in computational complexity offered by FFT-based convolution is significant for large kernel sizes, modern neural network architectures, such as VGGNet [28], ResNet [16], MobileNets [17], and GoogLeNet [30], tend towards smaller kernel sizes and deeper topologies. FFT-based convolutions have actually been shown to increase the overall computation time of layers that use smaller kernel sizes by as much as 16× [31]. Winograd minimal filtering [34] based fast convolution algorithms (we will refer to them as “fast-conv” from here onwards) have also been proposed and have shown significant improvements for small kernel sizes, applicable to most modern networks [20]. Fast-conv algorithms work by reducing the computational complexity of expensive operations while adding transform stages that increase the number of cheaper operations involved in the convolution. Furthermore, the arithmetic cost of the additional transform stages can be amortized over layer dimensions, leading to an overall reduction in computation complexity. In this article, we present FFConv, an efficient FPGA-based fast-conv accelerator for CNNs. We explore custom bitwidth quantization schemes in fast-conv and their impact on classification accuracy of the system. We follow through with a modular hardware implementation that not only allows the system to run at high frequency but also uses the resources efficiently for a tradeoff in throughput and accuracy. To this objective, we explore challenges that bottleneck the performance of our design and find optimizations to curb them and give a balance between performance and accuracy. The main contributions of this work are as follows: • We model losses in classification accuracy for fast-conv based VGG16-D [28], AlexNet [19], and Shufflenet-v1 [43] for different quantization levels for feature and kernel maps. • We propose FFConv, an FPGA-based optimized pipelined fast-conv accelerator for CNNs. • We explore limitations in terms of memory-compute tradeoffs and introduce optimizations to our base design, resulting in performance improvements while costing a minute loss in classification accuracy. The rest of the article is structured as follows: In Section 2, we present a short primer on CNNs, spatial convolution, and fast-conv while also surveying the previous works. Section 3 covers a design space exploration to find appropriate parameters and control knobs of fast-conv algorithms to be used in our hardware design. We also model the losses in classification accuracy for ACM Transactions on Embedded Computing Systems, Vol. 19, No. 2, Article 15. Publication date: March 2020. FFConv: FPGA-based Accelerator for Fast Convolution in CNNs 15:3 different quantization levels for fast-conv based VGG16-D, AlexNet, and Shufflenet-v1 architectures. In Section 4, we present stage-wise implementation of FFConv while discussing challenges and optimizations along with a qualitative discussion of the novelties of our design compared to the state-of-the-art. Section 5 contains a detailed discussion and comparison of implementation results of FFConv against the state-of-the-art implementations in terms of four metrics: accuracy, throughput, resource, and power efficiency. The article is then concluded in Section 6.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"FFConv: An FPGA-based Accelerator for Fast Convolution Layers in Convolutional Neural Networks\",\"authors\":\"Afzal Ahmad, Muhammad Adeel Pasha\",\"doi\":\"10.1145/3380548\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 1539-9087/2020/03-ART15 $15.00 https://doi.org/10.1145/3380548 ACM Transactions on Embedded Computing Systems, Vol. 19, No. 2, Article 15. Publication date: March 2020. 15:2 A. Ahmad and M. A. Pasha state-of-the-art algorithms for computer vision tasks such as image classification, object detection, and semantic segmentation [14, 16, 21]. While the accuracy achieved by CNNs on these tasks is unparalleled, their extreme computational budgets limit wider implementation. While Graphical Processing Units (GPUs) are being used to deploy CNN architectures exploiting the algorithmic parallelism over the many-cores that they provide [6], their power consumption is high and their architectures are more generic. Owing to their reconfigurability, Field Programmable Gate Array (FPGA)-based implementations [22, 29, 41] are being explored to design parallel and pipelined network architectures that give improved performance and power efficiency compared to general-purpose processors (CPUs) and GPUs. While more custom solutions in the form of Application-Specific Integrated Circuits (ASICs) can be implemented that further improve the performance and power efficiency compared to their FPGA-based counterparts [3], ASIC-based designs are rigid, hence may only be justified at the stage of final implementation when thorough testing and prototyping has been done on a more reconfigurable FPGA-based platform. Significant research effort is also being put into optimizing different layers of CNNs to gain improvements in the performance metrics for a wider range of CNN architectures and hardware platforms. Fast Fourier Transforms (FFT)-based convolutions have shown significant gains in reducing the computational complexity of convolutional layers that use large kernel sizes (≥7 × 7) implemented on GPU platforms [31]. Although this reduction in computational complexity offered by FFT-based convolution is significant for large kernel sizes, modern neural network architectures, such as VGGNet [28], ResNet [16], MobileNets [17], and GoogLeNet [30], tend towards smaller kernel sizes and deeper topologies. FFT-based convolutions have actually been shown to increase the overall computation time of layers that use smaller kernel sizes by as much as 16× [31]. Winograd minimal filtering [34] based fast convolution algorithms (we will refer to them as “fast-conv” from here onwards) have also been proposed and have shown significant improvements for small kernel sizes, applicable to most modern networks [20]. Fast-conv algorithms work by reducing the computational complexity of expensive operations while adding transform stages that increase the number of cheaper operations involved in the convolution. Furthermore, the arithmetic cost of the additional transform stages can be amortized over layer dimensions, leading to an overall reduction in computation complexity. In this article, we present FFConv, an efficient FPGA-based fast-conv accelerator for CNNs. We explore custom bitwidth quantization schemes in fast-conv and their impact on classification accuracy of the system. We follow through with a modular hardware implementation that not only allows the system to run at high frequency but also uses the resources efficiently for a tradeoff in throughput and accuracy. To this objective, we explore challenges that bottleneck the performance of our design and find optimizations to curb them and give a balance between performance and accuracy. The main contributions of this work are as follows: • We model losses in classification accuracy for fast-conv based VGG16-D [28], AlexNet [19], and Shufflenet-v1 [43] for different quantization levels for feature and kernel maps. • We propose FFConv, an FPGA-based optimized pipelined fast-conv accelerator for CNNs. • We explore limitations in terms of memory-compute tradeoffs and introduce optimizations to our base design, resulting in performance improvements while costing a minute loss in classification accuracy. The rest of the article is structured as follows: In Section 2, we present a short primer on CNNs, spatial convolution, and fast-conv while also surveying the previous works. Section 3 covers a design space exploration to find appropriate parameters and control knobs of fast-conv algorithms to be used in our hardware design. We also model the losses in classification accuracy for ACM Transactions on Embedded Computing Systems, Vol. 19, No. 2, Article 15. Publication date: March 2020. FFConv: FPGA-based Accelerator for Fast Convolution in CNNs 15:3 different quantization levels for fast-conv based VGG16-D, AlexNet, and Shufflenet-v1 architectures. In Section 4, we present stage-wise implementation of FFConv while discussing challenges and optimizations along with a qualitative discussion of the novelties of our design compared to the state-of-the-art. Section 5 contains a detailed discussion and comparison of implementation results of FFConv against the state-of-the-art implementations in terms of four metrics: accuracy, throughput, resource, and power efficiency. The article is then concluded in Section 6.\",\"PeriodicalId\":183677,\"journal\":{\"name\":\"ACM Trans. Embed. Comput. Syst.\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-03-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Trans. Embed. Comput. Syst.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3380548\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Trans. Embed. Comput. Syst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3380548","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
FFConv: An FPGA-based Accelerator for Fast Convolution Layers in Convolutional Neural Networks
ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 1539-9087/2020/03-ART15 $15.00 https://doi.org/10.1145/3380548 ACM Transactions on Embedded Computing Systems, Vol. 19, No. 2, Article 15. Publication date: March 2020. 15:2 A. Ahmad and M. A. Pasha state-of-the-art algorithms for computer vision tasks such as image classification, object detection, and semantic segmentation [14, 16, 21]. While the accuracy achieved by CNNs on these tasks is unparalleled, their extreme computational budgets limit wider implementation. While Graphical Processing Units (GPUs) are being used to deploy CNN architectures exploiting the algorithmic parallelism over the many-cores that they provide [6], their power consumption is high and their architectures are more generic. Owing to their reconfigurability, Field Programmable Gate Array (FPGA)-based implementations [22, 29, 41] are being explored to design parallel and pipelined network architectures that give improved performance and power efficiency compared to general-purpose processors (CPUs) and GPUs. While more custom solutions in the form of Application-Specific Integrated Circuits (ASICs) can be implemented that further improve the performance and power efficiency compared to their FPGA-based counterparts [3], ASIC-based designs are rigid, hence may only be justified at the stage of final implementation when thorough testing and prototyping has been done on a more reconfigurable FPGA-based platform. Significant research effort is also being put into optimizing different layers of CNNs to gain improvements in the performance metrics for a wider range of CNN architectures and hardware platforms. Fast Fourier Transforms (FFT)-based convolutions have shown significant gains in reducing the computational complexity of convolutional layers that use large kernel sizes (≥7 × 7) implemented on GPU platforms [31]. Although this reduction in computational complexity offered by FFT-based convolution is significant for large kernel sizes, modern neural network architectures, such as VGGNet [28], ResNet [16], MobileNets [17], and GoogLeNet [30], tend towards smaller kernel sizes and deeper topologies. FFT-based convolutions have actually been shown to increase the overall computation time of layers that use smaller kernel sizes by as much as 16× [31]. Winograd minimal filtering [34] based fast convolution algorithms (we will refer to them as “fast-conv” from here onwards) have also been proposed and have shown significant improvements for small kernel sizes, applicable to most modern networks [20]. Fast-conv algorithms work by reducing the computational complexity of expensive operations while adding transform stages that increase the number of cheaper operations involved in the convolution. Furthermore, the arithmetic cost of the additional transform stages can be amortized over layer dimensions, leading to an overall reduction in computation complexity. In this article, we present FFConv, an efficient FPGA-based fast-conv accelerator for CNNs. We explore custom bitwidth quantization schemes in fast-conv and their impact on classification accuracy of the system. We follow through with a modular hardware implementation that not only allows the system to run at high frequency but also uses the resources efficiently for a tradeoff in throughput and accuracy. To this objective, we explore challenges that bottleneck the performance of our design and find optimizations to curb them and give a balance between performance and accuracy. The main contributions of this work are as follows: • We model losses in classification accuracy for fast-conv based VGG16-D [28], AlexNet [19], and Shufflenet-v1 [43] for different quantization levels for feature and kernel maps. • We propose FFConv, an FPGA-based optimized pipelined fast-conv accelerator for CNNs. • We explore limitations in terms of memory-compute tradeoffs and introduce optimizations to our base design, resulting in performance improvements while costing a minute loss in classification accuracy. The rest of the article is structured as follows: In Section 2, we present a short primer on CNNs, spatial convolution, and fast-conv while also surveying the previous works. Section 3 covers a design space exploration to find appropriate parameters and control knobs of fast-conv algorithms to be used in our hardware design. We also model the losses in classification accuracy for ACM Transactions on Embedded Computing Systems, Vol. 19, No. 2, Article 15. Publication date: March 2020. FFConv: FPGA-based Accelerator for Fast Convolution in CNNs 15:3 different quantization levels for fast-conv based VGG16-D, AlexNet, and Shufflenet-v1 architectures. In Section 4, we present stage-wise implementation of FFConv while discussing challenges and optimizations along with a qualitative discussion of the novelties of our design compared to the state-of-the-art. Section 5 contains a detailed discussion and comparison of implementation results of FFConv against the state-of-the-art implementations in terms of four metrics: accuracy, throughput, resource, and power efficiency. The article is then concluded in Section 6.