FFConv:基于fpga的卷积神经网络快速卷积层加速器

ACM Trans. Embed. Comput. Syst. Pub Date : 2020-03-17 DOI:10.1145/3380548

Afzal Ahmad, Muhammad Adeel Pasha

{"title":"FFConv:基于fpga的卷积神经网络快速卷积层加速器","authors":"Afzal Ahmad, Muhammad Adeel Pasha","doi":"10.1145/3380548","DOIUrl":null,"url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 1539-9087/2020/03-ART15 $15.00 https://doi.org/10.1145/3380548 ACM Transactions on Embedded Computing Systems, Vol. 19, No. 2, Article 15. Publication date: March 2020. 15:2 A. Ahmad and M. A. Pasha state-of-the-art algorithms for computer vision tasks such as image classification, object detection, and semantic segmentation [14, 16, 21]. While the accuracy achieved by CNNs on these tasks is unparalleled, their extreme computational budgets limit wider implementation. While Graphical Processing Units (GPUs) are being used to deploy CNN architectures exploiting the algorithmic parallelism over the many-cores that they provide [6], their power consumption is high and their architectures are more generic. Owing to their reconfigurability, Field Programmable Gate Array (FPGA)-based implementations [22, 29, 41] are being explored to design parallel and pipelined network architectures that give improved performance and power efficiency compared to general-purpose processors (CPUs) and GPUs. While more custom solutions in the form of Application-Specific Integrated Circuits (ASICs) can be implemented that further improve the performance and power efficiency compared to their FPGA-based counterparts [3], ASIC-based designs are rigid, hence may only be justified at the stage of final implementation when thorough testing and prototyping has been done on a more reconfigurable FPGA-based platform. Significant research effort is also being put into optimizing different layers of CNNs to gain improvements in the performance metrics for a wider range of CNN architectures and hardware platforms. Fast Fourier Transforms (FFT)-based convolutions have shown significant gains in reducing the computational complexity of convolutional layers that use large kernel sizes (≥7 × 7) implemented on GPU platforms [31]. Although this reduction in computational complexity offered by FFT-based convolution is significant for large kernel sizes, modern neural network architectures, such as VGGNet [28], ResNet [16], MobileNets [17], and GoogLeNet [30], tend towards smaller kernel sizes and deeper topologies. FFT-based convolutions have actually been shown to increase the overall computation time of layers that use smaller kernel sizes by as much as 16× [31]. Winograd minimal filtering [34] based fast convolution algorithms (we will refer to them as “fast-conv” from here onwards) have also been proposed and have shown significant improvements for small kernel sizes, applicable to most modern networks [20]. Fast-conv algorithms work by reducing the computational complexity of expensive operations while adding transform stages that increase the number of cheaper operations involved in the convolution. Furthermore, the arithmetic cost of the additional transform stages can be amortized over layer dimensions, leading to an overall reduction in computation complexity. In this article, we present FFConv, an efficient FPGA-based fast-conv accelerator for CNNs. We explore custom bitwidth quantization schemes in fast-conv and their impact on classification accuracy of the system. We follow through with a modular hardware implementation that not only allows the system to run at high frequency but also uses the resources efficiently for a tradeoff in throughput and accuracy. To this objective, we explore challenges that bottleneck the performance of our design and find optimizations to curb them and give a balance between performance and accuracy. The main contributions of this work are as follows: • We model losses in classification accuracy for fast-conv based VGG16-D [28], AlexNet [19], and Shufflenet-v1 [43] for different quantization levels for feature and kernel maps. • We propose FFConv, an FPGA-based optimized pipelined fast-conv accelerator for CNNs. • We explore limitations in terms of memory-compute tradeoffs and introduce optimizations to our base design, resulting in performance improvements while costing a minute loss in classification accuracy. The rest of the article is structured as follows: In Section 2, we present a short primer on CNNs, spatial convolution, and fast-conv while also surveying the previous works. Section 3 covers a design space exploration to find appropriate parameters and control knobs of fast-conv algorithms to be used in our hardware design. We also model the losses in classification accuracy for ACM Transactions on Embedded Computing Systems, Vol. 19, No. 2, Article 15. Publication date: March 2020. FFConv: FPGA-based Accelerator for Fast Convolution in CNNs 15:3 different quantization levels for fast-conv based VGG16-D, AlexNet, and Shufflenet-v1 architectures. In Section 4, we present stage-wise implementation of FFConv while discussing challenges and optimizations along with a qualitative discussion of the novelties of our design compared to the state-of-the-art. Section 5 contains a detailed discussion and comparison of implementation results of FFConv against the state-of-the-art implementations in terms of four metrics: accuracy, throughput, resource, and power efficiency. The article is then concluded in Section 6.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"FFConv: An FPGA-based Accelerator for Fast Convolution Layers in Convolutional Neural Networks\",\"authors\":\"Afzal Ahmad, Muhammad Adeel Pasha\",\"doi\":\"10.1145/3380548\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 1539-9087/2020/03-ART15 $15.00 https://doi.org/10.1145/3380548 ACM Transactions on Embedded Computing Systems, Vol. 19, No. 2, Article 15. Publication date: March 2020. 15:2 A. Ahmad and M. A. Pasha state-of-the-art algorithms for computer vision tasks such as image classification, object detection, and semantic segmentation [14, 16, 21]. While the accuracy achieved by CNNs on these tasks is unparalleled, their extreme computational budgets limit wider implementation. While Graphical Processing Units (GPUs) are being used to deploy CNN architectures exploiting the algorithmic parallelism over the many-cores that they provide [6], their power consumption is high and their architectures are more generic. Owing to their reconfigurability, Field Programmable Gate Array (FPGA)-based implementations [22, 29, 41] are being explored to design parallel and pipelined network architectures that give improved performance and power efficiency compared to general-purpose processors (CPUs) and GPUs. While more custom solutions in the form of Application-Specific Integrated Circuits (ASICs) can be implemented that further improve the performance and power efficiency compared to their FPGA-based counterparts [3], ASIC-based designs are rigid, hence may only be justified at the stage of final implementation when thorough testing and prototyping has been done on a more reconfigurable FPGA-based platform. Significant research effort is also being put into optimizing different layers of CNNs to gain improvements in the performance metrics for a wider range of CNN architectures and hardware platforms. Fast Fourier Transforms (FFT)-based convolutions have shown significant gains in reducing the computational complexity of convolutional layers that use large kernel sizes (≥7 × 7) implemented on GPU platforms [31]. Although this reduction in computational complexity offered by FFT-based convolution is significant for large kernel sizes, modern neural network architectures, such as VGGNet [28], ResNet [16], MobileNets [17], and GoogLeNet [30], tend towards smaller kernel sizes and deeper topologies. FFT-based convolutions have actually been shown to increase the overall computation time of layers that use smaller kernel sizes by as much as 16× [31]. Winograd minimal filtering [34] based fast convolution algorithms (we will refer to them as “fast-conv” from here onwards) have also been proposed and have shown significant improvements for small kernel sizes, applicable to most modern networks [20]. Fast-conv algorithms work by reducing the computational complexity of expensive operations while adding transform stages that increase the number of cheaper operations involved in the convolution. Furthermore, the arithmetic cost of the additional transform stages can be amortized over layer dimensions, leading to an overall reduction in computation complexity. In this article, we present FFConv, an efficient FPGA-based fast-conv accelerator for CNNs. We explore custom bitwidth quantization schemes in fast-conv and their impact on classification accuracy of the system. We follow through with a modular hardware implementation that not only allows the system to run at high frequency but also uses the resources efficiently for a tradeoff in throughput and accuracy. To this objective, we explore challenges that bottleneck the performance of our design and find optimizations to curb them and give a balance between performance and accuracy. The main contributions of this work are as follows: • We model losses in classification accuracy for fast-conv based VGG16-D [28], AlexNet [19], and Shufflenet-v1 [43] for different quantization levels for feature and kernel maps. • We propose FFConv, an FPGA-based optimized pipelined fast-conv accelerator for CNNs. • We explore limitations in terms of memory-compute tradeoffs and introduce optimizations to our base design, resulting in performance improvements while costing a minute loss in classification accuracy. The rest of the article is structured as follows: In Section 2, we present a short primer on CNNs, spatial convolution, and fast-conv while also surveying the previous works. Section 3 covers a design space exploration to find appropriate parameters and control knobs of fast-conv algorithms to be used in our hardware design. We also model the losses in classification accuracy for ACM Transactions on Embedded Computing Systems, Vol. 19, No. 2, Article 15. Publication date: March 2020. FFConv: FPGA-based Accelerator for Fast Convolution in CNNs 15:3 different quantization levels for fast-conv based VGG16-D, AlexNet, and Shufflenet-v1 architectures. In Section 4, we present stage-wise implementation of FFConv while discussing challenges and optimizations along with a qualitative discussion of the novelties of our design compared to the state-of-the-art. Section 5 contains a detailed discussion and comparison of implementation results of FFConv against the state-of-the-art implementations in terms of four metrics: accuracy, throughput, resource, and power efficiency. The article is then concluded in Section 6.\",\"PeriodicalId\":183677,\"journal\":{\"name\":\"ACM Trans. Embed. Comput. Syst.\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-03-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Trans. Embed. Comput. Syst.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3380548\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Trans. Embed. Comput. Syst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3380548","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

允许赊账付款。以其他方式复制或重新发布，在服务器上发布或重新分发到列表，需要事先获得特定许可和/或付费。从permissions@acm.org请求权限。©2020计算机械协会。1539-9087/2020/03-ART15 $15.00 https://doi.org/10.1145/3380548 ACM嵌入式计算系统学报，Vol. 19, No. 2, Article 15。出版日期:2020年3月。15:2。Ahmad和M. A. Pasha提出了用于计算机视觉任务的最先进算法，如图像分类、目标检测和语义分割[14,16,21]。虽然cnn在这些任务上取得的准确性是无与伦比的，但它们的极端计算预算限制了更广泛的实施。虽然图形处理单元(gpu)被用于部署CNN架构，利用它们提供的多核算法并行性，但它们的功耗很高，而且它们的架构更通用。由于其可重构性，基于现场可编程门阵列(FPGA)的实现[22,29,41]正在探索设计并行和流水线网络架构，与通用处理器(cpu)和gpu相比，这些架构提供了更高的性能和功率效率。虽然可以实现更多以专用集成电路(asic)形式的定制解决方案，与基于fpga的同类产品相比，可以进一步提高性能和功率效率，但基于asic的设计是刚性的，因此只有在最终实施阶段，当在更可重构的基于fpga的平台上完成彻底的测试和原型设计时，才能证明是合理的。重要的研究工作也被投入到优化CNN的不同层，以获得更广泛的CNN架构和硬件平台的性能指标的改进。基于快速傅里叶变换(FFT)的卷积在降低在GPU平台[31]上实现的使用大内核尺寸(≥7 × 7)的卷积层的计算复杂性方面显示出显著的进步。尽管基于fft的卷积所提供的计算复杂性的降低对于大内核尺寸是显著的，但现代神经网络架构，如VGGNet[28]、ResNet[16]、MobileNets[17]和GoogLeNet[30]，倾向于更小的内核尺寸和更深的拓扑结构。基于fft的卷积实际上已经被证明可以将使用较小内核大小的层的总体计算时间增加多达16×[31]。基于Winograd最小滤波[34]的快速卷积算法(我们将从这里开始将其称为“fast-conv”)也被提出，并且在小内核尺寸下显示出显著的改进，适用于大多数现代网络[20]。快速卷积算法通过减少昂贵操作的计算复杂度来工作，同时增加变换阶段，增加卷积中涉及的廉价操作的数量。此外，额外变换阶段的算术代价可以在层维上平摊，从而降低了计算复杂度。在这篇文章中，我们提出了FFConv，一个高效的基于fpga的cnn快速转换加速器。探讨了快速转换中自定义位宽量化方案及其对系统分类精度的影响。我们通过模块化硬件实现，不仅允许系统在高频率下运行，而且还有效地利用资源来权衡吞吐量和准确性。为了实现这一目标，我们探索了设计性能瓶颈的挑战，并找到了抑制它们的优化方法，在性能和准确性之间取得了平衡。本工作的主要贡献如下:•我们对基于快速转换的VGG16-D [28]， AlexNet[19]和Shufflenet-v1[43]的特征和内核映射的不同量化水平的分类精度损失进行了建模。•我们提出FFConv，一种基于fpga的优化流水线式cnn快速转换加速器。•我们探索了内存计算权衡方面的限制，并对我们的基本设计进行了优化，从而提高了性能，同时损失了一分钟的分类准确性。本文的其余部分结构如下:在第2节中，我们介绍了cnn，空间卷积和快速转换的简短入门，同时也调查了以前的工作。第3节涵盖了一个设计空间探索，以找到在我们的硬件设计中使用的快速转换算法的适当参数和控制旋钮。我们还为ACM Transactions on Embedded Computing Systems, Vol. 19, No. 2, Article 15建立了分类精度损失的模型。出版日期:2020年3月。FFConv:基于fpga的CNNs 15:3快速卷积加速器，适用于基于快速卷积的VGG16-D、AlexNet和Shufflenet-v1架构。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FFConv: An FPGA-based Accelerator for Fast Convolution Layers in Convolutional Neural Networks

ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 1539-9087/2020/03-ART15 $15.00 https://doi.org/10.1145/3380548 ACM Transactions on Embedded Computing Systems, Vol. 19, No. 2, Article 15. Publication date: March 2020. 15:2 A. Ahmad and M. A. Pasha state-of-the-art algorithms for computer vision tasks such as image classification, object detection, and semantic segmentation [14, 16, 21]. While the accuracy achieved by CNNs on these tasks is unparalleled, their extreme computational budgets limit wider implementation. While Graphical Processing Units (GPUs) are being used to deploy CNN architectures exploiting the algorithmic parallelism over the many-cores that they provide [6], their power consumption is high and their architectures are more generic. Owing to their reconfigurability, Field Programmable Gate Array (FPGA)-based implementations [22, 29, 41] are being explored to design parallel and pipelined network architectures that give improved performance and power efficiency compared to general-purpose processors (CPUs) and GPUs. While more custom solutions in the form of Application-Specific Integrated Circuits (ASICs) can be implemented that further improve the performance and power efficiency compared to their FPGA-based counterparts [3], ASIC-based designs are rigid, hence may only be justified at the stage of final implementation when thorough testing and prototyping has been done on a more reconfigurable FPGA-based platform. Significant research effort is also being put into optimizing different layers of CNNs to gain improvements in the performance metrics for a wider range of CNN architectures and hardware platforms. Fast Fourier Transforms (FFT)-based convolutions have shown significant gains in reducing the computational complexity of convolutional layers that use large kernel sizes (≥7 × 7) implemented on GPU platforms [31]. Although this reduction in computational complexity offered by FFT-based convolution is significant for large kernel sizes, modern neural network architectures, such as VGGNet [28], ResNet [16], MobileNets [17], and GoogLeNet [30], tend towards smaller kernel sizes and deeper topologies. FFT-based convolutions have actually been shown to increase the overall computation time of layers that use smaller kernel sizes by as much as 16× [31]. Winograd minimal filtering [34] based fast convolution algorithms (we will refer to them as “fast-conv” from here onwards) have also been proposed and have shown significant improvements for small kernel sizes, applicable to most modern networks [20]. Fast-conv algorithms work by reducing the computational complexity of expensive operations while adding transform stages that increase the number of cheaper operations involved in the convolution. Furthermore, the arithmetic cost of the additional transform stages can be amortized over layer dimensions, leading to an overall reduction in computation complexity. In this article, we present FFConv, an efficient FPGA-based fast-conv accelerator for CNNs. We explore custom bitwidth quantization schemes in fast-conv and their impact on classification accuracy of the system. We follow through with a modular hardware implementation that not only allows the system to run at high frequency but also uses the resources efficiently for a tradeoff in throughput and accuracy. To this objective, we explore challenges that bottleneck the performance of our design and find optimizations to curb them and give a balance between performance and accuracy. The main contributions of this work are as follows: • We model losses in classification accuracy for fast-conv based VGG16-D [28], AlexNet [19], and Shufflenet-v1 [43] for different quantization levels for feature and kernel maps. • We propose FFConv, an FPGA-based optimized pipelined fast-conv accelerator for CNNs. • We explore limitations in terms of memory-compute tradeoffs and introduce optimizations to our base design, resulting in performance improvements while costing a minute loss in classification accuracy. The rest of the article is structured as follows: In Section 2, we present a short primer on CNNs, spatial convolution, and fast-conv while also surveying the previous works. Section 3 covers a design space exploration to find appropriate parameters and control knobs of fast-conv algorithms to be used in our hardware design. We also model the losses in classification accuracy for ACM Transactions on Embedded Computing Systems, Vol. 19, No. 2, Article 15. Publication date: March 2020. FFConv: FPGA-based Accelerator for Fast Convolution in CNNs 15:3 different quantization levels for fast-conv based VGG16-D, AlexNet, and Shufflenet-v1 architectures. In Section 4, we present stage-wise implementation of FFConv while discussing challenges and optimizations along with a qualitative discussion of the novelties of our design compared to the state-of-the-art. Section 5 contains a detailed discussion and comparison of implementation results of FFConv against the state-of-the-art implementations in terms of four metrics: accuracy, throughput, resource, and power efficiency. The article is then concluded in Section 6.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Trans. Embed. Comput. Syst.

自引率

0.00%

发文量