基于FPGA的指令驱动跨层CNN快速检测加速器

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2018-12-15 DOI:10.1145/3283452

Jincheng Yu, Guangjun Ge, Yiming Hu, Xuefei Ning, Jiantao Qiu, Kaiyuan Guo, Yu Wang, Huazhong Yang

{"title":"基于FPGA的指令驱动跨层CNN快速检测加速器","authors":"Jincheng Yu, Guangjun Ge, Yiming Hu, Xuefei Ning, Jiantao Qiu, Kaiyuan Guo, Yu Wang, Huazhong Yang","doi":"10.1145/3283452","DOIUrl":null,"url":null,"abstract":"In recent years, Convolutional Neural Networks (CNNs) have been widely applied in computer vision and have achieved significant improvements in object detection tasks. Although there are many optimizing methods to speed up CNN-based detection algorithms, it is still difficult to deploy detection algorithms on real-time low-power systems. Field-Programmable Gate Array (FPGA) has been widely explored as a platform for accelerating CNN due to its promising performance, high energy efficiency, and flexibility. Previous works show that the energy consumption of CNN accelerators is dominated by the memory access. By fusing multiple layers in CNN, the intermediate data transfer can be reduced. However, previous accelerators with the cross-layer scheduling are designed for a particular CNN model. In addition to the memory access optimization, the Winograd algorithm can greatly improve the computational performance of convolution. In this article, to improve the flexibility of hardware, we design an instruction-driven CNN accelerator, supporting the Winograd algorithm and the cross-layer scheduling, for object detection. We modify the loop unrolling order of CNN, so that we can schedule a CNN across different layers with instructions and eliminate the intermediate data transfer. We propose a hardware architecture to support the instructions with Winograd computation units and reach the state-of-the-art energy efficiency. To deploy image detection algorithms onto the proposed accelerator with fixed-point computation units, we adopt the fixed-point fine-tune method, which can guarantee the accuracy of the detection algorithms. We evaluate our accelerator and scheduling policy on the Xilinx KU115 FPGA platform. The intermediate data transfer can be reduced by more than 90% on the VGG-D CNN model with the cross-layer strategy. Thus, the performance of our hardware accelerator reaches 1700GOP/s on the classification model VGG-D. We also implement a framework for object detection algorithms, which achieves 2.3× and 50× in energy efficiency compared with GPU and CPU, respectively. Compared with floating-point algorithms, the accuracy of the fixed-point detection algorithms only drops by less than 1%.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Instruction Driven Cross-layer CNN Accelerator for Fast Detection on FPGA\",\"authors\":\"Jincheng Yu, Guangjun Ge, Yiming Hu, Xuefei Ning, Jiantao Qiu, Kaiyuan Guo, Yu Wang, Huazhong Yang\",\"doi\":\"10.1145/3283452\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, Convolutional Neural Networks (CNNs) have been widely applied in computer vision and have achieved significant improvements in object detection tasks. Although there are many optimizing methods to speed up CNN-based detection algorithms, it is still difficult to deploy detection algorithms on real-time low-power systems. Field-Programmable Gate Array (FPGA) has been widely explored as a platform for accelerating CNN due to its promising performance, high energy efficiency, and flexibility. Previous works show that the energy consumption of CNN accelerators is dominated by the memory access. By fusing multiple layers in CNN, the intermediate data transfer can be reduced. However, previous accelerators with the cross-layer scheduling are designed for a particular CNN model. In addition to the memory access optimization, the Winograd algorithm can greatly improve the computational performance of convolution. In this article, to improve the flexibility of hardware, we design an instruction-driven CNN accelerator, supporting the Winograd algorithm and the cross-layer scheduling, for object detection. We modify the loop unrolling order of CNN, so that we can schedule a CNN across different layers with instructions and eliminate the intermediate data transfer. We propose a hardware architecture to support the instructions with Winograd computation units and reach the state-of-the-art energy efficiency. To deploy image detection algorithms onto the proposed accelerator with fixed-point computation units, we adopt the fixed-point fine-tune method, which can guarantee the accuracy of the detection algorithms. We evaluate our accelerator and scheduling policy on the Xilinx KU115 FPGA platform. The intermediate data transfer can be reduced by more than 90% on the VGG-D CNN model with the cross-layer strategy. Thus, the performance of our hardware accelerator reaches 1700GOP/s on the classification model VGG-D. We also implement a framework for object detection algorithms, which achieves 2.3× and 50× in energy efficiency compared with GPU and CPU, respectively. Compared with floating-point algorithms, the accuracy of the fixed-point detection algorithms only drops by less than 1%.\",\"PeriodicalId\":162787,\"journal\":{\"name\":\"ACM Transactions on Reconfigurable Technology and Systems (TRETS)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Reconfigurable Technology and Systems (TRETS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3283452\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3283452","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

近年来，卷积神经网络(Convolutional Neural Networks, cnn)在计算机视觉中得到了广泛的应用，在目标检测任务上取得了显著的进步。尽管有许多优化方法可以提高基于cnn的检测算法的速度，但在实时低功耗系统上部署检测算法仍然很困难。现场可编程门阵列(FPGA)由于其良好的性能、高能效和灵活性而被广泛探索作为加速CNN的平台。前人的研究表明，CNN加速器的能量消耗主要是存储器的访问。通过在CNN中进行多层融合，可以减少中间数据的传输。然而，以前的跨层调度加速器是为特定的CNN模型设计的。除了内存访问优化之外，Winograd算法还可以大大提高卷积的计算性能。在本文中，为了提高硬件的灵活性，我们设计了一个指令驱动的CNN加速器，支持Winograd算法和跨层调度，用于目标检测。我们修改了CNN的循环展开顺序，这样我们就可以用指令在不同的层调度CNN，并消除了中间的数据传输。我们提出了一种硬件架构来支持带有Winograd计算单元的指令，并达到最先进的能源效率。为了将图像检测算法部署到具有定点计算单元的加速器上，我们采用了定点微调的方法，保证了检测算法的准确性。我们在赛灵思KU115 FPGA平台上评估了我们的加速器和调度策略。在VGG-D CNN模型上，采用跨层策略可以减少90%以上的中间数据传输。因此，我们的硬件加速器在分类模型VGG-D上的性能达到了1700GOP/s。我们还实现了一个目标检测算法框架，与GPU和CPU相比，该框架的能效分别达到2.3倍和50倍。与浮点算法相比，定点检测算法的准确率仅下降不到1%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Instruction Driven Cross-layer CNN Accelerator for Fast Detection on FPGA

In recent years, Convolutional Neural Networks (CNNs) have been widely applied in computer vision and have achieved significant improvements in object detection tasks. Although there are many optimizing methods to speed up CNN-based detection algorithms, it is still difficult to deploy detection algorithms on real-time low-power systems. Field-Programmable Gate Array (FPGA) has been widely explored as a platform for accelerating CNN due to its promising performance, high energy efficiency, and flexibility. Previous works show that the energy consumption of CNN accelerators is dominated by the memory access. By fusing multiple layers in CNN, the intermediate data transfer can be reduced. However, previous accelerators with the cross-layer scheduling are designed for a particular CNN model. In addition to the memory access optimization, the Winograd algorithm can greatly improve the computational performance of convolution. In this article, to improve the flexibility of hardware, we design an instruction-driven CNN accelerator, supporting the Winograd algorithm and the cross-layer scheduling, for object detection. We modify the loop unrolling order of CNN, so that we can schedule a CNN across different layers with instructions and eliminate the intermediate data transfer. We propose a hardware architecture to support the instructions with Winograd computation units and reach the state-of-the-art energy efficiency. To deploy image detection algorithms onto the proposed accelerator with fixed-point computation units, we adopt the fixed-point fine-tune method, which can guarantee the accuracy of the detection algorithms. We evaluate our accelerator and scheduling policy on the Xilinx KU115 FPGA platform. The intermediate data transfer can be reduced by more than 90% on the VGG-D CNN model with the cross-layer strategy. Thus, the performance of our hardware accelerator reaches 1700GOP/s on the classification model VGG-D. We also implement a framework for object detection algorithms, which achieves 2.3× and 50× in energy efficiency compared with GPU and CPU, respectively. Compared with floating-point algorithms, the accuracy of the fixed-point detection algorithms only drops by less than 1%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Reconfigurable Technology and Systems (TRETS)

自引率

0.00%

发文量