Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2018-12-20 DOI:10.1145/3242900

Shuanglong Liu, Hongxiang Fan, Xinyu Niu, Ho-Cheung Ng, Yang Chu, W. Luk

{"title":"Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA","authors":"Shuanglong Liu, Hongxiang Fan, Xinyu Niu, Ho-Cheung Ng, Yang Chu, W. Luk","doi":"10.1145/3242900","DOIUrl":null,"url":null,"abstract":"Convolutional Neural Networks-- (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation and super resolution. However, the deconvolution algorithms are computationally intensive, which limits their applicability to real-time applications. Particularly, there has been little research on the efficient implementations of deconvolution algorithms on FPGA platforms that have been widely used to accelerate CNN algorithms by practitioners and researchers due to their high performance and power efficiency. In this work, we propose and develop deconvolution architecture for efficient FPGA implementation. FPGA-based accelerators are proposed for both deconvolution and CNN algorithms. Besides, memory sharing between the computation modules is proposed for the FPGA-based CNN accelerator as well as for other optimization techniques. A non-linear optimization model based on the performance model is introduced to efficiently explore the design space to achieve optimal processing speed of the system and improve power efficiency. Furthermore, a hardware mapping framework is developed to automatically generate the low-latency hardware design for any given CNN model on the target device. Finally, we implement our designs on Xilinx Zynq ZC706 board and the deconvolution accelerator achieves a performance of 90.1 giga operations per second (GOPS) under 200MHz working frequency and a performance density of 0.10 GOPS/DSP using 32-bit quantization, which significantly outperforms previous designs on FPGAs. A real-time application of scene segmentation on Cityscapes Dataset is used to evaluate our CNN accelerator on Zynq ZC706 board, and the system achieves a performance of 107 GOPS and 0.12 GOPS/DSP using 16-bit quantization and supports up to 17 frames per second for 512 × 512 image inputs with a power consumption of only 9.6W.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"258 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"42","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3242900","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 42

Abstract

Convolutional Neural Networks-- (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation and super resolution. However, the deconvolution algorithms are computationally intensive, which limits their applicability to real-time applications. Particularly, there has been little research on the efficient implementations of deconvolution algorithms on FPGA platforms that have been widely used to accelerate CNN algorithms by practitioners and researchers due to their high performance and power efficiency. In this work, we propose and develop deconvolution architecture for efficient FPGA implementation. FPGA-based accelerators are proposed for both deconvolution and CNN algorithms. Besides, memory sharing between the computation modules is proposed for the FPGA-based CNN accelerator as well as for other optimization techniques. A non-linear optimization model based on the performance model is introduced to efficiently explore the design space to achieve optimal processing speed of the system and improve power efficiency. Furthermore, a hardware mapping framework is developed to automatically generate the low-latency hardware design for any given CNN model on the target device. Finally, we implement our designs on Xilinx Zynq ZC706 board and the deconvolution accelerator achieves a performance of 90.1 giga operations per second (GOPS) under 200MHz working frequency and a performance density of 0.10 GOPS/DSP using 32-bit quantization, which significantly outperforms previous designs on FPGAs. A real-time application of scene segmentation on Cityscapes Dataset is used to evaluate our CNN accelerator on Zynq ZC706 board, and the system achieves a performance of 107 GOPS and 0.12 GOPS/DSP using 16-bit quantization and supports up to 17 frames per second for 512 × 512 image inputs with a power consumption of only 9.6W.

查看原文本刊更多论文

基于FPGA的深度自定义卷积和反卷积架构优化cnn分割

基于卷积神经网络(cnn)的算法已经成功地解决了图像识别问题，显示出非常大的准确性提高。近年来，反卷积层被广泛用作最先进的cnn的关键组件，用于端到端训练和模型，以支持图像分割和超分辨率等任务。然而，反卷积算法的计算量很大，这限制了它们在实时应用中的适用性。特别是，关于在FPGA平台上高效实现反卷积算法的研究很少，由于其高性能和高能效，从业者和研究人员广泛使用反卷积算法来加速CNN算法。在这项工作中，我们提出并开发了高效FPGA实现的反卷积架构。针对反卷积算法和CNN算法，提出了基于fpga的加速器。此外，对于基于fpga的CNN加速器以及其他优化技术，提出了计算模块之间的内存共享。引入基于性能模型的非线性优化模型，有效地探索设计空间，实现系统的最优处理速度，提高功率效率。此外，开发了一个硬件映射框架，可以自动生成目标设备上任意给定CNN模型的低延迟硬件设计。最后，我们在Xilinx Zynq ZC706板上实现了我们的设计，反卷积加速器在200MHz工作频率下实现了每秒90.1千兆次操作(GOPS)的性能，使用32位量化实现了0.10 GOPS/DSP的性能密度，显着优于以前在fpga上的设计。在Zynq ZC706板上对CNN加速器进行了场景分割的实时应用，系统采用16位量化实现了107 GOPS和0.12 GOPS/DSP的性能，支持高达17帧/秒的512 × 512图像输入，功耗仅为9.6W。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Reconfigurable Technology and Systems (TRETS)

自引率

0.00%

发文量