FCN-Engine: Accelerating Deconvolutional Layers in Classic CNN Processors

2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) Pub Date : 2018-11-05 DOI:10.1145/3240765.3240810

Dawen Xu, Kaijie Tu, Y. Wang, Cheng Liu, Bingsheng He, Huawei Li

{"title":"FCN-Engine: Accelerating Deconvolutional Layers in Classic CNN Processors","authors":"Dawen Xu, Kaijie Tu, Y. Wang, Cheng Liu, Bingsheng He, Huawei Li","doi":"10.1145/3240765.3240810","DOIUrl":null,"url":null,"abstract":"Unlike standard Convolutional Neural Networks (CNNs) with fully-connected layers, Fully Convolutional Neural Networks (FCN) are prevalent in computer vision applications such as object detection, semantic/image segmentation, and the most popular generative tasks based on Generative Adversarial Networks (GAN). In an FCN, traditional convolutional layers and deconvolutional layers contribute to the majority of the computation complexity. However, prior deep learning accelerator designs mostly focus on CNN optimization. They either use independent compute-resources to handle deconvolution or convert deconvolutional layers (Deconv) into general convolution operations, which arouses considerable overhead. To address this problem, we propose a unified fully convolutional accelerator aiming to handle both the deconvolutional and convolutional layers with a single processing element (PE) array. We re-optimize the conventional CNN accelerator architecture of regular 2D processing elements array, to enable it more efficiently support the data flow of deconvolutional layer inference. By exploiting the locality in deconvolutional filters, this architecture reduces the consumption of on-chip memory communication from 24.79 GB to 6.56 GB and improves the power efficiency significantly. Compared to prior baseline deconvolution acceleration scheme, the proposed accelerator achieves 1.3X–44.9X speedup and reduces the energy consumption by 14.60/0-97.6% on a set of representative benchmark applications. Meanwhile, it keeps similar CNN inference performance to that of an optimized CNN-only accelerator with negligible power consumption and chip area overhead.","PeriodicalId":413037,"journal":{"name":"2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3240765.3240810","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

Unlike standard Convolutional Neural Networks (CNNs) with fully-connected layers, Fully Convolutional Neural Networks (FCN) are prevalent in computer vision applications such as object detection, semantic/image segmentation, and the most popular generative tasks based on Generative Adversarial Networks (GAN). In an FCN, traditional convolutional layers and deconvolutional layers contribute to the majority of the computation complexity. However, prior deep learning accelerator designs mostly focus on CNN optimization. They either use independent compute-resources to handle deconvolution or convert deconvolutional layers (Deconv) into general convolution operations, which arouses considerable overhead. To address this problem, we propose a unified fully convolutional accelerator aiming to handle both the deconvolutional and convolutional layers with a single processing element (PE) array. We re-optimize the conventional CNN accelerator architecture of regular 2D processing elements array, to enable it more efficiently support the data flow of deconvolutional layer inference. By exploiting the locality in deconvolutional filters, this architecture reduces the consumption of on-chip memory communication from 24.79 GB to 6.56 GB and improves the power efficiency significantly. Compared to prior baseline deconvolution acceleration scheme, the proposed accelerator achieves 1.3X–44.9X speedup and reduces the energy consumption by 14.60/0-97.6% on a set of representative benchmark applications. Meanwhile, it keeps similar CNN inference performance to that of an optimized CNN-only accelerator with negligible power consumption and chip area overhead.

查看原文本刊更多论文

FCN-Engine:经典CNN处理器中的加速反卷积层

与具有全连接层的标准卷积神经网络(cnn)不同，全卷积神经网络(FCN)在计算机视觉应用中很流行，例如对象检测，语义/图像分割以及基于生成对抗网络(GAN)的最流行的生成任务。在FCN中，传统的卷积层和反卷积层贡献了大部分的计算复杂度。它们要么使用独立的计算资源来处理反卷积，要么将反卷积层(Deconv)转换为一般的卷积操作，这会引起相当大的开销。为了解决这个问题，我们提出了一个统一的全卷积加速器，旨在用单个处理元素(PE)阵列处理反卷积层和卷积层。我们对常规二维处理元素阵列的传统CNN加速器架构进行了重新优化，使其能够更有效地支持反卷积层推理的数据流。该架构利用反卷积滤波器的局部性，将片上存储器通信消耗从24.79 GB降低到6.56 GB，显著提高了功耗效率。与之前的基线反褶积加速方案相比，在一组具有代表性的基准应用中，所提出的加速器实现了1.3X-44.9X的加速提升，能耗降低14.60/0-97.6%。同时，它保持了与优化后的CNN-only加速器相似的CNN推理性能，功耗和芯片面积开销可以忽略不计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

自引率

0.00%

发文量