Shuanglong Liu, Chenglong Zeng, Hongxiang Fan, Ho-Cheung Ng, Jiuxi Meng, Zhiqiang Que, Xinyu Niu, W. Luk
{"title":"Memory-Efficient Architecture for Accelerating Generative Networks on FPGA","authors":"Shuanglong Liu, Chenglong Zeng, Hongxiang Fan, Ho-Cheung Ng, Jiuxi Meng, Zhiqiang Que, Xinyu Niu, W. Luk","doi":"10.1109/FPT.2018.00016","DOIUrl":null,"url":null,"abstract":"Generative adversarial networks (GANs) are a class of artificial intelligence algorithms used in unsupervised machine learning, implemented by a system of two neural networks: a generative network (generator) and a discriminative network (discriminator). These two networks compete with each other to perform better at their respective tasks. The generator is typically a deconvolutional neural network and the discriminator is a convolutional neural network (CNN). Deconvolution performs a fundamentally new type of mathematical operation which differs from convolution. While the FPGA-based CNN accelerators have been widely studied in prior work, the acceleration of deconvolutional networks on FPGA is rarely explored. This paper proposes a novel parametrized deconvolutional architecture based on an FPGA-friendly method, in contrast to the transposed convolution implementation in CPUs and GPUs. Hardware design templates which map this architecture to FPGAs are provided with configurable deconvolutional layer parameters. Furthermore, a memory-efficient architecture with a new tiling method is proposed to accelerate the generator of GANs, by storing all intermediate data in on-chip memories and significantly reducing off-chip data transfers. The performance of the proposed accelerator is evaluated using a variety of GANs on a Xilinx Zynq 706 board, which shows 2.3x higher speed and 8.2x off-chip memory access reduction than an optimized Vanilla FPGA design. Compared to the respective implementations on CPUs and GPUs, the achieved improvements are in the range of 30x-92x in speed over an Intel 8-core i7-950 CPU, and 8x-108x in terms of Performance-per-Watt over an NVIDIA Titan X GPU.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Field-Programmable Technology (FPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPT.2018.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13
Abstract
Generative adversarial networks (GANs) are a class of artificial intelligence algorithms used in unsupervised machine learning, implemented by a system of two neural networks: a generative network (generator) and a discriminative network (discriminator). These two networks compete with each other to perform better at their respective tasks. The generator is typically a deconvolutional neural network and the discriminator is a convolutional neural network (CNN). Deconvolution performs a fundamentally new type of mathematical operation which differs from convolution. While the FPGA-based CNN accelerators have been widely studied in prior work, the acceleration of deconvolutional networks on FPGA is rarely explored. This paper proposes a novel parametrized deconvolutional architecture based on an FPGA-friendly method, in contrast to the transposed convolution implementation in CPUs and GPUs. Hardware design templates which map this architecture to FPGAs are provided with configurable deconvolutional layer parameters. Furthermore, a memory-efficient architecture with a new tiling method is proposed to accelerate the generator of GANs, by storing all intermediate data in on-chip memories and significantly reducing off-chip data transfers. The performance of the proposed accelerator is evaluated using a variety of GANs on a Xilinx Zynq 706 board, which shows 2.3x higher speed and 8.2x off-chip memory access reduction than an optimized Vanilla FPGA design. Compared to the respective implementations on CPUs and GPUs, the achieved improvements are in the range of 30x-92x in speed over an Intel 8-core i7-950 CPU, and 8x-108x in terms of Performance-per-Watt over an NVIDIA Titan X GPU.