Scalable FPGA Accelerator for Deep Convolutional Neural Networks with Stochastic Streaming

IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2018-10-01 DOI:10.1109/TMSCS.2018.2886266

Mohammed Alawad;mingjie lin

{"title":"Scalable FPGA Accelerator for Deep Convolutional Neural Networks with Stochastic Streaming","authors":"Mohammed Alawad;mingjie lin","doi":"10.1109/TMSCS.2018.2886266","DOIUrl":null,"url":null,"abstract":"FPGA-based heterogeneous computing platform, due to its extreme logic reconfigurability, emerges to be a strong contender as computing fabric in modern AI. As a result, various FPGA-based accelerators for deep CNN—the key driver of modern AI—have been proposed due to their advantages of high performance, reconfigurability, and fast development round, etc. In general, the consensus among researchers is that, although FPGA-based accelerator can achieve much higher energy efficiency, its raw computing performance lags behind when compared with GPUs with similar logic density. In this paper, we develop an alternative methodology to efficiently implement CNNs with FPGAs that outperform GPUs in terms of both power consumption and performance. Our key idea is to design a scalable hardware architecture and circuit design for large-scale CNNs that leverages a stochastic-based computing principle. Specifically, there are three major performance advantages. First, all key components of our deep learning CNN are designed and implemented to compute stochastically, thus achieving excellent computing performance and energy efficiency. Second, because our proposed CNN architecture enables a stream-mode computing, all of its stages can process even the partial results from preceding stages, therefore not incurring unnecessary latency due to data dependency. Finally, our FPGA-based deep CNN also provides a superior hardware scalability when compared with conventional FPGA implementations by reducing the bandwidth requirement between layers. The results show that our proposed CNN architecture significantly outperforms all previous FPGA-based deep CNN implementation approaches. It achieves 1.58x more GOPS, 6.42x more GOPS/Slice, and 10.92x more GOPS/W when compared with state-of-the-art CNN architecture. The top-5 accuracy of stochastic VGG-16 CNN is 86.77 percent with 18.91 fps frame rate.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"888-899"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2886266","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multi-Scale Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/8573843/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

FPGA-based heterogeneous computing platform, due to its extreme logic reconfigurability, emerges to be a strong contender as computing fabric in modern AI. As a result, various FPGA-based accelerators for deep CNN—the key driver of modern AI—have been proposed due to their advantages of high performance, reconfigurability, and fast development round, etc. In general, the consensus among researchers is that, although FPGA-based accelerator can achieve much higher energy efficiency, its raw computing performance lags behind when compared with GPUs with similar logic density. In this paper, we develop an alternative methodology to efficiently implement CNNs with FPGAs that outperform GPUs in terms of both power consumption and performance. Our key idea is to design a scalable hardware architecture and circuit design for large-scale CNNs that leverages a stochastic-based computing principle. Specifically, there are three major performance advantages. First, all key components of our deep learning CNN are designed and implemented to compute stochastically, thus achieving excellent computing performance and energy efficiency. Second, because our proposed CNN architecture enables a stream-mode computing, all of its stages can process even the partial results from preceding stages, therefore not incurring unnecessary latency due to data dependency. Finally, our FPGA-based deep CNN also provides a superior hardware scalability when compared with conventional FPGA implementations by reducing the bandwidth requirement between layers. The results show that our proposed CNN architecture significantly outperforms all previous FPGA-based deep CNN implementation approaches. It achieves 1.58x more GOPS, 6.42x more GOPS/Slice, and 10.92x more GOPS/W when compared with state-of-the-art CNN architecture. The top-5 accuracy of stochastic VGG-16 CNN is 86.77 percent with 18.91 fps frame rate.

查看原文本刊更多论文

用于随机流深度卷积神经网络的可扩展FPGA加速器

基于FPGA的异构计算平台由于其极端的逻辑可重构性，成为现代人工智能计算结构的有力竞争者。因此，基于FPGA的各种加速器因其高性能、可重构性和快速开发等优点，被提出用于现代人工智能的关键驱动力深度CNN，研究人员的共识是，尽管基于FPGA的加速器可以实现更高的能效，但与逻辑密度相似的GPU相比，其原始计算性能落后。在本文中，我们开发了一种替代方法，用在功耗和性能方面都优于GPU的FPGA来高效地实现CNNs。我们的关键思想是为大规模细胞神经网络设计一种可扩展的硬件架构和电路设计，利用基于随机的计算原理。具体而言，有三大性能优势。首先，我们的深度学习CNN的所有关键组件都被设计和实现为随机计算，从而实现了优异的计算性能和能源效率。其次，由于我们提出的CNN架构支持流模式计算，因此它的所有阶段都可以处理之前阶段的部分结果，因此不会因数据依赖性而产生不必要的延迟。最后，与传统FPGA实现相比，我们基于FPGA的深度CNN还通过降低层之间的带宽需求，提供了卓越的硬件可扩展性。结果表明，我们提出的CNN架构显著优于以前所有基于FPGA的深度CNN实现方法。与最先进的CNN架构相比，它的GOPS增加了1.58倍，GOPS/Slice增加了6.42倍，GOP/W增加了10.92倍。随机VGG-16CNN的前五位准确率为86.77%，帧速率为18.91fps。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multi-Scale Computing Systems

自引率

0.00%

发文量