Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI:10.1145/2684746.2689060

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, J. Cong

{"title":"Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks","authors":"Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, J. Cong","doi":"10.1145/2684746.2689060","DOIUrl":null,"url":null,"abstract":"Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning algorithms has further improved research and implementations. Especially, various accelerators for deep CNN have been proposed based on FPGA platform because it has advantages of high performance, reconfigurability, and fast development round, etc. Although current FPGA accelerators have demonstrated better performance over generic processors, the accelerator design space has not been well exploited. One critical problem is that the computation throughput may not well match the memory bandwidth provided an FPGA platform. Consequently, existing approaches cannot achieve best performance due to under-utilization of either logic resource or memory bandwidth. At the same time, the increasing complexity and scalability of deep learning applications aggravate this problem. In order to overcome this problem, we propose an analytical design scheme using the roofline model. For any solution of a CNN design, we quantitatively analyze its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. Then, with the help of rooine model, we can identify the solution with best performance and lowest FPGA resource requirement. As a case study, we implement a CNN accelerator on a VC707 FPGA board and compare it to previous approaches. Our implementation achieves a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"836 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1754","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2684746.2689060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1754

Abstract

Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning algorithms has further improved research and implementations. Especially, various accelerators for deep CNN have been proposed based on FPGA platform because it has advantages of high performance, reconfigurability, and fast development round, etc. Although current FPGA accelerators have demonstrated better performance over generic processors, the accelerator design space has not been well exploited. One critical problem is that the computation throughput may not well match the memory bandwidth provided an FPGA platform. Consequently, existing approaches cannot achieve best performance due to under-utilization of either logic resource or memory bandwidth. At the same time, the increasing complexity and scalability of deep learning applications aggravate this problem. In order to overcome this problem, we propose an analytical design scheme using the roofline model. For any solution of a CNN design, we quantitatively analyze its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. Then, with the help of rooine model, we can identify the solution with best performance and lowest FPGA resource requirement. As a case study, we implement a CNN accelerator on a VC707 FPGA board and compare it to previous approaches. Our implementation achieves a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.

查看原文本刊更多论文

基于fpga的深度卷积神经网络加速器优化设计

卷积神经网络(Convolutional neural network, CNN)通过模拟生物视神经的行为，可以达到较高的准确率，因此在图像识别中得到了广泛的应用。近年来，基于深度学习算法的现代应用的快速增长进一步提高了研究和实现。特别是基于FPGA平台的深度CNN加速器，由于其具有高性能、可重构性、开发周期快等优点，被提出了各种各样的深度CNN加速器。虽然目前的FPGA加速器已经证明了比通用处理器更好的性能，但加速器的设计空间还没有得到很好的利用。一个关键的问题是计算吞吐量可能不能很好地匹配FPGA平台提供的内存带宽。因此，由于逻辑资源或内存带宽利用率不足，现有方法无法达到最佳性能。与此同时，深度学习应用日益增长的复杂性和可扩展性加剧了这一问题。为了克服这个问题，我们提出了一种使用屋顶线模型的分析设计方案。对于CNN设计的任何解决方案，我们使用各种优化技术(如循环平铺和转换)定量分析其计算吞吐量和所需的内存带宽。然后，借助rooine模型，我们可以确定性能最佳且FPGA资源需求最低的解决方案。作为案例研究，我们在VC707 FPGA板上实现了CNN加速器，并将其与以前的方法进行了比较。我们的实现在100MHz工作频率下实现了61.62 GFLOPS的峰值性能，明显优于以前的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量