基于opencl的FPGA混合CNN-RNN推理加速器

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI:10.1109/ICFPT47387.2019.00048

Yunfei Sun, Brian Liu, Xianchao Xu

{"title":"基于opencl的FPGA混合CNN-RNN推理加速器","authors":"Yunfei Sun, Brian Liu, Xianchao Xu","doi":"10.1109/ICFPT47387.2019.00048","DOIUrl":null,"url":null,"abstract":"Recently, Convolution Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and CNN-RNN hybrid networks have demonstrated great success in many deep learning scenarios. Although many dedicated FPGA accelerators for a certain kind of network have been proposed, few of them combine CNN and RNN acceleration together. In this paper we propose a high-throughput and resource-efficient CNN-RNN fusion accelerator on FPGA with commercial OpenCL to support general-purpose DNNs. It utilizes a novel streaming architecture and mapping strategy to implement the most computationintensive and resource-demanding parts in DNNs on the same computation logic. By such a hardware reuse method, it realizes resource efficiency in accelerating CNNs, RNNs and their hybrid networks. Our accelerator follows a layer-by-layer, subgraph-by-subgraph or subnetwork-by-subnetwork execution mode, which facilities it to deploy most DNNs flexibly during runtime with best performance. YOLOv2, LSTM and CRNN are tested with our work on Intel Arria10 GX1150 FPGA. It achieves 646 GOPS throughput on CRNN, which is the best performance on CNNRNN hybrid networks among high-level-synthesis (HLS) based FPGA accelerators. Moreover, its throughput for CNNs and RNNs is competitive to the state-of-the-art specialized FPGA accelerators.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"An OpenCL-Based Hybrid CNN-RNN Inference Accelerator On FPGA\",\"authors\":\"Yunfei Sun, Brian Liu, Xianchao Xu\",\"doi\":\"10.1109/ICFPT47387.2019.00048\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, Convolution Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and CNN-RNN hybrid networks have demonstrated great success in many deep learning scenarios. Although many dedicated FPGA accelerators for a certain kind of network have been proposed, few of them combine CNN and RNN acceleration together. In this paper we propose a high-throughput and resource-efficient CNN-RNN fusion accelerator on FPGA with commercial OpenCL to support general-purpose DNNs. It utilizes a novel streaming architecture and mapping strategy to implement the most computationintensive and resource-demanding parts in DNNs on the same computation logic. By such a hardware reuse method, it realizes resource efficiency in accelerating CNNs, RNNs and their hybrid networks. Our accelerator follows a layer-by-layer, subgraph-by-subgraph or subnetwork-by-subnetwork execution mode, which facilities it to deploy most DNNs flexibly during runtime with best performance. YOLOv2, LSTM and CRNN are tested with our work on Intel Arria10 GX1150 FPGA. It achieves 646 GOPS throughput on CRNN, which is the best performance on CNNRNN hybrid networks among high-level-synthesis (HLS) based FPGA accelerators. Moreover, its throughput for CNNs and RNNs is competitive to the state-of-the-art specialized FPGA accelerators.\",\"PeriodicalId\":241340,\"journal\":{\"name\":\"2019 International Conference on Field-Programmable Technology (ICFPT)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Field-Programmable Technology (ICFPT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICFPT47387.2019.00048\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT47387.2019.00048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

最近，卷积神经网络(cnn)、循环神经网络(rnn)和CNN-RNN混合网络在许多深度学习场景中都取得了巨大的成功。虽然针对某一类网络提出了许多专用的FPGA加速器，但很少有将CNN和RNN加速结合在一起的。在本文中，我们提出了一个高吞吐量和资源高效的CNN-RNN融合加速器在FPGA与商用OpenCL支持通用dnn。它利用一种新颖的流架构和映射策略，在相同的计算逻辑上实现dnn中计算密集型和资源要求最高的部分。通过这种硬件复用方法，实现了加速cnn、rnn及其混合网络的资源效率。我们的加速器遵循一层接一层、一个子图接一个子图或一个子网络接一个子网络的执行模式，这使得它能够在运行时灵活地部署大多数dnn，并获得最佳性能。YOLOv2, LSTM和CRNN在Intel Arria10 GX1150 FPGA上进行了测试。在基于FPGA的高阶合成(high-level synthesis, HLS)加速器中，该算法实现了646 GOPS的吞吐量，是CNNRNN混合网络中性能最好的。此外，它对cnn和rnn的吞吐量与最先进的专用FPGA加速器具有竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An OpenCL-Based Hybrid CNN-RNN Inference Accelerator On FPGA

Recently, Convolution Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and CNN-RNN hybrid networks have demonstrated great success in many deep learning scenarios. Although many dedicated FPGA accelerators for a certain kind of network have been proposed, few of them combine CNN and RNN acceleration together. In this paper we propose a high-throughput and resource-efficient CNN-RNN fusion accelerator on FPGA with commercial OpenCL to support general-purpose DNNs. It utilizes a novel streaming architecture and mapping strategy to implement the most computationintensive and resource-demanding parts in DNNs on the same computation logic. By such a hardware reuse method, it realizes resource efficiency in accelerating CNNs, RNNs and their hybrid networks. Our accelerator follows a layer-by-layer, subgraph-by-subgraph or subnetwork-by-subnetwork execution mode, which facilities it to deploy most DNNs flexibly during runtime with best performance. YOLOv2, LSTM and CRNN are tested with our work on Intel Arria10 GX1150 FPGA. It achieves 646 GOPS throughput on CRNN, which is the best performance on CNNRNN hybrid networks among high-level-synthesis (HLS) based FPGA accelerators. Moreover, its throughput for CNNs and RNNs is competitive to the state-of-the-art specialized FPGA accelerators.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 International Conference on Field-Programmable Technology (ICFPT)

自引率

0.00%

发文量