基于FPGA加速器的TensorFlow Conv2d算子优化设计方法

Proceedings of the 4th International Conference on Computer Science and Application Engineering Pub Date : 2020-10-20 DOI:10.1145/3424978.3424987

Rengang Li, Hongwei Kan, Dongdong Su, Yanwei Wang, Hongbo Zhao, Peilin Tong

{"title":"基于FPGA加速器的TensorFlow Conv2d算子优化设计方法","authors":"Rengang Li, Hongwei Kan, Dongdong Su, Yanwei Wang, Hongbo Zhao, Peilin Tong","doi":"10.1145/3424978.3424987","DOIUrl":null,"url":null,"abstract":"Currently, TensorFlow architecture only supports CPU and GPU programming, and has not yet formed a unified support standard for FPGAs. To the best of our knowledge, when forward operators in TensorFlow specifies a new device, the backward gradient operator in the same neural network cannot use the same device, which does not comply with rules about node device allocation in TensorFlow. Therefore, we propose an improved algorithm for node device allocation based on placement mechanism and an optimization algorithm for conv2d operator based on OpenCL. The proposed improved algorithm for node device allocation makes forward and backward operators based on FPGA accelerator satisfy the node and device allocation requirements for all TensorFlow operators, and the conv2d operator optimization algorithm based on OpenCL takes full advantage of the parallel computing advantages of FPGA. Finally, this paper uses the CNN LeNet5 model and the MNIST dataset to conduct corresponding experiments. Referring to conv2d operator, based on FPGA accelerator, we implement both the forward and backward operators involved in the first four layers of the model. The experimental results show that the accuracy of the three methods is above 98%. Compared with CPU and GPU, the accuracy difference is only about five thousandths. In addition, in the case of different batch sizes, we tested the runtime of conv2d operator in the first layer of this model. The results show that when the input batch size increased to 10000, the FPGA runs 9 times faster than the CPU. It proved that we proposed an optimization solution for TensorFlow to use FPGA operators for neural network calculations.","PeriodicalId":178822,"journal":{"name":"Proceedings of the 4th International Conference on Computer Science and Application Engineering","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An Optimal Design Method of Conv2d Operator for TensorFlow Based on FPGA Accelerator\",\"authors\":\"Rengang Li, Hongwei Kan, Dongdong Su, Yanwei Wang, Hongbo Zhao, Peilin Tong\",\"doi\":\"10.1145/3424978.3424987\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Currently, TensorFlow architecture only supports CPU and GPU programming, and has not yet formed a unified support standard for FPGAs. To the best of our knowledge, when forward operators in TensorFlow specifies a new device, the backward gradient operator in the same neural network cannot use the same device, which does not comply with rules about node device allocation in TensorFlow. Therefore, we propose an improved algorithm for node device allocation based on placement mechanism and an optimization algorithm for conv2d operator based on OpenCL. The proposed improved algorithm for node device allocation makes forward and backward operators based on FPGA accelerator satisfy the node and device allocation requirements for all TensorFlow operators, and the conv2d operator optimization algorithm based on OpenCL takes full advantage of the parallel computing advantages of FPGA. Finally, this paper uses the CNN LeNet5 model and the MNIST dataset to conduct corresponding experiments. Referring to conv2d operator, based on FPGA accelerator, we implement both the forward and backward operators involved in the first four layers of the model. The experimental results show that the accuracy of the three methods is above 98%. Compared with CPU and GPU, the accuracy difference is only about five thousandths. In addition, in the case of different batch sizes, we tested the runtime of conv2d operator in the first layer of this model. The results show that when the input batch size increased to 10000, the FPGA runs 9 times faster than the CPU. It proved that we proposed an optimization solution for TensorFlow to use FPGA operators for neural network calculations.\",\"PeriodicalId\":178822,\"journal\":{\"name\":\"Proceedings of the 4th International Conference on Computer Science and Application Engineering\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th International Conference on Computer Science and Application Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3424978.3424987\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Computer Science and Application Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3424978.3424987","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

目前TensorFlow架构只支持CPU和GPU编程，对fpga尚未形成统一的支持标准。据我们所知，当TensorFlow中的前向算子指定一个新设备时，同一神经网络中的后向梯度算子不能使用相同的设备，这不符合TensorFlow中节点设备分配的规则。为此，我们提出了一种改进的基于放置机制的节点设备分配算法和基于OpenCL的conv2d算子优化算法。提出的节点设备分配改进算法使得基于FPGA加速器的正向和反向算子满足所有TensorFlow算子的节点和设备分配要求，基于OpenCL的conv2d算子优化算法充分利用了FPGA的并行计算优势。最后，本文使用CNN LeNet5模型和MNIST数据集进行相应的实验。参考conv2d算子，基于FPGA加速器，实现了模型前四层所涉及的前向和后向算子。实验结果表明，三种方法的准确率均在98%以上。与CPU和GPU相比，精度差异仅为千分之五左右。此外，在不同批大小的情况下，我们在该模型的第一层测试了conv2d算子的运行时间。结果表明，当输入批大小增加到10000个时，FPGA的运行速度比CPU快9倍。证明了我们提出了一种利用FPGA算子进行神经网络计算的TensorFlow优化方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Optimal Design Method of Conv2d Operator for TensorFlow Based on FPGA Accelerator

Currently, TensorFlow architecture only supports CPU and GPU programming, and has not yet formed a unified support standard for FPGAs. To the best of our knowledge, when forward operators in TensorFlow specifies a new device, the backward gradient operator in the same neural network cannot use the same device, which does not comply with rules about node device allocation in TensorFlow. Therefore, we propose an improved algorithm for node device allocation based on placement mechanism and an optimization algorithm for conv2d operator based on OpenCL. The proposed improved algorithm for node device allocation makes forward and backward operators based on FPGA accelerator satisfy the node and device allocation requirements for all TensorFlow operators, and the conv2d operator optimization algorithm based on OpenCL takes full advantage of the parallel computing advantages of FPGA. Finally, this paper uses the CNN LeNet5 model and the MNIST dataset to conduct corresponding experiments. Referring to conv2d operator, based on FPGA accelerator, we implement both the forward and backward operators involved in the first four layers of the model. The experimental results show that the accuracy of the three methods is above 98%. Compared with CPU and GPU, the accuracy difference is only about five thousandths. In addition, in the case of different batch sizes, we tested the runtime of conv2d operator in the first layer of this model. The results show that when the input batch size increased to 10000, the FPGA runs 9 times faster than the CPU. It proved that we proposed an optimization solution for TensorFlow to use FPGA operators for neural network calculations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 4th International Conference on Computer Science and Application Engineering

自引率

0.00%

发文量