深度卷积神经网络FPGA加速中的循环运算和数据流优化

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI:10.1145/3020078.3021736

Yufei Ma, Yu Cao, S. Vrudhula, Jae-sun Seo

{"title":"深度卷积神经网络FPGA加速中的循环运算和数据流优化","authors":"Yufei Ma, Yu Cao, S. Vrudhula, Jae-sun Seo","doi":"10.1145/3020078.3021736","DOIUrl":null,"url":null,"abstract":"As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2× enhancement compared to state-of-the-art FPGA implementations of VGG model.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"326","resultStr":"{\"title\":\"Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks\",\"authors\":\"Yufei Ma, Yu Cao, S. Vrudhula, Jae-sun Seo\",\"doi\":\"10.1145/3020078.3021736\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2× enhancement compared to state-of-the-art FPGA implementations of VGG model.\",\"PeriodicalId\":252039,\"journal\":{\"name\":\"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"326\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3020078.3021736\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3020078.3021736","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 326

摘要

卷积层是卷积神经网络(CNN)算法的主要运算层，有效的卷积加速方案对卷积神经网络硬件加速器的效率和性能有重要影响。cnn中的卷积涉及到四层循环的三维乘法和累积(MAC)运算，这导致了很大的设计空间。之前的工作要么采用有限循环优化技术，例如循环展开、平铺和交换，要么在加速器架构和数据流已经固定之后才调整一些设计变量。在硬件设计阶段之前，如果不充分研究卷积循环优化，所得到的加速器很难有效地利用数据重用和管理数据移动。本工作通过基于多个设计变量定量分析和优化CNN加速器的设计目标(例如所需的内存访问)，克服了这些障碍。我们通过搜索设计变量配置，系统地探索了硬件成本的权衡，并提出了一种特定的硬件CNN加速数据流，以最大限度地减少内存访问和数据移动，同时最大限度地提高资源利用率，以实现高性能。通过实现端到端VGG-16 CNN模型，在独立的Altera Arria 10 GX 1150 FPGA上演示了所提出的CNN加速方案和架构，实现了645.25 GOPS的吞吐量和47.97 ms的延迟，与目前最先进的VGG模型FPGA实现相比，提高了3.2倍以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. required memory access) of the CNN accelerator based on multiple design variables. We systematically explore the trade-offs of hardware cost by searching the design variable configurations, and propose a specific dataflow of hardware CNN acceleration to minimize the memory access and data movement while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated on a standalone Altera Arria 10 GX 1150 FPGA by implementing end-to-end VGG-16 CNN model and achieved 645.25 GOPS of throughput and 47.97 ms of latency, which is a >3.2× enhancement compared to state-of-the-art FPGA implementations of VGG model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量