基于FPGA的大规模轻量级卷积神经网络的高效推理

2020 IEEE 33rd International System-on-Chip Conference (SOCC) Pub Date : 2020-09-08 DOI:10.1109/socc49529.2020.9524773

Xiao Wu, Yufei Ma, Zhongfeng Wang

{"title":"基于FPGA的大规模轻量级卷积神经网络的高效推理","authors":"Xiao Wu, Yufei Ma, Zhongfeng Wang","doi":"10.1109/socc49529.2020.9524773","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNNs) have achieved significant accuracy improvement in many intelligent applications at the cost of intensive convolution operations and massive data movements. To efficiently deploy CNNs on low power embedded platforms in real time, the depthwise separable convolution has been proposed to replace the standard convolution, especially in lightweight CNNs, which remarkably reduces the computation complexity and model size. However, it is difficult for a general convolution engine to obtain the theoretical performance improvement as the decreased data dependency of depthwise convolution significantly reduces the data reuse opportunity. To address this issue, a flexible and highperformance accelerator based on FPGA is proposed to efficiently process the inference of both large-scale and lightweight CNNs. Firstly, by sharing the activation dataflow between the depthwise convolution and pooling layers, the control logic and data bus of the two layers are reused to maximize the data utilization and minimize the logic overhead. Furthermore, these two layers can be processed either directly after standard convolutions to eliminate the external memory accesses or independently to gain better flexibility. Thirdly, a performance model is proposed to automatically explore the optimal design options of the accelerator. The proposed hardware accelerator is evaluated on Intel Arria 10 SoC FPGA and demonstrates state-of-the-art performance on both large-scale CNNs, e.g., VGG, and lightweight ones, e.g., MobileNet.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient Inference of Large-Scale and Lightweight Convolutional Neural Networks on FPGA\",\"authors\":\"Xiao Wu, Yufei Ma, Zhongfeng Wang\",\"doi\":\"10.1109/socc49529.2020.9524773\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Convolutional neural networks (CNNs) have achieved significant accuracy improvement in many intelligent applications at the cost of intensive convolution operations and massive data movements. To efficiently deploy CNNs on low power embedded platforms in real time, the depthwise separable convolution has been proposed to replace the standard convolution, especially in lightweight CNNs, which remarkably reduces the computation complexity and model size. However, it is difficult for a general convolution engine to obtain the theoretical performance improvement as the decreased data dependency of depthwise convolution significantly reduces the data reuse opportunity. To address this issue, a flexible and highperformance accelerator based on FPGA is proposed to efficiently process the inference of both large-scale and lightweight CNNs. Firstly, by sharing the activation dataflow between the depthwise convolution and pooling layers, the control logic and data bus of the two layers are reused to maximize the data utilization and minimize the logic overhead. Furthermore, these two layers can be processed either directly after standard convolutions to eliminate the external memory accesses or independently to gain better flexibility. Thirdly, a performance model is proposed to automatically explore the optimal design options of the accelerator. The proposed hardware accelerator is evaluated on Intel Arria 10 SoC FPGA and demonstrates state-of-the-art performance on both large-scale CNNs, e.g., VGG, and lightweight ones, e.g., MobileNet.\",\"PeriodicalId\":114740,\"journal\":{\"name\":\"2020 IEEE 33rd International System-on-Chip Conference (SOCC)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 33rd International System-on-Chip Conference (SOCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/socc49529.2020.9524773\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/socc49529.2020.9524773","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

卷积神经网络(cnn)在许多智能应用中以密集的卷积运算和大量的数据移动为代价，取得了显著的精度提高。为了在低功耗嵌入式平台上实时高效地部署cnn，提出了深度可分卷积来取代标准卷积，特别是在轻量级cnn中，它显著降低了计算复杂度和模型尺寸。然而，对于一般的卷积引擎来说，由于深度卷积降低了数据依赖性，大大降低了数据重用的机会，因此很难获得理论上的性能提升。为了解决这一问题，提出了一种基于FPGA的灵活高性能加速器，以高效地处理大规模和轻量级cnn的推理。首先，通过在深度卷积层和池化层之间共享激活数据流，重用两层的控制逻辑和数据总线，最大限度地提高数据利用率，降低逻辑开销;此外，这两层既可以在标准卷积之后直接处理以消除外部内存访问，也可以独立处理以获得更好的灵活性。再次，提出了一种性能模型来自动探索加速器的最优设计方案。提出的硬件加速器在英特尔Arria 10 SoC FPGA上进行了评估，并在大型cnn(如VGG)和轻量级cnn(如MobileNet)上展示了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Inference of Large-Scale and Lightweight Convolutional Neural Networks on FPGA

Convolutional neural networks (CNNs) have achieved significant accuracy improvement in many intelligent applications at the cost of intensive convolution operations and massive data movements. To efficiently deploy CNNs on low power embedded platforms in real time, the depthwise separable convolution has been proposed to replace the standard convolution, especially in lightweight CNNs, which remarkably reduces the computation complexity and model size. However, it is difficult for a general convolution engine to obtain the theoretical performance improvement as the decreased data dependency of depthwise convolution significantly reduces the data reuse opportunity. To address this issue, a flexible and highperformance accelerator based on FPGA is proposed to efficiently process the inference of both large-scale and lightweight CNNs. Firstly, by sharing the activation dataflow between the depthwise convolution and pooling layers, the control logic and data bus of the two layers are reused to maximize the data utilization and minimize the logic overhead. Furthermore, these two layers can be processed either directly after standard convolutions to eliminate the external memory accesses or independently to gain better flexibility. Thirdly, a performance model is proposed to automatically explore the optimal design options of the accelerator. The proposed hardware accelerator is evaluated on Intel Arria 10 SoC FPGA and demonstrates state-of-the-art performance on both large-scale CNNs, e.g., VGG, and lightweight ones, e.g., MobileNet.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 33rd International System-on-Chip Conference (SOCC)

自引率

0.00%

发文量