{"title":"基于FPGA的大规模轻量级卷积神经网络的高效推理","authors":"Xiao Wu, Yufei Ma, Zhongfeng Wang","doi":"10.1109/socc49529.2020.9524773","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNNs) have achieved significant accuracy improvement in many intelligent applications at the cost of intensive convolution operations and massive data movements. To efficiently deploy CNNs on low power embedded platforms in real time, the depthwise separable convolution has been proposed to replace the standard convolution, especially in lightweight CNNs, which remarkably reduces the computation complexity and model size. However, it is difficult for a general convolution engine to obtain the theoretical performance improvement as the decreased data dependency of depthwise convolution significantly reduces the data reuse opportunity. To address this issue, a flexible and highperformance accelerator based on FPGA is proposed to efficiently process the inference of both large-scale and lightweight CNNs. Firstly, by sharing the activation dataflow between the depthwise convolution and pooling layers, the control logic and data bus of the two layers are reused to maximize the data utilization and minimize the logic overhead. Furthermore, these two layers can be processed either directly after standard convolutions to eliminate the external memory accesses or independently to gain better flexibility. Thirdly, a performance model is proposed to automatically explore the optimal design options of the accelerator. The proposed hardware accelerator is evaluated on Intel Arria 10 SoC FPGA and demonstrates state-of-the-art performance on both large-scale CNNs, e.g., VGG, and lightweight ones, e.g., MobileNet.","PeriodicalId":114740,"journal":{"name":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient Inference of Large-Scale and Lightweight Convolutional Neural Networks on FPGA\",\"authors\":\"Xiao Wu, Yufei Ma, Zhongfeng Wang\",\"doi\":\"10.1109/socc49529.2020.9524773\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Convolutional neural networks (CNNs) have achieved significant accuracy improvement in many intelligent applications at the cost of intensive convolution operations and massive data movements. To efficiently deploy CNNs on low power embedded platforms in real time, the depthwise separable convolution has been proposed to replace the standard convolution, especially in lightweight CNNs, which remarkably reduces the computation complexity and model size. However, it is difficult for a general convolution engine to obtain the theoretical performance improvement as the decreased data dependency of depthwise convolution significantly reduces the data reuse opportunity. To address this issue, a flexible and highperformance accelerator based on FPGA is proposed to efficiently process the inference of both large-scale and lightweight CNNs. Firstly, by sharing the activation dataflow between the depthwise convolution and pooling layers, the control logic and data bus of the two layers are reused to maximize the data utilization and minimize the logic overhead. Furthermore, these two layers can be processed either directly after standard convolutions to eliminate the external memory accesses or independently to gain better flexibility. Thirdly, a performance model is proposed to automatically explore the optimal design options of the accelerator. The proposed hardware accelerator is evaluated on Intel Arria 10 SoC FPGA and demonstrates state-of-the-art performance on both large-scale CNNs, e.g., VGG, and lightweight ones, e.g., MobileNet.\",\"PeriodicalId\":114740,\"journal\":{\"name\":\"2020 IEEE 33rd International System-on-Chip Conference (SOCC)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 33rd International System-on-Chip Conference (SOCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/socc49529.2020.9524773\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/socc49529.2020.9524773","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Efficient Inference of Large-Scale and Lightweight Convolutional Neural Networks on FPGA
Convolutional neural networks (CNNs) have achieved significant accuracy improvement in many intelligent applications at the cost of intensive convolution operations and massive data movements. To efficiently deploy CNNs on low power embedded platforms in real time, the depthwise separable convolution has been proposed to replace the standard convolution, especially in lightweight CNNs, which remarkably reduces the computation complexity and model size. However, it is difficult for a general convolution engine to obtain the theoretical performance improvement as the decreased data dependency of depthwise convolution significantly reduces the data reuse opportunity. To address this issue, a flexible and highperformance accelerator based on FPGA is proposed to efficiently process the inference of both large-scale and lightweight CNNs. Firstly, by sharing the activation dataflow between the depthwise convolution and pooling layers, the control logic and data bus of the two layers are reused to maximize the data utilization and minimize the logic overhead. Furthermore, these two layers can be processed either directly after standard convolutions to eliminate the external memory accesses or independently to gain better flexibility. Thirdly, a performance model is proposed to automatically explore the optimal design options of the accelerator. The proposed hardware accelerator is evaluated on Intel Arria 10 SoC FPGA and demonstrates state-of-the-art performance on both large-scale CNNs, e.g., VGG, and lightweight ones, e.g., MobileNet.