{"title":"基于FPGA的高吞吐量CNN加速器设计","authors":"Liang Xie, Xitian Fan, Wei Cao, Lingli Wang","doi":"10.1109/FPT.2018.00052","DOIUrl":null,"url":null,"abstract":"Due to the fact that FPGA on-chip memory capacity increases significantly, the feature maps and weights of convolutional layers can be stored on chip, which can reduce the data movement between on-chip memory and off-chip memory. Hence, the bottleneck can shift from the bandwidth to the computing resources in convolutional layers, which will improve the performance dramatically. Under this circumstance, this paper quantitatively analyzes how to design the hardware architecture based on the roofline model to optimize the performance under the constraints of available on-chip computing resources and propose an efficient architecture. Our accelerator is implemented on Xilinx UltraScale+ FPGA with the performance of 9.39 TOPS and 6.86 TOPS for 8-bit data width with 100MHz main frequency and 400MHz DSP frequency on ResNet-50 and AlexNet, which outperforms the existing FPGA-based CNN accelerator.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"High Throughput CNN Accelerator Design Based on FPGA\",\"authors\":\"Liang Xie, Xitian Fan, Wei Cao, Lingli Wang\",\"doi\":\"10.1109/FPT.2018.00052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to the fact that FPGA on-chip memory capacity increases significantly, the feature maps and weights of convolutional layers can be stored on chip, which can reduce the data movement between on-chip memory and off-chip memory. Hence, the bottleneck can shift from the bandwidth to the computing resources in convolutional layers, which will improve the performance dramatically. Under this circumstance, this paper quantitatively analyzes how to design the hardware architecture based on the roofline model to optimize the performance under the constraints of available on-chip computing resources and propose an efficient architecture. Our accelerator is implemented on Xilinx UltraScale+ FPGA with the performance of 9.39 TOPS and 6.86 TOPS for 8-bit data width with 100MHz main frequency and 400MHz DSP frequency on ResNet-50 and AlexNet, which outperforms the existing FPGA-based CNN accelerator.\",\"PeriodicalId\":434541,\"journal\":{\"name\":\"2018 International Conference on Field-Programmable Technology (FPT)\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 International Conference on Field-Programmable Technology (FPT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FPT.2018.00052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Field-Programmable Technology (FPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPT.2018.00052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
High Throughput CNN Accelerator Design Based on FPGA
Due to the fact that FPGA on-chip memory capacity increases significantly, the feature maps and weights of convolutional layers can be stored on chip, which can reduce the data movement between on-chip memory and off-chip memory. Hence, the bottleneck can shift from the bandwidth to the computing resources in convolutional layers, which will improve the performance dramatically. Under this circumstance, this paper quantitatively analyzes how to design the hardware architecture based on the roofline model to optimize the performance under the constraints of available on-chip computing resources and propose an efficient architecture. Our accelerator is implemented on Xilinx UltraScale+ FPGA with the performance of 9.39 TOPS and 6.86 TOPS for 8-bit data width with 100MHz main frequency and 400MHz DSP frequency on ResNet-50 and AlexNet, which outperforms the existing FPGA-based CNN accelerator.