基于高效lut的通用量化CNN推理FPGA加速设计

2021 2nd Asia Service Sciences and Software Engineering Conference Pub Date : 2021-02-24 DOI:10.1145/3456126.3456140

Yanpeng Cao, Changjun Song, Yongming Tang

{"title":"基于高效lut的通用量化CNN推理FPGA加速设计","authors":"Yanpeng Cao, Changjun Song, Yongming Tang","doi":"10.1145/3456126.3456140","DOIUrl":null,"url":null,"abstract":"Deep learning has achieved remarkable success in a variety of tasks in real life, such as speech and vision. However, the vast computational complexity of convolution neural networks (CNN) has limited the speed of the network running in hardware. In recent years, network quantization technology has made it possible to quantize network into the 16-bit fixed point, 8-bit integer, and even binary, maintaining the original performance, while the computational complexity of the network inference is still considerable. Therefore, exploring high-performance and efficient hardware architecture designed for quantized neural networks (QNN) is necessary to eliminate the bottleneck of high-density computing requirements. FPGA is a highly parallelized hardware computing platform. The outstanding advantage is that it contains a large number of primary configurable logic resources. We explore the possibility of implementation for convolution calculations based on LUTs, introduce the integer multipliers and addition trees based on FPGAs, and propose an efficient computing architecture for QNN. With the optimization of Winograd convolution algorithm for QNN, we demonstrate that our scheme could significantly reduce the number of multipliers without using DSP resources, saving the usage of LUT resources by 2.25× at least. In the end, our LUT-based architecture for QNN will shorten the latency up to 19.3× and represent more effective performance compared other methods.","PeriodicalId":431685,"journal":{"name":"2021 2nd Asia Service Sciences and Software Engineering Conference","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Efficient LUT-based FPGA Accelerator Design for Universal Quantized CNN Inference\",\"authors\":\"Yanpeng Cao, Changjun Song, Yongming Tang\",\"doi\":\"10.1145/3456126.3456140\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning has achieved remarkable success in a variety of tasks in real life, such as speech and vision. However, the vast computational complexity of convolution neural networks (CNN) has limited the speed of the network running in hardware. In recent years, network quantization technology has made it possible to quantize network into the 16-bit fixed point, 8-bit integer, and even binary, maintaining the original performance, while the computational complexity of the network inference is still considerable. Therefore, exploring high-performance and efficient hardware architecture designed for quantized neural networks (QNN) is necessary to eliminate the bottleneck of high-density computing requirements. FPGA is a highly parallelized hardware computing platform. The outstanding advantage is that it contains a large number of primary configurable logic resources. We explore the possibility of implementation for convolution calculations based on LUTs, introduce the integer multipliers and addition trees based on FPGAs, and propose an efficient computing architecture for QNN. With the optimization of Winograd convolution algorithm for QNN, we demonstrate that our scheme could significantly reduce the number of multipliers without using DSP resources, saving the usage of LUT resources by 2.25× at least. In the end, our LUT-based architecture for QNN will shorten the latency up to 19.3× and represent more effective performance compared other methods.\",\"PeriodicalId\":431685,\"journal\":{\"name\":\"2021 2nd Asia Service Sciences and Software Engineering Conference\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 2nd Asia Service Sciences and Software Engineering Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3456126.3456140\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 2nd Asia Service Sciences and Software Engineering Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3456126.3456140","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

深度学习在现实生活中的各种任务中取得了显著的成功，比如语音和视觉。然而，卷积神经网络(CNN)巨大的计算复杂度限制了网络在硬件上的运行速度。近年来，网络量化技术使得将网络量化为16位定点、8位整数甚至二进制成为可能，在保持原有性能的同时，网络推理的计算复杂度仍然相当大。因此，探索针对量化神经网络(QNN)设计的高性能、高效的硬件架构是消除高密度计算需求瓶颈的必要条件。FPGA是一种高度并行的硬件计算平台。突出的优点是它包含大量的主可配置逻辑资源。我们探索了基于lut实现卷积计算的可能性，引入了基于fpga的整数乘法器和加法树，并提出了一种高效的QNN计算架构。通过对QNN的Winograd卷积算法的优化，我们证明了我们的方案可以在不使用DSP资源的情况下显著减少乘法器的数量，节省LUT资源的使用至少2.25倍。最后，我们的基于lut的QNN架构将延迟缩短到19.3倍，并且与其他方法相比表现出更有效的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient LUT-based FPGA Accelerator Design for Universal Quantized CNN Inference

Deep learning has achieved remarkable success in a variety of tasks in real life, such as speech and vision. However, the vast computational complexity of convolution neural networks (CNN) has limited the speed of the network running in hardware. In recent years, network quantization technology has made it possible to quantize network into the 16-bit fixed point, 8-bit integer, and even binary, maintaining the original performance, while the computational complexity of the network inference is still considerable. Therefore, exploring high-performance and efficient hardware architecture designed for quantized neural networks (QNN) is necessary to eliminate the bottleneck of high-density computing requirements. FPGA is a highly parallelized hardware computing platform. The outstanding advantage is that it contains a large number of primary configurable logic resources. We explore the possibility of implementation for convolution calculations based on LUTs, introduce the integer multipliers and addition trees based on FPGAs, and propose an efficient computing architecture for QNN. With the optimization of Winograd convolution algorithm for QNN, we demonstrate that our scheme could significantly reduce the number of multipliers without using DSP resources, saving the usage of LUT resources by 2.25× at least. In the end, our LUT-based architecture for QNN will shorten the latency up to 19.3× and represent more effective performance compared other methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 2nd Asia Service Sciences and Software Engineering Conference

自引率

0.00%

发文量