基于高效lut的通用量化CNN推理FPGA加速设计

Yanpeng Cao, Changjun Song, Yongming Tang
{"title":"基于高效lut的通用量化CNN推理FPGA加速设计","authors":"Yanpeng Cao, Changjun Song, Yongming Tang","doi":"10.1145/3456126.3456140","DOIUrl":null,"url":null,"abstract":"Deep learning has achieved remarkable success in a variety of tasks in real life, such as speech and vision. However, the vast computational complexity of convolution neural networks (CNN) has limited the speed of the network running in hardware. In recent years, network quantization technology has made it possible to quantize network into the 16-bit fixed point, 8-bit integer, and even binary, maintaining the original performance, while the computational complexity of the network inference is still considerable. Therefore, exploring high-performance and efficient hardware architecture designed for quantized neural networks (QNN) is necessary to eliminate the bottleneck of high-density computing requirements. FPGA is a highly parallelized hardware computing platform. The outstanding advantage is that it contains a large number of primary configurable logic resources. We explore the possibility of implementation for convolution calculations based on LUTs, introduce the integer multipliers and addition trees based on FPGAs, and propose an efficient computing architecture for QNN. With the optimization of Winograd convolution algorithm for QNN, we demonstrate that our scheme could significantly reduce the number of multipliers without using DSP resources, saving the usage of LUT resources by 2.25× at least. In the end, our LUT-based architecture for QNN will shorten the latency up to 19.3× and represent more effective performance compared other methods.","PeriodicalId":431685,"journal":{"name":"2021 2nd Asia Service Sciences and Software Engineering Conference","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Efficient LUT-based FPGA Accelerator Design for Universal Quantized CNN Inference\",\"authors\":\"Yanpeng Cao, Changjun Song, Yongming Tang\",\"doi\":\"10.1145/3456126.3456140\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning has achieved remarkable success in a variety of tasks in real life, such as speech and vision. However, the vast computational complexity of convolution neural networks (CNN) has limited the speed of the network running in hardware. In recent years, network quantization technology has made it possible to quantize network into the 16-bit fixed point, 8-bit integer, and even binary, maintaining the original performance, while the computational complexity of the network inference is still considerable. Therefore, exploring high-performance and efficient hardware architecture designed for quantized neural networks (QNN) is necessary to eliminate the bottleneck of high-density computing requirements. FPGA is a highly parallelized hardware computing platform. The outstanding advantage is that it contains a large number of primary configurable logic resources. We explore the possibility of implementation for convolution calculations based on LUTs, introduce the integer multipliers and addition trees based on FPGAs, and propose an efficient computing architecture for QNN. With the optimization of Winograd convolution algorithm for QNN, we demonstrate that our scheme could significantly reduce the number of multipliers without using DSP resources, saving the usage of LUT resources by 2.25× at least. In the end, our LUT-based architecture for QNN will shorten the latency up to 19.3× and represent more effective performance compared other methods.\",\"PeriodicalId\":431685,\"journal\":{\"name\":\"2021 2nd Asia Service Sciences and Software Engineering Conference\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 2nd Asia Service Sciences and Software Engineering Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3456126.3456140\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 2nd Asia Service Sciences and Software Engineering Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3456126.3456140","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

深度学习在现实生活中的各种任务中取得了显著的成功,比如语音和视觉。然而,卷积神经网络(CNN)巨大的计算复杂度限制了网络在硬件上的运行速度。近年来,网络量化技术使得将网络量化为16位定点、8位整数甚至二进制成为可能,在保持原有性能的同时,网络推理的计算复杂度仍然相当大。因此,探索针对量化神经网络(QNN)设计的高性能、高效的硬件架构是消除高密度计算需求瓶颈的必要条件。FPGA是一种高度并行的硬件计算平台。突出的优点是它包含大量的主可配置逻辑资源。我们探索了基于lut实现卷积计算的可能性,引入了基于fpga的整数乘法器和加法树,并提出了一种高效的QNN计算架构。通过对QNN的Winograd卷积算法的优化,我们证明了我们的方案可以在不使用DSP资源的情况下显著减少乘法器的数量,节省LUT资源的使用至少2.25倍。最后,我们的基于lut的QNN架构将延迟缩短到19.3倍,并且与其他方法相比表现出更有效的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Efficient LUT-based FPGA Accelerator Design for Universal Quantized CNN Inference
Deep learning has achieved remarkable success in a variety of tasks in real life, such as speech and vision. However, the vast computational complexity of convolution neural networks (CNN) has limited the speed of the network running in hardware. In recent years, network quantization technology has made it possible to quantize network into the 16-bit fixed point, 8-bit integer, and even binary, maintaining the original performance, while the computational complexity of the network inference is still considerable. Therefore, exploring high-performance and efficient hardware architecture designed for quantized neural networks (QNN) is necessary to eliminate the bottleneck of high-density computing requirements. FPGA is a highly parallelized hardware computing platform. The outstanding advantage is that it contains a large number of primary configurable logic resources. We explore the possibility of implementation for convolution calculations based on LUTs, introduce the integer multipliers and addition trees based on FPGAs, and propose an efficient computing architecture for QNN. With the optimization of Winograd convolution algorithm for QNN, we demonstrate that our scheme could significantly reduce the number of multipliers without using DSP resources, saving the usage of LUT resources by 2.25× at least. In the end, our LUT-based architecture for QNN will shorten the latency up to 19.3× and represent more effective performance compared other methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信