HCG:基于FPGA的混合计算粒度方案的流DCNN加速器。

IF 8.9 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Wenjin Huang,Conghui Luo,Baoze Zhao,Han Jiao,Yihua Huang
{"title":"HCG:基于FPGA的混合计算粒度方案的流DCNN加速器。","authors":"Wenjin Huang,Conghui Luo,Baoze Zhao,Han Jiao,Yihua Huang","doi":"10.1109/tnnls.2025.3587694","DOIUrl":null,"url":null,"abstract":"With the growth of field-programmable gate array (FPGA) hardware resources, streaming DCNN accelerators leverage interconvolutional-layer parallelism to enhance throughput. In existing streaming accelerators, convolution nodes typically adopt layer- or column-based tiling methods, where the tiled input feature map (Ifmap) encompasses all input channels. This approach facilitates the comprehensive calculation of the output feature map (Ofmap) and maximizes interlayer parallelism. The computational granularity, defined in this study as the calculated rows or columns of Ofmap based on each tiled Ifmap data, significantly influences on-chip Ifmap storage and off-chip weight bandwidth (BW). The uniform application of computational granularity across all nodes inevitably impacts the memory-BW tradeoff. This article introduces a novel streaming accelerator with a hybrid computational granularity (HCG) scheme. Each node employs an independently optimized computational granularity, enabling a more flexible memory-BW tradeoff and more effective utilization of FPGA resources. However, this hybrid scheme can introduce pipeline bubbles and increase system pipeline complexity and control logic. To address these challenges, this article theoretically analyzes the impact of computational granularity on individual computing nodes and the overall system, aiming to establish a seamless system pipeline without pipeline bubbles and simplify system design. Furthermore, the article develops a hardware overhead model and employs a heuristic algorithm to optimize computational granularity for each computing node, achieving optimal memory-BW tradeoff and higher throughput. Finally, the effectiveness of the proposed design and optimization methodology is validated through the implementation of a 3-TOPS ResNet-18 accelerator on the Alveo U250 development board under BW constraints of 25, 20, and 15 GB/s. Additionally, accelerators for 4-TOPS VGG-16, 4-TOPS ResNet-34, 5-TOPS ResNet-50, 3-TOPS MobileNetV1, 4-TOPS ConvNeXt-T, and 4-TOPS ResNeXt-50 are implemented, surpassing the performance of most existing works.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"65 1","pages":""},"PeriodicalIF":8.9000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HCG: Streaming DCNN Accelerator With a Hybrid Computational Granularity Scheme on FPGA.\",\"authors\":\"Wenjin Huang,Conghui Luo,Baoze Zhao,Han Jiao,Yihua Huang\",\"doi\":\"10.1109/tnnls.2025.3587694\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the growth of field-programmable gate array (FPGA) hardware resources, streaming DCNN accelerators leverage interconvolutional-layer parallelism to enhance throughput. In existing streaming accelerators, convolution nodes typically adopt layer- or column-based tiling methods, where the tiled input feature map (Ifmap) encompasses all input channels. This approach facilitates the comprehensive calculation of the output feature map (Ofmap) and maximizes interlayer parallelism. The computational granularity, defined in this study as the calculated rows or columns of Ofmap based on each tiled Ifmap data, significantly influences on-chip Ifmap storage and off-chip weight bandwidth (BW). The uniform application of computational granularity across all nodes inevitably impacts the memory-BW tradeoff. This article introduces a novel streaming accelerator with a hybrid computational granularity (HCG) scheme. Each node employs an independently optimized computational granularity, enabling a more flexible memory-BW tradeoff and more effective utilization of FPGA resources. However, this hybrid scheme can introduce pipeline bubbles and increase system pipeline complexity and control logic. To address these challenges, this article theoretically analyzes the impact of computational granularity on individual computing nodes and the overall system, aiming to establish a seamless system pipeline without pipeline bubbles and simplify system design. Furthermore, the article develops a hardware overhead model and employs a heuristic algorithm to optimize computational granularity for each computing node, achieving optimal memory-BW tradeoff and higher throughput. Finally, the effectiveness of the proposed design and optimization methodology is validated through the implementation of a 3-TOPS ResNet-18 accelerator on the Alveo U250 development board under BW constraints of 25, 20, and 15 GB/s. Additionally, accelerators for 4-TOPS VGG-16, 4-TOPS ResNet-34, 5-TOPS ResNet-50, 3-TOPS MobileNetV1, 4-TOPS ConvNeXt-T, and 4-TOPS ResNeXt-50 are implemented, surpassing the performance of most existing works.\",\"PeriodicalId\":13303,\"journal\":{\"name\":\"IEEE transactions on neural networks and learning systems\",\"volume\":\"65 1\",\"pages\":\"\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks and learning systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/tnnls.2025.3587694\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tnnls.2025.3587694","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

随着现场可编程门阵列(FPGA)硬件资源的增长,流DCNN加速器利用卷积层并行性来提高吞吐量。在现有的流加速器中,卷积节点通常采用基于层或列的平铺方法,其中平铺的输入特征图(Ifmap)包含所有输入通道。这种方法有利于输出特征映射(Ofmap)的综合计算,并使层间并行性最大化。计算粒度(在本研究中定义为基于每个平铺Ifmap数据的Ofmap计算行或列)显著影响片上Ifmap存储和片外权重带宽(BW)。在所有节点上统一应用计算粒度不可避免地会影响内存- bw的权衡。本文介绍了一种基于混合计算粒度(HCG)方案的新型流加速器。每个节点采用独立优化的计算粒度,实现更灵活的内存- bw权衡和更有效地利用FPGA资源。但是这种混合方案会引入管路气泡,增加系统管路的复杂度和控制逻辑。针对这些挑战,本文从理论上分析计算粒度对单个计算节点和整个系统的影响,旨在建立无管道气泡的无缝系统管道,简化系统设计。此外,本文还开发了一个硬件开销模型,并采用启发式算法来优化每个计算节点的计算粒度,从而实现最佳的内存- bw权衡和更高的吞吐量。最后,在BW分别为25、20和15 GB/s的情况下,通过在Alveo U250开发板上实现3-TOPS ResNet-18加速器,验证了所提出设计和优化方法的有效性。此外,还实现了4-TOPS VGG-16, 4-TOPS ResNet-34, 5-TOPS ResNet-50, 3-TOPS MobileNetV1, 4-TOPS ConvNeXt-T和4-TOPS ResNeXt-50的加速器,超过了大多数现有作品的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
HCG: Streaming DCNN Accelerator With a Hybrid Computational Granularity Scheme on FPGA.
With the growth of field-programmable gate array (FPGA) hardware resources, streaming DCNN accelerators leverage interconvolutional-layer parallelism to enhance throughput. In existing streaming accelerators, convolution nodes typically adopt layer- or column-based tiling methods, where the tiled input feature map (Ifmap) encompasses all input channels. This approach facilitates the comprehensive calculation of the output feature map (Ofmap) and maximizes interlayer parallelism. The computational granularity, defined in this study as the calculated rows or columns of Ofmap based on each tiled Ifmap data, significantly influences on-chip Ifmap storage and off-chip weight bandwidth (BW). The uniform application of computational granularity across all nodes inevitably impacts the memory-BW tradeoff. This article introduces a novel streaming accelerator with a hybrid computational granularity (HCG) scheme. Each node employs an independently optimized computational granularity, enabling a more flexible memory-BW tradeoff and more effective utilization of FPGA resources. However, this hybrid scheme can introduce pipeline bubbles and increase system pipeline complexity and control logic. To address these challenges, this article theoretically analyzes the impact of computational granularity on individual computing nodes and the overall system, aiming to establish a seamless system pipeline without pipeline bubbles and simplify system design. Furthermore, the article develops a hardware overhead model and employs a heuristic algorithm to optimize computational granularity for each computing node, achieving optimal memory-BW tradeoff and higher throughput. Finally, the effectiveness of the proposed design and optimization methodology is validated through the implementation of a 3-TOPS ResNet-18 accelerator on the Alveo U250 development board under BW constraints of 25, 20, and 15 GB/s. Additionally, accelerators for 4-TOPS VGG-16, 4-TOPS ResNet-34, 5-TOPS ResNet-50, 3-TOPS MobileNetV1, 4-TOPS ConvNeXt-T, and 4-TOPS ResNeXt-50 are implemented, surpassing the performance of most existing works.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE transactions on neural networks and learning systems
IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
CiteScore
23.80
自引率
9.60%
发文量
2102
审稿时长
3-8 weeks
期刊介绍: The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信