HCG：基于FPGA的混合计算粒度方案的流DCNN加速器。

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE transactions on neural networks and learning systems Pub Date : 2025-07-24 DOI:10.1109/tnnls.2025.3587694

Wenjin Huang,Conghui Luo,Baoze Zhao,Han Jiao,Yihua Huang

{"title":"HCG：基于FPGA的混合计算粒度方案的流DCNN加速器。","authors":"Wenjin Huang,Conghui Luo,Baoze Zhao,Han Jiao,Yihua Huang","doi":"10.1109/tnnls.2025.3587694","DOIUrl":null,"url":null,"abstract":"With the growth of field-programmable gate array (FPGA) hardware resources, streaming DCNN accelerators leverage interconvolutional-layer parallelism to enhance throughput. In existing streaming accelerators, convolution nodes typically adopt layer- or column-based tiling methods, where the tiled input feature map (Ifmap) encompasses all input channels. This approach facilitates the comprehensive calculation of the output feature map (Ofmap) and maximizes interlayer parallelism. The computational granularity, defined in this study as the calculated rows or columns of Ofmap based on each tiled Ifmap data, significantly influences on-chip Ifmap storage and off-chip weight bandwidth (BW). The uniform application of computational granularity across all nodes inevitably impacts the memory-BW tradeoff. This article introduces a novel streaming accelerator with a hybrid computational granularity (HCG) scheme. Each node employs an independently optimized computational granularity, enabling a more flexible memory-BW tradeoff and more effective utilization of FPGA resources. However, this hybrid scheme can introduce pipeline bubbles and increase system pipeline complexity and control logic. To address these challenges, this article theoretically analyzes the impact of computational granularity on individual computing nodes and the overall system, aiming to establish a seamless system pipeline without pipeline bubbles and simplify system design. Furthermore, the article develops a hardware overhead model and employs a heuristic algorithm to optimize computational granularity for each computing node, achieving optimal memory-BW tradeoff and higher throughput. Finally, the effectiveness of the proposed design and optimization methodology is validated through the implementation of a 3-TOPS ResNet-18 accelerator on the Alveo U250 development board under BW constraints of 25, 20, and 15 GB/s. Additionally, accelerators for 4-TOPS VGG-16, 4-TOPS ResNet-34, 5-TOPS ResNet-50, 3-TOPS MobileNetV1, 4-TOPS ConvNeXt-T, and 4-TOPS ResNeXt-50 are implemented, surpassing the performance of most existing works.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"65 1","pages":""},"PeriodicalIF":8.9000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HCG: Streaming DCNN Accelerator With a Hybrid Computational Granularity Scheme on FPGA.\",\"authors\":\"Wenjin Huang,Conghui Luo,Baoze Zhao,Han Jiao,Yihua Huang\",\"doi\":\"10.1109/tnnls.2025.3587694\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the growth of field-programmable gate array (FPGA) hardware resources, streaming DCNN accelerators leverage interconvolutional-layer parallelism to enhance throughput. In existing streaming accelerators, convolution nodes typically adopt layer- or column-based tiling methods, where the tiled input feature map (Ifmap) encompasses all input channels. This approach facilitates the comprehensive calculation of the output feature map (Ofmap) and maximizes interlayer parallelism. The computational granularity, defined in this study as the calculated rows or columns of Ofmap based on each tiled Ifmap data, significantly influences on-chip Ifmap storage and off-chip weight bandwidth (BW). The uniform application of computational granularity across all nodes inevitably impacts the memory-BW tradeoff. This article introduces a novel streaming accelerator with a hybrid computational granularity (HCG) scheme. Each node employs an independently optimized computational granularity, enabling a more flexible memory-BW tradeoff and more effective utilization of FPGA resources. However, this hybrid scheme can introduce pipeline bubbles and increase system pipeline complexity and control logic. To address these challenges, this article theoretically analyzes the impact of computational granularity on individual computing nodes and the overall system, aiming to establish a seamless system pipeline without pipeline bubbles and simplify system design. Furthermore, the article develops a hardware overhead model and employs a heuristic algorithm to optimize computational granularity for each computing node, achieving optimal memory-BW tradeoff and higher throughput. Finally, the effectiveness of the proposed design and optimization methodology is validated through the implementation of a 3-TOPS ResNet-18 accelerator on the Alveo U250 development board under BW constraints of 25, 20, and 15 GB/s. Additionally, accelerators for 4-TOPS VGG-16, 4-TOPS ResNet-34, 5-TOPS ResNet-50, 3-TOPS MobileNetV1, 4-TOPS ConvNeXt-T, and 4-TOPS ResNeXt-50 are implemented, surpassing the performance of most existing works.\",\"PeriodicalId\":13303,\"journal\":{\"name\":\"IEEE transactions on neural networks and learning systems\",\"volume\":\"65 1\",\"pages\":\"\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks and learning systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/tnnls.2025.3587694\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tnnls.2025.3587694","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

随着现场可编程门阵列（FPGA）硬件资源的增长，流DCNN加速器利用卷积层并行性来提高吞吐量。在现有的流加速器中，卷积节点通常采用基于层或列的平铺方法，其中平铺的输入特征图（Ifmap）包含所有输入通道。这种方法有利于输出特征映射（Ofmap）的综合计算，并使层间并行性最大化。计算粒度（在本研究中定义为基于每个平铺Ifmap数据的Ofmap计算行或列）显著影响片上Ifmap存储和片外权重带宽（BW）。在所有节点上统一应用计算粒度不可避免地会影响内存- bw的权衡。本文介绍了一种基于混合计算粒度（HCG）方案的新型流加速器。每个节点采用独立优化的计算粒度，实现更灵活的内存- bw权衡和更有效地利用FPGA资源。但是这种混合方案会引入管路气泡，增加系统管路的复杂度和控制逻辑。针对这些挑战，本文从理论上分析计算粒度对单个计算节点和整个系统的影响，旨在建立无管道气泡的无缝系统管道，简化系统设计。此外，本文还开发了一个硬件开销模型，并采用启发式算法来优化每个计算节点的计算粒度，从而实现最佳的内存- bw权衡和更高的吞吐量。最后，在BW分别为25、20和15 GB/s的情况下，通过在Alveo U250开发板上实现3-TOPS ResNet-18加速器，验证了所提出设计和优化方法的有效性。此外，还实现了4-TOPS VGG-16, 4-TOPS ResNet-34, 5-TOPS ResNet-50, 3-TOPS MobileNetV1, 4-TOPS ConvNeXt-T和4-TOPS ResNeXt-50的加速器，超过了大多数现有作品的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

HCG: Streaming DCNN Accelerator With a Hybrid Computational Granularity Scheme on FPGA.

With the growth of field-programmable gate array (FPGA) hardware resources, streaming DCNN accelerators leverage interconvolutional-layer parallelism to enhance throughput. In existing streaming accelerators, convolution nodes typically adopt layer- or column-based tiling methods, where the tiled input feature map (Ifmap) encompasses all input channels. This approach facilitates the comprehensive calculation of the output feature map (Ofmap) and maximizes interlayer parallelism. The computational granularity, defined in this study as the calculated rows or columns of Ofmap based on each tiled Ifmap data, significantly influences on-chip Ifmap storage and off-chip weight bandwidth (BW). The uniform application of computational granularity across all nodes inevitably impacts the memory-BW tradeoff. This article introduces a novel streaming accelerator with a hybrid computational granularity (HCG) scheme. Each node employs an independently optimized computational granularity, enabling a more flexible memory-BW tradeoff and more effective utilization of FPGA resources. However, this hybrid scheme can introduce pipeline bubbles and increase system pipeline complexity and control logic. To address these challenges, this article theoretically analyzes the impact of computational granularity on individual computing nodes and the overall system, aiming to establish a seamless system pipeline without pipeline bubbles and simplify system design. Furthermore, the article develops a hardware overhead model and employs a heuristic algorithm to optimize computational granularity for each computing node, achieving optimal memory-BW tradeoff and higher throughput. Finally, the effectiveness of the proposed design and optimization methodology is validated through the implementation of a 3-TOPS ResNet-18 accelerator on the Alveo U250 development board under BW constraints of 25, 20, and 15 GB/s. Additionally, accelerators for 4-TOPS VGG-16, 4-TOPS ResNet-34, 5-TOPS ResNet-50, 3-TOPS MobileNetV1, 4-TOPS ConvNeXt-T, and 4-TOPS ResNeXt-50 are implemented, surpassing the performance of most existing works.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.