A High-Intensity Solution of Hardware Accelerator for Sparse and Redundant Computations in Semantic Segmentation Models

IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Jiahui Huang;Zhan Li;Yuxian Jiang;Zhihan Zhang;Hao Wang;Sheng Chang
{"title":"A High-Intensity Solution of Hardware Accelerator for Sparse and Redundant Computations in Semantic Segmentation Models","authors":"Jiahui Huang;Zhan Li;Yuxian Jiang;Zhihan Zhang;Hao Wang;Sheng Chang","doi":"10.1109/TC.2025.3585354","DOIUrl":null,"url":null,"abstract":"The rapid development of artificial intelligence (AI) has met people's personalized needs. However, with the increase of data capacities and computing requirements, the imbalance between large-scale data transmission and limited network bandwidth has become increasingly prominent. To improve the speed of embedded system, real-time intelligent computing is gradually moving from the cloud to the edge. Traditional FPGA-based AI accelerators mainly utilize PE architecture, but the low computing throughput and resource utilization make it difficult to meet the power requirement of edge AI application scenarios such as image segmentation. In recent years, AI accelerators based on streaming architecture have become a trend, and it is necessary to customize high-performance streaming accelerators for specific segmentation algorithms. In this paper, we design a high-intensity pixel-level fully pipelined accelerator with customized strategies to eliminate the sparse and redundant computations in specific algorithms of semantic segmentation, which significantly improve the accelerator's computing throughput and hardware resources utilization. On Xilinx FPGA, our acceleration of two typical semantic segmentation networks-ESPNet and DeepLabV3, achieves optimized throughputs of 171.3 GOPS and 1324.8 GOPS, and computing efficiency of 9.26 and 9.01, respectively. It provides the possibility of hardware deployment in real-time application with high computing intensity.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"3129-3142"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11062861/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

The rapid development of artificial intelligence (AI) has met people's personalized needs. However, with the increase of data capacities and computing requirements, the imbalance between large-scale data transmission and limited network bandwidth has become increasingly prominent. To improve the speed of embedded system, real-time intelligent computing is gradually moving from the cloud to the edge. Traditional FPGA-based AI accelerators mainly utilize PE architecture, but the low computing throughput and resource utilization make it difficult to meet the power requirement of edge AI application scenarios such as image segmentation. In recent years, AI accelerators based on streaming architecture have become a trend, and it is necessary to customize high-performance streaming accelerators for specific segmentation algorithms. In this paper, we design a high-intensity pixel-level fully pipelined accelerator with customized strategies to eliminate the sparse and redundant computations in specific algorithms of semantic segmentation, which significantly improve the accelerator's computing throughput and hardware resources utilization. On Xilinx FPGA, our acceleration of two typical semantic segmentation networks-ESPNet and DeepLabV3, achieves optimized throughputs of 171.3 GOPS and 1324.8 GOPS, and computing efficiency of 9.26 and 9.01, respectively. It provides the possibility of hardware deployment in real-time application with high computing intensity.
语义分割模型中稀疏冗余计算的硬件加速器高强度解决方案
人工智能(AI)的快速发展满足了人们的个性化需求。然而,随着数据容量和计算需求的增加,大规模数据传输与有限的网络带宽之间的不平衡问题日益突出。为了提高嵌入式系统的运行速度,实时智能计算正逐步从云端向边缘移动。传统的基于fpga的AI加速器主要采用PE架构,但计算吞吐量和资源利用率较低,难以满足图像分割等边缘AI应用场景的功耗需求。近年来,基于流架构的AI加速器已成为一种趋势,有必要针对特定的分割算法定制高性能的流加速器。本文设计了一种高强度像素级全流水线加速器,采用定制化策略消除了语义分割特定算法中的稀疏计算和冗余计算,显著提高了加速器的计算吞吐量和硬件资源利用率。在Xilinx FPGA上,我们对espnet和DeepLabV3两种典型语义分割网络进行了加速,优化后的吞吐量分别为171.3 GOPS和1324.8 GOPS,计算效率分别为9.26和9.01。它为高计算强度的实时应用提供了硬件部署的可能性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Computers
IEEE Transactions on Computers 工程技术-工程:电子与电气
CiteScore
6.60
自引率
5.40%
发文量
199
审稿时长
6.0 months
期刊介绍: The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信