{"title":"FPGA-based accelerator for YOLOv5 object detection with optimized computation and data access for edge deployment","authors":"Wei Qian , Zhengwei Zhu , Chenyang Zhu , Yanping Zhu","doi":"10.1016/j.parco.2025.103138","DOIUrl":null,"url":null,"abstract":"<div><div>In the realm of object detection, advancements in convolutional neural networks have been substantial. However, their high computational and data access demands complicate the deployment of these algorithms on edge devices. To mitigate these challenges, field-programmable gate arrays have emerged as an ideal hardware platform for executing the parallel computations inherent in convolutional neural networks, owing to their low power consumption and rapid response capabilities. We have developed a field-programmable gate array-based accelerator for the You Only Look Once version 5 (YOLOv5) object detection network, implemented using Verilog Hardware Description Language on the Xilinx XCZU15EG chip. This accelerator efficiently processes the convolutional layers, batch normalization fusion layers, and tensor addition operations of the Yolov5 network. Our architecture segregates the convolution computations into two computing units: multiplication and addition. The addition operations are significantly accelerated by the introduction of compressor adders and ternary adder trees. Additionally, off-chip bandwidth pressure is alleviated through the use of dual-input single-output buffers and dedicated data access units. Experimental results demonstrate that the power consumption of the accelerator is 13.021 watts at a central frequency of 200 megahertz. Experiment results indicate that our accelerator outperforms Amazon Web Services Graviton2 central processing units and Jetson Nano graphics processing units. Ablation experiments validate the enhancements provided by our innovative designs. Ultimately, our approach significantly boosts the inference speed of the Yolov5 network, with improvements of 61.88%, 69.1%, 59.36%, 64.07%, and 65.92%, thereby dramatically enhancing the performance of the accelerator and surpassing existing methods.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"124 ","pages":"Article 103138"},"PeriodicalIF":2.0000,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819125000146","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
In the realm of object detection, advancements in convolutional neural networks have been substantial. However, their high computational and data access demands complicate the deployment of these algorithms on edge devices. To mitigate these challenges, field-programmable gate arrays have emerged as an ideal hardware platform for executing the parallel computations inherent in convolutional neural networks, owing to their low power consumption and rapid response capabilities. We have developed a field-programmable gate array-based accelerator for the You Only Look Once version 5 (YOLOv5) object detection network, implemented using Verilog Hardware Description Language on the Xilinx XCZU15EG chip. This accelerator efficiently processes the convolutional layers, batch normalization fusion layers, and tensor addition operations of the Yolov5 network. Our architecture segregates the convolution computations into two computing units: multiplication and addition. The addition operations are significantly accelerated by the introduction of compressor adders and ternary adder trees. Additionally, off-chip bandwidth pressure is alleviated through the use of dual-input single-output buffers and dedicated data access units. Experimental results demonstrate that the power consumption of the accelerator is 13.021 watts at a central frequency of 200 megahertz. Experiment results indicate that our accelerator outperforms Amazon Web Services Graviton2 central processing units and Jetson Nano graphics processing units. Ablation experiments validate the enhancements provided by our innovative designs. Ultimately, our approach significantly boosts the inference speed of the Yolov5 network, with improvements of 61.88%, 69.1%, 59.36%, 64.07%, and 65.92%, thereby dramatically enhancing the performance of the accelerator and surpassing existing methods.
在目标检测领域,卷积神经网络的进步是实质性的。然而,它们的高计算和数据访问需求使这些算法在边缘设备上的部署复杂化。为了缓解这些挑战,现场可编程门阵列由于其低功耗和快速响应能力,已经成为执行卷积神经网络固有的并行计算的理想硬件平台。我们为You Only Look Once version 5 (YOLOv5)目标检测网络开发了一种基于现场可编程门阵列的加速器,该加速器在Xilinx XCZU15EG芯片上使用Verilog硬件描述语言实现。该加速器有效地处理了Yolov5网络的卷积层、批归一化融合层和张量加法操作。我们的架构将卷积计算分离为两个计算单元:乘法和加法。压缩加法器和三元加法器树的引入大大加快了加法运算。此外,通过使用双输入单输出缓冲器和专用数据访问单元,可以减轻片外带宽压力。实验结果表明,在200兆赫的中心频率下,加速器的功耗为13.021瓦。实验结果表明,该加速器的性能优于Amazon Web Services gravon2中央处理器和Jetson Nano图形处理器。烧蚀实验验证了我们的创新设计所提供的增强功能。最终,我们的方法显著提高了Yolov5网络的推理速度,分别提高了61.88%、69.1%、59.36%、64.07%和65.92%,从而大大提高了加速器的性能,超越了现有的方法。
期刊介绍:
Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems.
Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results.
Particular technical areas of interest include, but are not limited to:
-System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing).
-Enabling software including debuggers, performance tools, and system and numeric libraries.
-General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems
-Software engineering and productivity as it relates to parallel computing
-Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism
-Performance measurement results on state-of-the-art systems
-Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures.
-Parallel I/O systems both hardware and software
-Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications