Resource-aware strategies for real-time multi-person pose estimation

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-02-07 DOI:10.1016/j.imavis.2025.105441

Mohammed A. Esmail , Jinlei Wang , Yihao Wang , Li Sun , Guoliang Zhu , Guohe Zhang

{"title":"Resource-aware strategies for real-time multi-person pose estimation","authors":"Mohammed A. Esmail , Jinlei Wang , Yihao Wang , Li Sun , Guoliang Zhu , Guohe Zhang","doi":"10.1016/j.imavis.2025.105441","DOIUrl":null,"url":null,"abstract":"<div><div>When using deep learning applications for human posture estimation (HPE), especially on devices with limited resources, accuracy and efficiency must be balanced. Common deep-learning architectures have a propensity to use a large amount of processing power while yielding low accuracy. This work proposes the implementation of Efficient YoloPose, a new architecture based on You Only Look Once version 8 (YOLOv8)-Pose, in an attempt to address these issues. Advanced lightweight methods like Depthwise Convolution, Ghost Convolution, and the C3Ghost module are used by Efficient YoloPose to replace traditional convolution and C2f (a quicker implementation of the Cross Stage Partial Bottleneck). This approach greatly decreases the inference, parameter count, and computing complexity. To improve posture estimation even further, Efficient YoloPose integrates the Squeeze Excitation (SE) attention method into the network. The main focus of this process during posture estimation is the significant areas of an image. Experimental results show that the suggested model performs better than the current models on the COCO and OCHuman datasets. The proposed model lowers the inference time from 1.1 milliseconds (ms) to 0.9 ms, the computational complexity from 9.2 Giga Floating-point operations (GFlops) to 4.8 GFlops and the parameter count from 3.3 million to 1.3 million when compared to YOLOv8-Pose. In addition, this model maintains an average precision (AP) score of 78.8 on the COCO dataset. The source code for Efficient YoloPose has been made publicly available at [<span><span>https://github.com/malareeqi/Efficient-YoloPose</span><svg><path></path></svg></span>].</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105441"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000290","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

When using deep learning applications for human posture estimation (HPE), especially on devices with limited resources, accuracy and efficiency must be balanced. Common deep-learning architectures have a propensity to use a large amount of processing power while yielding low accuracy. This work proposes the implementation of Efficient YoloPose, a new architecture based on You Only Look Once version 8 (YOLOv8)-Pose, in an attempt to address these issues. Advanced lightweight methods like Depthwise Convolution, Ghost Convolution, and the C3Ghost module are used by Efficient YoloPose to replace traditional convolution and C2f (a quicker implementation of the Cross Stage Partial Bottleneck). This approach greatly decreases the inference, parameter count, and computing complexity. To improve posture estimation even further, Efficient YoloPose integrates the Squeeze Excitation (SE) attention method into the network. The main focus of this process during posture estimation is the significant areas of an image. Experimental results show that the suggested model performs better than the current models on the COCO and OCHuman datasets. The proposed model lowers the inference time from 1.1 milliseconds (ms) to 0.9 ms, the computational complexity from 9.2 Giga Floating-point operations (GFlops) to 4.8 GFlops and the parameter count from 3.3 million to 1.3 million when compared to YOLOv8-Pose. In addition, this model maintains an average precision (AP) score of 78.8 on the COCO dataset. The source code for Efficient YoloPose has been made publicly available at [https://github.com/malareeqi/Efficient-YoloPose].

Abstract Image

查看原文本刊更多论文

实时多人姿态估计的资源感知策略

在使用深度学习应用程序进行人体姿势估计（HPE）时，特别是在资源有限的设备上，必须平衡准确性和效率。常见的深度学习架构倾向于使用大量的处理能力，而产生较低的准确性。为了解决这些问题，本文提出了基于You Only Look Once version 8 (YOLOv8)-Pose的新架构Efficient YoloPose的实现。高效的YoloPose使用了深度卷积、Ghost卷积和C3Ghost模块等高级轻量级方法来取代传统的卷积和C2f（跨阶段部分瓶颈的更快实现）。这种方法大大减少了推理、参数计数和计算复杂度。为了进一步改进姿态估计，Efficient YoloPose将挤压激励（SE）注意方法集成到网络中。在姿态估计过程中，该过程的主要焦点是图像的重要区域。实验结果表明，该模型在COCO和ochhuman数据集上的性能优于现有模型。与YOLOv8-Pose相比，该模型将推理时间从1.1毫秒（ms）降低到0.9毫秒，计算复杂度从9.2千兆浮点运算（GFlops）降低到4.8千兆浮点运算（GFlops），参数数从330万减少到130万。此外，该模型在COCO数据集上的平均精度（AP）得分为78.8。高效YoloPose的源代码已经在[https://github.com/malareeqi/Efficient-YoloPose]]公开发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.