Mobile-friendly and multi-feature aggregation via transformer for human pose estimation

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2024-11-20 DOI:10.1016/j.imavis.2024.105343

Biao Li , Shoufeng Tang , Wenyi Li

{"title":"Mobile-friendly and multi-feature aggregation via transformer for human pose estimation","authors":"Biao Li , Shoufeng Tang , Wenyi Li","doi":"10.1016/j.imavis.2024.105343","DOIUrl":null,"url":null,"abstract":"<div><div>Human pose estimation is pivotal for human-centric visual tasks, yet deploying such models on mobile devices remains challenging due to high parameter counts and computational demands. In this paper, we study Mobile-Friendly and Multi-Feature Aggregation architectural designs for human pose estimation and propose a novel model called MobileMultiPose. Specifically, a lightweight aggregation method, incorporating multi-scale and multi-feature, mitigates redundant shallow semantic extraction and local deep semantic constraints. To efficiently aggregate diverse local and global features, a lightweight transformer module, constructed from a self-attention mechanism with linear complexity, is designed, achieving deep fusion of shallow and deep semantics. Furthermore, a multi-scale loss supervision method is incorporated into the training process to enhance model performance, facilitating the effective fusion of edge information across various scales. Extensive experiments show that the smallest variant of MobileMultiPose outperforms lightweight models (MobileNetv2, ShuffleNetv2, and Small HRNet) by 0.7, 5.4, and 10.1 points, respectively, on the COCO validation set, with fewer parameters and FLOPs. In particular, the largest MobileMultiPose variant achieves an impressive AP score of 72.4 on the COCO test-dev set, notably, its parameters and FLOPs are only 16% and 18% of HRNet-W32, and 7% and 9% of DARK, respectively. We aim to offer novel insights into designing lightweight and efficient feature extraction networks, supporting mobile-friendly model deployment.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"153 ","pages":"Article 105343"},"PeriodicalIF":4.2000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624004487","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Human pose estimation is pivotal for human-centric visual tasks, yet deploying such models on mobile devices remains challenging due to high parameter counts and computational demands. In this paper, we study Mobile-Friendly and Multi-Feature Aggregation architectural designs for human pose estimation and propose a novel model called MobileMultiPose. Specifically, a lightweight aggregation method, incorporating multi-scale and multi-feature, mitigates redundant shallow semantic extraction and local deep semantic constraints. To efficiently aggregate diverse local and global features, a lightweight transformer module, constructed from a self-attention mechanism with linear complexity, is designed, achieving deep fusion of shallow and deep semantics. Furthermore, a multi-scale loss supervision method is incorporated into the training process to enhance model performance, facilitating the effective fusion of edge information across various scales. Extensive experiments show that the smallest variant of MobileMultiPose outperforms lightweight models (MobileNetv2, ShuffleNetv2, and Small HRNet) by 0.7, 5.4, and 10.1 points, respectively, on the COCO validation set, with fewer parameters and FLOPs. In particular, the largest MobileMultiPose variant achieves an impressive AP score of 72.4 on the COCO test-dev set, notably, its parameters and FLOPs are only 16% and 18% of HRNet-W32, and 7% and 9% of DARK, respectively. We aim to offer novel insights into designing lightweight and efficient feature extraction networks, supporting mobile-friendly model deployment.

Abstract Image

查看原文本刊更多论文

通过变换器进行移动友好型多特征聚合，用于人体姿态估计

人体姿态估计对于以人为中心的视觉任务至关重要，但由于参数数量多、计算要求高，在移动设备上部署此类模型仍具有挑战性。在本文中，我们研究了用于人体姿态估计的移动友好和多特征聚合架构设计，并提出了一种名为 MobileMultiPose 的新型模型。具体来说，一种结合了多尺度和多特征的轻量级聚合方法可减轻冗余的浅层语义提取和局部深层语义限制。为了有效聚合多样化的局部和全局特征，设计了一个轻量级转换器模块，该模块由具有线性复杂性的自我注意机制构建而成，实现了浅层和深层语义的深度融合。此外，还在训练过程中加入了多尺度损失监督方法，以提高模型性能，促进不同尺度边缘信息的有效融合。大量实验表明，在 COCO 验证集上，MobileMultiPose 的最小变体以更少的参数和 FLOPs 分别比轻量级模型（MobileNetv2、ShuffleNetv2 和 Small HRNet）高出 0.7、5.4 和 10.1 分。特别是，最大的 MobileMultiPose 变体在 COCO 测试验证集上获得了 72.4 分的惊人 AP 分数，值得注意的是，其参数和 FLOP 分别只有 HRNet-W32 的 16% 和 18%，以及 DARK 的 7% 和 9%。我们的目标是为设计轻量级高效特征提取网络提供新见解，支持移动友好模型部署。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.