{"title":"Mobile-friendly and multi-feature aggregation via transformer for human pose estimation","authors":"Biao Li , Shoufeng Tang , Wenyi Li","doi":"10.1016/j.imavis.2024.105343","DOIUrl":null,"url":null,"abstract":"<div><div>Human pose estimation is pivotal for human-centric visual tasks, yet deploying such models on mobile devices remains challenging due to high parameter counts and computational demands. In this paper, we study Mobile-Friendly and Multi-Feature Aggregation architectural designs for human pose estimation and propose a novel model called MobileMultiPose. Specifically, a lightweight aggregation method, incorporating multi-scale and multi-feature, mitigates redundant shallow semantic extraction and local deep semantic constraints. To efficiently aggregate diverse local and global features, a lightweight transformer module, constructed from a self-attention mechanism with linear complexity, is designed, achieving deep fusion of shallow and deep semantics. Furthermore, a multi-scale loss supervision method is incorporated into the training process to enhance model performance, facilitating the effective fusion of edge information across various scales. Extensive experiments show that the smallest variant of MobileMultiPose outperforms lightweight models (MobileNetv2, ShuffleNetv2, and Small HRNet) by 0.7, 5.4, and 10.1 points, respectively, on the COCO validation set, with fewer parameters and FLOPs. In particular, the largest MobileMultiPose variant achieves an impressive AP score of 72.4 on the COCO test-dev set, notably, its parameters and FLOPs are only 16% and 18% of HRNet-W32, and 7% and 9% of DARK, respectively. We aim to offer novel insights into designing lightweight and efficient feature extraction networks, supporting mobile-friendly model deployment.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"153 ","pages":"Article 105343"},"PeriodicalIF":4.2000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624004487","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Human pose estimation is pivotal for human-centric visual tasks, yet deploying such models on mobile devices remains challenging due to high parameter counts and computational demands. In this paper, we study Mobile-Friendly and Multi-Feature Aggregation architectural designs for human pose estimation and propose a novel model called MobileMultiPose. Specifically, a lightweight aggregation method, incorporating multi-scale and multi-feature, mitigates redundant shallow semantic extraction and local deep semantic constraints. To efficiently aggregate diverse local and global features, a lightweight transformer module, constructed from a self-attention mechanism with linear complexity, is designed, achieving deep fusion of shallow and deep semantics. Furthermore, a multi-scale loss supervision method is incorporated into the training process to enhance model performance, facilitating the effective fusion of edge information across various scales. Extensive experiments show that the smallest variant of MobileMultiPose outperforms lightweight models (MobileNetv2, ShuffleNetv2, and Small HRNet) by 0.7, 5.4, and 10.1 points, respectively, on the COCO validation set, with fewer parameters and FLOPs. In particular, the largest MobileMultiPose variant achieves an impressive AP score of 72.4 on the COCO test-dev set, notably, its parameters and FLOPs are only 16% and 18% of HRNet-W32, and 7% and 9% of DARK, respectively. We aim to offer novel insights into designing lightweight and efficient feature extraction networks, supporting mobile-friendly model deployment.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.