LMFormer: Lightweight and multi-feature perspective via transformer for human pose estimation

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2024-05-17 DOI:10.1016/j.neucom.2024.127884

Biao Li , Shoufeng Tang , Wenyi Li

{"title":"LMFormer: Lightweight and multi-feature perspective via transformer for human pose estimation","authors":"Biao Li , Shoufeng Tang , Wenyi Li","doi":"10.1016/j.neucom.2024.127884","DOIUrl":null,"url":null,"abstract":"<div><p>The effectiveness of Token Mixer in visual tasks is well-established; however, its high computational complexity and a relatively singular spatial relationship modeling perspective present challenges. In this study, we propose LMFormer, a hybrid model based on CNN and Transformer architectures for human pose estimation. To achieve this, we first design a lightweight multi-feature perspective Token Mixer, using a lightweight feature reconstruction strategy to simultaneously aggregate the spatial and channel feature information, thereby enhancing the performance and generalization capabilities of the model. Subsequently, we explore multi-scale information interaction by developing an iterative multi-feature weighting module, coupled with the design of a multi-scale information propagation mechanism incorporated into the skip connections. Finally, we validate the effectiveness of the network on benchmark datasets, including COCO, MPII, and CrowdPose, utilizing a multi-scale deep supervision strategy. Extensive experiments demonstrate that LMFormer, with reduced computational complexity, comprehensively captures multi-scale features, resulting in significant performance improvements. Specifically, LMFormer-B achieves an AP score of 65.8 on the COCO val dataset, surpassing MobileNetV2 and ShuffleNetV2 by 1.0 and 5.6 points, respectively. Moreover, its parameters are merely 19.8% and 25% of MobileNetV2 and ShuffleNetV2, with corresponding GFLOPs at 43.8% and 50%. We aim to provide new insights into lightweight and efficient feature extraction strategies, as well as efficient Token Mixer designs.</p></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"594 ","pages":"Article 127884"},"PeriodicalIF":5.5000,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224006556","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The effectiveness of Token Mixer in visual tasks is well-established; however, its high computational complexity and a relatively singular spatial relationship modeling perspective present challenges. In this study, we propose LMFormer, a hybrid model based on CNN and Transformer architectures for human pose estimation. To achieve this, we first design a lightweight multi-feature perspective Token Mixer, using a lightweight feature reconstruction strategy to simultaneously aggregate the spatial and channel feature information, thereby enhancing the performance and generalization capabilities of the model. Subsequently, we explore multi-scale information interaction by developing an iterative multi-feature weighting module, coupled with the design of a multi-scale information propagation mechanism incorporated into the skip connections. Finally, we validate the effectiveness of the network on benchmark datasets, including COCO, MPII, and CrowdPose, utilizing a multi-scale deep supervision strategy. Extensive experiments demonstrate that LMFormer, with reduced computational complexity, comprehensively captures multi-scale features, resulting in significant performance improvements. Specifically, LMFormer-B achieves an AP score of 65.8 on the COCO val dataset, surpassing MobileNetV2 and ShuffleNetV2 by 1.0 and 5.6 points, respectively. Moreover, its parameters are merely 19.8% and 25% of MobileNetV2 and ShuffleNetV2, with corresponding GFLOPs at 43.8% and 50%. We aim to provide new insights into lightweight and efficient feature extraction strategies, as well as efficient Token Mixer designs.

Abstract Image

查看原文本刊更多论文

LMFormer：用于人体姿态估计的轻量级多特征透视变换器

令牌混合器（Token Mixer）在视觉任务中的有效性已得到证实；然而，其较高的计算复杂性和相对单一的空间关系建模视角带来了挑战。在本研究中，我们提出了基于 CNN 和 Transformer 架构的混合模型 LMFormer，用于人体姿态估计。为此，我们首先设计了一种轻量级多特征视角令牌混合器，利用轻量级特征重构策略同时聚合空间和通道特征信息，从而提高模型的性能和泛化能力。随后，我们通过开发迭代式多特征加权模块，并结合跳转连接中的多尺度信息传播机制设计，探索了多尺度信息交互。最后，我们利用多尺度深度监督策略，在 COCO、MPII 和 CrowdPose 等基准数据集上验证了该网络的有效性。广泛的实验证明，LMFormer 在降低计算复杂度的同时，还能全面捕捉多尺度特征，从而显著提高性能。具体来说，LMFormer-B 在 COCO val 数据集上的 AP 得分为 65.8 分，分别超过 MobileNetV2 和 ShuffleNetV2 1.0 分和 5.6 分。此外，它的参数仅为 MobileNetV2 和 ShuffleNetV2 的 19.8% 和 25%，相应的 GFLOP 为 43.8% 和 50%。我们的目标是为轻量级高效特征提取策略以及高效令牌混合器设计提供新的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.