H₂OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers.

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-09-10 DOI:10.1109/TPAMI.2025.3608284

Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Shijian Lu, Nicu Sebe

{"title":"H2OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers.","authors":"Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Shijian Lu, Nicu Sebe","doi":"10.1109/TPAMI.2025.3608284","DOIUrl":null,"url":null,"abstract":"Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play pruning-and-recovering framework, called Hierarchical Hourglass Tokenizer (H2OT), for efficient transformer-based 3D human pose estimation from videos. H2OT begins with progressively pruning pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token pruning and recovery strategies. In addition, our H2OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method. Code and models are available at https://github.com/NationalGAILab/HoT.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2025.3608284","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play pruning-and-recovering framework, called Hierarchical Hourglass Tokenizer (H₂OT), for efficient transformer-based 3D human pose estimation from videos. H₂OT begins with progressively pruning pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token pruning and recovery strategies. In addition, our H₂OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method. Code and models are available at https://github.com/NationalGAILab/HoT.

查看原文本刊更多论文

H2OT：高效视频姿势转换器的分层沙漏标记器。

变形金刚已经成功地应用于基于视频的三维人体姿态估计领域。然而，这些视频姿态转换器（vpt）的高计算成本使它们在资源受限的设备上不切实际。在本文中，我们提出了一种分层即插即用的修剪和恢复框架，称为分层沙漏标记器（H2OT），用于从视频中高效地估计基于变压器的3D人体姿势。H2OT从逐步剪枝冗余帧的位姿令牌开始，以恢复全长序列结束，从而在中间变压器块中产生少量位姿令牌，从而提高模型效率。它使用两个关键模块，即令牌修剪模块（TPM）和令牌恢复模块（TRM）。TPM动态选择一些有代表性的令牌来消除视频帧的冗余，而TRM则根据所选择的令牌恢复详细的时空信息，从而将网络输出扩展到原始的全长时间分辨率，从而实现快速推理。我们的方法是通用的：它可以很容易地合并到seq2seq和seq2frame管道上的公共VPT模型中，同时有效地适应不同的令牌修剪和恢复策略。此外，我们的H2OT表明，不需要保持完整的姿态序列，并且具有代表性的帧的几个姿态令牌可以达到较高的效率和估计精度。在多个基准数据集上的大量实验证明了该方法的有效性和高效性。代码和模型可在https://github.com/NationalGAILab/HoT上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量