Xuezhi Xiang , Xiaoheng Li , Weijie Bao , Yulong Qiao , Abdulmotaleb El Saddik
{"title":"DBMHT: A double-branch multi-hypothesis transformer for 3D human pose estimation in video","authors":"Xuezhi Xiang , Xiaoheng Li , Weijie Bao , Yulong Qiao , Abdulmotaleb El Saddik","doi":"10.1016/j.cviu.2024.104147","DOIUrl":null,"url":null,"abstract":"<div><p>The estimation of 3D human poses from monocular videos presents a significant challenge. The existing methods face the problems of deep ambiguity and self-occlusion. To overcome these problems, we propose a Double-Branch Multi-Hypothesis Transformer (DBMHT). In detail, we utilize a Double-Branch architecture to capture temporal and spatial information and generate multiple hypotheses. To merge these hypotheses, we adopt a lightweight module to integrate spatial and temporal representations. The DBMHT can not only capture spatial information from each joint in the human body and temporal information from each frame in the video but also merge multiple hypotheses that have different spatio-temporal information. Comprehensive evaluation on two challenging datasets (i.e. Human3.6M and MPI-INF-3DHP) demonstrates the superior performance of DBMHT, marking it as a robust and efficient approach for accurate 3D HPE in dynamic scenarios. The results show that our model surpasses the state-of-the-art approach by 1.9% MPJPE with ground truth 2D keypoints as input.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224002285","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The estimation of 3D human poses from monocular videos presents a significant challenge. The existing methods face the problems of deep ambiguity and self-occlusion. To overcome these problems, we propose a Double-Branch Multi-Hypothesis Transformer (DBMHT). In detail, we utilize a Double-Branch architecture to capture temporal and spatial information and generate multiple hypotheses. To merge these hypotheses, we adopt a lightweight module to integrate spatial and temporal representations. The DBMHT can not only capture spatial information from each joint in the human body and temporal information from each frame in the video but also merge multiple hypotheses that have different spatio-temporal information. Comprehensive evaluation on two challenging datasets (i.e. Human3.6M and MPI-INF-3DHP) demonstrates the superior performance of DBMHT, marking it as a robust and efficient approach for accurate 3D HPE in dynamic scenarios. The results show that our model surpasses the state-of-the-art approach by 1.9% MPJPE with ground truth 2D keypoints as input.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems