Skeleton Cluster Tracking for robust multi-view multi-person 3D human pose estimation

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2024-06-07 DOI:10.1016/j.cviu.2024.104059

Zehai Niu , Ke Lu , Jian Xue , Jinbao Wang

{"title":"Skeleton Cluster Tracking for robust multi-view multi-person 3D human pose estimation","authors":"Zehai Niu , Ke Lu , Jian Xue , Jinbao Wang","doi":"10.1016/j.cviu.2024.104059","DOIUrl":null,"url":null,"abstract":"<div><p>The multi-view 3D human pose estimation task relies on 2D human pose estimation for each view; however, severe occlusion, truncation, and human interaction lead to incorrect 2D human pose estimation for some views. The traditional “Matching-Lifting-Tracking” paradigm amplifies the incorrect 2D human pose into an incorrect 3D human pose, which significantly challenges the robustness of multi-view 3D human pose estimation. In this paper, we propose a novel method that tackles the inherent difficulties of the traditional paradigm. This method is rooted in the newly devised “Skeleton Pooling-Clustering-Tracking (SPCT)” paradigm. It initiates a 2D human pose estimation for each perspective. Then a symmetrical dilated network is created for skeleton pool estimation. Upon clustering the skeleton pool, we introduce and implement an innovative tracking method that is explicitly designed for the SPCT paradigm. The tracking method refines and filters the skeleton clusters, thereby enhancing the robustness of the multi-person 3D human pose estimation results. By coupling the skeleton pool with the tracking refinement process, our method obtains high-quality multi-person 3D human pose estimation results despite severe occlusions that produce erroneous 2D and 3D estimates. By employing the proposed SPCT paradigm and a computationally efficient network architecture, our method outperformed existing approaches regarding robustness on the Shelf, 4D Association, and CMU Panoptic datasets, and could be applied in practical scenarios such as markerless motion capture and animation production.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224001401","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The multi-view 3D human pose estimation task relies on 2D human pose estimation for each view; however, severe occlusion, truncation, and human interaction lead to incorrect 2D human pose estimation for some views. The traditional “Matching-Lifting-Tracking” paradigm amplifies the incorrect 2D human pose into an incorrect 3D human pose, which significantly challenges the robustness of multi-view 3D human pose estimation. In this paper, we propose a novel method that tackles the inherent difficulties of the traditional paradigm. This method is rooted in the newly devised “Skeleton Pooling-Clustering-Tracking (SPCT)” paradigm. It initiates a 2D human pose estimation for each perspective. Then a symmetrical dilated network is created for skeleton pool estimation. Upon clustering the skeleton pool, we introduce and implement an innovative tracking method that is explicitly designed for the SPCT paradigm. The tracking method refines and filters the skeleton clusters, thereby enhancing the robustness of the multi-person 3D human pose estimation results. By coupling the skeleton pool with the tracking refinement process, our method obtains high-quality multi-person 3D human pose estimation results despite severe occlusions that produce erroneous 2D and 3D estimates. By employing the proposed SPCT paradigm and a computationally efficient network architecture, our method outperformed existing approaches regarding robustness on the Shelf, 4D Association, and CMU Panoptic datasets, and could be applied in practical scenarios such as markerless motion capture and animation production.

查看原文本刊更多论文

用于多视角多人三维人体姿态稳健估算的骨架集群跟踪技术

多视角三维人体姿态估计任务依赖于每个视角的二维人体姿态估计；然而，严重的遮挡、截断和人机交互会导致某些视角的二维人体姿态估计不正确。传统的 "匹配-提升-跟踪 "范式会将错误的二维人体姿态放大为错误的三维人体姿态，这对多视角三维人体姿态估计的鲁棒性提出了极大的挑战。在本文中，我们提出了一种新方法来解决传统模式的固有难题。这种方法植根于新设计的 "骨架池-聚类-跟踪（SPCT）"范式。它首先对每个视角进行二维人体姿态估计。然后创建一个对称的扩张网络，用于骨架池估算。在对骨架池进行聚类后，我们引入并实施了一种明确针对 SPCT 范例设计的创新跟踪方法。该跟踪方法对骨架集群进行细化和过滤，从而增强了多人三维人体姿态估计结果的鲁棒性。通过将骨架池与跟踪细化过程相结合，我们的方法可以获得高质量的多人三维人体姿态估计结果，尽管严重的遮挡会产生错误的二维和三维估计结果。通过采用建议的 SPCT 范式和计算效率高的网络架构，我们的方法在 Shelf、4D Association 和 CMU Panoptic 数据集上的鲁棒性优于现有方法，可应用于无标记动作捕捉和动画制作等实际场景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems