Playing for 3D Human Recovery

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-08-27 DOI:10.1109/TPAMI.2024.3450537

Zhongang Cai;Mingyuan Zhang;Jiawei Ren;Chen Wei;Daxuan Ren;Zhengyu Lin;Haiyu Zhao;Lei Yang;Chen Change Loy;Ziwei Liu

{"title":"Playing for 3D Human Recovery","authors":"Zhongang Cai;Mingyuan Zhang;Jiawei Ren;Chen Wei;Daxuan Ren;Zhengyu Lin;Haiyu Zhao;Lei Yang;Chen Change Loy;Ziwei Liu","doi":"10.1109/TPAMI.2024.3450537","DOIUrl":null,"url":null,"abstract":"Image- and video-based 3D human recovery (i.e., pose and shape estimation) have achieved substantial progress. However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity. In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths. Specifically, we contribute \n<bold>GTA-Human\n, a large-scale 3D human dataset generated with the GTA-V game engine, featuring a highly diverse set of subjects, actions, and scenarios. More importantly, we study the use of game-playing data and obtain five major insights. \n<bold>First\n, game-playing data is surprisingly effective. A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin. For video-based methods, GTA-Human is even on par with the in-domain training set. \n<bold>Second\n, we discover that synthetic data provides critical complements to the real data that is typically collected indoor. We highlight that our investigation into domain gap provides explanations for our data mixture strategies that are simple yet useful, which offers new insights to the research community. \n<bold>Third\n, the scale of the dataset matters. The performance boost is closely related to the additional data available. A systematic study on multiple key factors (such as camera angle and body pose) reveals that the model performance is sensitive to data density. \n<bold>Fourth\n, the effectiveness of GTA-Human is also attributed to the rich collection of strong supervision labels (SMPL parameters), which are otherwise expensive to acquire in real datasets. \n<bold>Fifth\n, the benefits of synthetic data extend to larger models such as deeper convolutional neural networks (CNNs) and Transformers, for which a significant impact is also observed. We hope our work could pave the way for scaling up 3D human recovery to the real world.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10533-10545"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10652891/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Image- and video-based 3D human recovery (i.e., pose and shape estimation) have achieved substantial progress. However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity. In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths. Specifically, we contribute GTA-Human , a large-scale 3D human dataset generated with the GTA-V game engine, featuring a highly diverse set of subjects, actions, and scenarios. More importantly, we study the use of game-playing data and obtain five major insights. First , game-playing data is surprisingly effective. A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin. For video-based methods, GTA-Human is even on par with the in-domain training set. Second , we discover that synthetic data provides critical complements to the real data that is typically collected indoor. We highlight that our investigation into domain gap provides explanations for our data mixture strategies that are simple yet useful, which offers new insights to the research community. Third , the scale of the dataset matters. The performance boost is closely related to the additional data available. A systematic study on multiple key factors (such as camera angle and body pose) reveals that the model performance is sensitive to data density. Fourth , the effectiveness of GTA-Human is also attributed to the rich collection of strong supervision labels (SMPL parameters), which are otherwise expensive to acquire in real datasets. Fifth , the benefits of synthetic data extend to larger models such as deeper convolutional neural networks (CNNs) and Transformers, for which a significant impact is also observed. We hope our work could pave the way for scaling up 3D human recovery to the real world.

查看原文本刊更多论文

玩转 3D 人体复原。

基于图像和视频的三维人体复原（即姿势和形状估计）取得了长足的进步。然而，由于动作捕捉的成本过高，现有数据集在规模和多样性方面往往受到限制。在这项工作中，我们通过玩视频游戏和自动注释的三维地面实况来获取大量的人体序列。具体来说，我们贡献了 GTA-Human 数据集，这是一个利用 GTA-V 游戏引擎生成的大规模 3D 人体数据集，其中包含了高度多样化的主体、动作和场景。更重要的是，我们研究了游戏数据的使用，并获得了五大启示。首先，游戏数据出奇地有效。在《GTA-Human》上训练的基于帧的简单基线方法远远优于更复杂的方法。对于基于视频的方法，GTA-Human 甚至可以与域内训练集相媲美。其次，我们发现合成数据为通常在室内收集的真实数据提供了重要补充。我们强调，我们对领域差距的调查为我们的数据混合策略提供了简单而有用的解释，这为研究界提供了新的见解。第三，数据集的规模很重要。性能提升与可用的额外数据密切相关。对多个关键因素（如摄像机角度和身体姿势）的系统研究表明，模型性能对数据密度非常敏感。第四，GTA-Human 的有效性还归功于丰富的强监督标签集合（SMPL 参数），否则在真实数据集中获取这些标签的成本会很高。第五，合成数据的优势还可扩展到更大型的模型，如深度卷积神经网络（CNN）和变形器（Transformers），对它们也有显著影响。我们希望我们的工作能为扩大三维人体复原到现实世界铺平道路。主页：https://caizhongang.github.io/projects/GTA-Human/.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量