PoseScript: 3D Human Poses from Natural Language

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision Pub Date : 2022-10-21 DOI:10.48550/arXiv.2210.11795

Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, F. Moreno-Noguer, Grégory Rogez

{"title":"PoseScript: 3D Human Poses from Natural Language","authors":"Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, F. Moreno-Noguer, Grégory Rogez","doi":"10.48550/arXiv.2210.11795","DOIUrl":null,"url":null,"abstract":"Natural language is leveraged in many computer vision tasks such as image captioning, cross-modal retrieval or visual question answering, to provide fine-grained semantic information. While human pose is key to human understanding, current 3D human pose datasets lack detailed language descriptions. In this work, we introduce the PoseScript dataset, which pairs a few thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. To increase the size of this dataset to a scale compatible with typical data hungry learning algorithms, we propose an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints. This process extracts low-level pose information -- the posecodes -- using a set of simple but generic rules on the 3D keypoints. The posecodes are then combined into higher level textual descriptions using syntactic rules. Automatic annotations substantially increase the amount of available data, and make it possible to effectively pretrain deep models for finetuning on human captions. To demonstrate the potential of annotated poses, we show applications of the PoseScript dataset to retrieval of relevant poses from large-scale datasets and to synthetic pose generation, both based on a textual pose description.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"9 1","pages":"346-362"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.11795","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

Natural language is leveraged in many computer vision tasks such as image captioning, cross-modal retrieval or visual question answering, to provide fine-grained semantic information. While human pose is key to human understanding, current 3D human pose datasets lack detailed language descriptions. In this work, we introduce the PoseScript dataset, which pairs a few thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. To increase the size of this dataset to a scale compatible with typical data hungry learning algorithms, we propose an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints. This process extracts low-level pose information -- the posecodes -- using a set of simple but generic rules on the 3D keypoints. The posecodes are then combined into higher level textual descriptions using syntactic rules. Automatic annotations substantially increase the amount of available data, and make it possible to effectively pretrain deep models for finetuning on human captions. To demonstrate the potential of annotated poses, we show applications of the PoseScript dataset to retrieval of relevant poses from large-scale datasets and to synthetic pose generation, both based on a textual pose description.

查看原文本刊更多论文

postscript:来自自然语言的3D人体姿势

自然语言在许多计算机视觉任务中被利用，如图像字幕、跨模态检索或视觉问答，以提供细粒度的语义信息。虽然人体姿势是人类理解的关键，但目前的3D人体姿势数据集缺乏详细的语言描述。在这项工作中，我们引入了PoseScript数据集，该数据集将来自AMASS的数千个3D人体姿势与丰富的人体部位及其空间关系的人类注释描述配对。为了将该数据集的大小增加到与典型的数据饥渴学习算法兼容的规模，我们提出了一个精心设计的字幕过程，该过程可以从给定的3D关键点生成自然语言的自动合成描述。这个过程使用一组简单但通用的3D关键点规则提取低级姿态信息。然后使用语法规则将这些叠码组合成更高级的文本描述。自动注释大大增加了可用数据的数量，并使有效地预训练深度模型以微调人类标题成为可能。为了展示姿势标注的潜力，我们展示了PoseScript数据集在从大规模数据集中检索相关姿势和合成姿势生成方面的应用，两者都基于文本姿势描述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

自引率

0.00%

发文量