一个从单目视频中提取3D Avatar-Ready手势动画的工具

Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games Pub Date : 2022-11-03 DOI:10.1145/3561975.3562953

Andrew W. Feng, Samuel Shin, Youngwoo Yoon

{"title":"一个从单目视频中提取3D Avatar-Ready手势动画的工具","authors":"Andrew W. Feng, Samuel Shin, Youngwoo Yoon","doi":"10.1145/3561975.3562953","DOIUrl":null,"url":null,"abstract":"Modeling and generating realistic human gesture animations from speech audios has great impacts on creating a believable virtual human that can interact with human users and mimic real-world face-to-face communications. Large-scale datasets are essential in data-driven research, but creating multi-modal gesture datasets with 3D gesture motions and corresponding speech audios is either expensive to create via traditional workflow such as mocap, or producing subpar results via pose estimations from in-the-wild videos. As a result of such limitations, existing gesture datasets either suffer from shorter duration or lower animation quality, making them less ideal for training gesture synthesis models. Motivated by the key limitations from previous datasets and recent progress in human mesh recovery (HMR), we developed a tool for extracting avatar-ready gesture motions from monocular videos with improved animation quality. The tool utilizes a variational autoencoder (VAE) to refine raw gesture motions. The resulting gestures are in a unified pose representation that includes both body and finger motions and can be readily applied to a virtual avatar via online motion retargeting. We validated the proposed tool on existing datasets and created the refined dataset TED-SMPLX by re-processing videos from the original TED dataset. The new dataset is available at https://andrewfengusa.github.io/TED_SMPLX_Dataset.","PeriodicalId":246179,"journal":{"name":"Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games","volume":"199 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Tool for Extracting 3D Avatar-Ready Gesture Animations from Monocular Videos\",\"authors\":\"Andrew W. Feng, Samuel Shin, Youngwoo Yoon\",\"doi\":\"10.1145/3561975.3562953\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modeling and generating realistic human gesture animations from speech audios has great impacts on creating a believable virtual human that can interact with human users and mimic real-world face-to-face communications. Large-scale datasets are essential in data-driven research, but creating multi-modal gesture datasets with 3D gesture motions and corresponding speech audios is either expensive to create via traditional workflow such as mocap, or producing subpar results via pose estimations from in-the-wild videos. As a result of such limitations, existing gesture datasets either suffer from shorter duration or lower animation quality, making them less ideal for training gesture synthesis models. Motivated by the key limitations from previous datasets and recent progress in human mesh recovery (HMR), we developed a tool for extracting avatar-ready gesture motions from monocular videos with improved animation quality. The tool utilizes a variational autoencoder (VAE) to refine raw gesture motions. The resulting gestures are in a unified pose representation that includes both body and finger motions and can be readily applied to a virtual avatar via online motion retargeting. We validated the proposed tool on existing datasets and created the refined dataset TED-SMPLX by re-processing videos from the original TED dataset. The new dataset is available at https://andrewfengusa.github.io/TED_SMPLX_Dataset.\",\"PeriodicalId\":246179,\"journal\":{\"name\":\"Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games\",\"volume\":\"199 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3561975.3562953\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3561975.3562953","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

从语音音频中建模和生成逼真的人类手势动画对创建一个可信的虚拟人具有重要影响，该虚拟人可以与人类用户交互并模拟现实世界中的面对面交流。大规模数据集在数据驱动的研究中是必不可少的，但是通过传统的工作流程(如动作捕捉)创建具有3D手势运动和相应语音音频的多模态手势数据集是昂贵的，或者通过从野生视频中进行姿态估计产生低于标准的结果。由于这些限制，现有的手势数据集要么持续时间较短，要么动画质量较低，这使得它们不太适合训练手势合成模型。由于先前数据集的关键限制和人类网格恢复(HMR)的最新进展，我们开发了一种工具，用于从单眼视频中提取具有改进动画质量的虚拟姿态运动。该工具利用变分自编码器(VAE)来改进原始手势动作。由此产生的手势是一个统一的姿态表示，包括身体和手指的动作，可以很容易地通过在线动作重定向应用于虚拟化身。我们在现有数据集上验证了提出的工具，并通过重新处理原始TED数据集的视频创建了改进的数据集TED- smplx。新的数据集可在https://andrewfengusa.github.io/TED_SMPLX_Dataset上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Tool for Extracting 3D Avatar-Ready Gesture Animations from Monocular Videos

Modeling and generating realistic human gesture animations from speech audios has great impacts on creating a believable virtual human that can interact with human users and mimic real-world face-to-face communications. Large-scale datasets are essential in data-driven research, but creating multi-modal gesture datasets with 3D gesture motions and corresponding speech audios is either expensive to create via traditional workflow such as mocap, or producing subpar results via pose estimations from in-the-wild videos. As a result of such limitations, existing gesture datasets either suffer from shorter duration or lower animation quality, making them less ideal for training gesture synthesis models. Motivated by the key limitations from previous datasets and recent progress in human mesh recovery (HMR), we developed a tool for extracting avatar-ready gesture motions from monocular videos with improved animation quality. The tool utilizes a variational autoencoder (VAE) to refine raw gesture motions. The resulting gestures are in a unified pose representation that includes both body and finger motions and can be readily applied to a virtual avatar via online motion retargeting. We validated the proposed tool on existing datasets and created the refined dataset TED-SMPLX by re-processing videos from the original TED dataset. The new dataset is available at https://andrewfengusa.github.io/TED_SMPLX_Dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games

自引率

0.00%

发文量