Audio2Gestures:从音频生成不同的手势

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Visualization and Computer Graphics Pub Date : 2023-01-17 DOI:10.48550/arXiv.2301.06690

Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Linchao Bao, Zhenyu He

{"title":"Audio2Gestures:从音频生成不同的手势","authors":"Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Linchao Bao, Zhenyu He","doi":"10.48550/arXiv.2301.06690","DOIUrl":null,"url":null,"abstract":"People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during inference. So we propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code is expected to be responsible for the motion component that is more correlated to the audio while the motion-specific code is expected to capture diverse motion information that is more independent of the audio. However, splitting the latent code into two parts poses extra training difficulties. Several crucial training losses/strategies, including relaxed motion loss, bicycle constraint, and diversity loss, are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than previous state-of-the-art methods, quantitatively and qualitatively. Besides, our formulation is compatible with discrete cosine transformation (DCT) modeling and other popular backbones (i.e. RNN, Transformer). As for motion losses and quantitative motion evaluation, we find structured losses/metrics (e.g. STFT) that consider temporal and/or spatial context complement the most commonly used point-wise losses (e.g. PCK), resulting in better motion dynamics and more nuanced motion details. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline.","PeriodicalId":13376,"journal":{"name":"IEEE Transactions on Visualization and Computer Graphics","volume":" ","pages":""},"PeriodicalIF":6.5000,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Audio2Gestures: Generating Diverse Gestures from Audio\",\"authors\":\"Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Linchao Bao, Zhenyu He\",\"doi\":\"10.48550/arXiv.2301.06690\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during inference. So we propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code is expected to be responsible for the motion component that is more correlated to the audio while the motion-specific code is expected to capture diverse motion information that is more independent of the audio. However, splitting the latent code into two parts poses extra training difficulties. Several crucial training losses/strategies, including relaxed motion loss, bicycle constraint, and diversity loss, are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than previous state-of-the-art methods, quantitatively and qualitatively. Besides, our formulation is compatible with discrete cosine transformation (DCT) modeling and other popular backbones (i.e. RNN, Transformer). As for motion losses and quantitative motion evaluation, we find structured losses/metrics (e.g. STFT) that consider temporal and/or spatial context complement the most commonly used point-wise losses (e.g. PCK), resulting in better motion dynamics and more nuanced motion details. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline.\",\"PeriodicalId\":13376,\"journal\":{\"name\":\"IEEE Transactions on Visualization and Computer Graphics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2023-01-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Visualization and Computer Graphics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2301.06690\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Visualization and Computer Graphics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.48550/arXiv.2301.06690","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 1

摘要

人们在说同一句话时，可能会受到各种心理和身体因素的影响，做出不同的手势。这种固有的一对多关系使得从音频生成共同语音手势特别具有挑战性。传统的CNN/RNN假设一对一映射，因此倾向于预测所有可能的目标运动的平均值，在推理过程中很容易导致平淡/无聊的运动。因此，我们建议通过将跨模态潜在码划分为共享码和运动特定码来显式地对一对多音频到运动映射进行建模。共享代码被期望负责与音频更相关的运动分量，而运动专用代码被期望捕获更独立于音频的不同运动信息。然而，将潜在代码分为两部分会带来额外的训练困难。几个关键的训练损失/策略，包括放松运动损失、自行车约束和多样性损失，旨在更好地训练VAE。在3D和2D运动数据集上的实验验证了我们的方法在数量和质量上都比以前最先进的方法产生了更真实和多样化的运动。此外，我们的公式与离散余弦变换（DCT）建模和其他流行的主干（即RNN、Transformer）兼容。至于运动损失和定量运动评估，我们发现考虑时间和/或空间上下文的结构化损失/度量（例如STFT）补充了最常用的逐点损失（例如PCK），从而产生更好的运动动力学和更细微的运动细节。最后，我们证明了我们的方法可以很容易地用于在时间线上生成具有用户指定的运动片段的运动序列。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Audio2Gestures: Generating Diverse Gestures from Audio

People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during inference. So we propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code is expected to be responsible for the motion component that is more correlated to the audio while the motion-specific code is expected to capture diverse motion information that is more independent of the audio. However, splitting the latent code into two parts poses extra training difficulties. Several crucial training losses/strategies, including relaxed motion loss, bicycle constraint, and diversity loss, are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than previous state-of-the-art methods, quantitatively and qualitatively. Besides, our formulation is compatible with discrete cosine transformation (DCT) modeling and other popular backbones (i.e. RNN, Transformer). As for motion losses and quantitative motion evaluation, we find structured losses/metrics (e.g. STFT) that consider temporal and/or spatial context complement the most commonly used point-wise losses (e.g. PCK), resulting in better motion dynamics and more nuanced motion details. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Visualization and Computer Graphics 工程技术-计算机：软件工程

CiteScore

10.40

自引率

19.20%

发文量

946

审稿时长

4.5 months

期刊介绍： TVCG is a scholarly, archival journal published monthly. Its Editorial Board strives to publish papers that present important research results and state-of-the-art seminal papers in computer graphics, visualization, and virtual reality. Specific topics include, but are not limited to: rendering technologies; geometric modeling and processing; shape analysis; graphics hardware; animation and simulation; perception, interaction and user interfaces; haptics; computational photography; high-dynamic range imaging and display; user studies and evaluation; biomedical visualization; volume visualization and graphics; visual analytics for machine learning; topology-based visualization; visual programming and software visualization; visualization in data science; virtual reality, augmented reality and mixed reality; advanced display technology, (e.g., 3D, immersive and multi-modal displays); applications of computer graphics and visualization.