S2M-Net: Speech Driven Three-party Conversational Motion Synthesis Networks

Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games Pub Date : 2022-11-03 DOI:10.1145/3561975.3562954

Aobo Jin, Qixin Deng, Zhiwei Deng

{"title":"S2M-Net: Speech Driven Three-party Conversational Motion Synthesis Networks","authors":"Aobo Jin, Qixin Deng, Zhiwei Deng","doi":"10.1145/3561975.3562954","DOIUrl":null,"url":null,"abstract":"In this paper we propose a novel conditional generative adversarial network (cGAN) architecture, called S2M-Net, to holistically synthesize realistic three-party conversational animations based on acoustic speech input together with speaker marking (i.e., the speaking time of each interlocutor). Specifically, based on a pre-collected three-party conversational motion dataset, we design and train the S2M-Net for three-party conversational animation synthesis. In the architecture, a generator contains a LSTM encoder to encode a sequence of acoustic speech features to a latent vector that is further fed into a transform unit to transform the latent vector into a gesture kinematics space. Then, the output of this transform unit is fed into a LSTM decoder to generate corresponding three-party conversational gesture kinematics. Meanwhile, a discriminator is implemented to check whether an input sequence of three-party conversational gesture kinematics is real or fake. To evaluate our method, besides quantitative and qualitative evaluations, we also conducted paired comparison user studies to compare it with the state of the art.","PeriodicalId":246179,"journal":{"name":"Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games","volume":"188 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3561975.3562954","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In this paper we propose a novel conditional generative adversarial network (cGAN) architecture, called S2M-Net, to holistically synthesize realistic three-party conversational animations based on acoustic speech input together with speaker marking (i.e., the speaking time of each interlocutor). Specifically, based on a pre-collected three-party conversational motion dataset, we design and train the S2M-Net for three-party conversational animation synthesis. In the architecture, a generator contains a LSTM encoder to encode a sequence of acoustic speech features to a latent vector that is further fed into a transform unit to transform the latent vector into a gesture kinematics space. Then, the output of this transform unit is fed into a LSTM decoder to generate corresponding three-party conversational gesture kinematics. Meanwhile, a discriminator is implemented to check whether an input sequence of three-party conversational gesture kinematics is real or fake. To evaluate our method, besides quantitative and qualitative evaluations, we also conducted paired comparison user studies to compare it with the state of the art.

查看原文本刊更多论文

S2M-Net:语音驱动的三方会话运动合成网络

在本文中，我们提出了一种新的条件生成对抗网络(cGAN)架构，称为S2M-Net，以声学语音输入和说话人标记(即每个对话者的说话时间)为基础，全面合成逼真的三方对话动画。具体而言，基于预先采集的三方对话动作数据集，我们设计并训练了用于三方对话动画合成的S2M-Net。在该架构中，生成器包含一个LSTM编码器，用于将声学语音特征序列编码为潜在向量，该潜在向量进一步馈送到变换单元，将潜在向量转换为手势运动学空间。然后，将该变换单元的输出送入LSTM解码器，生成相应的三方会话手势运动学。同时，实现了一个鉴别器来检测三方会话手势运动学输入序列的真假。为了评估我们的方法，除了定量和定性评估外，我们还进行了配对比较用户研究，将其与最先进的技术进行比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 15th ACM SIGGRAPH Conference on Motion, Interaction and Games

自引率

0.00%

发文量