基于双路变换器的协同语音手势合成广义网络（GAN

IF 3.8 2区计算机科学 Q2 ROBOTICS

International Journal of Social Robotics Pub Date : 2024-05-13 DOI:10.1007/s12369-024-01136-y

Xinyuan Qian, Hao Tang, Jichen Yang, Hongxu Zhu, Xu-Cheng Yin

{"title":"基于双路变换器的协同语音手势合成广义网络（GAN","authors":"Xinyuan Qian, Hao Tang, Jichen Yang, Hongxu Zhu, Xu-Cheng Yin","doi":"10.1007/s12369-024-01136-y","DOIUrl":null,"url":null,"abstract":"<p>Co-speech gestures have significant impacts on conveying information. For social agents, producing realistic and smooth gestures are crucial to enable natural interactions with humans, which is a challenging task depending on many impact factors (e.g., speech audio, content, and the interacting person). In this paper, we tackle the cross-modal fusion problem through a novel fusion mechanism for end-to-end learning-based co-speech gesture generation. In particular, we facilitate parallel directional cross-modal transformers, and an interactive and cascaded 2D attention module, to achieve selective fusion of the gesture-related cues. Besides, we propose new metrics to evaluate gesture diversity and speech-gesture correspondence, without 3D pose annotation requirements. Experiments on a public dataset indicate that the proposed method can successfully produce diverse human-like poses, which outperform the other competitive state-of-the-art methods, with the evaluations conducted both objectively and subjectively.</p>","PeriodicalId":14361,"journal":{"name":"International Journal of Social Robotics","volume":"33 1","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis\",\"authors\":\"Xinyuan Qian, Hao Tang, Jichen Yang, Hongxu Zhu, Xu-Cheng Yin\",\"doi\":\"10.1007/s12369-024-01136-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Co-speech gestures have significant impacts on conveying information. For social agents, producing realistic and smooth gestures are crucial to enable natural interactions with humans, which is a challenging task depending on many impact factors (e.g., speech audio, content, and the interacting person). In this paper, we tackle the cross-modal fusion problem through a novel fusion mechanism for end-to-end learning-based co-speech gesture generation. In particular, we facilitate parallel directional cross-modal transformers, and an interactive and cascaded 2D attention module, to achieve selective fusion of the gesture-related cues. Besides, we propose new metrics to evaluate gesture diversity and speech-gesture correspondence, without 3D pose annotation requirements. Experiments on a public dataset indicate that the proposed method can successfully produce diverse human-like poses, which outperform the other competitive state-of-the-art methods, with the evaluations conducted both objectively and subjectively.</p>\",\"PeriodicalId\":14361,\"journal\":{\"name\":\"International Journal of Social Robotics\",\"volume\":\"33 1\",\"pages\":\"\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2024-05-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Social Robotics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s12369-024-01136-y\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ROBOTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Social Robotics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12369-024-01136-y","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

摘要

协同语音手势对传递信息有重大影响。对于社交代理来说，生成逼真流畅的手势对于实现与人类的自然交互至关重要，而这是一项具有挑战性的任务，取决于许多影响因素（如语音音频、内容和交互对象）。在本文中，我们通过一种新颖的融合机制来解决跨模态融合问题，从而实现基于端到端学习的协同语音手势生成。特别是，我们利用并行定向跨模态变换器和交互式级联二维注意力模块，实现了手势相关线索的选择性融合。此外，我们还提出了评估手势多样性和语音-手势对应性的新指标，而无需三维姿势注释要求。在一个公共数据集上进行的实验表明，所提出的方法可以成功生成多样化的类人姿势，其性能优于其他具有竞争力的先进方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis

查看原文本刊更多论文

Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis

Co-speech gestures have significant impacts on conveying information. For social agents, producing realistic and smooth gestures are crucial to enable natural interactions with humans, which is a challenging task depending on many impact factors (e.g., speech audio, content, and the interacting person). In this paper, we tackle the cross-modal fusion problem through a novel fusion mechanism for end-to-end learning-based co-speech gesture generation. In particular, we facilitate parallel directional cross-modal transformers, and an interactive and cascaded 2D attention module, to achieve selective fusion of the gesture-related cues. Besides, we propose new metrics to evaluate gesture diversity and speech-gesture correspondence, without 3D pose annotation requirements. Experiments on a public dataset indicate that the proposed method can successfully produce diverse human-like poses, which outperform the other competitive state-of-the-art methods, with the evaluations conducted both objectively and subjectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Social Robotics ROBOTICS-

CiteScore

9.80

自引率

8.50%

发文量

期刊介绍： Social Robotics is the study of robots that are able to interact and communicate among themselves, with humans, and with the environment, within the social and cultural structure attached to its role. The journal covers a broad spectrum of topics related to the latest technologies, new research results and developments in the area of social robotics on all levels, from developments in core enabling technologies to system integration, aesthetic design, applications and social implications. It provides a platform for like-minded researchers to present their findings and latest developments in social robotics, covering relevant advances in engineering, computing, arts and social sciences. The journal publishes original, peer reviewed articles and contributions on innovative ideas and concepts, new discoveries and improvements, as well as novel applications, by leading researchers and developers regarding the latest fundamental advances in the core technologies that form the backbone of social robotics, distinguished developmental projects in the area, as well as seminal works in aesthetic design, ethics and philosophy, studies on social impact and influence, pertaining to social robotics.