{"title":"Speech-Driven Gesture Generation Using Transformer-Based Denoising Diffusion Probabilistic Models","authors":"Bowen Wu;Chaoran Liu;Carlos Toshinori Ishi;Hiroshi Ishiguro","doi":"10.1109/THMS.2024.3456085","DOIUrl":null,"url":null,"abstract":"While it is crucial for human-like avatars to perform co-speech gestures, existing approaches struggle to generate natural and realistic movements. In the present study, a novel transformer-based denoising diffusion model is proposed to generate co-speech gestures. Moreover, we introduce a practical sampling trick for diffusion models to maintain the continuity between the generated motion segments while improving the within-segment motion likelihood and naturalness. Our model can be used for online generation since it generates gestures for a short segment of speech, e.g., 2 s. We evaluate our model on two large-scale speech-gesture datasets with finger movements using objective measurements and a user study, showing that our model outperforms all other baselines. Our user study is based on the Metahuman platform in the Unreal Engine, a popular tool for creating human-like avatars and motions.","PeriodicalId":48916,"journal":{"name":"IEEE Transactions on Human-Machine Systems","volume":"54 6","pages":"733-742"},"PeriodicalIF":3.5000,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10712170","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Human-Machine Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10712170/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
While it is crucial for human-like avatars to perform co-speech gestures, existing approaches struggle to generate natural and realistic movements. In the present study, a novel transformer-based denoising diffusion model is proposed to generate co-speech gestures. Moreover, we introduce a practical sampling trick for diffusion models to maintain the continuity between the generated motion segments while improving the within-segment motion likelihood and naturalness. Our model can be used for online generation since it generates gestures for a short segment of speech, e.g., 2 s. We evaluate our model on two large-scale speech-gesture datasets with finger movements using objective measurements and a user study, showing that our model outperforms all other baselines. Our user study is based on the Metahuman platform in the Unreal Engine, a popular tool for creating human-like avatars and motions.
期刊介绍:
The scope of the IEEE Transactions on Human-Machine Systems includes the fields of human machine systems. It covers human systems and human organizational interactions including cognitive ergonomics, system test and evaluation, and human information processing concerns in systems and organizations.