{"title":"Generating Diverse Gestures from Speech Using Memory Networks as Dynamic Dictionaries","authors":"Zeyu Zhao, Nan Gao, Zhi Zeng, Shuwu Zhanga","doi":"10.1109/CoST57098.2022.00042","DOIUrl":null,"url":null,"abstract":"People naturally enhance their speeches with body motion or gestures. Generating human gestures for digital humans or virtual avatars from speech audio or text remains challenging for its indeterministic nature. We observe that existing neural methods often give gestures with an inadequate amount of movement shift, which can be characterized as slow or dull. Thus, we propose a novel generative model coupled with memory networks to work as dynamic dictionaries for generating gestures with improved diversity. Under the hood of the proposed model, a dictionary network dynamically stores previously appeared pose features corresponding to text features for the generator to lookup, while a pose generation network takes in audio and pose features and outputs the resulting gesture sequences. Seed poses are utilized in the generation process to guarantee the continuity between two speech segments. We also propose a new objective evaluation metric for diversity of generated gestures and succeed in demonstrating that the proposed model has the ability to generate gestures with improved diversity.","PeriodicalId":135595,"journal":{"name":"2022 International Conference on Culture-Oriented Science and Technology (CoST)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Culture-Oriented Science and Technology (CoST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CoST57098.2022.00042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
People naturally enhance their speeches with body motion or gestures. Generating human gestures for digital humans or virtual avatars from speech audio or text remains challenging for its indeterministic nature. We observe that existing neural methods often give gestures with an inadequate amount of movement shift, which can be characterized as slow or dull. Thus, we propose a novel generative model coupled with memory networks to work as dynamic dictionaries for generating gestures with improved diversity. Under the hood of the proposed model, a dictionary network dynamically stores previously appeared pose features corresponding to text features for the generator to lookup, while a pose generation network takes in audio and pose features and outputs the resulting gesture sequences. Seed poses are utilized in the generation process to guarantee the continuity between two speech segments. We also propose a new objective evaluation metric for diversity of generated gestures and succeed in demonstrating that the proposed model has the ability to generate gestures with improved diversity.