Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma
{"title":"DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework","authors":"Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma","doi":"arxiv-2408.00370","DOIUrl":null,"url":null,"abstract":"Speech-driven gesture generation is an emerging domain within virtual human\ncreation, where current methods predominantly utilize Transformer-based\narchitectures that necessitate extensive memory and are characterized by slow\ninference speeds. In response to these limitations, we propose\n\\textit{DiM-Gestures}, a novel end-to-end generative model crafted to create\nhighly personalized 3D full-body gestures solely from raw speech audio,\nemploying Mamba-based architectures. This model integrates a Mamba-based fuzzy\nfeature extractor with a non-autoregressive Adaptive Layer Normalization\n(AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba\nframework and a WavLM pre-trained model, autonomously derives implicit,\ncontinuous fuzzy features, which are then unified into a singular latent\nfeature. This feature is processed by the AdaLN Mamba-2, which implements a\nuniform conditional mechanism across all tokens to robustly model the interplay\nbetween the fuzzy features and the resultant gesture sequence. This innovative\napproach guarantees high fidelity in gesture-speech synchronization while\nmaintaining the naturalness of the gestures. Employing a diffusion model for\ntraining and inference, our framework has undergone extensive subjective and\nobjective evaluations on the ZEGGS and BEAT datasets. These assessments\nsubstantiate our model's enhanced performance relative to contemporary\nstate-of-the-art methods, demonstrating competitive outcomes with the DiTs\narchitecture (Persona-Gestors) while optimizing memory usage and accelerating\ninference speed.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00370","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speech-driven gesture generation is an emerging domain within virtual human
creation, where current methods predominantly utilize Transformer-based
architectures that necessitate extensive memory and are characterized by slow
inference speeds. In response to these limitations, we propose
\textit{DiM-Gestures}, a novel end-to-end generative model crafted to create
highly personalized 3D full-body gestures solely from raw speech audio,
employing Mamba-based architectures. This model integrates a Mamba-based fuzzy
feature extractor with a non-autoregressive Adaptive Layer Normalization
(AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba
framework and a WavLM pre-trained model, autonomously derives implicit,
continuous fuzzy features, which are then unified into a singular latent
feature. This feature is processed by the AdaLN Mamba-2, which implements a
uniform conditional mechanism across all tokens to robustly model the interplay
between the fuzzy features and the resultant gesture sequence. This innovative
approach guarantees high fidelity in gesture-speech synchronization while
maintaining the naturalness of the gestures. Employing a diffusion model for
training and inference, our framework has undergone extensive subjective and
objective evaluations on the ZEGGS and BEAT datasets. These assessments
substantiate our model's enhanced performance relative to contemporary
state-of-the-art methods, demonstrating competitive outcomes with the DiTs
architecture (Persona-Gestors) while optimizing memory usage and accelerating
inference speed.