DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework

arXiv - CS - Sound Pub Date : 2024-08-01 DOI:arxiv-2408.00370

Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma

{"title":"DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework","authors":"Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma","doi":"arxiv-2408.00370","DOIUrl":null,"url":null,"abstract":"Speech-driven gesture generation is an emerging domain within virtual human\ncreation, where current methods predominantly utilize Transformer-based\narchitectures that necessitate extensive memory and are characterized by slow\ninference speeds. In response to these limitations, we propose\n\\textit{DiM-Gestures}, a novel end-to-end generative model crafted to create\nhighly personalized 3D full-body gestures solely from raw speech audio,\nemploying Mamba-based architectures. This model integrates a Mamba-based fuzzy\nfeature extractor with a non-autoregressive Adaptive Layer Normalization\n(AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba\nframework and a WavLM pre-trained model, autonomously derives implicit,\ncontinuous fuzzy features, which are then unified into a singular latent\nfeature. This feature is processed by the AdaLN Mamba-2, which implements a\nuniform conditional mechanism across all tokens to robustly model the interplay\nbetween the fuzzy features and the resultant gesture sequence. This innovative\napproach guarantees high fidelity in gesture-speech synchronization while\nmaintaining the naturalness of the gestures. Employing a diffusion model for\ntraining and inference, our framework has undergone extensive subjective and\nobjective evaluations on the ZEGGS and BEAT datasets. These assessments\nsubstantiate our model's enhanced performance relative to contemporary\nstate-of-the-art methods, demonstrating competitive outcomes with the DiTs\narchitecture (Persona-Gestors) while optimizing memory usage and accelerating\ninference speed.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00370","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Speech-driven gesture generation is an emerging domain within virtual human creation, where current methods predominantly utilize Transformer-based architectures that necessitate extensive memory and are characterized by slow inference speeds. In response to these limitations, we propose \textit{DiM-Gestures}, a novel end-to-end generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio, employing Mamba-based architectures. This model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba framework and a WavLM pre-trained model, autonomously derives implicit, continuous fuzzy features, which are then unified into a singular latent feature. This feature is processed by the AdaLN Mamba-2, which implements a uniform conditional mechanism across all tokens to robustly model the interplay between the fuzzy features and the resultant gesture sequence. This innovative approach guarantees high fidelity in gesture-speech synchronization while maintaining the naturalness of the gestures. Employing a diffusion model for training and inference, our framework has undergone extensive subjective and objective evaluations on the ZEGGS and BEAT datasets. These assessments substantiate our model's enhanced performance relative to contemporary state-of-the-art methods, demonstrating competitive outcomes with the DiTs architecture (Persona-Gestors) while optimizing memory usage and accelerating inference speed.

查看原文本刊更多论文

DiM-Gesture：利用自适应层归一化技术生成协同语音手势 Mamba-2 框架

语音驱动的手势生成是虚拟人创作中的一个新兴领域，目前的方法主要使用基于变压器的架构，这种架构需要大量内存，而且推理速度较慢。针对这些局限性，我们提出了 "DiM-Gestures"（DiM-手势）这一新颖的端到端生成模型，该模型采用基于 Mamba 的体系结构，可完全根据原始语音音频创建高度个性化的 3D 全身手势。该模型集成了一个基于 Mamba 的模糊特征提取器和一个非自回归自适应层归一化（AdaLN）Mamba-2 扩散架构。该提取器利用 Mambaframework 和 WavLM 预训练模型，自主提取隐含的连续模糊特征，然后将其统一为一个奇异的潜在特征。该特征由 AdaLN Mamba-2 处理，Mamba-2 对所有标记实施统一的条件机制，以对模糊特征和由此产生的手势序列之间的相互作用进行稳健建模。这种创新方法保证了手势与语音同步的高保真性，同时保持了手势的自然性。我们的框架采用扩散模型进行训练和推理，并在 ZEGGS 和 BEAT 数据集上进行了广泛的主观和客观评估。这些评估证明，与当代最先进的方法相比，我们的模型性能更强，与 DiTs 架构（Persona-Gestors）相比具有竞争力，同时优化了内存使用并加快了推理速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量