Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

arXiv - CS - Graphics Pub Date : 2024-09-03 DOI:arxiv-2409.01591

Sohan Anisetty, James Hays

引用次数: 0

Abstract

Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis.

查看原文本刊更多论文

动态运动合成：屏蔽音频文本条件时空变换器

我们的研究提出了一种新颖的运动生成框架，旨在同时根据多种模式（特别是文本和音频输入）生成全身运动序列。我们的方法利用矢量量化变异自动编码器（VQVAE）进行运动离散化，并利用双向屏蔽语言建模（MLM）策略进行高效标记预测，从而提高了处理效率和生成运动的一致性。这一框架拓展了动作生成的可能性，解决了现有方法的局限性，为多模态动作合成开辟了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Graphics

自引率

0.00%

发文量