PMMTalk$:$ Speech-Driven 3D Facial Animation From Complementary Pseudo Multi-Modal Features

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-12-25 DOI:10.1109/TMM.2024.3521701

Tianshun Han;Shengnan Gui;Yiqing Huang;Baihui Li;Lijian Liu;Benjia Zhou;Ning Jiang;Quan Lu;Ruicong Zhi;Yanyan Liang;Du Zhang;Jun Wan

{"title":"PMMTalk$:$ Speech-Driven 3D Facial Animation From Complementary Pseudo Multi-Modal Features","authors":"Tianshun Han;Shengnan Gui;Yiqing Huang;Baihui Li;Lijian Liu;Benjia Zhou;Ning Jiang;Quan Lu;Ruicong Zhi;Yanyan Liang;Du Zhang;Jun Wan","doi":"10.1109/TMM.2024.3521701","DOIUrl":null,"url":null,"abstract":"Speech-driven 3D facial animation has improved a lot recently while most related works only utilize acoustic modality and neglect the influence of visual and textual cues, leading to unsatisfactory results in terms of precision and coherence. We argue that visual and textual cues are not trivial information. Therefore, we present a novel framework, namely PMMTalk, using complementary <bold>Pseudo <bold>Multi-<bold>Modal features for improving the accuracy of facial animation. The framework entails three modules: PMMTalk encoder, cross-modal alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder employs the off-the-shelf talking head generation architecture and speech recognition technology to extract visual and textual information from speech, respectively. Following this, the cross-modal alignment module aligns the audio-image-text features at temporal and semantic levels. Subsequently, the PMMTalk decoder is employed to predict lip-syncing facial blendshape coefficients. Contrary to prior methods, PMMTalk only requires an additional random reference face image but yields more accurate results. Additionally, it is artist-friendly as it seamlessly integrates into standard animation production workflows by introducing facial blendshape coefficients. Finally, given the scarcity of 3D talking face datasets, we introduce a large-scale <bold>3D <bold>Chinese <bold>Audio-<bold>Visual <bold>Facial <bold>Animation (3D-CAVFA) dataset. Extensive experiments and user studies show that our approach outperforms the state of the art. Codes and datasets are available at PMMTalk.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2570-2581"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814703/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Speech-driven 3D facial animation has improved a lot recently while most related works only utilize acoustic modality and neglect the influence of visual and textual cues, leading to unsatisfactory results in terms of precision and coherence. We argue that visual and textual cues are not trivial information. Therefore, we present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation. The framework entails three modules: PMMTalk encoder, cross-modal alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder employs the off-the-shelf talking head generation architecture and speech recognition technology to extract visual and textual information from speech, respectively. Following this, the cross-modal alignment module aligns the audio-image-text features at temporal and semantic levels. Subsequently, the PMMTalk decoder is employed to predict lip-syncing facial blendshape coefficients. Contrary to prior methods, PMMTalk only requires an additional random reference face image but yields more accurate results. Additionally, it is artist-friendly as it seamlessly integrates into standard animation production workflows by introducing facial blendshape coefficients. Finally, given the scarcity of 3D talking face datasets, we introduce a large-scale 3D Chinese Audio-Visual Facial Animation (3D-CAVFA) dataset. Extensive experiments and user studies show that our approach outperforms the state of the art. Codes and datasets are available at PMMTalk.

查看原文本刊更多论文

语音驱动的3D面部动画从互补的伪多模态特征

语音驱动的3D面部动画近年来有了很大的进步，但大多数相关工作只利用声学模态，而忽略了视觉和文本线索的影响，导致精度和连贯性不理想。我们认为，视觉和文字线索并不是微不足道的信息。因此，我们提出了一个新的框架，即PMMTalk，利用互补的伪多模态特征来提高面部动画的准确性。该框架包含三个模块：PMMTalk编码器、跨模态对齐模块和PMMTalk解码器。具体来说，PMMTalk编码器采用了现成的说话头生成架构和语音识别技术，分别从语音中提取视觉信息和文本信息。在此之后，跨模态对齐模块在时间和语义级别上对齐音频-图像-文本特征。随后，利用PMMTalk解码器预测假唱面部混合形状系数。与之前的方法相反，PMMTalk只需要额外的随机参考人脸图像，但结果更准确。此外，它是艺术家友好的，因为它通过引入面部混合形状系数无缝集成到标准动画制作工作流程中。最后，考虑到三维说话面部数据集的稀缺性，我们引入了一个大规模的三维中国视听面部动画（3D- cavfa）数据集。大量的实验和用户研究表明，我们的方法优于最先进的方法。代码和数据集可在PMMTalk上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.