MusicFace: Music-driven expressive singing face synthesis

IF 18.3 3区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computational Visual Media Pub Date : 2023-11-30 DOI:10.1007/s41095-023-0343-7

Pengfei Liu, Wenjin Deng, Hengda Li, Jintai Wang, Yinglin Zheng, Yiwei Ding, Xiaohu Guo, Ming Zeng

{"title":"MusicFace: Music-driven expressive singing face synthesis","authors":"Pengfei Liu, Wenjin Deng, Hengda Li, Jintai Wang, Yinglin Zheng, Yiwei Ding, Xiaohu Guo, Ming Zeng","doi":"10.1007/s41095-023-0343-7","DOIUrl":null,"url":null,"abstract":"<p>It remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression, head pose, and eyes. Due to the coupling of mixed information for the human voice and backing music in common music audio signals, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into a human voice stream and a backing music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions, and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressivenes of the generated results, we decompose head movement generation in terms of speed and direction, and decompose eye state generation into short-term blinking and long-term eye closing, modeling them separately. We have also built a novel dataset, SingingFace, to support training and evaluation of models for this task, including future work on this topic. Extensive experiments and a user study show that our proposed method is capable of synthesizing vivid singing faces, qualitatively and quantitatively better than the prior state-of-the-art.\n</p>","PeriodicalId":37301,"journal":{"name":"Computational Visual Media","volume":"47 1","pages":""},"PeriodicalIF":18.3000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Visual Media","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s41095-023-0343-7","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

It remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression, head pose, and eyes. Due to the coupling of mixed information for the human voice and backing music in common music audio signals, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into a human voice stream and a backing music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions, and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressivenes of the generated results, we decompose head movement generation in terms of speed and direction, and decompose eye state generation into short-term blinking and long-term eye closing, modeling them separately. We have also built a novel dataset, SingingFace, to support training and evaluation of models for this task, including future work on this topic. Extensive experiments and a user study show that our proposed method is capable of synthesizing vivid singing faces, qualitatively and quantitatively better than the prior state-of-the-art.

Abstract Image

查看原文本刊更多论文

音乐脸谱音乐驱动的表情歌唱脸部合成

在音乐的驱动下合成一张生动逼真的歌唱脸谱，仍然是一个有趣而具有挑战性的问题。在本文中，我们针对这一任务提出了一种方法，其中包括嘴唇、面部表情、头部姿势和眼睛的自然运动。由于普通音乐音频信号中人声和伴奏音乐的混合信息耦合在一起，我们设计了一种 "解耦-融合"（decouple-and-fuse）策略来应对这一挑战。我们首先将输入的音乐音频分解为人声流和伴奏音乐流。由于双流输入信号与面部表情、头部运动和眼部状态的动态之间存在着隐含而复杂的相关性，我们用注意力方案来模拟它们之间的关系，将双流的效果无缝地融合在一起。此外，为了提高生成结果的表现力，我们将头部动作的生成分解为速度和方向，将眼球状态的生成分解为短期眨眼和长期闭眼，并分别对它们进行建模。我们还建立了一个新颖的数据集--SingingFace，以支持对这一任务的模型进行训练和评估，包括未来在这一主题上的工作。广泛的实验和用户研究表明，我们提出的方法能够合成生动的歌唱面孔，在质量和数量上都优于之前的先进水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Visual Media Computer Science-Computer Graphics and Computer-Aided Design

CiteScore

16.90

自引率

5.80%

发文量

243

审稿时长

6 weeks

期刊介绍： Computational Visual Media is a peer-reviewed open access journal. It publishes original high-quality research papers and significant review articles on novel ideas, methods, and systems relevant to visual media. Computational Visual Media publishes articles that focus on, but are not limited to, the following areas: • Editing and composition of visual media • Geometric computing for images and video • Geometry modeling and processing • Machine learning for visual media • Physically based animation • Realistic rendering • Recognition and understanding of visual media • Visual computing for robotics • Visualization and visual analytics Other interdisciplinary research into visual media that combines aspects of computer graphics, computer vision, image and video processing, geometric computing, and machine learning is also within the journal''s scope. This is an open access journal, published quarterly by Tsinghua University Press and Springer. The open access fees (article-processing charges) are fully sponsored by Tsinghua University, China. Authors can publish in the journal without any additional charges.