Say Anything with Any Style

ArXiv Pub Date : 2024-03-11 DOI:10.1609/aaai.v38i5.28314

Shuai Tan, Bin Ji, Yu Ding, Ye Pan

{"title":"Say Anything with Any Style","authors":"Shuai Tan, Bin Ji, Yu Ding, Ye Pan","doi":"10.1609/aaai.v38i5.28314","DOIUrl":null,"url":null,"abstract":"Generating stylized talking head with diverse head motions is crucial for achieving natural-looking videos but still remains challenging. Previous works either adopt a regressive method to capture the speaking style, resulting in a coarse style that is averaged across all training data, or employ a universal network to synthesize videos with different styles which causes suboptimal performance. To address these, we propose a novel dynamic-weight method, namely Say Anything with Any Style (SAAS), which queries the discrete style representation via a generative model with a learned style codebook. Specifically, we develop a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction. This discrete prior, along with the generative model, enhances the precision and robustness when extracting the speaking styles of the given style clips. By utilizing the extracted style, a residual architecture comprising a canonical branch and style-specific branch is employed to predict the mouth shapes conditioned on any driving audio while transferring the speaking style from the source to any desired one. To adapt to different speaking styles, we steer clear of employing a universal network by exploring an elaborate HyperStyle to produce the style-specific weights offset for the style branch. Furthermore, we construct a pose generator and a pose codebook to store the quantized pose representation, allowing us to sample diverse head motions aligned with the audio and the extracted style. Experiments demonstrate that our approach surpasses state-of-the-art methods in terms of both lip-synchronization and stylized expression. Besides, we extend our SAAS to video-driven style editing field and achieve satisfactory performance as well.","PeriodicalId":513202,"journal":{"name":"ArXiv","volume":"31 16","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/aaai.v38i5.28314","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Generating stylized talking head with diverse head motions is crucial for achieving natural-looking videos but still remains challenging. Previous works either adopt a regressive method to capture the speaking style, resulting in a coarse style that is averaged across all training data, or employ a universal network to synthesize videos with different styles which causes suboptimal performance. To address these, we propose a novel dynamic-weight method, namely Say Anything with Any Style (SAAS), which queries the discrete style representation via a generative model with a learned style codebook. Specifically, we develop a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction. This discrete prior, along with the generative model, enhances the precision and robustness when extracting the speaking styles of the given style clips. By utilizing the extracted style, a residual architecture comprising a canonical branch and style-specific branch is employed to predict the mouth shapes conditioned on any driving audio while transferring the speaking style from the source to any desired one. To adapt to different speaking styles, we steer clear of employing a universal network by exploring an elaborate HyperStyle to produce the style-specific weights offset for the style branch. Furthermore, we construct a pose generator and a pose codebook to store the quantized pose representation, allowing us to sample diverse head motions aligned with the audio and the extracted style. Experiments demonstrate that our approach surpasses state-of-the-art methods in terms of both lip-synchronization and stylized expression. Besides, we extend our SAAS to video-driven style editing field and achieve satisfactory performance as well.

查看原文本刊更多论文

随心所欲

生成具有不同头部动作的风格化说话头像对于实现自然的视频效果至关重要，但仍然具有挑战性。以往的研究要么采用回归法来捕捉说话风格，从而在所有训练数据中平均出一种粗略的风格；要么采用通用网络来合成不同风格的视频，从而导致性能不理想。为了解决这些问题，我们提出了一种新颖的动态加权方法，即 "随心所欲地说任何风格"（SAAS），它通过一个生成模型和一个已学风格代码集来查询离散风格表示。具体来说，我们开发了一种多任务 VQ-VAE，它结合了三个密切相关的任务来学习风格编码本，作为风格提取的先验。这种离散先验与生成模型一起，提高了从给定风格片段中提取说话风格的精确度和鲁棒性。通过利用所提取的风格，一个由典型分支和特定风格分支组成的残差架构被用来预测任何驱动音频条件下的口型，同时将说话风格从源传输到任何所需的风格。为了适应不同的说话风格，我们没有采用通用网络，而是通过探索精心设计的 HyperStyle 来为风格分支生成特定风格的权重偏移。此外，我们还构建了一个姿势生成器和一个姿势编码本，用于存储量化的姿势表示，使我们能够采样与音频和提取的风格相一致的各种头部动作。实验证明，我们的方法在唇语同步和风格化表达方面都超越了最先进的方法。此外，我们还将 SAAS 扩展到了视频驱动的风格编辑领域，并取得了令人满意的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ArXiv

自引率

0.00%

发文量