基于条件生成对抗网络的多模态注意力唇部合成

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2023-09-01 DOI:10.1016/j.specom.2023.102959

Andrea Vidal, Carlos Busso

{"title":"基于条件生成对抗网络的多模态注意力唇部合成","authors":"Andrea Vidal, Carlos Busso","doi":"10.1016/j.specom.2023.102959","DOIUrl":null,"url":null,"abstract":"<div><p>The synthesis of lip movements is an important problem for a <em>socially interactive agent</em> (SIA). It is important to generate lip movements that are synchronized with speech and have realistic co-articulation. We hypothesize that combining lexical information (i.e., sequence of phonemes) and acoustic features can lead not only to models that generate the correct lip movements matching the articulatory movements, but also to trajectories that are well synchronized with the speech emphasis and emotional content. This work presents attention-based frameworks that use acoustic and lexical information to enhance the synthesis of lip movements. The lexical information is obtained from <em>automatic speech recognition</em> (ASR) transcriptions, broadening the range of applications of the proposed solution. We propose models based on <em>conditional generative adversarial networks</em> (CGAN) with self-modality attention and cross-modalities attention mechanisms. These models allow us to understand which frames are considered more in the generation of lip movements. We animate the synthesized lip movements using blendshapes. These animations are used to compare our proposed multimodal models with alternative methods, including unimodal models implemented with either text or acoustic features. We rely on subjective metrics using perceptual evaluations and an objective metric based on the LipSync model. The results show that our proposed models with attention mechanisms are preferred over the baselines on the perception of naturalness. The addition of cross-modality attentions and self-modality attentions has a significant positive impact on the performance of the generated sequences. We observe that lexical information provides valuable information even when the transcriptions are not perfect. The improved performance observed by the multimodal system confirms the complementary information provided by the speech and text modalities.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102959"},"PeriodicalIF":2.4000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal attention for lip synthesis using conditional generative adversarial networks\",\"authors\":\"Andrea Vidal, Carlos Busso\",\"doi\":\"10.1016/j.specom.2023.102959\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>The synthesis of lip movements is an important problem for a <em>socially interactive agent</em> (SIA). It is important to generate lip movements that are synchronized with speech and have realistic co-articulation. We hypothesize that combining lexical information (i.e., sequence of phonemes) and acoustic features can lead not only to models that generate the correct lip movements matching the articulatory movements, but also to trajectories that are well synchronized with the speech emphasis and emotional content. This work presents attention-based frameworks that use acoustic and lexical information to enhance the synthesis of lip movements. The lexical information is obtained from <em>automatic speech recognition</em> (ASR) transcriptions, broadening the range of applications of the proposed solution. We propose models based on <em>conditional generative adversarial networks</em> (CGAN) with self-modality attention and cross-modalities attention mechanisms. These models allow us to understand which frames are considered more in the generation of lip movements. We animate the synthesized lip movements using blendshapes. These animations are used to compare our proposed multimodal models with alternative methods, including unimodal models implemented with either text or acoustic features. We rely on subjective metrics using perceptual evaluations and an objective metric based on the LipSync model. The results show that our proposed models with attention mechanisms are preferred over the baselines on the perception of naturalness. The addition of cross-modality attentions and self-modality attentions has a significant positive impact on the performance of the generated sequences. We observe that lexical information provides valuable information even when the transcriptions are not perfect. The improved performance observed by the multimodal system confirms the complementary information provided by the speech and text modalities.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"153 \",\"pages\":\"Article 102959\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639323000936\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323000936","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

唇动作的合成是社会互动智能体(SIA)的一个重要问题。重要的是产生与言语同步的嘴唇运动，并具有现实的协同发音。我们假设，将词汇信息(即音素序列)和声学特征结合起来，不仅可以产生与发音运动相匹配的正确嘴唇运动模型，还可以产生与语音重点和情感内容同步的轨迹。这项工作提出了基于注意力的框架，使用声学和词汇信息来增强唇部运动的合成。词汇信息从自动语音识别(ASR)转录中获得，扩大了所提出解决方案的应用范围。我们提出了基于自模态注意和跨模态注意机制的条件生成对抗网络(CGAN)模型。这些模型使我们能够了解哪些框架在嘴唇运动的产生中被考虑得更多。我们使用混合形状动画合成的嘴唇运动。这些动画用于将我们提出的多模态模型与其他方法进行比较，包括使用文本或声学特征实现的单模态模型。我们依靠主观指标使用感知评估和基于LipSync模型的客观指标。结果表明，我们提出的具有注意机制的模型在自然性感知方面优于基线。交叉模态注意和自模态注意的加入对生成序列的性能有显著的正向影响。我们观察到，即使转录不完美，词汇信息也能提供有价值的信息。多模态系统所观察到的改进性能证实了语音和文本模态所提供的互补信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Multimodal attention for lip synthesis using conditional generative adversarial networks

查看原文本刊更多论文

Multimodal attention for lip synthesis using conditional generative adversarial networks

The synthesis of lip movements is an important problem for a socially interactive agent (SIA). It is important to generate lip movements that are synchronized with speech and have realistic co-articulation. We hypothesize that combining lexical information (i.e., sequence of phonemes) and acoustic features can lead not only to models that generate the correct lip movements matching the articulatory movements, but also to trajectories that are well synchronized with the speech emphasis and emotional content. This work presents attention-based frameworks that use acoustic and lexical information to enhance the synthesis of lip movements. The lexical information is obtained from automatic speech recognition (ASR) transcriptions, broadening the range of applications of the proposed solution. We propose models based on conditional generative adversarial networks (CGAN) with self-modality attention and cross-modalities attention mechanisms. These models allow us to understand which frames are considered more in the generation of lip movements. We animate the synthesized lip movements using blendshapes. These animations are used to compare our proposed multimodal models with alternative methods, including unimodal models implemented with either text or acoustic features. We rely on subjective metrics using perceptual evaluations and an objective metric based on the LipSync model. The results show that our proposed models with attention mechanisms are preferred over the baselines on the perception of naturalness. The addition of cross-modality attentions and self-modality attentions has a significant positive impact on the performance of the generated sequences. We observe that lexical information provides valuable information even when the transcriptions are not perfect. The improved performance observed by the multimodal system confirms the complementary information provided by the speech and text modalities.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.