基于说话人特征融合的唇语合成

Rui Zeng, Shengwu Xiong
{"title":"基于说话人特征融合的唇语合成","authors":"Rui Zeng, Shengwu Xiong","doi":"10.1145/3548636.3548648","DOIUrl":null,"url":null,"abstract":"Lip to speech synthesis (Lip2Speech) is a technology that reconstructs speech from the silent talking face video. With the development of deep learning, achievements have been made in this field. Due to the silent talking face video does not contain the speaker characteristics information, reconstructing speech directly from the silent talking video will lose the characteristic information of the speaker, thus reducing the quality of the reconstructed speech. In this paper we proposed a new framework using the pre-trained speaker encoder network which extract the speaker characteristics information. More specially: (1) The pretrained speaker encoder network generates a fixed-dimensional embedding vector from a few seconds of given speaker's speech, which contains the speaker characteristics information, (2) The content encoder uses a stack of 3D convolutions to extracts the content information of the video, (3) a sequence-to-sequence synthesis network based on Tacotron2 that generates Mel-spectrogram from silent video, conditioned on the speaker's identity embedding. Experimental results show that, using the pretrained speaker encoder can improved the speech reconstruction quality.","PeriodicalId":384376,"journal":{"name":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Lip to Speech Synthesis Based on Speaker Characteristics Feature Fusion\",\"authors\":\"Rui Zeng, Shengwu Xiong\",\"doi\":\"10.1145/3548636.3548648\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lip to speech synthesis (Lip2Speech) is a technology that reconstructs speech from the silent talking face video. With the development of deep learning, achievements have been made in this field. Due to the silent talking face video does not contain the speaker characteristics information, reconstructing speech directly from the silent talking video will lose the characteristic information of the speaker, thus reducing the quality of the reconstructed speech. In this paper we proposed a new framework using the pre-trained speaker encoder network which extract the speaker characteristics information. More specially: (1) The pretrained speaker encoder network generates a fixed-dimensional embedding vector from a few seconds of given speaker's speech, which contains the speaker characteristics information, (2) The content encoder uses a stack of 3D convolutions to extracts the content information of the video, (3) a sequence-to-sequence synthesis network based on Tacotron2 that generates Mel-spectrogram from silent video, conditioned on the speaker's identity embedding. Experimental results show that, using the pretrained speaker encoder can improved the speech reconstruction quality.\",\"PeriodicalId\":384376,\"journal\":{\"name\":\"Proceedings of the 4th International Conference on Information Technology and Computer Communications\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th International Conference on Information Technology and Computer Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3548636.3548648\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3548636.3548648","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

嘴唇到语音合成(Lip2Speech)是一种从无声谈话的面部视频中重建语音的技术。随着深度学习的发展,这一领域已经取得了一些成果。由于沉默谈话的人脸视频不包含说话人的特征信息,直接从沉默谈话视频中重构语音会丢失说话人的特征信息,从而降低重构语音的质量。本文提出了一种利用预训练的说话人编码器网络提取说话人特征信息的新框架。具体来说:(1)预训练的说话人编码器网络从给定说话人的几秒语音中生成固定维的嵌入向量,其中包含说话人的特征信息;(2)内容编码器使用3D卷积堆栈提取视频的内容信息;(3)基于Tacotron2的序列到序列合成网络,以说话人的身份嵌入为条件,从无声视频中生成梅尔谱图。实验结果表明,使用预训练的说话人编码器可以提高语音重建的质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Lip to Speech Synthesis Based on Speaker Characteristics Feature Fusion
Lip to speech synthesis (Lip2Speech) is a technology that reconstructs speech from the silent talking face video. With the development of deep learning, achievements have been made in this field. Due to the silent talking face video does not contain the speaker characteristics information, reconstructing speech directly from the silent talking video will lose the characteristic information of the speaker, thus reducing the quality of the reconstructed speech. In this paper we proposed a new framework using the pre-trained speaker encoder network which extract the speaker characteristics information. More specially: (1) The pretrained speaker encoder network generates a fixed-dimensional embedding vector from a few seconds of given speaker's speech, which contains the speaker characteristics information, (2) The content encoder uses a stack of 3D convolutions to extracts the content information of the video, (3) a sequence-to-sequence synthesis network based on Tacotron2 that generates Mel-spectrogram from silent video, conditioned on the speaker's identity embedding. Experimental results show that, using the pretrained speaker encoder can improved the speech reconstruction quality.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信