基于说话人特征融合的唇语合成

Proceedings of the 4th International Conference on Information Technology and Computer Communications Pub Date : 2022-06-23 DOI:10.1145/3548636.3548648

Rui Zeng, Shengwu Xiong

{"title":"基于说话人特征融合的唇语合成","authors":"Rui Zeng, Shengwu Xiong","doi":"10.1145/3548636.3548648","DOIUrl":null,"url":null,"abstract":"Lip to speech synthesis (Lip2Speech) is a technology that reconstructs speech from the silent talking face video. With the development of deep learning, achievements have been made in this field. Due to the silent talking face video does not contain the speaker characteristics information, reconstructing speech directly from the silent talking video will lose the characteristic information of the speaker, thus reducing the quality of the reconstructed speech. In this paper we proposed a new framework using the pre-trained speaker encoder network which extract the speaker characteristics information. More specially: (1) The pretrained speaker encoder network generates a fixed-dimensional embedding vector from a few seconds of given speaker's speech, which contains the speaker characteristics information, (2) The content encoder uses a stack of 3D convolutions to extracts the content information of the video, (3) a sequence-to-sequence synthesis network based on Tacotron2 that generates Mel-spectrogram from silent video, conditioned on the speaker's identity embedding. Experimental results show that, using the pretrained speaker encoder can improved the speech reconstruction quality.","PeriodicalId":384376,"journal":{"name":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Lip to Speech Synthesis Based on Speaker Characteristics Feature Fusion\",\"authors\":\"Rui Zeng, Shengwu Xiong\",\"doi\":\"10.1145/3548636.3548648\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lip to speech synthesis (Lip2Speech) is a technology that reconstructs speech from the silent talking face video. With the development of deep learning, achievements have been made in this field. Due to the silent talking face video does not contain the speaker characteristics information, reconstructing speech directly from the silent talking video will lose the characteristic information of the speaker, thus reducing the quality of the reconstructed speech. In this paper we proposed a new framework using the pre-trained speaker encoder network which extract the speaker characteristics information. More specially: (1) The pretrained speaker encoder network generates a fixed-dimensional embedding vector from a few seconds of given speaker's speech, which contains the speaker characteristics information, (2) The content encoder uses a stack of 3D convolutions to extracts the content information of the video, (3) a sequence-to-sequence synthesis network based on Tacotron2 that generates Mel-spectrogram from silent video, conditioned on the speaker's identity embedding. Experimental results show that, using the pretrained speaker encoder can improved the speech reconstruction quality.\",\"PeriodicalId\":384376,\"journal\":{\"name\":\"Proceedings of the 4th International Conference on Information Technology and Computer Communications\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th International Conference on Information Technology and Computer Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3548636.3548648\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3548636.3548648","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

嘴唇到语音合成(Lip2Speech)是一种从无声谈话的面部视频中重建语音的技术。随着深度学习的发展，这一领域已经取得了一些成果。由于沉默谈话的人脸视频不包含说话人的特征信息，直接从沉默谈话视频中重构语音会丢失说话人的特征信息，从而降低重构语音的质量。本文提出了一种利用预训练的说话人编码器网络提取说话人特征信息的新框架。具体来说:(1)预训练的说话人编码器网络从给定说话人的几秒语音中生成固定维的嵌入向量，其中包含说话人的特征信息;(2)内容编码器使用3D卷积堆栈提取视频的内容信息;(3)基于Tacotron2的序列到序列合成网络，以说话人的身份嵌入为条件，从无声视频中生成梅尔谱图。实验结果表明，使用预训练的说话人编码器可以提高语音重建的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Lip to Speech Synthesis Based on Speaker Characteristics Feature Fusion

Lip to speech synthesis (Lip2Speech) is a technology that reconstructs speech from the silent talking face video. With the development of deep learning, achievements have been made in this field. Due to the silent talking face video does not contain the speaker characteristics information, reconstructing speech directly from the silent talking video will lose the characteristic information of the speaker, thus reducing the quality of the reconstructed speech. In this paper we proposed a new framework using the pre-trained speaker encoder network which extract the speaker characteristics information. More specially: (1) The pretrained speaker encoder network generates a fixed-dimensional embedding vector from a few seconds of given speaker's speech, which contains the speaker characteristics information, (2) The content encoder uses a stack of 3D convolutions to extracts the content information of the video, (3) a sequence-to-sequence synthesis network based on Tacotron2 that generates Mel-spectrogram from silent video, conditioned on the speaker's identity embedding. Experimental results show that, using the pretrained speaker encoder can improved the speech reconstruction quality.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 4th International Conference on Information Technology and Computer Communications

自引率

0.00%

发文量