{"title":"基于说话人特征融合的唇语合成","authors":"Rui Zeng, Shengwu Xiong","doi":"10.1145/3548636.3548648","DOIUrl":null,"url":null,"abstract":"Lip to speech synthesis (Lip2Speech) is a technology that reconstructs speech from the silent talking face video. With the development of deep learning, achievements have been made in this field. Due to the silent talking face video does not contain the speaker characteristics information, reconstructing speech directly from the silent talking video will lose the characteristic information of the speaker, thus reducing the quality of the reconstructed speech. In this paper we proposed a new framework using the pre-trained speaker encoder network which extract the speaker characteristics information. More specially: (1) The pretrained speaker encoder network generates a fixed-dimensional embedding vector from a few seconds of given speaker's speech, which contains the speaker characteristics information, (2) The content encoder uses a stack of 3D convolutions to extracts the content information of the video, (3) a sequence-to-sequence synthesis network based on Tacotron2 that generates Mel-spectrogram from silent video, conditioned on the speaker's identity embedding. Experimental results show that, using the pretrained speaker encoder can improved the speech reconstruction quality.","PeriodicalId":384376,"journal":{"name":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Lip to Speech Synthesis Based on Speaker Characteristics Feature Fusion\",\"authors\":\"Rui Zeng, Shengwu Xiong\",\"doi\":\"10.1145/3548636.3548648\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lip to speech synthesis (Lip2Speech) is a technology that reconstructs speech from the silent talking face video. With the development of deep learning, achievements have been made in this field. Due to the silent talking face video does not contain the speaker characteristics information, reconstructing speech directly from the silent talking video will lose the characteristic information of the speaker, thus reducing the quality of the reconstructed speech. In this paper we proposed a new framework using the pre-trained speaker encoder network which extract the speaker characteristics information. More specially: (1) The pretrained speaker encoder network generates a fixed-dimensional embedding vector from a few seconds of given speaker's speech, which contains the speaker characteristics information, (2) The content encoder uses a stack of 3D convolutions to extracts the content information of the video, (3) a sequence-to-sequence synthesis network based on Tacotron2 that generates Mel-spectrogram from silent video, conditioned on the speaker's identity embedding. Experimental results show that, using the pretrained speaker encoder can improved the speech reconstruction quality.\",\"PeriodicalId\":384376,\"journal\":{\"name\":\"Proceedings of the 4th International Conference on Information Technology and Computer Communications\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th International Conference on Information Technology and Computer Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3548636.3548648\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Information Technology and Computer Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3548636.3548648","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Lip to Speech Synthesis Based on Speaker Characteristics Feature Fusion
Lip to speech synthesis (Lip2Speech) is a technology that reconstructs speech from the silent talking face video. With the development of deep learning, achievements have been made in this field. Due to the silent talking face video does not contain the speaker characteristics information, reconstructing speech directly from the silent talking video will lose the characteristic information of the speaker, thus reducing the quality of the reconstructed speech. In this paper we proposed a new framework using the pre-trained speaker encoder network which extract the speaker characteristics information. More specially: (1) The pretrained speaker encoder network generates a fixed-dimensional embedding vector from a few seconds of given speaker's speech, which contains the speaker characteristics information, (2) The content encoder uses a stack of 3D convolutions to extracts the content information of the video, (3) a sequence-to-sequence synthesis network based on Tacotron2 that generates Mel-spectrogram from silent video, conditioned on the speaker's identity embedding. Experimental results show that, using the pretrained speaker encoder can improved the speech reconstruction quality.