{"title":"Voicifier-LN: An Novel Approach to Elevate the Speaker Similarity for General Zero-shot Multi-Speaker TTS","authors":"Dengfeng Ke, Liangjie Huang, Wenhan Yao, Ruixin Hu, Xueyin Zu, Yanlu Xie, Jinsong Zhang","doi":"10.1145/3573942.3574120","DOIUrl":null,"url":null,"abstract":"Speeches generated from neural network-based Text-to-Speech (TTS) have been becoming more natural and intelligible. However, the evident dropping performance still exists when synthesizing multi-speaker speeches in zero-shot manner, especially for those from different countries with different accents. To bridge this gap, we propose a novel method, called Voicifier. It firstly operates on high frequency mel-spectrogram bins to approximately remove the content and rhythm. Then Voicifier uses two strategies, from the shallow to the deep mixing, to further destroy the content and rhythm but retain the timbre. Furthermore, for better zero-shot performance, we propose Voice-Pin Layer Normalization (VPLN) which pins down the timbre according with the text feature. During inference, the model is allowed to synthesize high quality and similarity speeches with just around 1 sec target speech audio. Experiments and ablation studies prove that the methods are able to retain more target timbre while abandoning much more of the content and rhythm-related information. To our best knowledge, the methods are found to be universal that is to say it can be applied to most of the existing TTS systems to enhance the ability of cross-speaker synthesis.","PeriodicalId":103293,"journal":{"name":"Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573942.3574120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speeches generated from neural network-based Text-to-Speech (TTS) have been becoming more natural and intelligible. However, the evident dropping performance still exists when synthesizing multi-speaker speeches in zero-shot manner, especially for those from different countries with different accents. To bridge this gap, we propose a novel method, called Voicifier. It firstly operates on high frequency mel-spectrogram bins to approximately remove the content and rhythm. Then Voicifier uses two strategies, from the shallow to the deep mixing, to further destroy the content and rhythm but retain the timbre. Furthermore, for better zero-shot performance, we propose Voice-Pin Layer Normalization (VPLN) which pins down the timbre according with the text feature. During inference, the model is allowed to synthesize high quality and similarity speeches with just around 1 sec target speech audio. Experiments and ablation studies prove that the methods are able to retain more target timbre while abandoning much more of the content and rhythm-related information. To our best knowledge, the methods are found to be universal that is to say it can be applied to most of the existing TTS systems to enhance the ability of cross-speaker synthesis.