Yuewen Cao, Songxiang Liu, Shiyin Kang, Na Hu, Peng Liu, Xunying Liu, Dan Su, Dong Yu, H. Meng
{"title":"Exploring Cross-lingual Singing Voice Synthesis Using Speech Data","authors":"Yuewen Cao, Songxiang Liu, Shiyin Kang, Na Hu, Peng Liu, Xunying Liu, Dan Su, Dong Yu, H. Meng","doi":"10.1109/ISCSLP49672.2021.9362077","DOIUrl":null,"url":null,"abstract":"State-of-the-art singing voice synthesis (SVS) models can generate natural singing voice of a target speaker, given his/her speaking/singing data in the same language. However, there may be challenging conditions where only speech data in a non-target language of the target speaker is available. In this paper, we present a cross-lingual SVS system that can synthesize an English speaker’s singing voice in Mandarin from musical scores with only her speech data in English. The pro-posed cross-lingual SVS system contains four parts: a BLSTM based duration model, a pitch model, a cross-lingual acoustic model and a neural vocoder. The acoustic model employs encoder-decoder architecture conditioned on pitch, phoneme duration, speaker information and language information. An adversarially-trained speaker classifier is employed to discourage the text encodings from capturing speaker information. Objective evaluation and subjective listening tests demonstrate that the proposed cross-lingual SVS system can generate singing voice with decent naturalness and fair speaker similarity. We also find that adding singing data or multi-speaker monolingual speech data further improves generalization on pronunciation and pitch accuracy.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":" 34","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCSLP49672.2021.9362077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
State-of-the-art singing voice synthesis (SVS) models can generate natural singing voice of a target speaker, given his/her speaking/singing data in the same language. However, there may be challenging conditions where only speech data in a non-target language of the target speaker is available. In this paper, we present a cross-lingual SVS system that can synthesize an English speaker’s singing voice in Mandarin from musical scores with only her speech data in English. The pro-posed cross-lingual SVS system contains four parts: a BLSTM based duration model, a pitch model, a cross-lingual acoustic model and a neural vocoder. The acoustic model employs encoder-decoder architecture conditioned on pitch, phoneme duration, speaker information and language information. An adversarially-trained speaker classifier is employed to discourage the text encodings from capturing speaker information. Objective evaluation and subjective listening tests demonstrate that the proposed cross-lingual SVS system can generate singing voice with decent naturalness and fair speaker similarity. We also find that adding singing data or multi-speaker monolingual speech data further improves generalization on pronunciation and pitch accuracy.