Huaiping Ming, Yanfeng Lu, Zhengchen Zhang, M. Dong
{"title":"A light-weight method of building an LSTM-RNN-based bilingual tts system","authors":"Huaiping Ming, Yanfeng Lu, Zhengchen Zhang, M. Dong","doi":"10.1109/IALP.2017.8300579","DOIUrl":null,"url":null,"abstract":"For a long time, text-to-speech (TTS) synthesis systems could only handle one language. Early bilingual TTS systems were constructed by directly combining two monolingual systems, with language switching. The bilingual speech generated by such systems normally contained two different voices, therefore causing unnatural, sometimes disturbing effects. A genuine bilingual TTS system should use a single voice and avoid switching between two independent monolingual systems. Accordingly, the difficulties of building genuine bilingual speech synthesizers lie in merging two different languages into the same system and preparing bilingual speech data with the same speaker. Various methods have been proposed to overcome these difficulties, including soft prosody prediction, phone, state and frame mapping, and most recently speaker and language factorization. Professional speakers who can speak two languages fluently are hard to find. In many cases a speaker can speak one language well, but the second only fairly. In this paper we propose an easy linguistic feature concatenation method to build a bilingual TTS system with data created by such a speaker, using an LSTM-RNN-based speech synthesizer. Both objective and subjective evaluations show the effectiveness of this method.","PeriodicalId":183586,"journal":{"name":"2017 International Conference on Asian Language Processing (IALP)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Asian Language Processing (IALP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2017.8300579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20
Abstract
For a long time, text-to-speech (TTS) synthesis systems could only handle one language. Early bilingual TTS systems were constructed by directly combining two monolingual systems, with language switching. The bilingual speech generated by such systems normally contained two different voices, therefore causing unnatural, sometimes disturbing effects. A genuine bilingual TTS system should use a single voice and avoid switching between two independent monolingual systems. Accordingly, the difficulties of building genuine bilingual speech synthesizers lie in merging two different languages into the same system and preparing bilingual speech data with the same speaker. Various methods have been proposed to overcome these difficulties, including soft prosody prediction, phone, state and frame mapping, and most recently speaker and language factorization. Professional speakers who can speak two languages fluently are hard to find. In many cases a speaker can speak one language well, but the second only fairly. In this paper we propose an easy linguistic feature concatenation method to build a bilingual TTS system with data created by such a speaker, using an LSTM-RNN-based speech synthesizer. Both objective and subjective evaluations show the effectiveness of this method.