A light-weight method of building an LSTM-RNN-based bilingual tts system

2017 International Conference on Asian Language Processing (IALP) Pub Date : 2017-12-01 DOI:10.1109/IALP.2017.8300579

Huaiping Ming, Yanfeng Lu, Zhengchen Zhang, M. Dong

{"title":"A light-weight method of building an LSTM-RNN-based bilingual tts system","authors":"Huaiping Ming, Yanfeng Lu, Zhengchen Zhang, M. Dong","doi":"10.1109/IALP.2017.8300579","DOIUrl":null,"url":null,"abstract":"For a long time, text-to-speech (TTS) synthesis systems could only handle one language. Early bilingual TTS systems were constructed by directly combining two monolingual systems, with language switching. The bilingual speech generated by such systems normally contained two different voices, therefore causing unnatural, sometimes disturbing effects. A genuine bilingual TTS system should use a single voice and avoid switching between two independent monolingual systems. Accordingly, the difficulties of building genuine bilingual speech synthesizers lie in merging two different languages into the same system and preparing bilingual speech data with the same speaker. Various methods have been proposed to overcome these difficulties, including soft prosody prediction, phone, state and frame mapping, and most recently speaker and language factorization. Professional speakers who can speak two languages fluently are hard to find. In many cases a speaker can speak one language well, but the second only fairly. In this paper we propose an easy linguistic feature concatenation method to build a bilingual TTS system with data created by such a speaker, using an LSTM-RNN-based speech synthesizer. Both objective and subjective evaluations show the effectiveness of this method.","PeriodicalId":183586,"journal":{"name":"2017 International Conference on Asian Language Processing (IALP)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Asian Language Processing (IALP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2017.8300579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

For a long time, text-to-speech (TTS) synthesis systems could only handle one language. Early bilingual TTS systems were constructed by directly combining two monolingual systems, with language switching. The bilingual speech generated by such systems normally contained two different voices, therefore causing unnatural, sometimes disturbing effects. A genuine bilingual TTS system should use a single voice and avoid switching between two independent monolingual systems. Accordingly, the difficulties of building genuine bilingual speech synthesizers lie in merging two different languages into the same system and preparing bilingual speech data with the same speaker. Various methods have been proposed to overcome these difficulties, including soft prosody prediction, phone, state and frame mapping, and most recently speaker and language factorization. Professional speakers who can speak two languages fluently are hard to find. In many cases a speaker can speak one language well, but the second only fairly. In this paper we propose an easy linguistic feature concatenation method to build a bilingual TTS system with data created by such a speaker, using an LSTM-RNN-based speech synthesizer. Both objective and subjective evaluations show the effectiveness of this method.

查看原文本刊更多论文

基于lstm - rnn的双语tts系统轻量级构建方法

长期以来，文本到语音(TTS)合成系统只能处理一种语言。早期的双语TTS系统是由两个单语系统直接组合而成，并进行语言切换。这种系统生成的双语语音通常包含两种不同的声音，因此会产生不自然的、有时令人不安的效果。一个真正的双语TTS系统应该使用单一语音，避免在两个独立的单语系统之间切换。因此，构建真正的双语语音合成器的难点在于将两种不同的语言合并到同一个系统中，并使用同一说话者准备双语语音数据。已经提出了各种方法来克服这些困难，包括软韵律预测，电话，状态和帧映射，以及最近的说话人和语言分解。能流利地说两种语言的专业人士很难找到。在许多情况下，一个人可以把一种语言说得很好，但另一种语言只能说得一般。在本文中，我们提出了一种简单的语言特征拼接方法，利用基于lstm - rnn的语音合成器，利用演讲者创建的数据构建双语TTS系统。客观评价和主观评价均表明了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 International Conference on Asian Language Processing (IALP)

自引率

0.00%

发文量