{"title":"Word level prosody prediction using large audiobook dataset","authors":"Yanfeng Lu, Chenyu Yang, M. Dong","doi":"10.1109/APSIPA.2017.8282218","DOIUrl":null,"url":null,"abstract":"Prosody modelling is an essential part of the text-to- speech synthesis system. In this paper we propose and investigate a way to leverage public domain audiobook data to do word level prosody modelling. Specifically we base our work on the LibriSpeech project, in which a large quantity of public domain audiobook data from LibriVox were processed, selected and aligned with text. We choose long-short-term-memory recurrent deep neural network as the modelling tool. The input word features spread from phonetic, through syntactic, to semantic layers. The word prosody features include log F0, energy and after-word break. A way of incorporating the word prosody model into the speech synthesis system is also proposed. Experiments show that it is an effective way to leverage large quantity and variety of speech data to do prosody modelling for speech synthesis.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSIPA.2017.8282218","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Prosody modelling is an essential part of the text-to- speech synthesis system. In this paper we propose and investigate a way to leverage public domain audiobook data to do word level prosody modelling. Specifically we base our work on the LibriSpeech project, in which a large quantity of public domain audiobook data from LibriVox were processed, selected and aligned with text. We choose long-short-term-memory recurrent deep neural network as the modelling tool. The input word features spread from phonetic, through syntactic, to semantic layers. The word prosody features include log F0, energy and after-word break. A way of incorporating the word prosody model into the speech synthesis system is also proposed. Experiments show that it is an effective way to leverage large quantity and variety of speech data to do prosody modelling for speech synthesis.