Éva Székely, T. Csapó, B. Tóth, P. Mihajlik, Julie Carson-Berndsen
{"title":"从业余有声读物录音中合成富有表现力的言语","authors":"Éva Székely, T. Csapó, B. Tóth, P. Mihajlik, Julie Carson-Berndsen","doi":"10.1109/SLT.2012.6424239","DOIUrl":null,"url":null,"abstract":"Freely available audiobooks are a rich resource of expressive speech recordings that can be used for the purposes of speech synthesis. Natural sounding, expressive synthetic voices have previously been built from audiobooks that contained large amounts of highly expressive speech recorded from a professionally trained speaker. The majority of freely available audiobooks, however, are read by amateur speakers, are shorter and contain less expressive (less emphatic, less emotional, etc.) speech both in terms of quality and quantity. Synthesizing expressive speech from a typical online audiobook therefore poses many challenges. In this work we address these challenges by applying a method consisting of minimally supervised techniques to align the text with the recorded speech, select groups of expressive speech segments and build expressive voices for hidden Markov-model based synthesis using speaker adaptation. Subjective listening tests have shown that the expressive synthetic speech generated with this method is often able to produce utterances suited to an emotional message. We used a restricted amount of speech data in our experiment, in order to show that the method is generally applicable to most typical audiobooks widely available online.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"Synthesizing expressive speech from amateur audiobook recordings\",\"authors\":\"Éva Székely, T. Csapó, B. Tóth, P. Mihajlik, Julie Carson-Berndsen\",\"doi\":\"10.1109/SLT.2012.6424239\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Freely available audiobooks are a rich resource of expressive speech recordings that can be used for the purposes of speech synthesis. Natural sounding, expressive synthetic voices have previously been built from audiobooks that contained large amounts of highly expressive speech recorded from a professionally trained speaker. The majority of freely available audiobooks, however, are read by amateur speakers, are shorter and contain less expressive (less emphatic, less emotional, etc.) speech both in terms of quality and quantity. Synthesizing expressive speech from a typical online audiobook therefore poses many challenges. In this work we address these challenges by applying a method consisting of minimally supervised techniques to align the text with the recorded speech, select groups of expressive speech segments and build expressive voices for hidden Markov-model based synthesis using speaker adaptation. Subjective listening tests have shown that the expressive synthetic speech generated with this method is often able to produce utterances suited to an emotional message. We used a restricted amount of speech data in our experiment, in order to show that the method is generally applicable to most typical audiobooks widely available online.\",\"PeriodicalId\":375378,\"journal\":{\"name\":\"2012 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2012.6424239\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2012.6424239","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Synthesizing expressive speech from amateur audiobook recordings
Freely available audiobooks are a rich resource of expressive speech recordings that can be used for the purposes of speech synthesis. Natural sounding, expressive synthetic voices have previously been built from audiobooks that contained large amounts of highly expressive speech recorded from a professionally trained speaker. The majority of freely available audiobooks, however, are read by amateur speakers, are shorter and contain less expressive (less emphatic, less emotional, etc.) speech both in terms of quality and quantity. Synthesizing expressive speech from a typical online audiobook therefore poses many challenges. In this work we address these challenges by applying a method consisting of minimally supervised techniques to align the text with the recorded speech, select groups of expressive speech segments and build expressive voices for hidden Markov-model based synthesis using speaker adaptation. Subjective listening tests have shown that the expressive synthetic speech generated with this method is often able to produce utterances suited to an emotional message. We used a restricted amount of speech data in our experiment, in order to show that the method is generally applicable to most typical audiobooks widely available online.