耳语和伦巴第神经语音合成

Qiong Hu, T. Bleisch, Petko N. Petkov, T. Raitio, E. Marchi, V. Lakshminarasimhan
{"title":"耳语和伦巴第神经语音合成","authors":"Qiong Hu, T. Bleisch, Petko N. Petkov, T. Raitio, E. Marchi, V. Lakshminarasimhan","doi":"10.1109/SLT48900.2021.9383454","DOIUrl":null,"url":null,"abstract":"It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pretrain this system, SV model can be used as style encoder for generating different style embeddings as input for Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Whispered and Lombard Neural Speech Synthesis\",\"authors\":\"Qiong Hu, T. Bleisch, Petko N. Petkov, T. Raitio, E. Marchi, V. Lakshminarasimhan\",\"doi\":\"10.1109/SLT48900.2021.9383454\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pretrain this system, SV model can be used as style encoder for generating different style embeddings as input for Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.\",\"PeriodicalId\":243211,\"journal\":{\"name\":\"2021 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT48900.2021.9383454\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT48900.2021.9383454","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

摘要

对于文本到语音的系统来说,考虑到呈现合成语音的环境,并向用户提供适当的上下文相关输出是可取的。在本文中,我们仅使用有限的数据,提出并比较了生成不同说话风格的各种方法,即正常,伦巴第语和耳语语。提出并评估了以下系统:1)对每种风格的模型进行预训练和微调。2)通过基于信号处理的方法实现伦巴第语和耳语语音的转换。3)基于说话人验证模型的单模型多风格生成。我们的平均意见得分和AB偏好听力测试表明,1)我们可以通过对所有说话风格的预训练/微调方法生成高质量的语音。2)虽然我们的说话人验证(SV)模型没有被明确训练来区分不同的说话风格,也没有使用伦巴第语和耳语语音对系统进行预训练,但SV模型可以作为风格编码器来生成不同的风格嵌入作为Tacotron系统的输入。我们还表明,由此产生的合成伦巴第语对可理解性增益有显著的积极影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Whispered and Lombard Neural Speech Synthesis
It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pretrain this system, SV model can be used as style encoder for generating different style embeddings as input for Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信