实现自然语音合成的两项努力:根据对话者纳入不流畅和说话风格变化

Akiko Mokhtari, Nick Campbell, Toshiyuki Sadanobu
{"title":"实现自然语音合成的两项努力:根据对话者纳入不流畅和说话风格变化","authors":"Akiko Mokhtari, Nick Campbell, Toshiyuki Sadanobu","doi":"10.1121/10.0023286","DOIUrl":null,"url":null,"abstract":"During the period 2000–2005, a Japanese female speaker recorded her everyday conversations with many different interlocutors using a head-set microphone. As a result, 600 hours of natural Japanese speech data were obtained. This study describes a DNN-based speech synthesis system which was trained on 300 hours of the data, focusing on two unique efforts to make it more expressive in a human-like way: (1) allowing for disfluencies, and (2) accounting for the category of interlocutor. Incorporating some frequently observed disfluent patterns in general Japanese speech such as fillers, phrase-final rising intonation, and word-internal prolongation or suspension, is believed to be effective in practical application as certain disfluencies are connected to a speaker’s attitude in Japanese communication. For example, having word-internal prolongations can show hesitation or politeness, and word-internal suspending can show the speaker’s surprised attitude. Interlocutors in the original data were categorized into four groups: family, friend, child and others. This information was used in the training process, and as a result, the synthesizer can generate different speaking styles according to the interlocutor setting. Being able to generate disfluent speech and change the speaking style depending on who you are talking to can make the synthesizer ever more expressive.","PeriodicalId":256727,"journal":{"name":"The Journal of the Acoustical Society of America","volume":"75 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Two efforts towards natural speech synthesis: Incorporating disfluency and speaking style change based on the interlocutor\",\"authors\":\"Akiko Mokhtari, Nick Campbell, Toshiyuki Sadanobu\",\"doi\":\"10.1121/10.0023286\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"During the period 2000–2005, a Japanese female speaker recorded her everyday conversations with many different interlocutors using a head-set microphone. As a result, 600 hours of natural Japanese speech data were obtained. This study describes a DNN-based speech synthesis system which was trained on 300 hours of the data, focusing on two unique efforts to make it more expressive in a human-like way: (1) allowing for disfluencies, and (2) accounting for the category of interlocutor. Incorporating some frequently observed disfluent patterns in general Japanese speech such as fillers, phrase-final rising intonation, and word-internal prolongation or suspension, is believed to be effective in practical application as certain disfluencies are connected to a speaker’s attitude in Japanese communication. For example, having word-internal prolongations can show hesitation or politeness, and word-internal suspending can show the speaker’s surprised attitude. Interlocutors in the original data were categorized into four groups: family, friend, child and others. This information was used in the training process, and as a result, the synthesizer can generate different speaking styles according to the interlocutor setting. Being able to generate disfluent speech and change the speaking style depending on who you are talking to can make the synthesizer ever more expressive.\",\"PeriodicalId\":256727,\"journal\":{\"name\":\"The Journal of the Acoustical Society of America\",\"volume\":\"75 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Journal of the Acoustical Society of America\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1121/10.0023286\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Journal of the Acoustical Society of America","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1121/10.0023286","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在 2000-2005 年期间,一位日语女讲演者使用头戴式麦克风录制了她与许多不同对话者的日常对话。结果获得了 600 小时的自然日语语音数据。本研究介绍了一个基于 DNN 的语音合成系统,该系统在 300 小时的数据基础上进行了训练,重点是通过以下两项独特的努力,使语音合成系统具有更强的类人表达能力:(1) 允许出现不流利现象;(2) 考虑对话者的类别。由于某些不连贯现象与说话者在日语交际中的态度有关,因此将一般日语中经常出现的一些不连贯模式(如补语、短语末尾升调、词内延长或中止)纳入其中,相信在实际应用中会很有效。例如,词内延长可以表示犹豫或礼貌,词内暂停可以表示说话者的惊讶态度。原始数据中的对话者被分为四类:家人、朋友、孩子和其他人。在训练过程中使用了这些信息,因此合成器可以根据对话者的设置生成不同的说话风格。能够生成不流畅的语音,并根据谈话对象的不同而改变说话风格,可以使合成器更具表现力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Two efforts towards natural speech synthesis: Incorporating disfluency and speaking style change based on the interlocutor
During the period 2000–2005, a Japanese female speaker recorded her everyday conversations with many different interlocutors using a head-set microphone. As a result, 600 hours of natural Japanese speech data were obtained. This study describes a DNN-based speech synthesis system which was trained on 300 hours of the data, focusing on two unique efforts to make it more expressive in a human-like way: (1) allowing for disfluencies, and (2) accounting for the category of interlocutor. Incorporating some frequently observed disfluent patterns in general Japanese speech such as fillers, phrase-final rising intonation, and word-internal prolongation or suspension, is believed to be effective in practical application as certain disfluencies are connected to a speaker’s attitude in Japanese communication. For example, having word-internal prolongations can show hesitation or politeness, and word-internal suspending can show the speaker’s surprised attitude. Interlocutors in the original data were categorized into four groups: family, friend, child and others. This information was used in the training process, and as a result, the synthesizer can generate different speaking styles according to the interlocutor setting. Being able to generate disfluent speech and change the speaking style depending on who you are talking to can make the synthesizer ever more expressive.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信