Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness

Mohamed Osman
{"title":"Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness","authors":"Mohamed Osman","doi":"10.1109/icci54321.2022.9756092","DOIUrl":null,"url":null,"abstract":"One of the pillars of human social interaction is the ability to communicate one's feelings and emotions. In recent years, there has been a fast growth in research on the subject of emotional voice synthesis. Regardless, the results leave something to be desired in terms of the clarity of the emotions expressed. In this study, we propose Emo-Tts, a parallel transformer-based text-to-speech (TTS) model modified to model emotions in speech. We use a conformer-based architecture that has been augmented with speaker and emotion embedding. An external speech emotion recognition (SER) model is utilized to incorporate classification loss and perceptual loss into the TTS model, which improves emotional expressiveness and allows it to train in a self-supervised way when no emotion ground truth is available. Improving speaker embedding is critical for training hundreds of speakers with minimal valid data, allowing us to generate realistic-sounding emotional voices with only minutes of audio. By combining effective emotion and speaker embedding, we may be able to model emotions for speakers with unseen emotions. Achieving strong emotional expressiveness with a small amount of viable data could significantly improve many fields, including automated audio-book reading and possibly replacing voice actors. We achieve an accuracy of 80% on a combination of 5 datasets in our SER task.","PeriodicalId":122550,"journal":{"name":"2022 5th International Conference on Computing and Informatics (ICCI)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Conference on Computing and Informatics (ICCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icci54321.2022.9756092","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

One of the pillars of human social interaction is the ability to communicate one's feelings and emotions. In recent years, there has been a fast growth in research on the subject of emotional voice synthesis. Regardless, the results leave something to be desired in terms of the clarity of the emotions expressed. In this study, we propose Emo-Tts, a parallel transformer-based text-to-speech (TTS) model modified to model emotions in speech. We use a conformer-based architecture that has been augmented with speaker and emotion embedding. An external speech emotion recognition (SER) model is utilized to incorporate classification loss and perceptual loss into the TTS model, which improves emotional expressiveness and allows it to train in a self-supervised way when no emotion ground truth is available. Improving speaker embedding is critical for training hundreds of speakers with minimal valid data, allowing us to generate realistic-sounding emotional voices with only minutes of audio. By combining effective emotion and speaker embedding, we may be able to model emotions for speakers with unseen emotions. Achieving strong emotional expressiveness with a small amount of viable data could significantly improve many fields, including automated audio-book reading and possibly replacing voice actors. We achieve an accuracy of 80% on a combination of 5 datasets in our SER task.
Emo-Tts:基于并行转换器的文本到语音的情感意识模型
人类社会互动的支柱之一是交流个人感受和情绪的能力。近年来,情感语音合成的研究得到了快速发展。无论如何,就所表达的情绪的清晰度而言,结果还有待改进。在这项研究中,我们提出了Emo-Tts,这是一种基于并行转换器的文本到语音(TTS)模型,用于模拟语音中的情绪。我们使用了一个基于从众者的架构,这个架构已经被演讲者和情感嵌入所增强。利用外部语音情感识别(SER)模型将分类损失和感知损失纳入TTS模型,提高了TTS模型的情感表达能力,使其能够在没有情感基础真理的情况下进行自监督训练。改进扬声器嵌入对于用最少的有效数据训练数百个扬声器至关重要,这使我们能够仅用几分钟的音频生成听起来逼真的情感声音。通过结合有效的情感和说话人嵌入,我们可以为说话人建立一个看不见的情感模型。用少量可行的数据实现强烈的情感表达可以显著改善许多领域,包括自动有声读物阅读和可能取代配音演员。在我们的SER任务中,我们在5个数据集的组合上实现了80%的准确率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信