Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness

2022 5th International Conference on Computing and Informatics (ICCI) Pub Date : 2022-03-09 DOI:10.1109/icci54321.2022.9756092

Mohamed Osman

{"title":"Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness","authors":"Mohamed Osman","doi":"10.1109/icci54321.2022.9756092","DOIUrl":null,"url":null,"abstract":"One of the pillars of human social interaction is the ability to communicate one's feelings and emotions. In recent years, there has been a fast growth in research on the subject of emotional voice synthesis. Regardless, the results leave something to be desired in terms of the clarity of the emotions expressed. In this study, we propose Emo-Tts, a parallel transformer-based text-to-speech (TTS) model modified to model emotions in speech. We use a conformer-based architecture that has been augmented with speaker and emotion embedding. An external speech emotion recognition (SER) model is utilized to incorporate classification loss and perceptual loss into the TTS model, which improves emotional expressiveness and allows it to train in a self-supervised way when no emotion ground truth is available. Improving speaker embedding is critical for training hundreds of speakers with minimal valid data, allowing us to generate realistic-sounding emotional voices with only minutes of audio. By combining effective emotion and speaker embedding, we may be able to model emotions for speakers with unseen emotions. Achieving strong emotional expressiveness with a small amount of viable data could significantly improve many fields, including automated audio-book reading and possibly replacing voice actors. We achieve an accuracy of 80% on a combination of 5 datasets in our SER task.","PeriodicalId":122550,"journal":{"name":"2022 5th International Conference on Computing and Informatics (ICCI)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Conference on Computing and Informatics (ICCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icci54321.2022.9756092","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

One of the pillars of human social interaction is the ability to communicate one's feelings and emotions. In recent years, there has been a fast growth in research on the subject of emotional voice synthesis. Regardless, the results leave something to be desired in terms of the clarity of the emotions expressed. In this study, we propose Emo-Tts, a parallel transformer-based text-to-speech (TTS) model modified to model emotions in speech. We use a conformer-based architecture that has been augmented with speaker and emotion embedding. An external speech emotion recognition (SER) model is utilized to incorporate classification loss and perceptual loss into the TTS model, which improves emotional expressiveness and allows it to train in a self-supervised way when no emotion ground truth is available. Improving speaker embedding is critical for training hundreds of speakers with minimal valid data, allowing us to generate realistic-sounding emotional voices with only minutes of audio. By combining effective emotion and speaker embedding, we may be able to model emotions for speakers with unseen emotions. Achieving strong emotional expressiveness with a small amount of viable data could significantly improve many fields, including automated audio-book reading and possibly replacing voice actors. We achieve an accuracy of 80% on a combination of 5 datasets in our SER task.

查看原文本刊更多论文

Emo-Tts:基于并行转换器的文本到语音的情感意识模型

人类社会互动的支柱之一是交流个人感受和情绪的能力。近年来，情感语音合成的研究得到了快速发展。无论如何，就所表达的情绪的清晰度而言，结果还有待改进。在这项研究中，我们提出了Emo-Tts，这是一种基于并行转换器的文本到语音(TTS)模型，用于模拟语音中的情绪。我们使用了一个基于从众者的架构，这个架构已经被演讲者和情感嵌入所增强。利用外部语音情感识别(SER)模型将分类损失和感知损失纳入TTS模型，提高了TTS模型的情感表达能力，使其能够在没有情感基础真理的情况下进行自监督训练。改进扬声器嵌入对于用最少的有效数据训练数百个扬声器至关重要，这使我们能够仅用几分钟的音频生成听起来逼真的情感声音。通过结合有效的情感和说话人嵌入，我们可以为说话人建立一个看不见的情感模型。用少量可行的数据实现强烈的情感表达可以显著改善许多领域，包括自动有声读物阅读和可能取代配音演员。在我们的SER任务中，我们在5个数据集的组合上实现了80%的准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 5th International Conference on Computing and Informatics (ICCI)

自引率

0.00%

发文量