{"title":"Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness","authors":"Mohamed Osman","doi":"10.1109/icci54321.2022.9756092","DOIUrl":null,"url":null,"abstract":"One of the pillars of human social interaction is the ability to communicate one's feelings and emotions. In recent years, there has been a fast growth in research on the subject of emotional voice synthesis. Regardless, the results leave something to be desired in terms of the clarity of the emotions expressed. In this study, we propose Emo-Tts, a parallel transformer-based text-to-speech (TTS) model modified to model emotions in speech. We use a conformer-based architecture that has been augmented with speaker and emotion embedding. An external speech emotion recognition (SER) model is utilized to incorporate classification loss and perceptual loss into the TTS model, which improves emotional expressiveness and allows it to train in a self-supervised way when no emotion ground truth is available. Improving speaker embedding is critical for training hundreds of speakers with minimal valid data, allowing us to generate realistic-sounding emotional voices with only minutes of audio. By combining effective emotion and speaker embedding, we may be able to model emotions for speakers with unseen emotions. Achieving strong emotional expressiveness with a small amount of viable data could significantly improve many fields, including automated audio-book reading and possibly replacing voice actors. We achieve an accuracy of 80% on a combination of 5 datasets in our SER task.","PeriodicalId":122550,"journal":{"name":"2022 5th International Conference on Computing and Informatics (ICCI)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Conference on Computing and Informatics (ICCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icci54321.2022.9756092","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
One of the pillars of human social interaction is the ability to communicate one's feelings and emotions. In recent years, there has been a fast growth in research on the subject of emotional voice synthesis. Regardless, the results leave something to be desired in terms of the clarity of the emotions expressed. In this study, we propose Emo-Tts, a parallel transformer-based text-to-speech (TTS) model modified to model emotions in speech. We use a conformer-based architecture that has been augmented with speaker and emotion embedding. An external speech emotion recognition (SER) model is utilized to incorporate classification loss and perceptual loss into the TTS model, which improves emotional expressiveness and allows it to train in a self-supervised way when no emotion ground truth is available. Improving speaker embedding is critical for training hundreds of speakers with minimal valid data, allowing us to generate realistic-sounding emotional voices with only minutes of audio. By combining effective emotion and speaker embedding, we may be able to model emotions for speakers with unseen emotions. Achieving strong emotional expressiveness with a small amount of viable data could significantly improve many fields, including automated audio-book reading and possibly replacing voice actors. We achieve an accuracy of 80% on a combination of 5 datasets in our SER task.