A method for emotional speech synthesis based on the position of emotional state in Valence-Activation space

Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific Pub Date : 2014-12-01 DOI:10.1109/APSIPA.2014.7041729

Yasuhiro Hamada, Reda Elbarougy, M. Akagi

{"title":"A method for emotional speech synthesis based on the position of emotional state in Valence-Activation space","authors":"Yasuhiro Hamada, Reda Elbarougy, M. Akagi","doi":"10.1109/APSIPA.2014.7041729","DOIUrl":null,"url":null,"abstract":"Speech to Speech translation (S2ST) systems are very important for processing by which a spoken utterance in one language is used to produce a spoken output in another language. In S2ST techniques, so far, linguistic information has been mainly adopted without para- and non-linguistic information (emotion, individuality and gender, etc.). Therefore, this systems have a limitation in synthesizing affective speech, for example emotional speech, instead of neutral one. To deal with affective speech, a system that can recognize and synthesize emotional speech is required. Although most studies focused on emotions categorically, emotional styles are not categorical but continuously spread in emotion space that are spanned by two dimensions (Valence and Activation). This paper proposes a method for synthesizing emotional speech based on the positions in Valence-Activation (V-A) space. In order to model relationships between acoustic features and V-A space, Fuzzy Inference Systems (FISs) were constructed. Twenty-one acoustic features were morphed using FISs. To verify whether synthesized speech can be perceived as the same intended position in V-A space, listening tests were carried out. The results indicate that the synthesized speech can give the same impression in the V-A space as the intended speech does.","PeriodicalId":231382,"journal":{"name":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","volume":"603 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSIPA.2014.7041729","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Speech to Speech translation (S2ST) systems are very important for processing by which a spoken utterance in one language is used to produce a spoken output in another language. In S2ST techniques, so far, linguistic information has been mainly adopted without para- and non-linguistic information (emotion, individuality and gender, etc.). Therefore, this systems have a limitation in synthesizing affective speech, for example emotional speech, instead of neutral one. To deal with affective speech, a system that can recognize and synthesize emotional speech is required. Although most studies focused on emotions categorically, emotional styles are not categorical but continuously spread in emotion space that are spanned by two dimensions (Valence and Activation). This paper proposes a method for synthesizing emotional speech based on the positions in Valence-Activation (V-A) space. In order to model relationships between acoustic features and V-A space, Fuzzy Inference Systems (FISs) were constructed. Twenty-one acoustic features were morphed using FISs. To verify whether synthesized speech can be perceived as the same intended position in V-A space, listening tests were carried out. The results indicate that the synthesized speech can give the same impression in the V-A space as the intended speech does.

查看原文本刊更多论文

一种基于情绪状态在效价激活空间位置的情绪语音合成方法

语音到语音翻译(S2ST)系统对于将一种语言的口头话语用于产生另一种语言的口头输出的处理非常重要。在S2ST技术中，迄今为止主要采用的是语言信息，而没有使用超语言和非语言信息(情感、个性、性别等)。因此，该系统在合成情感言语(如情绪性言语)而非中性言语方面存在一定的局限性。为了处理情感言语，需要一个能够识别和合成情感言语的系统。虽然大多数研究关注的是情绪的分类，但情绪风格并不是分类的，而是在两个维度(效价和激活)跨越的情绪空间中不断传播的。本文提出了一种基于价态激活(V-A)空间位置的情感语音合成方法。为了对声学特征与V-A空间之间的关系进行建模，构建了模糊推理系统。使用FISs对21个声学特征进行了变形。为了验证合成语音是否可以被感知为在V-A空间中相同的预期位置，进行了听力测试。结果表明，合成语音能够在V-A空间中给人与预期语音相同的印象。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific

自引率

0.00%

发文量