{"title":"A method for emotional speech synthesis based on the position of emotional state in Valence-Activation space","authors":"Yasuhiro Hamada, Reda Elbarougy, M. Akagi","doi":"10.1109/APSIPA.2014.7041729","DOIUrl":null,"url":null,"abstract":"Speech to Speech translation (S2ST) systems are very important for processing by which a spoken utterance in one language is used to produce a spoken output in another language. In S2ST techniques, so far, linguistic information has been mainly adopted without para- and non-linguistic information (emotion, individuality and gender, etc.). Therefore, this systems have a limitation in synthesizing affective speech, for example emotional speech, instead of neutral one. To deal with affective speech, a system that can recognize and synthesize emotional speech is required. Although most studies focused on emotions categorically, emotional styles are not categorical but continuously spread in emotion space that are spanned by two dimensions (Valence and Activation). This paper proposes a method for synthesizing emotional speech based on the positions in Valence-Activation (V-A) space. In order to model relationships between acoustic features and V-A space, Fuzzy Inference Systems (FISs) were constructed. Twenty-one acoustic features were morphed using FISs. To verify whether synthesized speech can be perceived as the same intended position in V-A space, listening tests were carried out. The results indicate that the synthesized speech can give the same impression in the V-A space as the intended speech does.","PeriodicalId":231382,"journal":{"name":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSIPA.2014.7041729","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Speech to Speech translation (S2ST) systems are very important for processing by which a spoken utterance in one language is used to produce a spoken output in another language. In S2ST techniques, so far, linguistic information has been mainly adopted without para- and non-linguistic information (emotion, individuality and gender, etc.). Therefore, this systems have a limitation in synthesizing affective speech, for example emotional speech, instead of neutral one. To deal with affective speech, a system that can recognize and synthesize emotional speech is required. Although most studies focused on emotions categorically, emotional styles are not categorical but continuously spread in emotion space that are spanned by two dimensions (Valence and Activation). This paper proposes a method for synthesizing emotional speech based on the positions in Valence-Activation (V-A) space. In order to model relationships between acoustic features and V-A space, Fuzzy Inference Systems (FISs) were constructed. Twenty-one acoustic features were morphed using FISs. To verify whether synthesized speech can be perceived as the same intended position in V-A space, listening tests were carried out. The results indicate that the synthesized speech can give the same impression in the V-A space as the intended speech does.