Prediction of Emotions from Text using Sentiment Analysis for Expressive Speech Synthesis

Speech Synthesis Workshop Pub Date : 2016-09-13 DOI:10.21437/SSW.2016-4

Eva Vanmassenhove, João P. Cabral, F. Haider

{"title":"Prediction of Emotions from Text using Sentiment Analysis for Expressive Speech Synthesis","authors":"Eva Vanmassenhove, João P. Cabral, F. Haider","doi":"10.21437/SSW.2016-4","DOIUrl":null,"url":null,"abstract":"The generation of expressive speech is a great challenge for text-to-speech synthesis in audiobooks. One of the most important factors is the variation in speech emotion or voice style. In this work, we developed a method to predict the emotion from a sentence so that we can convey it through the synthetic voice. It consists of combining a standard emotion-lexicon based technique with the polarity-scores (positive/negative polarity) provided by a less ﬁne-grained sentiment analysis tool, in order to get more accurate emotion-labels. The primary goal of this emotion prediction tool was to select the type of voice (one of the emotions or neutral) given the input sentence to a state-of-the-art HMM-based Text-to-Speech (TTS) system. In addition, we also combined the emotion prediction from text with a speech clustering method to select the utterances with emotion during the process of building the emotional corpus for the speech synthesizer. Speech clustering is a popular approach to divide the speech data into subsets associated with different voice styles. The challenge here is to determine the clusters that map out the basic emotions from an audiobook corpus that contains high variety of speaking styles, in a way that minimizes the need for human annotation. The evaluation of emotion classiﬁcation from text showed that, in general, our system can obtain accuracy results close to that of human annotators. Results also indicate that this technique is useful in the selection of utterances with emotion for building expressive synthetic voices.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Synthesis Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/SSW.2016-4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

The generation of expressive speech is a great challenge for text-to-speech synthesis in audiobooks. One of the most important factors is the variation in speech emotion or voice style. In this work, we developed a method to predict the emotion from a sentence so that we can convey it through the synthetic voice. It consists of combining a standard emotion-lexicon based technique with the polarity-scores (positive/negative polarity) provided by a less ﬁne-grained sentiment analysis tool, in order to get more accurate emotion-labels. The primary goal of this emotion prediction tool was to select the type of voice (one of the emotions or neutral) given the input sentence to a state-of-the-art HMM-based Text-to-Speech (TTS) system. In addition, we also combined the emotion prediction from text with a speech clustering method to select the utterances with emotion during the process of building the emotional corpus for the speech synthesizer. Speech clustering is a popular approach to divide the speech data into subsets associated with different voice styles. The challenge here is to determine the clusters that map out the basic emotions from an audiobook corpus that contains high variety of speaking styles, in a way that minimizes the need for human annotation. The evaluation of emotion classiﬁcation from text showed that, in general, our system can obtain accuracy results close to that of human annotators. Results also indicate that this technique is useful in the selection of utterances with emotion for building expressive synthetic voices.

查看原文本刊更多论文

基于情感分析的文本情感预测及其表达性语音合成

表达性语音的生成是有声读物文本-语音合成的一大挑战。其中一个最重要的因素是言语情绪或声音风格的变化。在这项工作中，我们开发了一种方法来预测句子中的情绪，以便我们可以通过合成语音来传达它。它将基于标准情感词典的技术与由不太细粒度的情感分析工具提供的极性分数(积极/消极极性)相结合，以获得更准确的情感标签。这种情绪预测工具的主要目标是为最先进的基于hmm的文本到语音(TTS)系统选择输入句子的声音类型(一种情绪或中性)。此外，在为语音合成器构建情感语料库的过程中，我们还将文本情感预测与语音聚类方法相结合，选择具有情感的话语。语音聚类是一种常用的将语音数据划分为不同语音风格的子集的方法。这里的挑战是确定从包含多种说话风格的有声读物语料库中映射基本情绪的集群，以最小化人工注释的需要。对文本情感分类的评价表明，总体而言，我们的系统可以获得接近人类注释器的准确率结果。结果还表明，该技术在选择带有情感的话语以构建具有表现力的合成语音方面是有用的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Speech Synthesis Workshop

自引率

0.00%

发文量