{"title":"Speech Emotion Classification using Raw Audio Input and Transcriptions","authors":"Gabriel Lima, Jinyeong Bak","doi":"10.1145/3297067.3297089","DOIUrl":null,"url":null,"abstract":"As new gadgets that interact with the user through voice become accessible, the importance of not only the content of the speech increases, but also the significance of the way the user has spoken. Even though many techniques have been developed to indicate emotion on speech, none of them can fully grasp the real emotion of the speaker. This paper presents a neural network model capable of predicting emotions in conversations by analyzing transcriptions and raw audio waveforms, focusing on feature extraction using convolutional layers and feature combination. The model achieves an accuracy of over 71% across four classes: Anger, Happiness, Neutrality and Sadness. We also analyze the effect of audio and textual features on the classification task, by interpreting attention scores and parts of speech. This paper explores the use of raw audio waveforms, that in the best of our knowledge, have not yet been used deeply in the emotion classification task, achieving close to state of art results.","PeriodicalId":340004,"journal":{"name":"International Conference on Signal Processing and Machine Learning","volume":"82 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Signal Processing and Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3297067.3297089","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
As new gadgets that interact with the user through voice become accessible, the importance of not only the content of the speech increases, but also the significance of the way the user has spoken. Even though many techniques have been developed to indicate emotion on speech, none of them can fully grasp the real emotion of the speaker. This paper presents a neural network model capable of predicting emotions in conversations by analyzing transcriptions and raw audio waveforms, focusing on feature extraction using convolutional layers and feature combination. The model achieves an accuracy of over 71% across four classes: Anger, Happiness, Neutrality and Sadness. We also analyze the effect of audio and textual features on the classification task, by interpreting attention scores and parts of speech. This paper explores the use of raw audio waveforms, that in the best of our knowledge, have not yet been used deeply in the emotion classification task, achieving close to state of art results.