基于图像的多方音频对话情绪状态预测

S. Jaiswal, Ayush Jain, G. Nandi
{"title":"基于图像的多方音频对话情绪状态预测","authors":"S. Jaiswal, Ayush Jain, G. Nandi","doi":"10.1109/PuneCon50868.2020.9362475","DOIUrl":null,"url":null,"abstract":"Recognizing human emotion is a complex task and is being researched upon since couple of decades. The problem has still gained popularity because of its need in various domains, when it comes to human computer interaction or human robot interaction. As per researchers, human predict other persons state of mind by observing various parameters, 70% of them being non-verbal. Human have emotions embedded in their speech, pose, gesture, context, facial expressions, and even the past history of conversation or situation. These all sub problems can be beautifully solved using learning based techniques. Predicting emotion in multi party audio based conversation aids complexity to the problem, which needs to predict intent of speech, culture, accent of talking, gender and many other diversities. There are various attempts made by researchers to classify human audio into required classes, using Support Vector Machine model, Long Short Term Memeory (LSTM) and bi-LSTM on audio input. We propose an image based emotional classification approach for an audio conversation. Spectrogram of an audio signal plotted as an image is used as a input to Convolutional Neural Network model obtaining the pattern for classification. The proposed approach is able to provide an accuracy of around 86% on test dataset, which is considerable improvement over state of the art models.","PeriodicalId":368862,"journal":{"name":"2020 IEEE Pune Section International Conference (PuneCon)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Image based Emotional State Prediction from Multiparty Audio Conversation\",\"authors\":\"S. Jaiswal, Ayush Jain, G. Nandi\",\"doi\":\"10.1109/PuneCon50868.2020.9362475\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recognizing human emotion is a complex task and is being researched upon since couple of decades. The problem has still gained popularity because of its need in various domains, when it comes to human computer interaction or human robot interaction. As per researchers, human predict other persons state of mind by observing various parameters, 70% of them being non-verbal. Human have emotions embedded in their speech, pose, gesture, context, facial expressions, and even the past history of conversation or situation. These all sub problems can be beautifully solved using learning based techniques. Predicting emotion in multi party audio based conversation aids complexity to the problem, which needs to predict intent of speech, culture, accent of talking, gender and many other diversities. There are various attempts made by researchers to classify human audio into required classes, using Support Vector Machine model, Long Short Term Memeory (LSTM) and bi-LSTM on audio input. We propose an image based emotional classification approach for an audio conversation. Spectrogram of an audio signal plotted as an image is used as a input to Convolutional Neural Network model obtaining the pattern for classification. The proposed approach is able to provide an accuracy of around 86% on test dataset, which is considerable improvement over state of the art models.\",\"PeriodicalId\":368862,\"journal\":{\"name\":\"2020 IEEE Pune Section International Conference (PuneCon)\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE Pune Section International Conference (PuneCon)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PuneCon50868.2020.9362475\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Pune Section International Conference (PuneCon)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PuneCon50868.2020.9362475","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

识别人类情感是一项复杂的任务,几十年来一直在进行研究。由于在人机交互或人机交互的各个领域中都需要该问题,因此该问题仍然得到了广泛的应用。据研究人员称,人类通过观察各种参数来预测他人的心理状态,其中70%的参数是非语言的。人类的语言、姿势、手势、语境、面部表情,甚至是过去的谈话或情境中都蕴含着情感。所有这些子问题都可以使用基于学习的技术完美地解决。在基于多方音频的对话中预测情绪有助于提高问题的复杂性,因为这需要预测说话的意图、文化、说话的口音、性别和许多其他多样性。在音频输入上,研究者们使用支持向量机模型、长短期记忆(LSTM)和双LSTM对人类音频进行了各种分类。我们提出了一种基于图像的音频对话情感分类方法。将音频信号的频谱图绘制为图像,作为卷积神经网络模型的输入,得到用于分类的模式。所提出的方法能够在测试数据集上提供约86%的准确性,这是对最先进模型状态的相当大的改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Image based Emotional State Prediction from Multiparty Audio Conversation
Recognizing human emotion is a complex task and is being researched upon since couple of decades. The problem has still gained popularity because of its need in various domains, when it comes to human computer interaction or human robot interaction. As per researchers, human predict other persons state of mind by observing various parameters, 70% of them being non-verbal. Human have emotions embedded in their speech, pose, gesture, context, facial expressions, and even the past history of conversation or situation. These all sub problems can be beautifully solved using learning based techniques. Predicting emotion in multi party audio based conversation aids complexity to the problem, which needs to predict intent of speech, culture, accent of talking, gender and many other diversities. There are various attempts made by researchers to classify human audio into required classes, using Support Vector Machine model, Long Short Term Memeory (LSTM) and bi-LSTM on audio input. We propose an image based emotional classification approach for an audio conversation. Spectrogram of an audio signal plotted as an image is used as a input to Convolutional Neural Network model obtaining the pattern for classification. The proposed approach is able to provide an accuracy of around 86% on test dataset, which is considerable improvement over state of the art models.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信