Study on CNN in the recognition of emotion in audio and images

Bin Zhang, Changqin Quan, F. Ren
{"title":"Study on CNN in the recognition of emotion in audio and images","authors":"Bin Zhang, Changqin Quan, F. Ren","doi":"10.1109/ICIS.2016.7550778","DOIUrl":null,"url":null,"abstract":"In this paper, the performance of Convolution Neural Network (CNN) in image recognition and emotion recognition in speech will be compared and presented. Feature extraction and selection in pattern recognition is an important issue and have been frequently discussed. Moreover, two-dimensional signals such as image and voice are hard to be modelled well by traditional models like SVM. The ability of CNN to characterize two-dimensional signals is prominent. And CNN can adaptively extract feature to eliminate the dependence on human subjectivity or experience. It mimics the effect of local filtering in visual cortex cells to dig local correlation in natural dimensional space. In this work, for the problems of the image recognition and emotion recognition in speech, CNN and SVM which is used as baseline for comparison of the recognition effect. Different kernel functions in SVM have been experimented for image recognition with, the best accuracy is 94.17%. However, the accuracy of using CNN is 95.5% (7291 pictures for train and 2007 pictures for test) with less time consuming. In the emotion recognition of speech, the accuracy of CNN is 97.6% corresponds to 55.5% by baseline model (4000 utterances for training, 1500 for validation, 500 for test). The experimental results showed that CNN can effectively extract features and its modeling capability for two-dimensional signals is prominent.","PeriodicalId":336322,"journal":{"name":"2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIS.2016.7550778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 48

Abstract

In this paper, the performance of Convolution Neural Network (CNN) in image recognition and emotion recognition in speech will be compared and presented. Feature extraction and selection in pattern recognition is an important issue and have been frequently discussed. Moreover, two-dimensional signals such as image and voice are hard to be modelled well by traditional models like SVM. The ability of CNN to characterize two-dimensional signals is prominent. And CNN can adaptively extract feature to eliminate the dependence on human subjectivity or experience. It mimics the effect of local filtering in visual cortex cells to dig local correlation in natural dimensional space. In this work, for the problems of the image recognition and emotion recognition in speech, CNN and SVM which is used as baseline for comparison of the recognition effect. Different kernel functions in SVM have been experimented for image recognition with, the best accuracy is 94.17%. However, the accuracy of using CNN is 95.5% (7291 pictures for train and 2007 pictures for test) with less time consuming. In the emotion recognition of speech, the accuracy of CNN is 97.6% corresponds to 55.5% by baseline model (4000 utterances for training, 1500 for validation, 500 for test). The experimental results showed that CNN can effectively extract features and its modeling capability for two-dimensional signals is prominent.
CNN在音频和图像情感识别中的应用研究
本文将比较和介绍卷积神经网络(CNN)在图像识别和语音情感识别中的性能。特征提取与选择是模式识别中的一个重要问题,也是人们经常讨论的问题。此外,图像和语音等二维信号难以用SVM等传统模型很好地建模。CNN对二维信号的表征能力是突出的。CNN可以自适应提取特征,消除对人的主观性或经验的依赖。它模仿视觉皮层细胞的局部滤波作用,在自然维度空间中挖掘局部相关性。本文针对语音中的图像识别和情感识别问题,采用CNN和SVM作为基线,对识别效果进行比较。利用支持向量机的不同核函数进行了图像识别实验,准确率达到94.17%。而使用CNN的准确率为95.5%(7291张火车图片和2007张测试图片),耗时更少。在语音的情感识别中,CNN的准确率为97.6%,对应基线模型(训练4000条,验证1500条,测试500条)的准确率为55.5%。实验结果表明,CNN可以有效地提取特征,对二维信号的建模能力突出。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信