Study on CNN in the recognition of emotion in audio and images

2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS) Pub Date : 2016-06-26 DOI:10.1109/ICIS.2016.7550778

Bin Zhang, Changqin Quan, F. Ren

{"title":"Study on CNN in the recognition of emotion in audio and images","authors":"Bin Zhang, Changqin Quan, F. Ren","doi":"10.1109/ICIS.2016.7550778","DOIUrl":null,"url":null,"abstract":"In this paper, the performance of Convolution Neural Network (CNN) in image recognition and emotion recognition in speech will be compared and presented. Feature extraction and selection in pattern recognition is an important issue and have been frequently discussed. Moreover, two-dimensional signals such as image and voice are hard to be modelled well by traditional models like SVM. The ability of CNN to characterize two-dimensional signals is prominent. And CNN can adaptively extract feature to eliminate the dependence on human subjectivity or experience. It mimics the effect of local filtering in visual cortex cells to dig local correlation in natural dimensional space. In this work, for the problems of the image recognition and emotion recognition in speech, CNN and SVM which is used as baseline for comparison of the recognition effect. Different kernel functions in SVM have been experimented for image recognition with, the best accuracy is 94.17%. However, the accuracy of using CNN is 95.5% (7291 pictures for train and 2007 pictures for test) with less time consuming. In the emotion recognition of speech, the accuracy of CNN is 97.6% corresponds to 55.5% by baseline model (4000 utterances for training, 1500 for validation, 500 for test). The experimental results showed that CNN can effectively extract features and its modeling capability for two-dimensional signals is prominent.","PeriodicalId":336322,"journal":{"name":"2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIS.2016.7550778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 48

Abstract

In this paper, the performance of Convolution Neural Network (CNN) in image recognition and emotion recognition in speech will be compared and presented. Feature extraction and selection in pattern recognition is an important issue and have been frequently discussed. Moreover, two-dimensional signals such as image and voice are hard to be modelled well by traditional models like SVM. The ability of CNN to characterize two-dimensional signals is prominent. And CNN can adaptively extract feature to eliminate the dependence on human subjectivity or experience. It mimics the effect of local filtering in visual cortex cells to dig local correlation in natural dimensional space. In this work, for the problems of the image recognition and emotion recognition in speech, CNN and SVM which is used as baseline for comparison of the recognition effect. Different kernel functions in SVM have been experimented for image recognition with, the best accuracy is 94.17%. However, the accuracy of using CNN is 95.5% (7291 pictures for train and 2007 pictures for test) with less time consuming. In the emotion recognition of speech, the accuracy of CNN is 97.6% corresponds to 55.5% by baseline model (4000 utterances for training, 1500 for validation, 500 for test). The experimental results showed that CNN can effectively extract features and its modeling capability for two-dimensional signals is prominent.

查看原文本刊更多论文

CNN在音频和图像情感识别中的应用研究

本文将比较和介绍卷积神经网络(CNN)在图像识别和语音情感识别中的性能。特征提取与选择是模式识别中的一个重要问题，也是人们经常讨论的问题。此外，图像和语音等二维信号难以用SVM等传统模型很好地建模。CNN对二维信号的表征能力是突出的。CNN可以自适应提取特征，消除对人的主观性或经验的依赖。它模仿视觉皮层细胞的局部滤波作用，在自然维度空间中挖掘局部相关性。本文针对语音中的图像识别和情感识别问题，采用CNN和SVM作为基线，对识别效果进行比较。利用支持向量机的不同核函数进行了图像识别实验，准确率达到94.17%。而使用CNN的准确率为95.5%(7291张火车图片和2007张测试图片)，耗时更少。在语音的情感识别中，CNN的准确率为97.6%，对应基线模型(训练4000条，验证1500条，测试500条)的准确率为55.5%。实验结果表明，CNN可以有效地提取特征，对二维信号的建模能力突出。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS)

自引率

0.00%

发文量