{"title":"Study on CNN in the recognition of emotion in audio and images","authors":"Bin Zhang, Changqin Quan, F. Ren","doi":"10.1109/ICIS.2016.7550778","DOIUrl":null,"url":null,"abstract":"In this paper, the performance of Convolution Neural Network (CNN) in image recognition and emotion recognition in speech will be compared and presented. Feature extraction and selection in pattern recognition is an important issue and have been frequently discussed. Moreover, two-dimensional signals such as image and voice are hard to be modelled well by traditional models like SVM. The ability of CNN to characterize two-dimensional signals is prominent. And CNN can adaptively extract feature to eliminate the dependence on human subjectivity or experience. It mimics the effect of local filtering in visual cortex cells to dig local correlation in natural dimensional space. In this work, for the problems of the image recognition and emotion recognition in speech, CNN and SVM which is used as baseline for comparison of the recognition effect. Different kernel functions in SVM have been experimented for image recognition with, the best accuracy is 94.17%. However, the accuracy of using CNN is 95.5% (7291 pictures for train and 2007 pictures for test) with less time consuming. In the emotion recognition of speech, the accuracy of CNN is 97.6% corresponds to 55.5% by baseline model (4000 utterances for training, 1500 for validation, 500 for test). The experimental results showed that CNN can effectively extract features and its modeling capability for two-dimensional signals is prominent.","PeriodicalId":336322,"journal":{"name":"2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIS.2016.7550778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 48
Abstract
In this paper, the performance of Convolution Neural Network (CNN) in image recognition and emotion recognition in speech will be compared and presented. Feature extraction and selection in pattern recognition is an important issue and have been frequently discussed. Moreover, two-dimensional signals such as image and voice are hard to be modelled well by traditional models like SVM. The ability of CNN to characterize two-dimensional signals is prominent. And CNN can adaptively extract feature to eliminate the dependence on human subjectivity or experience. It mimics the effect of local filtering in visual cortex cells to dig local correlation in natural dimensional space. In this work, for the problems of the image recognition and emotion recognition in speech, CNN and SVM which is used as baseline for comparison of the recognition effect. Different kernel functions in SVM have been experimented for image recognition with, the best accuracy is 94.17%. However, the accuracy of using CNN is 95.5% (7291 pictures for train and 2007 pictures for test) with less time consuming. In the emotion recognition of speech, the accuracy of CNN is 97.6% corresponds to 55.5% by baseline model (4000 utterances for training, 1500 for validation, 500 for test). The experimental results showed that CNN can effectively extract features and its modeling capability for two-dimensional signals is prominent.