{"title":"Deep Neural Network for visual Emotion Recognition based on ResNet50 using Song-Speech characteristics","authors":"Souha Ayadi, Z. Lachiri","doi":"10.1109/IC_ASET53395.2022.9765898","DOIUrl":null,"url":null,"abstract":"Visual emotion recognition is a very large field. It plays a very important role in different domains such as security, robotics, and medical tasks. The visual tasks could be either image or video. Unlike the image processing, the difficulty of video processing is always a challenge due to changes in information over time variation. Significant performance improvements when applying deep learning algorithms to video processing. This paper presents a deep neural network based on ResNet50 model. The latter is conducted on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) due to the variance of the nature of the data exists which is speech and song. The choice of ResNet model is based on the ability of facing different problems such as of vanishing gradients, the performing stability offered by this model, the ability of CNN for feature extraction which is considered to be the base architecture for ResNet, and the ability of improving the accuracy results and minimizing the loss. The achieved results are 57.73% for song and 55.52% for speech. Results shows that the Resnet50 model is suitable for both speech and song while maintaining performance stability.","PeriodicalId":6874,"journal":{"name":"2022 5th International Conference on Advanced Systems and Emergent Technologies (IC_ASET)","volume":"3 1","pages":"363-368"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Conference on Advanced Systems and Emergent Technologies (IC_ASET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC_ASET53395.2022.9765898","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Visual emotion recognition is a very large field. It plays a very important role in different domains such as security, robotics, and medical tasks. The visual tasks could be either image or video. Unlike the image processing, the difficulty of video processing is always a challenge due to changes in information over time variation. Significant performance improvements when applying deep learning algorithms to video processing. This paper presents a deep neural network based on ResNet50 model. The latter is conducted on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) due to the variance of the nature of the data exists which is speech and song. The choice of ResNet model is based on the ability of facing different problems such as of vanishing gradients, the performing stability offered by this model, the ability of CNN for feature extraction which is considered to be the base architecture for ResNet, and the ability of improving the accuracy results and minimizing the loss. The achieved results are 57.73% for song and 55.52% for speech. Results shows that the Resnet50 model is suitable for both speech and song while maintaining performance stability.