{"title":"基于ResNet50的基于歌曲-语音特征的深度神经网络视觉情感识别","authors":"Souha Ayadi, Z. Lachiri","doi":"10.1109/IC_ASET53395.2022.9765898","DOIUrl":null,"url":null,"abstract":"Visual emotion recognition is a very large field. It plays a very important role in different domains such as security, robotics, and medical tasks. The visual tasks could be either image or video. Unlike the image processing, the difficulty of video processing is always a challenge due to changes in information over time variation. Significant performance improvements when applying deep learning algorithms to video processing. This paper presents a deep neural network based on ResNet50 model. The latter is conducted on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) due to the variance of the nature of the data exists which is speech and song. The choice of ResNet model is based on the ability of facing different problems such as of vanishing gradients, the performing stability offered by this model, the ability of CNN for feature extraction which is considered to be the base architecture for ResNet, and the ability of improving the accuracy results and minimizing the loss. The achieved results are 57.73% for song and 55.52% for speech. Results shows that the Resnet50 model is suitable for both speech and song while maintaining performance stability.","PeriodicalId":6874,"journal":{"name":"2022 5th International Conference on Advanced Systems and Emergent Technologies (IC_ASET)","volume":"3 1","pages":"363-368"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Deep Neural Network for visual Emotion Recognition based on ResNet50 using Song-Speech characteristics\",\"authors\":\"Souha Ayadi, Z. Lachiri\",\"doi\":\"10.1109/IC_ASET53395.2022.9765898\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual emotion recognition is a very large field. It plays a very important role in different domains such as security, robotics, and medical tasks. The visual tasks could be either image or video. Unlike the image processing, the difficulty of video processing is always a challenge due to changes in information over time variation. Significant performance improvements when applying deep learning algorithms to video processing. This paper presents a deep neural network based on ResNet50 model. The latter is conducted on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) due to the variance of the nature of the data exists which is speech and song. The choice of ResNet model is based on the ability of facing different problems such as of vanishing gradients, the performing stability offered by this model, the ability of CNN for feature extraction which is considered to be the base architecture for ResNet, and the ability of improving the accuracy results and minimizing the loss. The achieved results are 57.73% for song and 55.52% for speech. Results shows that the Resnet50 model is suitable for both speech and song while maintaining performance stability.\",\"PeriodicalId\":6874,\"journal\":{\"name\":\"2022 5th International Conference on Advanced Systems and Emergent Technologies (IC_ASET)\",\"volume\":\"3 1\",\"pages\":\"363-368\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 5th International Conference on Advanced Systems and Emergent Technologies (IC_ASET)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IC_ASET53395.2022.9765898\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Conference on Advanced Systems and Emergent Technologies (IC_ASET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC_ASET53395.2022.9765898","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Deep Neural Network for visual Emotion Recognition based on ResNet50 using Song-Speech characteristics
Visual emotion recognition is a very large field. It plays a very important role in different domains such as security, robotics, and medical tasks. The visual tasks could be either image or video. Unlike the image processing, the difficulty of video processing is always a challenge due to changes in information over time variation. Significant performance improvements when applying deep learning algorithms to video processing. This paper presents a deep neural network based on ResNet50 model. The latter is conducted on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) due to the variance of the nature of the data exists which is speech and song. The choice of ResNet model is based on the ability of facing different problems such as of vanishing gradients, the performing stability offered by this model, the ability of CNN for feature extraction which is considered to be the base architecture for ResNet, and the ability of improving the accuracy results and minimizing the loss. The achieved results are 57.73% for song and 55.52% for speech. Results shows that the Resnet50 model is suitable for both speech and song while maintaining performance stability.