{"title":"Spatiotemporal Features Learning from Song for Emotions Recognition with Time Distributed CNN","authors":"Andry Chowanda","doi":"10.1109/iccsai53272.2021.9609722","DOIUrl":null,"url":null,"abstract":"Building a system that can naturally interact with humans has been one of the ultimate goals for researchers in the computer science field. The system should be able to interpret both verbal and non-verbal meanings from the messages conveyed by the interlocutors. A song can also be a vehicle to express a message to the listeners, and capturing the emotions from the song automatically can provide a system that can have the digital feeling when they are listening to the song. Emotions can be automatically captured and processed through several modalities via sensors. Deep learning has been the golden standard of learning architecture in many fields. The emotions recognition model can be trained well with some of the deep learning architectures. Convolution Neural Networks (CNN) is famous to train models that have multi-dimensional input features. However, it has a limitation when dealing with features that have temporal information. This research aims to use Time Distributed layers to CNN architecture to learn Spatio-temporal features from the songs (audio signals). Eight architectures were proposed in this research to explore the potential of learning Spatio-temporal features from songs with CNN architecture. The best model presented in this paper achieved 99.95%, 93.41 %, 1.84, 2.03 in training accuracy, testing accuracy, training loss and testing loss, respectively.","PeriodicalId":426993,"journal":{"name":"2021 1st International Conference on Computer Science and Artificial Intelligence (ICCSAI)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 1st International Conference on Computer Science and Artificial Intelligence (ICCSAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iccsai53272.2021.9609722","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Building a system that can naturally interact with humans has been one of the ultimate goals for researchers in the computer science field. The system should be able to interpret both verbal and non-verbal meanings from the messages conveyed by the interlocutors. A song can also be a vehicle to express a message to the listeners, and capturing the emotions from the song automatically can provide a system that can have the digital feeling when they are listening to the song. Emotions can be automatically captured and processed through several modalities via sensors. Deep learning has been the golden standard of learning architecture in many fields. The emotions recognition model can be trained well with some of the deep learning architectures. Convolution Neural Networks (CNN) is famous to train models that have multi-dimensional input features. However, it has a limitation when dealing with features that have temporal information. This research aims to use Time Distributed layers to CNN architecture to learn Spatio-temporal features from the songs (audio signals). Eight architectures were proposed in this research to explore the potential of learning Spatio-temporal features from songs with CNN architecture. The best model presented in this paper achieved 99.95%, 93.41 %, 1.84, 2.03 in training accuracy, testing accuracy, training loss and testing loss, respectively.
建立一个能够与人类自然互动的系统一直是计算机科学领域研究人员的终极目标之一。该系统应该能够从对话者传达的信息中解释口头和非口头的含义。歌曲也可以成为向听众表达信息的载体,从歌曲中自动捕捉情感可以提供一个系统,当他们听这首歌时,这个系统可以有数字感觉。情绪可以通过传感器通过多种方式自动捕获和处理。深度学习已经成为许多领域学习架构的黄金标准。使用一些深度学习架构可以很好地训练情绪识别模型。卷积神经网络(CNN)以训练具有多维输入特征的模型而闻名。然而,在处理具有时间信息的特征时,它有一个局限性。本研究旨在利用Time Distributed layers to CNN架构从歌曲(音频信号)中学习时空特征。本研究提出了八个架构,以探索利用CNN架构从歌曲中学习时空特征的潜力。本文提出的最佳模型在训练准确率、测试准确率、训练损失和测试损失方面分别达到99.95%、93.41%、1.84、2.03。