Spatiotemporal Features Learning from Song for Emotions Recognition with Time Distributed CNN

Andry Chowanda
{"title":"Spatiotemporal Features Learning from Song for Emotions Recognition with Time Distributed CNN","authors":"Andry Chowanda","doi":"10.1109/iccsai53272.2021.9609722","DOIUrl":null,"url":null,"abstract":"Building a system that can naturally interact with humans has been one of the ultimate goals for researchers in the computer science field. The system should be able to interpret both verbal and non-verbal meanings from the messages conveyed by the interlocutors. A song can also be a vehicle to express a message to the listeners, and capturing the emotions from the song automatically can provide a system that can have the digital feeling when they are listening to the song. Emotions can be automatically captured and processed through several modalities via sensors. Deep learning has been the golden standard of learning architecture in many fields. The emotions recognition model can be trained well with some of the deep learning architectures. Convolution Neural Networks (CNN) is famous to train models that have multi-dimensional input features. However, it has a limitation when dealing with features that have temporal information. This research aims to use Time Distributed layers to CNN architecture to learn Spatio-temporal features from the songs (audio signals). Eight architectures were proposed in this research to explore the potential of learning Spatio-temporal features from songs with CNN architecture. The best model presented in this paper achieved 99.95%, 93.41 %, 1.84, 2.03 in training accuracy, testing accuracy, training loss and testing loss, respectively.","PeriodicalId":426993,"journal":{"name":"2021 1st International Conference on Computer Science and Artificial Intelligence (ICCSAI)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 1st International Conference on Computer Science and Artificial Intelligence (ICCSAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iccsai53272.2021.9609722","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Building a system that can naturally interact with humans has been one of the ultimate goals for researchers in the computer science field. The system should be able to interpret both verbal and non-verbal meanings from the messages conveyed by the interlocutors. A song can also be a vehicle to express a message to the listeners, and capturing the emotions from the song automatically can provide a system that can have the digital feeling when they are listening to the song. Emotions can be automatically captured and processed through several modalities via sensors. Deep learning has been the golden standard of learning architecture in many fields. The emotions recognition model can be trained well with some of the deep learning architectures. Convolution Neural Networks (CNN) is famous to train models that have multi-dimensional input features. However, it has a limitation when dealing with features that have temporal information. This research aims to use Time Distributed layers to CNN architecture to learn Spatio-temporal features from the songs (audio signals). Eight architectures were proposed in this research to explore the potential of learning Spatio-temporal features from songs with CNN architecture. The best model presented in this paper achieved 99.95%, 93.41 %, 1.84, 2.03 in training accuracy, testing accuracy, training loss and testing loss, respectively.
基于时间分布CNN的歌曲时空特征学习情绪识别
建立一个能够与人类自然互动的系统一直是计算机科学领域研究人员的终极目标之一。该系统应该能够从对话者传达的信息中解释口头和非口头的含义。歌曲也可以成为向听众表达信息的载体,从歌曲中自动捕捉情感可以提供一个系统,当他们听这首歌时,这个系统可以有数字感觉。情绪可以通过传感器通过多种方式自动捕获和处理。深度学习已经成为许多领域学习架构的黄金标准。使用一些深度学习架构可以很好地训练情绪识别模型。卷积神经网络(CNN)以训练具有多维输入特征的模型而闻名。然而,在处理具有时间信息的特征时,它有一个局限性。本研究旨在利用Time Distributed layers to CNN架构从歌曲(音频信号)中学习时空特征。本研究提出了八个架构,以探索利用CNN架构从歌曲中学习时空特征的潜力。本文提出的最佳模型在训练准确率、测试准确率、训练损失和测试损失方面分别达到99.95%、93.41%、1.84、2.03。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信