{"title":"基于原始音频的音乐情感识别的多视图神经网络","authors":"Na He, Sam Ferguson","doi":"10.1109/ISM.2020.00037","DOIUrl":null,"url":null,"abstract":"In Music Emotion Recognition (MER) research, most existing research uses human engineered audio features as learning model inputs, which require domain knowledge and much effort for feature extraction. We propose a novel end-to-end deep learning approach to address music emotion recognition as a regression problem, using the raw audio signal as input. We adopt multi-view convolutional neural networks as feature extractors to learn feature representations automatically. Then the extracted feature vectors are merged and fed into two layers of Bidirectional Long Short-Term Memory to capture temporal context sufficiently. In this way, our model is capable of recognizing dynamic music emotion without requiring too much workload on domain knowledge learning and audio feature processing. Combined with data augmentation strategies, the experimental results show that our model outperforms the state-of-the-art baseline with a significant margin in terms of R2 score (approximately 16%) on the Emotion in Music Database.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Multi-view Neural Networks for Raw Audio-based Music Emotion Recognition\",\"authors\":\"Na He, Sam Ferguson\",\"doi\":\"10.1109/ISM.2020.00037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In Music Emotion Recognition (MER) research, most existing research uses human engineered audio features as learning model inputs, which require domain knowledge and much effort for feature extraction. We propose a novel end-to-end deep learning approach to address music emotion recognition as a regression problem, using the raw audio signal as input. We adopt multi-view convolutional neural networks as feature extractors to learn feature representations automatically. Then the extracted feature vectors are merged and fed into two layers of Bidirectional Long Short-Term Memory to capture temporal context sufficiently. In this way, our model is capable of recognizing dynamic music emotion without requiring too much workload on domain knowledge learning and audio feature processing. Combined with data augmentation strategies, the experimental results show that our model outperforms the state-of-the-art baseline with a significant margin in terms of R2 score (approximately 16%) on the Emotion in Music Database.\",\"PeriodicalId\":120972,\"journal\":{\"name\":\"2020 IEEE International Symposium on Multimedia (ISM)\",\"volume\":\"59 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE International Symposium on Multimedia (ISM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISM.2020.00037\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on Multimedia (ISM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISM.2020.00037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multi-view Neural Networks for Raw Audio-based Music Emotion Recognition
In Music Emotion Recognition (MER) research, most existing research uses human engineered audio features as learning model inputs, which require domain knowledge and much effort for feature extraction. We propose a novel end-to-end deep learning approach to address music emotion recognition as a regression problem, using the raw audio signal as input. We adopt multi-view convolutional neural networks as feature extractors to learn feature representations automatically. Then the extracted feature vectors are merged and fed into two layers of Bidirectional Long Short-Term Memory to capture temporal context sufficiently. In this way, our model is capable of recognizing dynamic music emotion without requiring too much workload on domain knowledge learning and audio feature processing. Combined with data augmentation strategies, the experimental results show that our model outperforms the state-of-the-art baseline with a significant margin in terms of R2 score (approximately 16%) on the Emotion in Music Database.