Speech Emotion Recognition on Indonesian YouTube Web Series Using Deep Learning Approach

2020 Fifth International Conference on Informatics and Computing (ICIC) Pub Date : 2020-11-03 DOI:10.1109/ICIC50835.2020.9288650

H. N. Zahra, Muhammad Okky Ibrohim, Junaedi Fahmi, Rike Adelia, Fandy Akhmad Nur Febryanto, Oskar Riandi

{"title":"Speech Emotion Recognition on Indonesian YouTube Web Series Using Deep Learning Approach","authors":"H. N. Zahra, Muhammad Okky Ibrohim, Junaedi Fahmi, Rike Adelia, Fandy Akhmad Nur Febryanto, Oskar Riandi","doi":"10.1109/ICIC50835.2020.9288650","DOIUrl":null,"url":null,"abstract":"These days, human-computer interactions develop in an alarmingly fast rate. To keep up with this development, one of many things to be advanced is machine's capability of recognizing human emotions through speech, or simply put, Speech Emotion Recognition (SER). Various studies regarding SER have been carried out using varying data modalities, such as TV shows, movies, and actor voice recordings. While the result may be proven satisfying, to collect these data of TV and actor recordings can be quite difficult and may require some costs. On the other hand, YouTube is an open and free platform for data gathering, and retrieving data from YouTube is effortless as well. Despite that, almost none of SER studies have tried this method of data collecting. This paper presents SER in Indonesian language, using Indonesian YouTube Web Series dataset with 4 labels of emotions. In the beginning, several experiments were carried out to determine which deep learning approach trained with which specific combination of features would yield out the most favorable result. The initial stage of the experiments showed that the Convolutional Neural Network (CNN) using a feature combination of MFCC, Contrast, and Tonnetz, gives better performance than other deep learning approach that we use. After tuning parameter process, we obtain that CNN with the combination of MFCC, Contrast, and Tonnetz gives 62.30% of F1 - Score.","PeriodicalId":413610,"journal":{"name":"2020 Fifth International Conference on Informatics and Computing (ICIC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Fifth International Conference on Informatics and Computing (ICIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIC50835.2020.9288650","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

These days, human-computer interactions develop in an alarmingly fast rate. To keep up with this development, one of many things to be advanced is machine's capability of recognizing human emotions through speech, or simply put, Speech Emotion Recognition (SER). Various studies regarding SER have been carried out using varying data modalities, such as TV shows, movies, and actor voice recordings. While the result may be proven satisfying, to collect these data of TV and actor recordings can be quite difficult and may require some costs. On the other hand, YouTube is an open and free platform for data gathering, and retrieving data from YouTube is effortless as well. Despite that, almost none of SER studies have tried this method of data collecting. This paper presents SER in Indonesian language, using Indonesian YouTube Web Series dataset with 4 labels of emotions. In the beginning, several experiments were carried out to determine which deep learning approach trained with which specific combination of features would yield out the most favorable result. The initial stage of the experiments showed that the Convolutional Neural Network (CNN) using a feature combination of MFCC, Contrast, and Tonnetz, gives better performance than other deep learning approach that we use. After tuning parameter process, we obtain that CNN with the combination of MFCC, Contrast, and Tonnetz gives 62.30% of F1 - Score.

查看原文本刊更多论文

使用深度学习方法识别印尼YouTube网络系列的语音情感

如今，人机交互正以惊人的速度发展。为了跟上这一发展，机器通过语音识别人类情感的能力，或者简单地说，语音情感识别(SER)是许多需要改进的东西之一。关于SER的各种研究使用了不同的数据模式，例如电视节目、电影和演员的录音。虽然结果可能令人满意，但收集电视和演员录音的这些数据可能相当困难，可能需要一些成本。另一方面，YouTube是一个开放和免费的数据收集平台，从YouTube上检索数据也毫不费力。尽管如此，几乎没有SER研究尝试过这种数据收集方法。本文使用带有4个情感标签的印度尼西亚YouTube Web Series数据集来呈现印度尼西亚语的SER。一开始，我们进行了几个实验，以确定哪种深度学习方法使用哪种特定的特征组合进行训练会产生最有利的结果。实验的初始阶段表明，使用MFCC、Contrast和Tonnetz的特征组合的卷积神经网络(CNN)比我们使用的其他深度学习方法提供了更好的性能。经过参数调整处理，我们得到MFCC、Contrast和Tonnetz组合的CNN给出了62.30%的F1 - Score。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 Fifth International Conference on Informatics and Computing (ICIC)

自引率

0.00%

发文量