Implementation of Mel-Frequency Cepstral Coefficient as Feature Extraction using K-Nearest Neighbor for Emotion Detection Based on Voice Intonation

Telematika Pub Date : 2023-03-01 DOI:10.31315/telematika.v20i1.9518

Revanto Alif Nawasta, Nurheri Cahyana, H. Heriyanto

{"title":"Implementation of Mel-Frequency Cepstral Coefficient as Feature Extraction using K-Nearest Neighbor for Emotion Detection Based on Voice Intonation","authors":"Revanto Alif Nawasta, Nurheri Cahyana, H. Heriyanto","doi":"10.31315/telematika.v20i1.9518","DOIUrl":null,"url":null,"abstract":"Purpose: To determine emotions based on voice intonation by implementing MFCC as a feature extraction method and KNN as an emotion detection method.Design/methodology/approach: In this study, the data used was downloaded from several video podcasts on YouTube. Some of the methods used in this study are pitch shifting for data augmentation, MFCC for feature extraction on audio data, basic statistics for taking the mean, median, min, max, standard deviation for each coefficient, Min max scaler for the normalization process and KNN for the method classification.Findings/result: Because testing is carried out separately for each gender, there are two classification models. In the male model, the highest accuracy was obtained at 88.8% and is included in the good fit model. In the female model, the highest accuracy was obtained at 92.5%, but the model was unable to correctly classify emotions in the new data. This condition is called overfitting. After testing, the cause of this condition was because the pitch shifting augmentation process of one tone in women was unable to solve the problem of the training data size being too small and not containing enough data samples to accurately represent all possible input data values.Originality/value/state of the art: The research data used in this study has never been used in previous studies because the research data is obtained by downloading from Youtube and then processed until the data is ready to be used for research.","PeriodicalId":31716,"journal":{"name":"Telematika","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Telematika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31315/telematika.v20i1.9518","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Purpose: To determine emotions based on voice intonation by implementing MFCC as a feature extraction method and KNN as an emotion detection method.Design/methodology/approach: In this study, the data used was downloaded from several video podcasts on YouTube. Some of the methods used in this study are pitch shifting for data augmentation, MFCC for feature extraction on audio data, basic statistics for taking the mean, median, min, max, standard deviation for each coefficient, Min max scaler for the normalization process and KNN for the method classification.Findings/result: Because testing is carried out separately for each gender, there are two classification models. In the male model, the highest accuracy was obtained at 88.8% and is included in the good fit model. In the female model, the highest accuracy was obtained at 92.5%, but the model was unable to correctly classify emotions in the new data. This condition is called overfitting. After testing, the cause of this condition was because the pitch shifting augmentation process of one tone in women was unable to solve the problem of the training data size being too small and not containing enough data samples to accurately represent all possible input data values.Originality/value/state of the art: The research data used in this study has never been used in previous studies because the research data is obtained by downloading from Youtube and then processed until the data is ready to be used for research.

查看原文本刊更多论文

基于k近邻的mel频率倒谱系数特征提取在语音语调情感检测中的实现

目的:实现MFCC作为特征提取方法，KNN作为情绪检测方法，基于语音语调确定情绪。设计/方法/方法:在这项研究中，使用的数据是从YouTube上的几个视频播客中下载的。本研究中使用的方法有:基音移位法进行数据增强，MFCC法对音频数据进行特征提取，基本统计法对每个系数取均值、中位数、最小值、最大值、标准差，min max尺度法进行归一化处理，KNN法进行方法分类。发现/结果:由于对每个性别分别进行了测试，因此存在两种分类模型。在男性模型中，准确率最高，达到88.8%，属于良好拟合模型。在女性模型中，获得了最高的准确率为92.5%，但该模型无法正确分类新数据中的情绪。这种情况称为过拟合。经过测试，造成这种情况的原因是女性单音的音调移位增强过程无法解决训练数据量过小，没有包含足够的数据样本来准确表示所有可能的输入数据值的问题。原创性/价值/技术水平:本研究中使用的研究数据从未在以前的研究中使用过，因为研究数据是从Youtube上下载获得的，然后经过处理，直到数据准备好用于研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Telematika

自引率

0.00%

发文量

审稿时长

24 weeks