基于1D CNN的三层特征提取方法在人声性别和区域检测中的应用

IF 1.7 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Information and Telecommunication Pub Date : 2021-10-10 DOI:10.1080/24751839.2021.1983318

Mohammad Amaz Uddin, Refat Khan Pathan, Md Sayem Hossain, Munmun Biswas

{"title":"基于1D CNN的三层特征提取方法在人声性别和区域检测中的应用","authors":"Mohammad Amaz Uddin, Refat Khan Pathan, Md Sayem Hossain, Munmun Biswas","doi":"10.1080/24751839.2021.1983318","DOIUrl":null,"url":null,"abstract":"ABSTRACT Analysing the human voice has always been a challenge to the engineering society for various purposes such as product review, emotional state detection, developing AI, and much more. Two basic grounds of voice or speech analysis are to detect human gender and the geographical region based on accent. This study presents a three-layer feature extraction method from the raw human voice to detect the gender as male or female, as well as the region from where that voice belongs. Fundamental frequency, spectral entropy, spectral flatness, and mode frequency have been calculated in the first layer of feature extraction. On the other hand, Mel Frequency Cepstral Coefficient has been used to extract the features in the second layer and linear predictive coding in the third layer. Regular voice contains some noises which have been removed with multiple audio data filtering processes to get noise-free smooth data. Multi-Output-based 1D Convolutional Neural Network has been used to recognize gender and region from a combined dataset which consists of TIMIT, RAVDESS, and BGC datasets. The model has successfully predicted the gender with 93.01% and region with 97.07% accuracy. This method works better than usual state-of-the-art methods in separate datasets along with the combined dataset on both gender and region classification.","PeriodicalId":32180,"journal":{"name":"Journal of Information and Telecommunication","volume":"6 1","pages":"27 - 42"},"PeriodicalIF":1.7000,"publicationDate":"2021-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Gender and region detection from human voice using the three-layer feature extraction method with 1D CNN\",\"authors\":\"Mohammad Amaz Uddin, Refat Khan Pathan, Md Sayem Hossain, Munmun Biswas\",\"doi\":\"10.1080/24751839.2021.1983318\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACT Analysing the human voice has always been a challenge to the engineering society for various purposes such as product review, emotional state detection, developing AI, and much more. Two basic grounds of voice or speech analysis are to detect human gender and the geographical region based on accent. This study presents a three-layer feature extraction method from the raw human voice to detect the gender as male or female, as well as the region from where that voice belongs. Fundamental frequency, spectral entropy, spectral flatness, and mode frequency have been calculated in the first layer of feature extraction. On the other hand, Mel Frequency Cepstral Coefficient has been used to extract the features in the second layer and linear predictive coding in the third layer. Regular voice contains some noises which have been removed with multiple audio data filtering processes to get noise-free smooth data. Multi-Output-based 1D Convolutional Neural Network has been used to recognize gender and region from a combined dataset which consists of TIMIT, RAVDESS, and BGC datasets. The model has successfully predicted the gender with 93.01% and region with 97.07% accuracy. This method works better than usual state-of-the-art methods in separate datasets along with the combined dataset on both gender and region classification.\",\"PeriodicalId\":32180,\"journal\":{\"name\":\"Journal of Information and Telecommunication\",\"volume\":\"6 1\",\"pages\":\"27 - 42\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2021-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information and Telecommunication\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/24751839.2021.1983318\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information and Telecommunication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/24751839.2021.1983318","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 6

摘要

摘要分析人声一直是工程社会面临的挑战，用于各种目的，如产品审查、情绪状态检测、开发人工智能等。语音分析的两个基本依据是检测人类性别和基于口音的地理区域。本研究提出了一种从原始人声中提取三层特征的方法，以检测男性或女性的性别，以及该语音所属的区域。在第一层特征提取中计算了基频、谱熵、谱平坦度和模式频率。另一方面，在第二层中使用梅尔频率倒谱系数来提取特征，在第三层中使用线性预测编码。常规语音中包含一些噪声，这些噪声已通过多次音频数据过滤过程去除，以获得无噪声的平滑数据。基于多输出的1D卷积神经网络已被用于从由TIMIT、RAVDESS和BGC数据集组成的组合数据集中识别性别和区域。该模型成功地预测了性别，准确率为93.01%，地区预测准确率为97.07%。这种方法在单独的数据集以及在性别和区域分类方面的组合数据集中比通常的最先进的方法效果更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Gender and region detection from human voice using the three-layer feature extraction method with 1D CNN

ABSTRACT Analysing the human voice has always been a challenge to the engineering society for various purposes such as product review, emotional state detection, developing AI, and much more. Two basic grounds of voice or speech analysis are to detect human gender and the geographical region based on accent. This study presents a three-layer feature extraction method from the raw human voice to detect the gender as male or female, as well as the region from where that voice belongs. Fundamental frequency, spectral entropy, spectral flatness, and mode frequency have been calculated in the first layer of feature extraction. On the other hand, Mel Frequency Cepstral Coefficient has been used to extract the features in the second layer and linear predictive coding in the third layer. Regular voice contains some noises which have been removed with multiple audio data filtering processes to get noise-free smooth data. Multi-Output-based 1D Convolutional Neural Network has been used to recognize gender and region from a combined dataset which consists of TIMIT, RAVDESS, and BGC datasets. The model has successfully predicted the gender with 93.01% and region with 97.07% accuracy. This method works better than usual state-of-the-art methods in separate datasets along with the combined dataset on both gender and region classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Information and Telecommunication Multiple-

CiteScore

7.50

自引率

0.00%

发文量

审稿时长

27 weeks