基于1D CNN的三层特征提取方法在人声性别和区域检测中的应用

IF 2.7 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Mohammad Amaz Uddin, Refat Khan Pathan, Md Sayem Hossain, Munmun Biswas
{"title":"基于1D CNN的三层特征提取方法在人声性别和区域检测中的应用","authors":"Mohammad Amaz Uddin, Refat Khan Pathan, Md Sayem Hossain, Munmun Biswas","doi":"10.1080/24751839.2021.1983318","DOIUrl":null,"url":null,"abstract":"ABSTRACT Analysing the human voice has always been a challenge to the engineering society for various purposes such as product review, emotional state detection, developing AI, and much more. Two basic grounds of voice or speech analysis are to detect human gender and the geographical region based on accent. This study presents a three-layer feature extraction method from the raw human voice to detect the gender as male or female, as well as the region from where that voice belongs. Fundamental frequency, spectral entropy, spectral flatness, and mode frequency have been calculated in the first layer of feature extraction. On the other hand, Mel Frequency Cepstral Coefficient has been used to extract the features in the second layer and linear predictive coding in the third layer. Regular voice contains some noises which have been removed with multiple audio data filtering processes to get noise-free smooth data. Multi-Output-based 1D Convolutional Neural Network has been used to recognize gender and region from a combined dataset which consists of TIMIT, RAVDESS, and BGC datasets. The model has successfully predicted the gender with 93.01% and region with 97.07% accuracy. This method works better than usual state-of-the-art methods in separate datasets along with the combined dataset on both gender and region classification.","PeriodicalId":32180,"journal":{"name":"Journal of Information and Telecommunication","volume":"6 1","pages":"27 - 42"},"PeriodicalIF":2.7000,"publicationDate":"2021-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Gender and region detection from human voice using the three-layer feature extraction method with 1D CNN\",\"authors\":\"Mohammad Amaz Uddin, Refat Khan Pathan, Md Sayem Hossain, Munmun Biswas\",\"doi\":\"10.1080/24751839.2021.1983318\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACT Analysing the human voice has always been a challenge to the engineering society for various purposes such as product review, emotional state detection, developing AI, and much more. Two basic grounds of voice or speech analysis are to detect human gender and the geographical region based on accent. This study presents a three-layer feature extraction method from the raw human voice to detect the gender as male or female, as well as the region from where that voice belongs. Fundamental frequency, spectral entropy, spectral flatness, and mode frequency have been calculated in the first layer of feature extraction. On the other hand, Mel Frequency Cepstral Coefficient has been used to extract the features in the second layer and linear predictive coding in the third layer. Regular voice contains some noises which have been removed with multiple audio data filtering processes to get noise-free smooth data. Multi-Output-based 1D Convolutional Neural Network has been used to recognize gender and region from a combined dataset which consists of TIMIT, RAVDESS, and BGC datasets. The model has successfully predicted the gender with 93.01% and region with 97.07% accuracy. This method works better than usual state-of-the-art methods in separate datasets along with the combined dataset on both gender and region classification.\",\"PeriodicalId\":32180,\"journal\":{\"name\":\"Journal of Information and Telecommunication\",\"volume\":\"6 1\",\"pages\":\"27 - 42\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2021-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information and Telecommunication\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/24751839.2021.1983318\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information and Telecommunication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/24751839.2021.1983318","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 6

摘要

摘要分析人声一直是工程社会面临的挑战,用于各种目的,如产品审查、情绪状态检测、开发人工智能等。语音分析的两个基本依据是检测人类性别和基于口音的地理区域。本研究提出了一种从原始人声中提取三层特征的方法,以检测男性或女性的性别,以及该语音所属的区域。在第一层特征提取中计算了基频、谱熵、谱平坦度和模式频率。另一方面,在第二层中使用梅尔频率倒谱系数来提取特征,在第三层中使用线性预测编码。常规语音中包含一些噪声,这些噪声已通过多次音频数据过滤过程去除,以获得无噪声的平滑数据。基于多输出的1D卷积神经网络已被用于从由TIMIT、RAVDESS和BGC数据集组成的组合数据集中识别性别和区域。该模型成功地预测了性别,准确率为93.01%,地区预测准确率为97.07%。这种方法在单独的数据集以及在性别和区域分类方面的组合数据集中比通常的最先进的方法效果更好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Gender and region detection from human voice using the three-layer feature extraction method with 1D CNN
ABSTRACT Analysing the human voice has always been a challenge to the engineering society for various purposes such as product review, emotional state detection, developing AI, and much more. Two basic grounds of voice or speech analysis are to detect human gender and the geographical region based on accent. This study presents a three-layer feature extraction method from the raw human voice to detect the gender as male or female, as well as the region from where that voice belongs. Fundamental frequency, spectral entropy, spectral flatness, and mode frequency have been calculated in the first layer of feature extraction. On the other hand, Mel Frequency Cepstral Coefficient has been used to extract the features in the second layer and linear predictive coding in the third layer. Regular voice contains some noises which have been removed with multiple audio data filtering processes to get noise-free smooth data. Multi-Output-based 1D Convolutional Neural Network has been used to recognize gender and region from a combined dataset which consists of TIMIT, RAVDESS, and BGC datasets. The model has successfully predicted the gender with 93.01% and region with 97.07% accuracy. This method works better than usual state-of-the-art methods in separate datasets along with the combined dataset on both gender and region classification.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
7.50
自引率
0.00%
发文量
18
审稿时长
27 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信