利用修改后的 SincNet 和来自合适声学区域的稳健特征以及用于原始音频分析的适当优化器,进行与文本无关的说话者识别

IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Nirupam Shome , Richik Kashyap , Rabul Hussain Laskar
{"title":"利用修改后的 SincNet 和来自合适声学区域的稳健特征以及用于原始音频分析的适当优化器,进行与文本无关的说话者识别","authors":"Nirupam Shome ,&nbsp;Richik Kashyap ,&nbsp;Rabul Hussain Laskar","doi":"10.1016/j.compeleceng.2024.109915","DOIUrl":null,"url":null,"abstract":"<div><div>Speaker identification is a method of identifying an individual from a set of speakers, and text-independent speaker identification systems allow speakers to utter any phrase without any constraints. This study is focused on raw audio analysis as phase, fine-grained frequency patterns, timing cues, and other minute characteristics are preserved when raw waveforms are processed as compared to handcrafted features like Mel-Frequency Cepstral Coefficients (MFCC) and visual representation of audio-like spectrogram. Due to the depth of information, which includes variations in speech rhythm, pitch, and vocal tract shape, it is beneficial for identifying speakers. The deep learning architecture known as SincNet has gained popularity in speaker identification because of its parametric Sinc functions that allow it to operate directly on the raw audio input. In this paper, we have considered SincNet as the baseline model for speaker identification. The effect of proper speech boundary detection, including high-level features and effective optimizer selection are analysed. The precise identification of the signal start and terminus point is important for eliminating the redundant non-speech regions. We have included endpoint detection module as a pre-processing step in the system. Proper feature extraction and selection are crucial to the model's success. To extract more abstract features from the data, we have added more convolution layers to the original SincNet model. Further, we investigated the hyperparameter tuning protocol's sensitivity to the optimizer and selected the suitable optimizer for raw audio analysis. With all the modifications in the system architecture, we are able to archive improvements of 12.76 %, 13.33 %, and 13.39 % respectively for training, validation, and testing over the original SincNet model. In terms of validation loss, our proposed approach attains 0.35 in comparison to the original SincNet loss of 1.02. With this significant improvement, the total training time is marginally increased by 20 minutes for our proposed model. We have performed our investigation on the LibriSpeech dataset to check the effectiveness of our proposed system in comparison to the other model..</div></div>","PeriodicalId":50630,"journal":{"name":"Computers & Electrical Engineering","volume":"121 ","pages":"Article 109915"},"PeriodicalIF":4.0000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Text-independent speaker identification using modified SincNet with robust features from suitable acoustic region and appropriate optimizer for raw audio analysis\",\"authors\":\"Nirupam Shome ,&nbsp;Richik Kashyap ,&nbsp;Rabul Hussain Laskar\",\"doi\":\"10.1016/j.compeleceng.2024.109915\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Speaker identification is a method of identifying an individual from a set of speakers, and text-independent speaker identification systems allow speakers to utter any phrase without any constraints. This study is focused on raw audio analysis as phase, fine-grained frequency patterns, timing cues, and other minute characteristics are preserved when raw waveforms are processed as compared to handcrafted features like Mel-Frequency Cepstral Coefficients (MFCC) and visual representation of audio-like spectrogram. Due to the depth of information, which includes variations in speech rhythm, pitch, and vocal tract shape, it is beneficial for identifying speakers. The deep learning architecture known as SincNet has gained popularity in speaker identification because of its parametric Sinc functions that allow it to operate directly on the raw audio input. In this paper, we have considered SincNet as the baseline model for speaker identification. The effect of proper speech boundary detection, including high-level features and effective optimizer selection are analysed. The precise identification of the signal start and terminus point is important for eliminating the redundant non-speech regions. We have included endpoint detection module as a pre-processing step in the system. Proper feature extraction and selection are crucial to the model's success. To extract more abstract features from the data, we have added more convolution layers to the original SincNet model. Further, we investigated the hyperparameter tuning protocol's sensitivity to the optimizer and selected the suitable optimizer for raw audio analysis. With all the modifications in the system architecture, we are able to archive improvements of 12.76 %, 13.33 %, and 13.39 % respectively for training, validation, and testing over the original SincNet model. In terms of validation loss, our proposed approach attains 0.35 in comparison to the original SincNet loss of 1.02. With this significant improvement, the total training time is marginally increased by 20 minutes for our proposed model. We have performed our investigation on the LibriSpeech dataset to check the effectiveness of our proposed system in comparison to the other model..</div></div>\",\"PeriodicalId\":50630,\"journal\":{\"name\":\"Computers & Electrical Engineering\",\"volume\":\"121 \",\"pages\":\"Article 109915\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2024-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers & Electrical Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0045790624008413\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Electrical Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0045790624008413","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

扬声器识别是从一组扬声器中识别出一个人的方法,与文本无关的扬声器识别系统允许扬声器不受任何限制地说出任何短语。本研究的重点是原始音频分析,因为在处理原始波形时,相位、细粒度频率模式、时间线索和其他微小特征都会保留下来,而手工制作的特征(如梅尔频率倒频谱系数(MFCC)和类似音频频谱图的视觉表示)则不会。由于信息的深度,其中包括语音节奏、音高和声道形状的变化,因此有利于识别说话者。被称为 SincNet 的深度学习架构因其参数化 Sinc 函数可直接对原始音频输入进行操作而在扬声器识别领域大受欢迎。在本文中,我们将 SincNet 视为识别说话人的基准模型。本文分析了正确的语音边界检测(包括高级特征和有效的优化器选择)的效果。精确识别信号的起点和终点对于消除多余的非语音区域非常重要。我们在系统中加入了终点检测模块作为预处理步骤。正确的特征提取和选择是模型成功的关键。为了从数据中提取更多抽象特征,我们在原始 SincNet 模型中添加了更多卷积层。此外,我们还研究了超参数调整协议对优化器的敏感性,并为原始音频分析选择了合适的优化器。在对系统架构进行所有修改后,我们在训练、验证和测试方面分别比原始 SincNet 模型提高了 12.76%、13.33% 和 13.39%。在验证损失方面,我们提出的方法达到了 0.35,而原始 SincNet 的损失为 1.02。由于这一重大改进,我们提出的模型的总训练时间略微增加了 20 分钟。我们在 LibriSpeech 数据集上进行了调查,以检验我们提出的系统与其他模型相比的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Text-independent speaker identification using modified SincNet with robust features from suitable acoustic region and appropriate optimizer for raw audio analysis
Speaker identification is a method of identifying an individual from a set of speakers, and text-independent speaker identification systems allow speakers to utter any phrase without any constraints. This study is focused on raw audio analysis as phase, fine-grained frequency patterns, timing cues, and other minute characteristics are preserved when raw waveforms are processed as compared to handcrafted features like Mel-Frequency Cepstral Coefficients (MFCC) and visual representation of audio-like spectrogram. Due to the depth of information, which includes variations in speech rhythm, pitch, and vocal tract shape, it is beneficial for identifying speakers. The deep learning architecture known as SincNet has gained popularity in speaker identification because of its parametric Sinc functions that allow it to operate directly on the raw audio input. In this paper, we have considered SincNet as the baseline model for speaker identification. The effect of proper speech boundary detection, including high-level features and effective optimizer selection are analysed. The precise identification of the signal start and terminus point is important for eliminating the redundant non-speech regions. We have included endpoint detection module as a pre-processing step in the system. Proper feature extraction and selection are crucial to the model's success. To extract more abstract features from the data, we have added more convolution layers to the original SincNet model. Further, we investigated the hyperparameter tuning protocol's sensitivity to the optimizer and selected the suitable optimizer for raw audio analysis. With all the modifications in the system architecture, we are able to archive improvements of 12.76 %, 13.33 %, and 13.39 % respectively for training, validation, and testing over the original SincNet model. In terms of validation loss, our proposed approach attains 0.35 in comparison to the original SincNet loss of 1.02. With this significant improvement, the total training time is marginally increased by 20 minutes for our proposed model. We have performed our investigation on the LibriSpeech dataset to check the effectiveness of our proposed system in comparison to the other model..
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computers & Electrical Engineering
Computers & Electrical Engineering 工程技术-工程:电子与电气
CiteScore
9.20
自引率
7.00%
发文量
661
审稿时长
47 days
期刊介绍: The impact of computers has nowhere been more revolutionary than in electrical engineering. The design, analysis, and operation of electrical and electronic systems are now dominated by computers, a transformation that has been motivated by the natural ease of interface between computers and electrical systems, and the promise of spectacular improvements in speed and efficiency. Published since 1973, Computers & Electrical Engineering provides rapid publication of topical research into the integration of computer technology and computational techniques with electrical and electronic systems. The journal publishes papers featuring novel implementations of computers and computational techniques in areas like signal and image processing, high-performance computing, parallel processing, and communications. Special attention will be paid to papers describing innovative architectures, algorithms, and software tools.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信