具有多尺度关注机制的堆叠卷积神经网络框架,适用于与文本无关的声纹识别

IF 3.7 4区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
V. Karthikeyan, S. Suja Priyadharsini
{"title":"具有多尺度关注机制的堆叠卷积神经网络框架,适用于与文本无关的声纹识别","authors":"V. Karthikeyan, S. Suja Priyadharsini","doi":"10.1007/s10044-024-01278-9","DOIUrl":null,"url":null,"abstract":"<p>Short-utterance speaker identification is a difficult area of study in natural language processing (NLP). Most cutting-edge experimental approaches for speech processing make use of convolutional neural networks (CNNs) and deep neural networks and analyse data in a unidirectional stream of time. In the past, approaches for identifying speakers that utilised CNNs often made use of highly dense or vast layers, leading to a large number of factors and significant computational expenses. In this article, we provide a novel multi-scale attention-focused 1-dimensional convolutional neural network (MSA-CNN) for recognising speakers that combines L1 and L2 norms. The multi-scale convolutional training architecture was developed to autonomously extract multi-scale characteristics of raw audio data by employing a variety of filter banks. In order for the multi-scale system to emphasis on important speaker feature characteristics in varying settings, a novel attention mechanism was built. In the end, it was combined and applied to the suggested multi-layered convolutional neural network framework to identify the speakers' labels. The recommended network model was tested on a number of standard voice databases and real time recorded corpus. The findings from the experiments demonstrate that our methodology outperformed a baseline CNN scheme (without an attention mechanism) in addition to conventional speaker identification techniques involving feature engineering, achieving an accuracy rate of 97.94% across numerous databases as well as distortion constraints.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"21 1","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition\",\"authors\":\"V. Karthikeyan, S. Suja Priyadharsini\",\"doi\":\"10.1007/s10044-024-01278-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Short-utterance speaker identification is a difficult area of study in natural language processing (NLP). Most cutting-edge experimental approaches for speech processing make use of convolutional neural networks (CNNs) and deep neural networks and analyse data in a unidirectional stream of time. In the past, approaches for identifying speakers that utilised CNNs often made use of highly dense or vast layers, leading to a large number of factors and significant computational expenses. In this article, we provide a novel multi-scale attention-focused 1-dimensional convolutional neural network (MSA-CNN) for recognising speakers that combines L1 and L2 norms. The multi-scale convolutional training architecture was developed to autonomously extract multi-scale characteristics of raw audio data by employing a variety of filter banks. In order for the multi-scale system to emphasis on important speaker feature characteristics in varying settings, a novel attention mechanism was built. In the end, it was combined and applied to the suggested multi-layered convolutional neural network framework to identify the speakers' labels. The recommended network model was tested on a number of standard voice databases and real time recorded corpus. The findings from the experiments demonstrate that our methodology outperformed a baseline CNN scheme (without an attention mechanism) in addition to conventional speaker identification techniques involving feature engineering, achieving an accuracy rate of 97.94% across numerous databases as well as distortion constraints.</p>\",\"PeriodicalId\":54639,\"journal\":{\"name\":\"Pattern Analysis and Applications\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-04-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Analysis and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10044-024-01278-9\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Analysis and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10044-024-01278-9","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

短口音说话人识别是自然语言处理(NLP)中的一个研究难点。大多数先进的语音处理实验方法都使用卷积神经网络(CNN)和深度神经网络,并以单向时间流的方式分析数据。过去,利用卷积神经网络(CNN)识别说话人的方法往往使用高密度或庞大的层,从而导致大量因素和巨大的计算费用。在本文中,我们提供了一种新颖的多尺度注意力集中一维卷积神经网络(MSA-CNN),用于结合 L1 和 L2 规范识别说话者。我们开发的多尺度卷积训练架构可通过使用各种滤波器组自主提取原始音频数据的多尺度特征。为了让多尺度系统在不同的环境中强调重要的说话者特征,我们建立了一种新颖的关注机制。最后,它被结合并应用于建议的多层卷积神经网络框架,以识别扬声器的标签。推荐的网络模型在一些标准语音数据库和实时录音语料库中进行了测试。实验结果表明,除了涉及特征工程的传统扬声器识别技术外,我们的方法还优于基线 CNN 方案(无注意力机制),在众多数据库和失真约束条件下的准确率达到了 97.94%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition

A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition

Short-utterance speaker identification is a difficult area of study in natural language processing (NLP). Most cutting-edge experimental approaches for speech processing make use of convolutional neural networks (CNNs) and deep neural networks and analyse data in a unidirectional stream of time. In the past, approaches for identifying speakers that utilised CNNs often made use of highly dense or vast layers, leading to a large number of factors and significant computational expenses. In this article, we provide a novel multi-scale attention-focused 1-dimensional convolutional neural network (MSA-CNN) for recognising speakers that combines L1 and L2 norms. The multi-scale convolutional training architecture was developed to autonomously extract multi-scale characteristics of raw audio data by employing a variety of filter banks. In order for the multi-scale system to emphasis on important speaker feature characteristics in varying settings, a novel attention mechanism was built. In the end, it was combined and applied to the suggested multi-layered convolutional neural network framework to identify the speakers' labels. The recommended network model was tested on a number of standard voice databases and real time recorded corpus. The findings from the experiments demonstrate that our methodology outperformed a baseline CNN scheme (without an attention mechanism) in addition to conventional speaker identification techniques involving feature engineering, achieving an accuracy rate of 97.94% across numerous databases as well as distortion constraints.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Pattern Analysis and Applications
Pattern Analysis and Applications 工程技术-计算机:人工智能
CiteScore
7.40
自引率
2.60%
发文量
76
审稿时长
13.5 months
期刊介绍: The journal publishes high quality articles in areas of fundamental research in intelligent pattern analysis and applications in computer science and engineering. It aims to provide a forum for original research which describes novel pattern analysis techniques and industrial applications of the current technology. In addition, the journal will also publish articles on pattern analysis applications in medical imaging. The journal solicits articles that detail new technology and methods for pattern recognition and analysis in applied domains including, but not limited to, computer vision and image processing, speech analysis, robotics, multimedia, document analysis, character recognition, knowledge engineering for pattern recognition, fractal analysis, and intelligent control. The journal publishes articles on the use of advanced pattern recognition and analysis methods including statistical techniques, neural networks, genetic algorithms, fuzzy pattern recognition, machine learning, and hardware implementations which are either relevant to the development of pattern analysis as a research area or detail novel pattern analysis applications. Papers proposing new classifier systems or their development, pattern analysis systems for real-time applications, fuzzy and temporal pattern recognition and uncertainty management in applied pattern recognition are particularly solicited.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信