具有多尺度关注机制的堆叠卷积神经网络框架，适用于与文本无关的声纹识别

IF 3.7 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Analysis and Applications Pub Date : 2024-04-27 DOI:10.1007/s10044-024-01278-9

V. Karthikeyan, S. Suja Priyadharsini

{"title":"具有多尺度关注机制的堆叠卷积神经网络框架，适用于与文本无关的声纹识别","authors":"V. Karthikeyan, S. Suja Priyadharsini","doi":"10.1007/s10044-024-01278-9","DOIUrl":null,"url":null,"abstract":"<p>Short-utterance speaker identification is a difficult area of study in natural language processing (NLP). Most cutting-edge experimental approaches for speech processing make use of convolutional neural networks (CNNs) and deep neural networks and analyse data in a unidirectional stream of time. In the past, approaches for identifying speakers that utilised CNNs often made use of highly dense or vast layers, leading to a large number of factors and significant computational expenses. In this article, we provide a novel multi-scale attention-focused 1-dimensional convolutional neural network (MSA-CNN) for recognising speakers that combines L1 and L2 norms. The multi-scale convolutional training architecture was developed to autonomously extract multi-scale characteristics of raw audio data by employing a variety of filter banks. In order for the multi-scale system to emphasis on important speaker feature characteristics in varying settings, a novel attention mechanism was built. In the end, it was combined and applied to the suggested multi-layered convolutional neural network framework to identify the speakers' labels. The recommended network model was tested on a number of standard voice databases and real time recorded corpus. The findings from the experiments demonstrate that our methodology outperformed a baseline CNN scheme (without an attention mechanism) in addition to conventional speaker identification techniques involving feature engineering, achieving an accuracy rate of 97.94% across numerous databases as well as distortion constraints.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"21 1","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition\",\"authors\":\"V. Karthikeyan, S. Suja Priyadharsini\",\"doi\":\"10.1007/s10044-024-01278-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Short-utterance speaker identification is a difficult area of study in natural language processing (NLP). Most cutting-edge experimental approaches for speech processing make use of convolutional neural networks (CNNs) and deep neural networks and analyse data in a unidirectional stream of time. In the past, approaches for identifying speakers that utilised CNNs often made use of highly dense or vast layers, leading to a large number of factors and significant computational expenses. In this article, we provide a novel multi-scale attention-focused 1-dimensional convolutional neural network (MSA-CNN) for recognising speakers that combines L1 and L2 norms. The multi-scale convolutional training architecture was developed to autonomously extract multi-scale characteristics of raw audio data by employing a variety of filter banks. In order for the multi-scale system to emphasis on important speaker feature characteristics in varying settings, a novel attention mechanism was built. In the end, it was combined and applied to the suggested multi-layered convolutional neural network framework to identify the speakers' labels. The recommended network model was tested on a number of standard voice databases and real time recorded corpus. The findings from the experiments demonstrate that our methodology outperformed a baseline CNN scheme (without an attention mechanism) in addition to conventional speaker identification techniques involving feature engineering, achieving an accuracy rate of 97.94% across numerous databases as well as distortion constraints.</p>\",\"PeriodicalId\":54639,\"journal\":{\"name\":\"Pattern Analysis and Applications\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-04-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Analysis and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10044-024-01278-9\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Analysis and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10044-024-01278-9","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

短口音说话人识别是自然语言处理（NLP）中的一个研究难点。大多数先进的语音处理实验方法都使用卷积神经网络（CNN）和深度神经网络，并以单向时间流的方式分析数据。过去，利用卷积神经网络（CNN）识别说话人的方法往往使用高密度或庞大的层，从而导致大量因素和巨大的计算费用。在本文中，我们提供了一种新颖的多尺度注意力集中一维卷积神经网络（MSA-CNN），用于结合 L1 和 L2 规范识别说话者。我们开发的多尺度卷积训练架构可通过使用各种滤波器组自主提取原始音频数据的多尺度特征。为了让多尺度系统在不同的环境中强调重要的说话者特征，我们建立了一种新颖的关注机制。最后，它被结合并应用于建议的多层卷积神经网络框架，以识别扬声器的标签。推荐的网络模型在一些标准语音数据库和实时录音语料库中进行了测试。实验结果表明，除了涉及特征工程的传统扬声器识别技术外，我们的方法还优于基线 CNN 方案（无注意力机制），在众多数据库和失真约束条件下的准确率达到了 97.94%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition

查看原文本刊更多论文

A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition

Short-utterance speaker identification is a difficult area of study in natural language processing (NLP). Most cutting-edge experimental approaches for speech processing make use of convolutional neural networks (CNNs) and deep neural networks and analyse data in a unidirectional stream of time. In the past, approaches for identifying speakers that utilised CNNs often made use of highly dense or vast layers, leading to a large number of factors and significant computational expenses. In this article, we provide a novel multi-scale attention-focused 1-dimensional convolutional neural network (MSA-CNN) for recognising speakers that combines L1 and L2 norms. The multi-scale convolutional training architecture was developed to autonomously extract multi-scale characteristics of raw audio data by employing a variety of filter banks. In order for the multi-scale system to emphasis on important speaker feature characteristics in varying settings, a novel attention mechanism was built. In the end, it was combined and applied to the suggested multi-layered convolutional neural network framework to identify the speakers' labels. The recommended network model was tested on a number of standard voice databases and real time recorded corpus. The findings from the experiments demonstrate that our methodology outperformed a baseline CNN scheme (without an attention mechanism) in addition to conventional speaker identification techniques involving feature engineering, achieving an accuracy rate of 97.94% across numerous databases as well as distortion constraints.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pattern Analysis and Applications 工程技术-计算机：人工智能

CiteScore

7.40

自引率

2.60%

发文量

审稿时长

13.5 months

期刊介绍： The journal publishes high quality articles in areas of fundamental research in intelligent pattern analysis and applications in computer science and engineering. It aims to provide a forum for original research which describes novel pattern analysis techniques and industrial applications of the current technology. In addition, the journal will also publish articles on pattern analysis applications in medical imaging. The journal solicits articles that detail new technology and methods for pattern recognition and analysis in applied domains including, but not limited to, computer vision and image processing, speech analysis, robotics, multimedia, document analysis, character recognition, knowledge engineering for pattern recognition, fractal analysis, and intelligent control. The journal publishes articles on the use of advanced pattern recognition and analysis methods including statistical techniques, neural networks, genetic algorithms, fuzzy pattern recognition, machine learning, and hardware implementations which are either relevant to the development of pattern analysis as a research area or detail novel pattern analysis applications. Papers proposing new classifier systems or their development, pattern analysis systems for real-time applications, fuzzy and temporal pattern recognition and uncertainty management in applied pattern recognition are particularly solicited.