DKSCNN: Deep Kronecker Siamese Convolutional Neural Network enabled speaker identification

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-05-23 DOI:10.1016/j.eswa.2025.127946

Karthikeyan Chinnasamy , Rajesh Kumar Thevasigamani , Rajiv Vincent , Sam Kumar Gopalsamy Venkatesan , Deepa Thilak Kanniyappan , Kalaiselvi Kaliannan

{"title":"DKSCNN: Deep Kronecker Siamese Convolutional Neural Network enabled speaker identification","authors":"Karthikeyan Chinnasamy , Rajesh Kumar Thevasigamani , Rajiv Vincent , Sam Kumar Gopalsamy Venkatesan , Deepa Thilak Kanniyappan , Kalaiselvi Kaliannan","doi":"10.1016/j.eswa.2025.127946","DOIUrl":null,"url":null,"abstract":"<div><div>Speaker identification refers to the process of discerning between different voices within audio recordings or streams. Various factors contribute to the complexity of this task, including differences in frameworks, overlapping of sound events, and multiple sound sources present during recording. These aspects significantly complicate the process of speaker identification. To overcome such complexity, a hybrid Deep Kronecker Siamese Convolutional Neural Network (DKSCNN) method is proposed as a solution for performing speaker identification. Initially, the speech signals are collected from the VoxCeleb dataset and it is fed as an input to the pre-processing step and it is performed by the Gaussian distribution-based method. After preprocessing, feature extraction is done to extract features like spectral centroid, pitch chroma, spectral skewness, Power Spectral Density (PSD), Mel-Scale Frequency Cepstral Coefficients (MFCC), logarithmic band power, Hjorth parameters, and tonal power ratio. Based on the extracted features, the speaker identification is done using hybrid DKSCNN, which is the combination of Deep Kronecker Network (DKN) and Siamese Convolutional Neural Network (SCNN). The proposed speaker identification model attained a better accuracy of 90.682%, a True Positive Rate (TPR) rate of 91.362% and a False Positive Rate (FPR) rate of 0.086. The DKSCNN model significantly improves the accuracy and reliability of speaker identification, achieving remarkable performance metrics. This research contributes to the enhancement of speaker identification technology and addresses the problems of real-world audio environments. Also, this approach ensures that speaker identification can be applied across diverse applications.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"288 ","pages":"Article 127946"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425015684","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Speaker identification refers to the process of discerning between different voices within audio recordings or streams. Various factors contribute to the complexity of this task, including differences in frameworks, overlapping of sound events, and multiple sound sources present during recording. These aspects significantly complicate the process of speaker identification. To overcome such complexity, a hybrid Deep Kronecker Siamese Convolutional Neural Network (DKSCNN) method is proposed as a solution for performing speaker identification. Initially, the speech signals are collected from the VoxCeleb dataset and it is fed as an input to the pre-processing step and it is performed by the Gaussian distribution-based method. After preprocessing, feature extraction is done to extract features like spectral centroid, pitch chroma, spectral skewness, Power Spectral Density (PSD), Mel-Scale Frequency Cepstral Coefficients (MFCC), logarithmic band power, Hjorth parameters, and tonal power ratio. Based on the extracted features, the speaker identification is done using hybrid DKSCNN, which is the combination of Deep Kronecker Network (DKN) and Siamese Convolutional Neural Network (SCNN). The proposed speaker identification model attained a better accuracy of 90.682%, a True Positive Rate (TPR) rate of 91.362% and a False Positive Rate (FPR) rate of 0.086. The DKSCNN model significantly improves the accuracy and reliability of speaker identification, achieving remarkable performance metrics. This research contributes to the enhancement of speaker identification technology and addresses the problems of real-world audio environments. Also, this approach ensures that speaker identification can be applied across diverse applications.

查看原文本刊更多论文

DKSCNN：深度克罗内克连体卷积神经网络支持说话人识别

说话人识别是指在录音或音频流中识别不同声音的过程。各种因素导致了这项任务的复杂性，包括框架的差异、声音事件的重叠以及录制过程中出现的多个声源。这些方面显著地使说话人识别过程复杂化。为了克服这种复杂性，提出了一种混合深度Kronecker Siamese卷积神经网络（DKSCNN）方法作为执行说话人识别的解决方案。首先，从VoxCeleb数据集中收集语音信号，并将其作为预处理步骤的输入，采用基于高斯分布的方法进行预处理。预处理后进行特征提取，提取谱质心、音色、谱偏度、功率谱密度（PSD）、梅尔尺度频率倒谱系数（MFCC）、对数波段功率、Hjorth参数、音调功率比等特征。基于提取的特征，采用深度Kronecker网络（DKN）和Siamese卷积神经网络（SCNN）相结合的混合DKSCNN进行说话人识别。所提出的说话人识别模型准确率为90.682%，真阳性率为91.362%，假阳性率为0.086。DKSCNN模型显著提高了说话人识别的准确性和可靠性，取得了显著的性能指标。本研究有助于提高说话人识别技术，并解决现实音频环境中的问题。此外，这种方法确保说话人识别可以跨不同的应用程序应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.