{"title":"DKSCNN: Deep Kronecker Siamese Convolutional Neural Network enabled speaker identification","authors":"Karthikeyan Chinnasamy , Rajesh Kumar Thevasigamani , Rajiv Vincent , Sam Kumar Gopalsamy Venkatesan , Deepa Thilak Kanniyappan , Kalaiselvi Kaliannan","doi":"10.1016/j.eswa.2025.127946","DOIUrl":null,"url":null,"abstract":"<div><div>Speaker identification refers to the process of discerning between different voices within audio recordings or streams. Various factors contribute to the complexity of this task, including differences in frameworks, overlapping of sound events, and multiple sound sources present during recording. These aspects significantly complicate the process of speaker identification. To overcome such complexity, a hybrid Deep Kronecker Siamese Convolutional Neural Network (DKSCNN) method is proposed as a solution for performing speaker identification. Initially, the speech signals are collected from the VoxCeleb dataset and it is fed as an input to the pre-processing step and it is performed by the Gaussian distribution-based method. After preprocessing, feature extraction is done to extract features like spectral centroid, pitch chroma, spectral skewness, Power Spectral Density (PSD), Mel-Scale Frequency Cepstral Coefficients (MFCC), logarithmic band power, Hjorth parameters, and tonal power ratio. Based on the extracted features, the speaker identification is done using hybrid DKSCNN, which is the combination of Deep Kronecker Network (DKN) and Siamese Convolutional Neural Network (SCNN). The proposed speaker identification model attained a better accuracy of 90.682%, a True Positive Rate (TPR) rate of 91.362% and a False Positive Rate (FPR) rate of 0.086. The DKSCNN model significantly improves the accuracy and reliability of speaker identification, achieving remarkable performance metrics. This research contributes to the enhancement of speaker identification technology and addresses the problems of real-world audio environments. Also, this approach ensures that speaker identification can be applied across diverse applications.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"288 ","pages":"Article 127946"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425015684","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Speaker identification refers to the process of discerning between different voices within audio recordings or streams. Various factors contribute to the complexity of this task, including differences in frameworks, overlapping of sound events, and multiple sound sources present during recording. These aspects significantly complicate the process of speaker identification. To overcome such complexity, a hybrid Deep Kronecker Siamese Convolutional Neural Network (DKSCNN) method is proposed as a solution for performing speaker identification. Initially, the speech signals are collected from the VoxCeleb dataset and it is fed as an input to the pre-processing step and it is performed by the Gaussian distribution-based method. After preprocessing, feature extraction is done to extract features like spectral centroid, pitch chroma, spectral skewness, Power Spectral Density (PSD), Mel-Scale Frequency Cepstral Coefficients (MFCC), logarithmic band power, Hjorth parameters, and tonal power ratio. Based on the extracted features, the speaker identification is done using hybrid DKSCNN, which is the combination of Deep Kronecker Network (DKN) and Siamese Convolutional Neural Network (SCNN). The proposed speaker identification model attained a better accuracy of 90.682%, a True Positive Rate (TPR) rate of 91.362% and a False Positive Rate (FPR) rate of 0.086. The DKSCNN model significantly improves the accuracy and reliability of speaker identification, achieving remarkable performance metrics. This research contributes to the enhancement of speaker identification technology and addresses the problems of real-world audio environments. Also, this approach ensures that speaker identification can be applied across diverse applications.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.