Semi-automatic approach utilizing Siamese Neural Network for forensic voice comparison

Franklin Open Pub Date : 2026-03-01 Epub Date: 2026-02-07 DOI:10.1016/j.fraope.2026.100527

S.G. Kruthika , Trisiladevi C. Nagavi , P. Mahesha , H.T. Chethana , Vinayakumar Ravi , Alanoud Al Mazroa

{"title":"Semi-automatic approach utilizing Siamese Neural Network for forensic voice comparison","authors":"S.G. Kruthika , Trisiladevi C. Nagavi , P. Mahesha , H.T. Chethana , Vinayakumar Ravi , Alanoud Al Mazroa","doi":"10.1016/j.fraope.2026.100527","DOIUrl":null,"url":null,"abstract":"<div><div>Forensic Voice Comparison (FVC) remains a critical yet challenging task in digital forensics, often hindered by manual subjectivity, background noise, and speaker variability. This paper presents a novel semi-automatic FVC framework based on Siamese Neural Networks (SNN), a discriminative metric-learning architecture combined with stationary noise reduction for robust voice similarity assessment. Proposed framework leverages the SNN’s ability to learn a shared embedding space where Euclidean distance reflects speaker identity. Using a jurisdiction-specific dataset of 3899 Australian English speech samples (FLAC format), proposed framework achieves 96.02% accuracy, 94.00% precision, and 92.10% recall in distinguishing same vs different speaker pairs. The proposed framework is evaluated against strong baselines including Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (BiLSTM), Gaussian Mixture Model-Universal Background Model (GMM-UBM), and validated via 5-fold cross-validation (mean ± std. dev.) to ensure statistical robustness. Proposed framework fills a critical gap in forensic phonetics by demonstrating that lightweight, interpretable, pairwise deep learning models can outperform complex generative or ensemble systems in real-world FVC scenarios. All preprocessing, training protocols, and hyperparameters are documented for reproducibility.</div></div>","PeriodicalId":100554,"journal":{"name":"Franklin Open","volume":"14 ","pages":"Article 100527"},"PeriodicalIF":0.0000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Franklin Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2773186326000435","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/7 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Forensic Voice Comparison (FVC) remains a critical yet challenging task in digital forensics, often hindered by manual subjectivity, background noise, and speaker variability. This paper presents a novel semi-automatic FVC framework based on Siamese Neural Networks (SNN), a discriminative metric-learning architecture combined with stationary noise reduction for robust voice similarity assessment. Proposed framework leverages the SNN’s ability to learn a shared embedding space where Euclidean distance reflects speaker identity. Using a jurisdiction-specific dataset of 3899 Australian English speech samples (FLAC format), proposed framework achieves 96.02% accuracy, 94.00% precision, and 92.10% recall in distinguishing same vs different speaker pairs. The proposed framework is evaluated against strong baselines including Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (BiLSTM), Gaussian Mixture Model-Universal Background Model (GMM-UBM), and validated via 5-fold cross-validation (mean ± std. dev.) to ensure statistical robustness. Proposed framework fills a critical gap in forensic phonetics by demonstrating that lightweight, interpretable, pairwise deep learning models can outperform complex generative or ensemble systems in real-world FVC scenarios. All preprocessing, training protocols, and hyperparameters are documented for reproducibility.

查看原文本刊更多论文

利用暹罗神经网络进行法医语音比对的半自动方法

法医语音比较（FVC）在数字取证中仍然是一项关键但具有挑战性的任务，经常受到人工主观性、背景噪声和说话者变化的阻碍。本文提出了一种基于Siamese神经网络（SNN）的半自动FVC框架，该框架是一种结合平稳降噪的判别度量学习架构，用于鲁棒语音相似度评估。该框架利用SNN学习共享嵌入空间的能力，其中欧几里得距离反映说话人身份。使用3899个澳大利亚英语语音样本（FLAC格式）的管辖区特定数据集，该框架在区分相同和不同的说话人对方面达到96.02%的准确率，94.00%的精度和92.10%的召回率。该框架通过卷积神经网络（CNN）、双向长短期记忆（BiLSTM）、高斯混合模型-通用背景模型（GMM-UBM）等强大基线进行评估，并通过5倍交叉验证（mean±std. dev.）进行验证，以确保统计稳健性。该框架通过证明轻量级、可解释、两两深度学习模型在现实世界的FVC场景中可以胜过复杂的生成或集成系统，填补了法医语音学的关键空白。所有预处理、训练协议和超参数都记录为可重复性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Franklin Open

自引率

0.00%

发文量