Delta特征映射与应用欺骗语音检测

IF 4.9 3区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Computers & Electrical Engineering Pub Date : 2025-10-08 DOI:10.1016/j.compeleceng.2025.110748

Gökay Dişken

{"title":"Delta特征映射与应用欺骗语音检测","authors":"Gökay Dişken","doi":"10.1016/j.compeleceng.2025.110748","DOIUrl":null,"url":null,"abstract":"<div><div>Convolutional layers have been used in many deep learning architectures due to their feature extraction capabilities. Besides traditional convolution, several modified convolution techniques have been proposed. Among them, differential convolution generates additional feature maps by considering the differences on activation maps in a selected direction. It was found to be effective for image recognition with pre-defined fixed filters focusing on two adjacent activations. For speech-related tasks, tracking dynamic information on a broader range may be beneficial. With this intention, this paper proposes delta feature maps, where the fixed filters of differential convolution are modified based on the computation of handcrafted delta cepstral features. The proposed filters can extract dynamic information, similar to the delta cepstral features, within a convolutional neural network scheme. Handcrafted Delta and/or delta-delta features are proven to be effective especially for synthetic speech detection. Hence, logical access (LA) condition of ASVspoof 2019 and the recent ASVspoof 5 datasets are used to verify the effectiveness of the delta feature maps. For ASVspoof 2019 dataset, residual time-domain synthetic speech detection net (Res-TSSDNet) is used as a 1-D model and one-class neural network with directed statistics pooling (OCNet-DSP) is used as a 2-D model, verifying that delta feature maps can work with both dimensions. As ASVspoof 5 is a more challenging dataset, data augmentation, a foundation model front-end, and Nes2Net-X back-end are used. Delta feature maps are utilized within Nes2Net-X via two different configurations. One of these configurations dramatically reduced the back-end size from 291 K to 76 K while preserving the performance. The other configuration achieved the lowest equal error rate, 4.33 %, among the reported single systems with a pre-trained foundation model.</div></div>","PeriodicalId":50630,"journal":{"name":"Computers & Electrical Engineering","volume":"128 ","pages":"Article 110748"},"PeriodicalIF":4.9000,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Delta feature maps with application to spoofed speech detection\",\"authors\":\"Gökay Dişken\",\"doi\":\"10.1016/j.compeleceng.2025.110748\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Convolutional layers have been used in many deep learning architectures due to their feature extraction capabilities. Besides traditional convolution, several modified convolution techniques have been proposed. Among them, differential convolution generates additional feature maps by considering the differences on activation maps in a selected direction. It was found to be effective for image recognition with pre-defined fixed filters focusing on two adjacent activations. For speech-related tasks, tracking dynamic information on a broader range may be beneficial. With this intention, this paper proposes delta feature maps, where the fixed filters of differential convolution are modified based on the computation of handcrafted delta cepstral features. The proposed filters can extract dynamic information, similar to the delta cepstral features, within a convolutional neural network scheme. Handcrafted Delta and/or delta-delta features are proven to be effective especially for synthetic speech detection. Hence, logical access (LA) condition of ASVspoof 2019 and the recent ASVspoof 5 datasets are used to verify the effectiveness of the delta feature maps. For ASVspoof 2019 dataset, residual time-domain synthetic speech detection net (Res-TSSDNet) is used as a 1-D model and one-class neural network with directed statistics pooling (OCNet-DSP) is used as a 2-D model, verifying that delta feature maps can work with both dimensions. As ASVspoof 5 is a more challenging dataset, data augmentation, a foundation model front-end, and Nes2Net-X back-end are used. Delta feature maps are utilized within Nes2Net-X via two different configurations. One of these configurations dramatically reduced the back-end size from 291 K to 76 K while preserving the performance. The other configuration achieved the lowest equal error rate, 4.33 %, among the reported single systems with a pre-trained foundation model.</div></div>\",\"PeriodicalId\":50630,\"journal\":{\"name\":\"Computers & Electrical Engineering\",\"volume\":\"128 \",\"pages\":\"Article 110748\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-10-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers & Electrical Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0045790625006913\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Electrical Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0045790625006913","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

卷积层由于其特征提取能力已被用于许多深度学习架构中。除了传统的卷积外，还提出了几种改进的卷积技术。其中，微分卷积通过考虑激活图在选定方向上的差异，生成额外的特征图。研究发现，使用预定义的固定滤波器对两个相邻的激活点进行图像识别是有效的。对于与语音相关的任务，在更广泛的范围内跟踪动态信息可能是有益的。为此，本文提出了delta特征映射，其中基于手工制作的delta倒谱特征的计算对微分卷积的固定滤波器进行修改。所提出的滤波器可以在卷积神经网络方案中提取动态信息，类似于delta倒谱特征。手工制作的Delta和/或Delta - Delta特征被证明是有效的，特别是对于合成语音检测。因此，使用ASVspoof 2019和最近的ASVspoof 5数据集的逻辑访问（LA）条件来验证增量特征映射的有效性。对于ASVspoof 2019数据集，使用残差时域合成语音检测网络（Res-TSSDNet）作为一维模型，使用具有定向统计池的一类神经网络（OCNet-DSP）作为二维模型，验证了delta特征映射可以在两个维度上工作。由于ASVspoof 5是一个更具挑战性的数据集，因此使用了数据增强、基础模型前端和Nes2Net-X后端。Delta特征映射通过两种不同的配置在Nes2Net-X中使用。其中一种配置显著地将后端大小从291k减少到76k，同时保持了性能。另一种配置在报告的具有预训练基础模型的单个系统中获得了最低的相等错误率，为4.33%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Delta feature maps with application to spoofed speech detection

Convolutional layers have been used in many deep learning architectures due to their feature extraction capabilities. Besides traditional convolution, several modified convolution techniques have been proposed. Among them, differential convolution generates additional feature maps by considering the differences on activation maps in a selected direction. It was found to be effective for image recognition with pre-defined fixed filters focusing on two adjacent activations. For speech-related tasks, tracking dynamic information on a broader range may be beneficial. With this intention, this paper proposes delta feature maps, where the fixed filters of differential convolution are modified based on the computation of handcrafted delta cepstral features. The proposed filters can extract dynamic information, similar to the delta cepstral features, within a convolutional neural network scheme. Handcrafted Delta and/or delta-delta features are proven to be effective especially for synthetic speech detection. Hence, logical access (LA) condition of ASVspoof 2019 and the recent ASVspoof 5 datasets are used to verify the effectiveness of the delta feature maps. For ASVspoof 2019 dataset, residual time-domain synthetic speech detection net (Res-TSSDNet) is used as a 1-D model and one-class neural network with directed statistics pooling (OCNet-DSP) is used as a 2-D model, verifying that delta feature maps can work with both dimensions. As ASVspoof 5 is a more challenging dataset, data augmentation, a foundation model front-end, and Nes2Net-X back-end are used. Delta feature maps are utilized within Nes2Net-X via two different configurations. One of these configurations dramatically reduced the back-end size from 291 K to 76 K while preserving the performance. The other configuration achieved the lowest equal error rate, 4.33 %, among the reported single systems with a pre-trained foundation model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computers & Electrical Engineering 工程技术-工程：电子与电气

CiteScore

9.20

自引率

7.00%

发文量

661

审稿时长

47 days

期刊介绍： The impact of computers has nowhere been more revolutionary than in electrical engineering. The design, analysis, and operation of electrical and electronic systems are now dominated by computers, a transformation that has been motivated by the natural ease of interface between computers and electrical systems, and the promise of spectacular improvements in speed and efficiency. Published since 1973, Computers & Electrical Engineering provides rapid publication of topical research into the integration of computer technology and computational techniques with electrical and electronic systems. The journal publishes papers featuring novel implementations of computers and computational techniques in areas like signal and image processing, high-performance computing, parallel processing, and communications. Special attention will be paid to papers describing innovative architectures, algorithms, and software tools.