Fake speech detection using VGGish with attention block

IF 1.9 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-06-26 DOI:10.1186/s13636-024-00348-4

Tahira Kanwal, Rabbia Mahum, Abdul Malik AlSalman, Mohamed Sharaf, Haseeb Hassan

{"title":"Fake speech detection using VGGish with attention block","authors":"Tahira Kanwal, Rabbia Mahum, Abdul Malik AlSalman, Mohamed Sharaf, Haseeb Hassan","doi":"10.1186/s13636-024-00348-4","DOIUrl":null,"url":null,"abstract":"While deep learning technologies have made remarkable progress in generating deepfakes, their misuse has become a well-known concern. As a result, the ubiquitous usage of deepfakes for increasing false information poses significant risks to the security and privacy of individuals. The primary objective of audio spoofing detection is to identify audio generated through numerous AI-based techniques. Several techniques for fake audio detection already exist using machine learning algorithms. However, they lack generalization and may not identify all types of AI-synthesized audios such as replay attacks, voice conversion, and text-to-speech (TTS). In this paper, a deep layered model, i.e., VGGish, along with an attention block, namely Convolutional Block Attention Module (CBAM) for spoofing detection, is introduced. Our suggested model successfully classifies input audio into two classes: Fake and Real, converting them into mel-spectrograms, and extracting their most representative features due to the attention block. Our model is a significant technique to utilize for audio spoofing detection due to a simple layered architecture. It captures complex relationships in audio signals due to both spatial and channel features present in an attention module. To evaluate the effectiveness of our model, we have conducted in-depth testing using the ASVspoof 2019 dataset. The proposed technique achieved an EER of 0.52% for Physical Access (PA) attacks and 0.07 % for Logical Access (LA) attacks.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"169 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-024-00348-4","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

While deep learning technologies have made remarkable progress in generating deepfakes, their misuse has become a well-known concern. As a result, the ubiquitous usage of deepfakes for increasing false information poses significant risks to the security and privacy of individuals. The primary objective of audio spoofing detection is to identify audio generated through numerous AI-based techniques. Several techniques for fake audio detection already exist using machine learning algorithms. However, they lack generalization and may not identify all types of AI-synthesized audios such as replay attacks, voice conversion, and text-to-speech (TTS). In this paper, a deep layered model, i.e., VGGish, along with an attention block, namely Convolutional Block Attention Module (CBAM) for spoofing detection, is introduced. Our suggested model successfully classifies input audio into two classes: Fake and Real, converting them into mel-spectrograms, and extracting their most representative features due to the attention block. Our model is a significant technique to utilize for audio spoofing detection due to a simple layered architecture. It captures complex relationships in audio signals due to both spatial and channel features present in an attention module. To evaluate the effectiveness of our model, we have conducted in-depth testing using the ASVspoof 2019 dataset. The proposed technique achieved an EER of 0.52% for Physical Access (PA) attacks and 0.07 % for Logical Access (LA) attacks.

查看原文本刊更多论文

使用带有注意力区块的 VGGish 进行假语音检测

虽然深度学习技术在生成深度伪造信息方面取得了显著进展，但其滥用已成为众所周知的问题。因此，无处不在地使用深度伪造来增加虚假信息对个人的安全和隐私构成了重大风险。音频欺骗检测的主要目的是识别通过大量基于人工智能的技术生成的音频。目前已经有几种使用机器学习算法的假音频检测技术。然而，这些技术缺乏通用性，可能无法识别所有类型的人工智能合成音频，例如重放攻击、语音转换和文本到语音（TTS）。本文介绍了一个深度分层模型，即 VGGish，以及一个用于欺骗检测的注意力模块，即卷积块注意力模块（CBAM）。我们建议的模型成功地将输入音频分为两类：我们建议的模型成功地将输入音频分为两类：假音频和真音频，将它们转换成旋律谱图，并通过注意块提取出它们最具代表性的特征。我们的模型具有简单的分层结构，是音频欺骗检测的重要技术。它能捕捉到音频信号中的复杂关系，这些关系是由注意力模块中的空间和信道特征造成的。为了评估模型的有效性，我们使用 ASVspoof 2019 数据集进行了深入测试。针对物理访问（PA）攻击和逻辑访问（LA）攻击，所提出的技术分别实现了 0.52% 和 0.07% 的 EER。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

4.10

自引率

4.20%

发文量

审稿时长

12 months

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.