Toward Robust ASR System against Audio Adversarial Examples using Agitated Logit

IF 2.8 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Privacy and Security Pub Date : 2024-04-26 DOI:10.1145/3661822

Namgyu Park, Jong Kim

{"title":"Toward Robust ASR System against Audio Adversarial Examples using Agitated Logit","authors":"Namgyu Park, Jong Kim","doi":"10.1145/3661822","DOIUrl":null,"url":null,"abstract":"<p>Automatic speech recognition (ASR) systems are vulnerable to audio adversarial examples, which aim to deceive ASR systems by adding perturbations to benign speech signals. These audio adversarial examples appear indistinguishable from benign audio waves, but the ASR system decodes them as intentional malicious commands. Previous studies have demonstrated the feasibility of such attacks in simulated environments (over-line) and have further showcased the creation of robust physical audio adversarial examples (over-air). Various defense techniques have been proposed to counter these attacks. However, most of them have either failed to handle various types of attacks effectively or have resulted in significant time overhead. </p><p>In this paper, we propose a novel method for detecting audio adversarial examples. Our approach involves feeding both smoothed audio and original audio inputs into the ASR system. Subsequently, we introduce noise to the logits before providing them to the decoder of the ASR. We demonstrate that carefully selected noise can considerably influence the transcription results of audio adversarial examples while having minimal impact on the transcription of benign audio waves. Leveraging this characteristic, we detect audio adversarial examples by comparing the altered transcription, resulting from logit noising, with the original transcription. The proposed method can be easily applied to ASR systems without requiring any structural modifications or additional training. Experimental results indicate that the proposed method exhibits robustness against both over-line and over-air audio adversarial examples, outperforming state-of-the-art detection methods.</p>","PeriodicalId":56050,"journal":{"name":"ACM Transactions on Privacy and Security","volume":"120 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Privacy and Security","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3661822","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Automatic speech recognition (ASR) systems are vulnerable to audio adversarial examples, which aim to deceive ASR systems by adding perturbations to benign speech signals. These audio adversarial examples appear indistinguishable from benign audio waves, but the ASR system decodes them as intentional malicious commands. Previous studies have demonstrated the feasibility of such attacks in simulated environments (over-line) and have further showcased the creation of robust physical audio adversarial examples (over-air). Various defense techniques have been proposed to counter these attacks. However, most of them have either failed to handle various types of attacks effectively or have resulted in significant time overhead.

In this paper, we propose a novel method for detecting audio adversarial examples. Our approach involves feeding both smoothed audio and original audio inputs into the ASR system. Subsequently, we introduce noise to the logits before providing them to the decoder of the ASR. We demonstrate that carefully selected noise can considerably influence the transcription results of audio adversarial examples while having minimal impact on the transcription of benign audio waves. Leveraging this characteristic, we detect audio adversarial examples by comparing the altered transcription, resulting from logit noising, with the original transcription. The proposed method can be easily applied to ASR systems without requiring any structural modifications or additional training. Experimental results indicate that the proposed method exhibits robustness against both over-line and over-air audio adversarial examples, outperforming state-of-the-art detection methods.

查看原文本刊更多论文

利用激动 Logit 实现针对音频对抗性示例的鲁棒 ASR 系统

自动语音识别（ASR）系统容易受到音频对抗范例的影响，这些范例旨在通过在良性语音信号中添加扰动来欺骗 ASR 系统。这些音频对抗范例看起来与良性音频波无异，但 ASR 系统却能将其解码为故意的恶意指令。以前的研究已经证明了在模拟环境中进行此类攻击的可行性（在线），并进一步展示了创建鲁棒物理音频对抗示例的过程（空中）。为应对这些攻击，人们提出了各种防御技术。然而，其中大多数技术要么无法有效处理各种类型的攻击，要么导致大量时间开销。在本文中，我们提出了一种检测音频对抗示例的新方法。我们的方法是将平滑音频和原始音频输入 ASR 系统。随后，我们将噪声引入对数，然后再将其提供给 ASR 解码器。我们证明，经过精心挑选的噪声可以极大地影响对抗性音频示例的转录结果，而对良性音频波的转录影响却微乎其微。利用这一特点，我们通过比较因 logit 噪声而改变的转录结果和原始转录结果，来检测音频对抗示例。所提出的方法可轻松应用于 ASR 系统，无需进行任何结构修改或额外训练。实验结果表明，所提出的方法对过线和过空音频对抗示例都具有鲁棒性，优于最先进的检测方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Privacy and Security Computer Science-General Computer Science

CiteScore

5.20

自引率

0.00%

发文量

期刊介绍： ACM Transactions on Privacy and Security (TOPS) (formerly known as TISSEC) publishes high-quality research results in the fields of information and system security and privacy. Studies addressing all aspects of these fields are welcomed, ranging from technologies, to systems and applications, to the crafting of policies.