The Crux of Voice (In)Security: A Brain Study of Speaker Legitimacy Detection

Proceedings 2019 Network and Distributed System Security Symposium Pub Date : 2019-01-01 DOI:10.14722/ndss.2019.23206

Ajaya Neupane, Nitesh Saxena, Leanne M. Hirshfield, Sarah E. Bratt

{"title":"The Crux of Voice (In)Security: A Brain Study of Speaker Legitimacy Detection","authors":"Ajaya Neupane, Nitesh Saxena, Leanne M. Hirshfield, Sarah E. Bratt","doi":"10.14722/ndss.2019.23206","DOIUrl":null,"url":null,"abstract":"A new generation of scams has emerged that uses voice impersonation to obtain sensitive information, eavesdrop over voice calls and extort money from unsuspecting human users. Research demonstrates that users are fallible to voice impersonation attacks that exploit the current advancement in speech synthesis. In this paper, we set out to elicit a deeper understanding of such human-centered “voice hacking” based on a neuro-scientific methodology (thereby corroborating and expanding the traditional behavioral-only approach in significant ways). Specifically, we investigate the neural underpinnings of voice security through functional near-infrared spectroscopy (fNIRS), a cutting-edge neuroimaging technique, that captures neural signals in both temporal and spatial domains. We design and conduct an fNIRS study to pursue a thorough investigation of users’ mental processing related to speaker legitimacy detection – whether a voice sample is rendered by a target speaker, a different other human speaker or a synthesizer mimicking the speaker. We analyze the neural activity associated within this task as well as the brain areas that may control such activity. Our key insight is that there may be no statistically significant differences in the way the human brain processes the legitimate speakers vs. synthesized speakers, whereas clear differences are visible when encountering legitimate vs. different other human speakers. This finding may help to explain users’ susceptibility to synthesized attacks, as seen from the behavioral self-reported analysis. That is, the impersonated synthesized voices may seem indistinguishable from the real voices in terms of both behavioral and neural perspectives. In sharp contrast, prior studies showed subconscious neural differences in other real vs. fake artifacts (e.g., paintings and websites), despite users failing to note these differences behaviorally. Overall, our work dissects the fundamental neural patterns underlying voice-based insecurity and reveals users’ susceptibility to voice synthesis attacks at a biological level. We believe that this could be a significant insight for the security community suggesting that the human detection of voice synthesis attacks may not improve over time, especially given that voice synthesis techniques will likely continue to improve, calling for the design of careful machine-assisted techniques to help humans counter these attacks. *Work done while being a student at UAB","PeriodicalId":20444,"journal":{"name":"Proceedings 2019 Network and Distributed System Security Symposium","volume":"54 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 2019 Network and Distributed System Security Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14722/ndss.2019.23206","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

A new generation of scams has emerged that uses voice impersonation to obtain sensitive information, eavesdrop over voice calls and extort money from unsuspecting human users. Research demonstrates that users are fallible to voice impersonation attacks that exploit the current advancement in speech synthesis. In this paper, we set out to elicit a deeper understanding of such human-centered “voice hacking” based on a neuro-scientific methodology (thereby corroborating and expanding the traditional behavioral-only approach in significant ways). Specifically, we investigate the neural underpinnings of voice security through functional near-infrared spectroscopy (fNIRS), a cutting-edge neuroimaging technique, that captures neural signals in both temporal and spatial domains. We design and conduct an fNIRS study to pursue a thorough investigation of users’ mental processing related to speaker legitimacy detection – whether a voice sample is rendered by a target speaker, a different other human speaker or a synthesizer mimicking the speaker. We analyze the neural activity associated within this task as well as the brain areas that may control such activity. Our key insight is that there may be no statistically significant differences in the way the human brain processes the legitimate speakers vs. synthesized speakers, whereas clear differences are visible when encountering legitimate vs. different other human speakers. This finding may help to explain users’ susceptibility to synthesized attacks, as seen from the behavioral self-reported analysis. That is, the impersonated synthesized voices may seem indistinguishable from the real voices in terms of both behavioral and neural perspectives. In sharp contrast, prior studies showed subconscious neural differences in other real vs. fake artifacts (e.g., paintings and websites), despite users failing to note these differences behaviorally. Overall, our work dissects the fundamental neural patterns underlying voice-based insecurity and reveals users’ susceptibility to voice synthesis attacks at a biological level. We believe that this could be a significant insight for the security community suggesting that the human detection of voice synthesis attacks may not improve over time, especially given that voice synthesis techniques will likely continue to improve, calling for the design of careful machine-assisted techniques to help humans counter these attacks. *Work done while being a student at UAB

查看原文本刊更多论文

语音安全的关键:说话人合法性检测的大脑研究

新一代的骗局已经出现，他们利用语音冒充来获取敏感信息，窃听语音通话，并从毫无戒心的人类用户那里勒索钱财。研究表明，用户很容易受到语音模仿攻击，这种攻击利用了当前语音合成的进步。在本文中，我们开始基于神经科学方法对这种以人为中心的“语音黑客”进行更深入的理解(从而在重要方面证实和扩展了传统的仅行为方法)。具体来说，我们通过功能近红外光谱(fNIRS)研究语音安全的神经基础，fNIRS是一种尖端的神经成像技术，可以捕获时间和空间域的神经信号。我们设计并进行了一项fNIRS研究，以对用户与说话者合法性检测相关的心理处理进行彻底调查-无论语音样本是由目标说话者，其他不同的人类说话者还是模仿说话者的合成器呈现的。我们分析了与这项任务相关的神经活动，以及可能控制这种活动的大脑区域。我们的关键观点是，人脑处理合法说话人和合成说话人的方式在统计上可能没有显著差异，而在遇到合法说话人和其他不同的人说话人时，差异是明显的。从行为自我报告分析中可以看出，这一发现可能有助于解释用户对综合攻击的敏感性。也就是说，从行为和神经角度来看，模仿的合成声音似乎与真实的声音没有区别。与此形成鲜明对比的是，之前的研究表明，在其他真实与虚假的人工制品(如绘画和网站)中，潜意识神经系统存在差异，尽管用户在行为上没有注意到这些差异。总的来说，我们的工作剖析了基于语音的不安全的基本神经模式，并揭示了用户在生物水平上对语音合成攻击的易感性。我们认为，这可能是安全社区的一个重要见解，表明人类对语音合成攻击的检测可能不会随着时间的推移而改善，特别是考虑到语音合成技术可能会继续改进，这需要设计仔细的机器辅助技术来帮助人类对抗这些攻击。*在UAB学习期间完成的工作

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings 2019 Network and Distributed System Security Symposium

自引率

0.00%

发文量