{"title":"你说话太多:通过语音输入限制隐私暴露","authors":"Tavish Vaidya, M. Sherr","doi":"10.1109/SPW.2019.00026","DOIUrl":null,"url":null,"abstract":"Voice synthesis uses a voice model to synthesize arbitrary phrases. Advances in voice synthesis have made it possible to create an accurate voice model of a targeted individual, which can then in turn be used to generate spoofed audio in his or her voice. Generating an accurate voice model of target's voice requires the availability of a corpus of the target's speech. This paper makes the observation that the increasing popularity of voice interfaces that use cloud-backed speech recognition (e.g., Siri, Google Assistant, Amazon Alexa) increases the public's vulnerability to voice synthesis attacks. That is, our growing dependence on voice interfaces fosters the collection of our voices. As our main contribution, we show that voice recognition and voice accumulation (that is, the accumulation of users' voices) are separable. This paper introduces techniques for locally sanitizing voice inputs before they are transmitted to the cloud for processing. In essence, such methods employ audio processing techniques to remove distinctive voice characteristics, leaving only the information that is necessary for the cloud-based services to perform speech recognition. Our preliminary experiments show that our defenses prevent state-of-the-art voice synthesis techniques from constructing convincing forgeries of a user's speech, while still permitting accurate voice recognition.","PeriodicalId":125351,"journal":{"name":"2019 IEEE Security and Privacy Workshops (SPW)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":"{\"title\":\"You Talk Too Much: Limiting Privacy Exposure Via Voice Input\",\"authors\":\"Tavish Vaidya, M. Sherr\",\"doi\":\"10.1109/SPW.2019.00026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Voice synthesis uses a voice model to synthesize arbitrary phrases. Advances in voice synthesis have made it possible to create an accurate voice model of a targeted individual, which can then in turn be used to generate spoofed audio in his or her voice. Generating an accurate voice model of target's voice requires the availability of a corpus of the target's speech. This paper makes the observation that the increasing popularity of voice interfaces that use cloud-backed speech recognition (e.g., Siri, Google Assistant, Amazon Alexa) increases the public's vulnerability to voice synthesis attacks. That is, our growing dependence on voice interfaces fosters the collection of our voices. As our main contribution, we show that voice recognition and voice accumulation (that is, the accumulation of users' voices) are separable. This paper introduces techniques for locally sanitizing voice inputs before they are transmitted to the cloud for processing. In essence, such methods employ audio processing techniques to remove distinctive voice characteristics, leaving only the information that is necessary for the cloud-based services to perform speech recognition. Our preliminary experiments show that our defenses prevent state-of-the-art voice synthesis techniques from constructing convincing forgeries of a user's speech, while still permitting accurate voice recognition.\",\"PeriodicalId\":125351,\"journal\":{\"name\":\"2019 IEEE Security and Privacy Workshops (SPW)\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"26\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE Security and Privacy Workshops (SPW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPW.2019.00026\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Security and Privacy Workshops (SPW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPW.2019.00026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26
摘要
语音合成使用语音模型来合成任意的短语。语音合成技术的进步使得为目标人物创建准确的语音模型成为可能,然后可以用这个模型在他或她的声音中生成欺骗的音频。生成目标语音的准确语音模型需要目标语音语料库的可用性。本文观察到,使用云支持语音识别的语音界面(例如Siri, Google Assistant, Amazon Alexa)的日益普及增加了公众对语音合成攻击的脆弱性。也就是说,我们对语音界面的日益依赖促进了我们声音的收集。作为我们的主要贡献,我们表明语音识别和语音积累(即用户声音的积累)是可分离的。本文介绍了在将语音输入传输到云端进行处理之前对其进行本地消毒的技术。从本质上讲,这种方法采用音频处理技术来去除独特的语音特征,只留下基于云的服务执行语音识别所必需的信息。我们的初步实验表明,我们的防御系统可以防止最先进的语音合成技术伪造用户的语音,同时仍然允许准确的语音识别。
You Talk Too Much: Limiting Privacy Exposure Via Voice Input
Voice synthesis uses a voice model to synthesize arbitrary phrases. Advances in voice synthesis have made it possible to create an accurate voice model of a targeted individual, which can then in turn be used to generate spoofed audio in his or her voice. Generating an accurate voice model of target's voice requires the availability of a corpus of the target's speech. This paper makes the observation that the increasing popularity of voice interfaces that use cloud-backed speech recognition (e.g., Siri, Google Assistant, Amazon Alexa) increases the public's vulnerability to voice synthesis attacks. That is, our growing dependence on voice interfaces fosters the collection of our voices. As our main contribution, we show that voice recognition and voice accumulation (that is, the accumulation of users' voices) are separable. This paper introduces techniques for locally sanitizing voice inputs before they are transmitted to the cloud for processing. In essence, such methods employ audio processing techniques to remove distinctive voice characteristics, leaving only the information that is necessary for the cloud-based services to perform speech recognition. Our preliminary experiments show that our defenses prevent state-of-the-art voice synthesis techniques from constructing convincing forgeries of a user's speech, while still permitting accurate voice recognition.