具有归一化流的可逆生成语音隐藏用于安全物联网语音

IF 7.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Internet of Things Pub Date : 2025-04-17 DOI:10.1016/j.iot.2025.101606

Xiaoyi Ge, Xiongwei Zhang, Meng Sun, Kunkun SongGong, Xia Zou

{"title":"具有归一化流的可逆生成语音隐藏用于安全物联网语音","authors":"Xiaoyi Ge, Xiongwei Zhang, Meng Sun, Kunkun SongGong, Xia Zou","doi":"10.1016/j.iot.2025.101606","DOIUrl":null,"url":null,"abstract":"<div><div>Speech-based control is widely used for remotely operating the Internet of Things (IoT) devices, but it risks eavesdropping and cyberattacks. Speech hiding enhances security by embedding secret speech in a cover speech to conceal communication behavior. However, existing methods are limited by the extracted secret speech’s poor intelligibility and the stego speech’s insufficient security. To address these challenges, we propose a novel invertible generative speech hiding framework that integrates the embedding process into the speech synthesis pipeline. Our method establishes a bijective mapping between secret speech inputs and stego speech outputs, conditioned on text-derived Mel-spectrograms. The embedding process employs a normalizing flow-based SecFlow module to map secret speech into Gaussian-distributed latent codes, which are subsequently synthesized into stego speech through a flow-based vocoder. Crucially, the invertibility of both SecFlow and the vocoder enables precise secret speech extraction during extraction. Extensive evaluation demonstrated the generated stego speech achieves high quality with a Perceived Evaluation of Speech Quality (PESQ) score of 3.40 and a Short-Term Objective Intelligibility (STOI) score of 0.96. Extracted secret speech exhibits high quality and intelligibility with a character error rate (CER) of 0.021. In addition, the latent codes of secret speech mapped and randomly sampled Gaussian noise are very close to each other, effectively guaranteeing security. The framework achieves real-time performance with 1.28s generation latency for 2.22s speech segment embedding(achieving a real-time factor (RTF) of 0.577), which ensures efficient covert communication for latency-sensitive IoT applications.</div></div>","PeriodicalId":29968,"journal":{"name":"Internet of Things","volume":"32 ","pages":"Article 101606"},"PeriodicalIF":7.6000,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Invertible generative speech hiding with normalizing flow for secure IoT voice\",\"authors\":\"Xiaoyi Ge, Xiongwei Zhang, Meng Sun, Kunkun SongGong, Xia Zou\",\"doi\":\"10.1016/j.iot.2025.101606\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Speech-based control is widely used for remotely operating the Internet of Things (IoT) devices, but it risks eavesdropping and cyberattacks. Speech hiding enhances security by embedding secret speech in a cover speech to conceal communication behavior. However, existing methods are limited by the extracted secret speech’s poor intelligibility and the stego speech’s insufficient security. To address these challenges, we propose a novel invertible generative speech hiding framework that integrates the embedding process into the speech synthesis pipeline. Our method establishes a bijective mapping between secret speech inputs and stego speech outputs, conditioned on text-derived Mel-spectrograms. The embedding process employs a normalizing flow-based SecFlow module to map secret speech into Gaussian-distributed latent codes, which are subsequently synthesized into stego speech through a flow-based vocoder. Crucially, the invertibility of both SecFlow and the vocoder enables precise secret speech extraction during extraction. Extensive evaluation demonstrated the generated stego speech achieves high quality with a Perceived Evaluation of Speech Quality (PESQ) score of 3.40 and a Short-Term Objective Intelligibility (STOI) score of 0.96. Extracted secret speech exhibits high quality and intelligibility with a character error rate (CER) of 0.021. In addition, the latent codes of secret speech mapped and randomly sampled Gaussian noise are very close to each other, effectively guaranteeing security. The framework achieves real-time performance with 1.28s generation latency for 2.22s speech segment embedding(achieving a real-time factor (RTF) of 0.577), which ensures efficient covert communication for latency-sensitive IoT applications.</div></div>\",\"PeriodicalId\":29968,\"journal\":{\"name\":\"Internet of Things\",\"volume\":\"32 \",\"pages\":\"Article 101606\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-04-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Internet of Things\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2542660525001192\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Internet of Things","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2542660525001192","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

基于语音的控制被广泛用于远程操作物联网（IoT）设备，但它存在窃听和网络攻击的风险。语音隐藏通过在掩护语音中嵌入秘密语音来隐藏通信行为，从而提高安全性。然而，现有的方法受到提取的秘密语音的可理解性差和隐写语音的安全性不足的限制。为了解决这些挑战，我们提出了一种新的可逆生成语音隐藏框架，该框架将嵌入过程集成到语音合成管道中。我们的方法建立了秘密语音输入和隐写语音输出之间的双向映射，条件是文本派生的mel -谱图。嵌入过程采用基于规范化流的SecFlow模块将秘密语音映射为高斯分布的潜在码，然后通过基于流的声码器合成为隐写语音。至关重要的是，SecFlow和声码器的可逆性使提取过程中精确的秘密语音提取成为可能。广泛的评估表明，生成的隐写语音达到了较高的质量，语音质量感知评价（PESQ）得分为3.40，短期客观可理解性（STOI）得分为0.96。所提取的秘密语音具有较高的质量和可理解性，字符错误率（CER）为0.021。此外，秘密语音映射的隐码与随机抽样的高斯噪声的隐码非常接近，有效地保证了安全性。该框架以1.28秒的生成延迟实现了2.22秒语音段嵌入的实时性（实现了0.577的实时因子（RTF）），确保了对延迟敏感的物联网应用的有效隐蔽通信。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Invertible generative speech hiding with normalizing flow for secure IoT voice

Speech-based control is widely used for remotely operating the Internet of Things (IoT) devices, but it risks eavesdropping and cyberattacks. Speech hiding enhances security by embedding secret speech in a cover speech to conceal communication behavior. However, existing methods are limited by the extracted secret speech’s poor intelligibility and the stego speech’s insufficient security. To address these challenges, we propose a novel invertible generative speech hiding framework that integrates the embedding process into the speech synthesis pipeline. Our method establishes a bijective mapping between secret speech inputs and stego speech outputs, conditioned on text-derived Mel-spectrograms. The embedding process employs a normalizing flow-based SecFlow module to map secret speech into Gaussian-distributed latent codes, which are subsequently synthesized into stego speech through a flow-based vocoder. Crucially, the invertibility of both SecFlow and the vocoder enables precise secret speech extraction during extraction. Extensive evaluation demonstrated the generated stego speech achieves high quality with a Perceived Evaluation of Speech Quality (PESQ) score of 3.40 and a Short-Term Objective Intelligibility (STOI) score of 0.96. Extracted secret speech exhibits high quality and intelligibility with a character error rate (CER) of 0.021. In addition, the latent codes of secret speech mapped and randomly sampled Gaussian noise are very close to each other, effectively guaranteeing security. The framework achieves real-time performance with 1.28s generation latency for 2.22s speech segment embedding(achieving a real-time factor (RTF) of 0.577), which ensures efficient covert communication for latency-sensitive IoT applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Internet of Things Multiple-

CiteScore

3.60

自引率

5.10%

发文量

115

审稿时长

37 days

期刊介绍： Internet of Things; Engineering Cyber Physical Human Systems is a comprehensive journal encouraging cross collaboration between researchers, engineers and practitioners in the field of IoT & Cyber Physical Human Systems. The journal offers a unique platform to exchange scientific information on the entire breadth of technology, science, and societal applications of the IoT. The journal will place a high priority on timely publication, and provide a home for high quality. Furthermore, IOT is interested in publishing topical Special Issues on any aspect of IOT.