Xiaoyi Ge, Xiongwei Zhang, Meng Sun, Kunkun SongGong, Xia Zou
{"title":"具有归一化流的可逆生成语音隐藏用于安全物联网语音","authors":"Xiaoyi Ge, Xiongwei Zhang, Meng Sun, Kunkun SongGong, Xia Zou","doi":"10.1016/j.iot.2025.101606","DOIUrl":null,"url":null,"abstract":"<div><div>Speech-based control is widely used for remotely operating the Internet of Things (IoT) devices, but it risks eavesdropping and cyberattacks. Speech hiding enhances security by embedding secret speech in a cover speech to conceal communication behavior. However, existing methods are limited by the extracted secret speech’s poor intelligibility and the stego speech’s insufficient security. To address these challenges, we propose a novel invertible generative speech hiding framework that integrates the embedding process into the speech synthesis pipeline. Our method establishes a bijective mapping between secret speech inputs and stego speech outputs, conditioned on text-derived Mel-spectrograms. The embedding process employs a normalizing flow-based SecFlow module to map secret speech into Gaussian-distributed latent codes, which are subsequently synthesized into stego speech through a flow-based vocoder. Crucially, the invertibility of both SecFlow and the vocoder enables precise secret speech extraction during extraction. Extensive evaluation demonstrated the generated stego speech achieves high quality with a Perceived Evaluation of Speech Quality (PESQ) score of 3.40 and a Short-Term Objective Intelligibility (STOI) score of 0.96. Extracted secret speech exhibits high quality and intelligibility with a character error rate (CER) of 0.021. In addition, the latent codes of secret speech mapped and randomly sampled Gaussian noise are very close to each other, effectively guaranteeing security. The framework achieves real-time performance with 1.28s generation latency for 2.22s speech segment embedding(achieving a real-time factor (RTF) of 0.577), which ensures efficient covert communication for latency-sensitive IoT applications.</div></div>","PeriodicalId":29968,"journal":{"name":"Internet of Things","volume":"32 ","pages":"Article 101606"},"PeriodicalIF":6.0000,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Invertible generative speech hiding with normalizing flow for secure IoT voice\",\"authors\":\"Xiaoyi Ge, Xiongwei Zhang, Meng Sun, Kunkun SongGong, Xia Zou\",\"doi\":\"10.1016/j.iot.2025.101606\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Speech-based control is widely used for remotely operating the Internet of Things (IoT) devices, but it risks eavesdropping and cyberattacks. Speech hiding enhances security by embedding secret speech in a cover speech to conceal communication behavior. However, existing methods are limited by the extracted secret speech’s poor intelligibility and the stego speech’s insufficient security. To address these challenges, we propose a novel invertible generative speech hiding framework that integrates the embedding process into the speech synthesis pipeline. Our method establishes a bijective mapping between secret speech inputs and stego speech outputs, conditioned on text-derived Mel-spectrograms. The embedding process employs a normalizing flow-based SecFlow module to map secret speech into Gaussian-distributed latent codes, which are subsequently synthesized into stego speech through a flow-based vocoder. Crucially, the invertibility of both SecFlow and the vocoder enables precise secret speech extraction during extraction. Extensive evaluation demonstrated the generated stego speech achieves high quality with a Perceived Evaluation of Speech Quality (PESQ) score of 3.40 and a Short-Term Objective Intelligibility (STOI) score of 0.96. Extracted secret speech exhibits high quality and intelligibility with a character error rate (CER) of 0.021. In addition, the latent codes of secret speech mapped and randomly sampled Gaussian noise are very close to each other, effectively guaranteeing security. The framework achieves real-time performance with 1.28s generation latency for 2.22s speech segment embedding(achieving a real-time factor (RTF) of 0.577), which ensures efficient covert communication for latency-sensitive IoT applications.</div></div>\",\"PeriodicalId\":29968,\"journal\":{\"name\":\"Internet of Things\",\"volume\":\"32 \",\"pages\":\"Article 101606\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-04-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Internet of Things\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2542660525001192\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Internet of Things","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2542660525001192","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Invertible generative speech hiding with normalizing flow for secure IoT voice
Speech-based control is widely used for remotely operating the Internet of Things (IoT) devices, but it risks eavesdropping and cyberattacks. Speech hiding enhances security by embedding secret speech in a cover speech to conceal communication behavior. However, existing methods are limited by the extracted secret speech’s poor intelligibility and the stego speech’s insufficient security. To address these challenges, we propose a novel invertible generative speech hiding framework that integrates the embedding process into the speech synthesis pipeline. Our method establishes a bijective mapping between secret speech inputs and stego speech outputs, conditioned on text-derived Mel-spectrograms. The embedding process employs a normalizing flow-based SecFlow module to map secret speech into Gaussian-distributed latent codes, which are subsequently synthesized into stego speech through a flow-based vocoder. Crucially, the invertibility of both SecFlow and the vocoder enables precise secret speech extraction during extraction. Extensive evaluation demonstrated the generated stego speech achieves high quality with a Perceived Evaluation of Speech Quality (PESQ) score of 3.40 and a Short-Term Objective Intelligibility (STOI) score of 0.96. Extracted secret speech exhibits high quality and intelligibility with a character error rate (CER) of 0.021. In addition, the latent codes of secret speech mapped and randomly sampled Gaussian noise are very close to each other, effectively guaranteeing security. The framework achieves real-time performance with 1.28s generation latency for 2.22s speech segment embedding(achieving a real-time factor (RTF) of 0.577), which ensures efficient covert communication for latency-sensitive IoT applications.
期刊介绍:
Internet of Things; Engineering Cyber Physical Human Systems is a comprehensive journal encouraging cross collaboration between researchers, engineers and practitioners in the field of IoT & Cyber Physical Human Systems. The journal offers a unique platform to exchange scientific information on the entire breadth of technology, science, and societal applications of the IoT.
The journal will place a high priority on timely publication, and provide a home for high quality.
Furthermore, IOT is interested in publishing topical Special Issues on any aspect of IOT.