Cross-domain redundancy exploration by a deep encoder–decoder network for speech steganography

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Information Security and Applications Pub Date : 2025-06-30 DOI:10.1016/j.jisa.2025.104150

Xiaoyi Ge , Xiongwei Zhang , Meng Sun , Yimin Wang , Li Li , Kunkun SongGong

{"title":"Cross-domain redundancy exploration by a deep encoder–decoder network for speech steganography","authors":"Xiaoyi Ge , Xiongwei Zhang , Meng Sun , Yimin Wang , Li Li , Kunkun SongGong","doi":"10.1016/j.jisa.2025.104150","DOIUrl":null,"url":null,"abstract":"<div><div>The technique of speech steganography involves embedding messages within openly transmitted speech channels without arousing suspicion. Nevertheless, current methods for embedding speech in speech suffer from weak imperceptibility and low message speech intelligibility. In this paper, we introduce a novel approach that explores cross-domain redundancy by leveraging a deep encoder–decoder neural network architecture to embed Mel-spectrograms into magnitude spectrograms. Specifically, the message is transformed into its Mel-spectrogram, while the cover is transformed into its magnitude spectrogram. Subsequently, the Mel-spectrogram is embedded as residuals in the magnitude spectrogram through an encoder known as the spectrogram super-resolution network (SSRN). Upon receiving the stego, a decoder network recoveres the Mel-spectrograms of the messages, and a high-fidelity HiFi-GAN vocoder then recovers the message waveform. The encoder–decoder network’s parameters are optimized to ensure imperceptibility and high quality. To validate the superiority of our proposed method, we compare it with recently proposed baselines using common databases such as the LJ Speech and VCTK datasets. Experimental results demonstrate that our method achieves SNRs of 33.83 dB and 30.28 dB for the cover signals on these two datasets, respectively. Furthermore, both the content and speaker identity of the recovered messages are well preserved, and the experiments also confirm the robustness against noises and the security of our approach.</div></div>","PeriodicalId":48638,"journal":{"name":"Journal of Information Security and Applications","volume":"93 ","pages":"Article 104150"},"PeriodicalIF":3.7000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Security and Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214212625001875","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The technique of speech steganography involves embedding messages within openly transmitted speech channels without arousing suspicion. Nevertheless, current methods for embedding speech in speech suffer from weak imperceptibility and low message speech intelligibility. In this paper, we introduce a novel approach that explores cross-domain redundancy by leveraging a deep encoder–decoder neural network architecture to embed Mel-spectrograms into magnitude spectrograms. Specifically, the message is transformed into its Mel-spectrogram, while the cover is transformed into its magnitude spectrogram. Subsequently, the Mel-spectrogram is embedded as residuals in the magnitude spectrogram through an encoder known as the spectrogram super-resolution network (SSRN). Upon receiving the stego, a decoder network recoveres the Mel-spectrograms of the messages, and a high-fidelity HiFi-GAN vocoder then recovers the message waveform. The encoder–decoder network’s parameters are optimized to ensure imperceptibility and high quality. To validate the superiority of our proposed method, we compare it with recently proposed baselines using common databases such as the LJ Speech and VCTK datasets. Experimental results demonstrate that our method achieves SNRs of 33.83 dB and 30.28 dB for the cover signals on these two datasets, respectively. Furthermore, both the content and speaker identity of the recovered messages are well preserved, and the experiments also confirm the robustness against noises and the security of our approach.

查看原文本刊更多论文

基于深度编码器-解码器网络的语音隐写跨域冗余探索

语音隐写技术涉及在不引起怀疑的情况下将信息嵌入公开传输的语音通道中。然而，目前的语音嵌入方法存在着较弱的不可感知性和较低的信息语音可理解性。在本文中，我们介绍了一种新颖的方法，通过利用深度编码器-解码器神经网络架构将mel谱图嵌入到幅度谱图中来探索跨域冗余。具体而言，将信息转换为其梅尔谱图，而将封面转换为其幅度谱图。随后，mel谱图通过称为谱图超分辨率网络（SSRN）的编码器作为残差嵌入到幅度谱图中。在接收到隐码后，解码器网络恢复消息的mel频谱图，高保真HiFi-GAN声码器然后恢复消息波形。对编解码器网络的参数进行了优化，保证了网络的不可感知性和高质量。为了验证我们提出的方法的优越性，我们将其与最近提出的使用常见数据库（如LJ Speech和VCTK数据集）的基线进行了比较。实验结果表明，我们的方法在这两个数据集上对覆盖信号的信噪比分别达到33.83 dB和30.28 dB。此外，恢复的消息内容和说话人的身份都得到了很好的保留，实验也证实了我们的方法对噪声的鲁棒性和安全性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Information Security and Applications Computer Science-Computer Networks and Communications

CiteScore

10.90

自引率

5.40%

发文量

206

审稿时长

56 days

期刊介绍： Journal of Information Security and Applications (JISA) focuses on the original research and practice-driven applications with relevance to information security and applications. JISA provides a common linkage between a vibrant scientific and research community and industry professionals by offering a clear view on modern problems and challenges in information security, as well as identifying promising scientific and "best-practice" solutions. JISA issues offer a balance between original research work and innovative industrial approaches by internationally renowned information security experts and researchers.