{"title":"WMCodec:带有深度水印的端到端神经语音编解码器,用于真实性验证","authors":"Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang","doi":"arxiv-2409.12121","DOIUrl":null,"url":null,"abstract":"Recent advances in speech spoofing necessitate stronger verification\nmechanisms in neural speech codecs to ensure authenticity. Current methods\nembed numerical watermarks before compression and extract them from\nreconstructed speech for verification, but face limitations such as separate\ntraining processes for the watermark and codec, and insufficient cross-modal\ninformation integration, leading to reduced watermark imperceptibility,\nextraction accuracy, and capacity. To address these issues, we propose WMCodec,\nthe first neural speech codec to jointly train compression-reconstruction and\nwatermark embedding-extraction in an end-to-end manner, optimizing both\nimperceptibility and extractability of the watermark. Furthermore, We design an\niterative Attention Imprint Unit (AIU) for deeper feature integration of\nwatermark and speech, reducing the impact of quantization noise on the\nwatermark. Experimental results show WMCodec outperforms AudioSeal with Encodec\nin most quality metrics for watermark imperceptibility and consistently exceeds\nboth AudioSeal with Encodec and reinforced TraceableSpeech in extraction\naccuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16\nbps, WMCodec maintains over 99% extraction accuracy under common attacks,\ndemonstrating strong robustness.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification\",\"authors\":\"Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang\",\"doi\":\"arxiv-2409.12121\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advances in speech spoofing necessitate stronger verification\\nmechanisms in neural speech codecs to ensure authenticity. Current methods\\nembed numerical watermarks before compression and extract them from\\nreconstructed speech for verification, but face limitations such as separate\\ntraining processes for the watermark and codec, and insufficient cross-modal\\ninformation integration, leading to reduced watermark imperceptibility,\\nextraction accuracy, and capacity. To address these issues, we propose WMCodec,\\nthe first neural speech codec to jointly train compression-reconstruction and\\nwatermark embedding-extraction in an end-to-end manner, optimizing both\\nimperceptibility and extractability of the watermark. Furthermore, We design an\\niterative Attention Imprint Unit (AIU) for deeper feature integration of\\nwatermark and speech, reducing the impact of quantization noise on the\\nwatermark. Experimental results show WMCodec outperforms AudioSeal with Encodec\\nin most quality metrics for watermark imperceptibility and consistently exceeds\\nboth AudioSeal with Encodec and reinforced TraceableSpeech in extraction\\naccuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16\\nbps, WMCodec maintains over 99% extraction accuracy under common attacks,\\ndemonstrating strong robustness.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.12121\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12121","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
语音欺骗技术的最新进展要求神经语音编解码器采用更强大的验证机制来确保真实性。目前的方法是在压缩前嵌入数字水印,并从重组后的语音中提取水印进行验证,但这种方法面临着水印和编解码器训练过程分离、跨模态信息整合不足等限制,导致水印的不可感知性、提取精度和容量降低。为了解决这些问题,我们提出了 WMCodec,它是第一个以端到端方式联合训练压缩-重构和水印嵌入-提取的神经语音编解码器,同时优化了水印的可感知性和可提取性。此外,我们还设计了一种迭代注意力印记单元(AIU),用于更深入地整合水印和语音的特征,从而降低量化噪声对水印的影响。实验结果表明,WMCodec 在水印不可感知性的大多数质量指标上都优于 AudioSeal with Encodec,并且在水印提取准确性上一直超过 AudioSeal with Encodec 和强化可追踪语音。在带宽为 6 kbps、水印容量为 16bps 的情况下,WMCodec 在常见攻击下的提取准确率保持在 99% 以上,显示了强大的鲁棒性。
WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification
Recent advances in speech spoofing necessitate stronger verification
mechanisms in neural speech codecs to ensure authenticity. Current methods
embed numerical watermarks before compression and extract them from
reconstructed speech for verification, but face limitations such as separate
training processes for the watermark and codec, and insufficient cross-modal
information integration, leading to reduced watermark imperceptibility,
extraction accuracy, and capacity. To address these issues, we propose WMCodec,
the first neural speech codec to jointly train compression-reconstruction and
watermark embedding-extraction in an end-to-end manner, optimizing both
imperceptibility and extractability of the watermark. Furthermore, We design an
iterative Attention Imprint Unit (AIU) for deeper feature integration of
watermark and speech, reducing the impact of quantization noise on the
watermark. Experimental results show WMCodec outperforms AudioSeal with Encodec
in most quality metrics for watermark imperceptibility and consistently exceeds
both AudioSeal with Encodec and reinforced TraceableSpeech in extraction
accuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16
bps, WMCodec maintains over 99% extraction accuracy under common attacks,
demonstrating strong robustness.