{"title":"WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification","authors":"Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang","doi":"arxiv-2409.12121","DOIUrl":null,"url":null,"abstract":"Recent advances in speech spoofing necessitate stronger verification\nmechanisms in neural speech codecs to ensure authenticity. Current methods\nembed numerical watermarks before compression and extract them from\nreconstructed speech for verification, but face limitations such as separate\ntraining processes for the watermark and codec, and insufficient cross-modal\ninformation integration, leading to reduced watermark imperceptibility,\nextraction accuracy, and capacity. To address these issues, we propose WMCodec,\nthe first neural speech codec to jointly train compression-reconstruction and\nwatermark embedding-extraction in an end-to-end manner, optimizing both\nimperceptibility and extractability of the watermark. Furthermore, We design an\niterative Attention Imprint Unit (AIU) for deeper feature integration of\nwatermark and speech, reducing the impact of quantization noise on the\nwatermark. Experimental results show WMCodec outperforms AudioSeal with Encodec\nin most quality metrics for watermark imperceptibility and consistently exceeds\nboth AudioSeal with Encodec and reinforced TraceableSpeech in extraction\naccuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16\nbps, WMCodec maintains over 99% extraction accuracy under common attacks,\ndemonstrating strong robustness.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12121","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advances in speech spoofing necessitate stronger verification
mechanisms in neural speech codecs to ensure authenticity. Current methods
embed numerical watermarks before compression and extract them from
reconstructed speech for verification, but face limitations such as separate
training processes for the watermark and codec, and insufficient cross-modal
information integration, leading to reduced watermark imperceptibility,
extraction accuracy, and capacity. To address these issues, we propose WMCodec,
the first neural speech codec to jointly train compression-reconstruction and
watermark embedding-extraction in an end-to-end manner, optimizing both
imperceptibility and extractability of the watermark. Furthermore, We design an
iterative Attention Imprint Unit (AIU) for deeper feature integration of
watermark and speech, reducing the impact of quantization noise on the
watermark. Experimental results show WMCodec outperforms AudioSeal with Encodec
in most quality metrics for watermark imperceptibility and consistently exceeds
both AudioSeal with Encodec and reinforced TraceableSpeech in extraction
accuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16
bps, WMCodec maintains over 99% extraction accuracy under common attacks,
demonstrating strong robustness.