{"title":"Reversible natural language watermarking with augmented word prediction and compression","authors":"Lingyun Xiang , Yangfan Liu , Yuling Liu","doi":"10.1016/j.jisa.2025.104211","DOIUrl":null,"url":null,"abstract":"<div><div>Reversible natural language watermarking presents a significant challenge due to the dual requirements of perfect content recovery and maintaining high-quality, natural outputs. Existing methods often struggle with limited embedding capacity or noticeable degradation in text fluency and semantics. To overcome these limitations, this paper proposes a novel reversible watermarking method that improves embedding capacity while preserving text naturalness by leveraging augmented word prediction and compression techniques. Specifically, the proposed method utilizes the masked language model BERT to predict high-quality candidate substitutable words at selected embedding positions. Based on prediction results, original words across the entire text are mapped into an unbalanced binary sequence, which is then compressed via arithmetic coding to create additional space to accommodate the watermark information. The compressed sequence and the watermark bits are jointly embedded by replacing the original words with their predicted substitutable ones. During watermark extraction, the words at the embedding positions in the watermarked text are decoded to recover the embedded watermark and the original binary sequence, enabling lossless restoration of the original text. Moreover, to further improve compression efficiency, which in turn increases embedding capacity, a lexical substitution-based data augmentation strategy is proposed to expand the corpus for fine-tuning the BERT model. This enhancement improves prediction consistency, increasing the likelihood that more original words are accurately predicted as the most probable candidates. As a result, more original words are mapped to the same value, intensifying the imbalance in the binary sequence and thus favoring better compression rates and more available embedding space. Experimental results demonstrate that, compared to existing similar reversible natural language watermarking methods, the proposed method achieves higher watermark embedding capacity, and renders better security and higher imperceptibility under the same embedding rate.</div></div>","PeriodicalId":48638,"journal":{"name":"Journal of Information Security and Applications","volume":"94 ","pages":"Article 104211"},"PeriodicalIF":3.7000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Security and Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214212625002480","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Reversible natural language watermarking presents a significant challenge due to the dual requirements of perfect content recovery and maintaining high-quality, natural outputs. Existing methods often struggle with limited embedding capacity or noticeable degradation in text fluency and semantics. To overcome these limitations, this paper proposes a novel reversible watermarking method that improves embedding capacity while preserving text naturalness by leveraging augmented word prediction and compression techniques. Specifically, the proposed method utilizes the masked language model BERT to predict high-quality candidate substitutable words at selected embedding positions. Based on prediction results, original words across the entire text are mapped into an unbalanced binary sequence, which is then compressed via arithmetic coding to create additional space to accommodate the watermark information. The compressed sequence and the watermark bits are jointly embedded by replacing the original words with their predicted substitutable ones. During watermark extraction, the words at the embedding positions in the watermarked text are decoded to recover the embedded watermark and the original binary sequence, enabling lossless restoration of the original text. Moreover, to further improve compression efficiency, which in turn increases embedding capacity, a lexical substitution-based data augmentation strategy is proposed to expand the corpus for fine-tuning the BERT model. This enhancement improves prediction consistency, increasing the likelihood that more original words are accurately predicted as the most probable candidates. As a result, more original words are mapped to the same value, intensifying the imbalance in the binary sequence and thus favoring better compression rates and more available embedding space. Experimental results demonstrate that, compared to existing similar reversible natural language watermarking methods, the proposed method achieves higher watermark embedding capacity, and renders better security and higher imperceptibility under the same embedding rate.
期刊介绍:
Journal of Information Security and Applications (JISA) focuses on the original research and practice-driven applications with relevance to information security and applications. JISA provides a common linkage between a vibrant scientific and research community and industry professionals by offering a clear view on modern problems and challenges in information security, as well as identifying promising scientific and "best-practice" solutions. JISA issues offer a balance between original research work and innovative industrial approaches by internationally renowned information security experts and researchers.