Reversible natural language watermarking with augmented word prediction and compression

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Information Security and Applications Pub Date : 2025-08-25 DOI:10.1016/j.jisa.2025.104211

Lingyun Xiang , Yangfan Liu , Yuling Liu

{"title":"Reversible natural language watermarking with augmented word prediction and compression","authors":"Lingyun Xiang , Yangfan Liu , Yuling Liu","doi":"10.1016/j.jisa.2025.104211","DOIUrl":null,"url":null,"abstract":"<div><div>Reversible natural language watermarking presents a significant challenge due to the dual requirements of perfect content recovery and maintaining high-quality, natural outputs. Existing methods often struggle with limited embedding capacity or noticeable degradation in text fluency and semantics. To overcome these limitations, this paper proposes a novel reversible watermarking method that improves embedding capacity while preserving text naturalness by leveraging augmented word prediction and compression techniques. Specifically, the proposed method utilizes the masked language model BERT to predict high-quality candidate substitutable words at selected embedding positions. Based on prediction results, original words across the entire text are mapped into an unbalanced binary sequence, which is then compressed via arithmetic coding to create additional space to accommodate the watermark information. The compressed sequence and the watermark bits are jointly embedded by replacing the original words with their predicted substitutable ones. During watermark extraction, the words at the embedding positions in the watermarked text are decoded to recover the embedded watermark and the original binary sequence, enabling lossless restoration of the original text. Moreover, to further improve compression efficiency, which in turn increases embedding capacity, a lexical substitution-based data augmentation strategy is proposed to expand the corpus for fine-tuning the BERT model. This enhancement improves prediction consistency, increasing the likelihood that more original words are accurately predicted as the most probable candidates. As a result, more original words are mapped to the same value, intensifying the imbalance in the binary sequence and thus favoring better compression rates and more available embedding space. Experimental results demonstrate that, compared to existing similar reversible natural language watermarking methods, the proposed method achieves higher watermark embedding capacity, and renders better security and higher imperceptibility under the same embedding rate.</div></div>","PeriodicalId":48638,"journal":{"name":"Journal of Information Security and Applications","volume":"94 ","pages":"Article 104211"},"PeriodicalIF":3.7000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Security and Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214212625002480","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Reversible natural language watermarking presents a significant challenge due to the dual requirements of perfect content recovery and maintaining high-quality, natural outputs. Existing methods often struggle with limited embedding capacity or noticeable degradation in text fluency and semantics. To overcome these limitations, this paper proposes a novel reversible watermarking method that improves embedding capacity while preserving text naturalness by leveraging augmented word prediction and compression techniques. Specifically, the proposed method utilizes the masked language model BERT to predict high-quality candidate substitutable words at selected embedding positions. Based on prediction results, original words across the entire text are mapped into an unbalanced binary sequence, which is then compressed via arithmetic coding to create additional space to accommodate the watermark information. The compressed sequence and the watermark bits are jointly embedded by replacing the original words with their predicted substitutable ones. During watermark extraction, the words at the embedding positions in the watermarked text are decoded to recover the embedded watermark and the original binary sequence, enabling lossless restoration of the original text. Moreover, to further improve compression efficiency, which in turn increases embedding capacity, a lexical substitution-based data augmentation strategy is proposed to expand the corpus for fine-tuning the BERT model. This enhancement improves prediction consistency, increasing the likelihood that more original words are accurately predicted as the most probable candidates. As a result, more original words are mapped to the same value, intensifying the imbalance in the binary sequence and thus favoring better compression rates and more available embedding space. Experimental results demonstrate that, compared to existing similar reversible natural language watermarking methods, the proposed method achieves higher watermark embedding capacity, and renders better security and higher imperceptibility under the same embedding rate.

查看原文本刊更多论文

具有增强词预测和压缩的可逆自然语言水印

由于完美的内容恢复和保持高质量的自然输出的双重要求，可逆自然语言水印提出了一个重大的挑战。现有的方法往往存在嵌入能力有限或文本流畅性和语义明显下降的问题。为了克服这些限制，本文提出了一种新的可逆水印方法，该方法利用增强词预测和压缩技术提高嵌入容量，同时保持文本的自然性。具体而言，该方法利用掩码语言模型BERT在选定的嵌入位置预测高质量的候选可替换词。基于预测结果，将整个文本中的原始单词映射到一个不平衡的二进制序列中，然后通过算术编码压缩该序列以创建额外的空间来容纳水印信息。将压缩后的序列与水印比特一起嵌入，方法是用预测的可替换字替换原始字。在水印提取过程中，对水印文本中嵌入位置的单词进行解码，恢复嵌入的水印和原始二值序列，实现对原始文本的无损恢复。此外，为了进一步提高压缩效率，从而增加嵌入容量，提出了一种基于词法替换的数据增强策略来扩展语料库，对BERT模型进行微调。这种增强提高了预测的一致性，增加了更多原始单词被准确预测为最可能候选词的可能性。因此，更多的原始词被映射到相同的值，加剧了二值序列的不平衡，从而有利于更好的压缩率和更多的可用嵌入空间。实验结果表明，与现有的同类可逆自然语言水印方法相比，该方法具有更高的水印嵌入容量，在相同嵌入率下具有更好的安全性和不可感知性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Information Security and Applications Computer Science-Computer Networks and Communications

CiteScore

10.90

自引率

5.40%

发文量

206

审稿时长

56 days

期刊介绍： Journal of Information Security and Applications (JISA) focuses on the original research and practice-driven applications with relevance to information security and applications. JISA provides a common linkage between a vibrant scientific and research community and industry professionals by offering a clear view on modern problems and challenges in information security, as well as identifying promising scientific and "best-practice" solutions. JISA issues offer a balance between original research work and innovative industrial approaches by internationally renowned information security experts and researchers.