Combining style and semantics for robust authorship verification

IF 4.9

Machine learning with applications Pub Date : 2025-09-23 DOI:10.1016/j.mlwa.2025.100732

Britt van Leeuwen , Sandjai Bhulai , Rob van der Mei

{"title":"Combining style and semantics for robust authorship verification","authors":"Britt van Leeuwen , Sandjai Bhulai , Rob van der Mei","doi":"10.1016/j.mlwa.2025.100732","DOIUrl":null,"url":null,"abstract":"<div><div>Authorship Verification is a key task in Natural Language Processing, essential for applications like plagiarism detection and content authentication. This paper analyzes the use of deep learning models for Authorship Verification, focusing on combining semantic and style features to enhance model performance. We propose three models: the Feature Interaction Network, Pairwise Concatenation Network, and Siamese Network, which aim to determine if two texts are written by the same author. Each model uses RoBERTa embeddings to capture semantic content and incorporates style features such as sentence length, word frequency, and punctuation to differentiate authors based on writing style.</div><div>Our results confirm that incorporating style features consistently improves model performance, with the extent of improvement varying by architecture. This demonstrates the value of combining semantic and stylistic information for Authorship Verification. While limitations such as RoBERTa’s fixed input length and the use of predefined style features exist, they do not hinder model effectiveness and point to clear opportunities for future enhancement through extended input handling and dynamic style feature extraction.</div><div>In contrast to prior studies such as Bevendorff et al., (2020) and Kestemont, et al., (2022), which relied on balanced and homogeneous datasets with consistent topics and well-formed language, our work evaluates models on a more challenging, imbalanced, and stylistically diverse dataset, better reflecting real-world Authorship Verification conditions. Despite the increased difficulty, our models achieve competitive results, underscoring their robustness and practical applicability.</div><div>These findings support the value of combining semantic and style features for real-world Authorship Verification.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"22 ","pages":"Article 100732"},"PeriodicalIF":4.9000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S266682702500115X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Authorship Verification is a key task in Natural Language Processing, essential for applications like plagiarism detection and content authentication. This paper analyzes the use of deep learning models for Authorship Verification, focusing on combining semantic and style features to enhance model performance. We propose three models: the Feature Interaction Network, Pairwise Concatenation Network, and Siamese Network, which aim to determine if two texts are written by the same author. Each model uses RoBERTa embeddings to capture semantic content and incorporates style features such as sentence length, word frequency, and punctuation to differentiate authors based on writing style.

Our results confirm that incorporating style features consistently improves model performance, with the extent of improvement varying by architecture. This demonstrates the value of combining semantic and stylistic information for Authorship Verification. While limitations such as RoBERTa’s fixed input length and the use of predefined style features exist, they do not hinder model effectiveness and point to clear opportunities for future enhancement through extended input handling and dynamic style feature extraction.

In contrast to prior studies such as Bevendorff et al., (2020) and Kestemont, et al., (2022), which relied on balanced and homogeneous datasets with consistent topics and well-formed language, our work evaluates models on a more challenging, imbalanced, and stylistically diverse dataset, better reflecting real-world Authorship Verification conditions. Despite the increased difficulty, our models achieve competitive results, underscoring their robustness and practical applicability.

These findings support the value of combining semantic and style features for real-world Authorship Verification.

查看原文本刊更多论文

结合风格和语义，实现健壮的作者身份验证

作者身份验证是自然语言处理中的一项关键任务，对于抄袭检测和内容认证等应用至关重要。本文分析了作者身份验证中深度学习模型的使用，重点是结合语义和风格特征来提高模型的性能。我们提出了三个模型：特征交互网络、配对连接网络和连体网络，旨在确定两个文本是否由同一作者撰写。每个模型都使用RoBERTa嵌入来捕获语义内容，并结合句子长度、词频和标点符号等风格特征，以根据写作风格区分作者。我们的结果证实，结合风格特征可以持续改善模型性能，改善的程度因架构而异。这说明了将语义和风格信息结合起来进行作者身份验证的价值。虽然RoBERTa的固定输入长度和预定义样式特征的使用等限制存在，但它们并不妨碍模型的有效性，并为将来通过扩展输入处理和动态样式特征提取进行增强指明了明确的机会。与Bevendorff等人（2020）和Kestemont等人（2022）等先前的研究不同，Bevendorff等人（2020）和Kestemont等人（2022）依赖于具有一致主题和格式良好的语言的平衡和同构数据集，我们的工作在更具挑战性、不平衡和风格多样化的数据集上评估模型，更好地反映了现实世界的作者身份验证条件。尽管难度增加，但我们的模型取得了有竞争力的结果，强调了它们的鲁棒性和实用性。这些发现支持了将语义和风格特征结合起来进行真实作者身份验证的价值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Machine learning with applications Management Science and Operations Research, Artificial Intelligence, Computer Science Applications

自引率

0.00%

发文量

审稿时长

98 days