T-SRE: Transformer-based semantic Relation extraction for contextual paraphrased plagiarism detection

IF 5.2 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of King Saud University-Computer and Information Sciences Pub Date : 2024-12-01 DOI:10.1016/j.jksuci.2024.102257

Pon Abisheka , C. Deisy , P. Sharmila

{"title":"T-SRE: Transformer-based semantic Relation extraction for contextual paraphrased plagiarism detection","authors":"Pon Abisheka , C. Deisy , P. Sharmila","doi":"10.1016/j.jksuci.2024.102257","DOIUrl":null,"url":null,"abstract":"<div><div>Plagiarism has become a pervasive issue in academics and professionals to safeguard academic integrity and intellectual property rights. The escalating sophistication of plagiarized content through semantic manipulation and structural reorganization poses significant challenges to existing detection systems that rely primarily on lexical similarity measures. The proposed T-SRE (Transformer-based Semantic Relation Extraction), a novel framework addresses the limitations of traditional n-gram and string-matching approaches by leveraging deep semantic analysis. The proposed framework combines Dependency Parsing (DP) for syntactic relationship mapping and Named Entity Recognition (NER) for contextual entity identification, augmented by a transformer-based neural network that captures long-range contextual dependencies. This learning methodology incorporates three key components: a position-aware word reordering algorithm, Levenshtein distance metric for structural similarity, and contextual word embeddings for semantic preservation detection. The proposed T-SRE enhances text structure recognition by combining position-aware reordering with semantic preservation through ensemble learning. The system implements a hierarchical classification scheme that quantifies plagiarism severity through a four-tier taxonomy: heavy, low, non-plagiarized and verbatim copy. The Udacity benchmark dataset showcases the model’s superior detection capabilities, achieving 92% precision, 89% recall, and an F1-score of 90.5%, particularly in lightweight textual modifications.The framework achieves a granularity score of 1.28, outperforming existing approaches.</div></div>","PeriodicalId":48547,"journal":{"name":"Journal of King Saud University-Computer and Information Sciences","volume":"36 10","pages":"Article 102257"},"PeriodicalIF":5.2000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of King Saud University-Computer and Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S131915782400346X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Plagiarism has become a pervasive issue in academics and professionals to safeguard academic integrity and intellectual property rights. The escalating sophistication of plagiarized content through semantic manipulation and structural reorganization poses significant challenges to existing detection systems that rely primarily on lexical similarity measures. The proposed T-SRE (Transformer-based Semantic Relation Extraction), a novel framework addresses the limitations of traditional n-gram and string-matching approaches by leveraging deep semantic analysis. The proposed framework combines Dependency Parsing (DP) for syntactic relationship mapping and Named Entity Recognition (NER) for contextual entity identification, augmented by a transformer-based neural network that captures long-range contextual dependencies. This learning methodology incorporates three key components: a position-aware word reordering algorithm, Levenshtein distance metric for structural similarity, and contextual word embeddings for semantic preservation detection. The proposed T-SRE enhances text structure recognition by combining position-aware reordering with semantic preservation through ensemble learning. The system implements a hierarchical classification scheme that quantifies plagiarism severity through a four-tier taxonomy: heavy, low, non-plagiarized and verbatim copy. The Udacity benchmark dataset showcases the model’s superior detection capabilities, achieving 92% precision, 89% recall, and an F1-score of 90.5%, particularly in lightweight textual modifications.The framework achieves a granularity score of 1.28, outperforming existing approaches.

查看原文本刊更多论文

T-SRE：基于转换的语义关系提取，用于上下文释义抄袭检测

为了维护学术诚信和知识产权，剽窃已成为学术界和专业人士普遍存在的问题。通过语义操纵和结构重组不断升级的剽窃内容复杂性对主要依赖词汇相似性度量的现有检测系统提出了重大挑战。本文提出的基于变换的语义关系提取（T-SRE）框架利用深度语义分析解决了传统n图和字符串匹配方法的局限性。该框架结合了用于句法关系映射的依赖解析（DP）和用于上下文实体识别的命名实体识别（NER），并通过基于转换器的神经网络进行增强，以捕获远程上下文依赖关系。该学习方法包含三个关键组件：位置感知词重排算法，用于结构相似性的Levenshtein距离度量，以及用于语义保存检测的上下文词嵌入。本文提出的T-SRE通过集成学习将位置感知重排序和语义保存相结合来增强文本结构识别。该系统实现了一种分层分类方案，通过四层分类来量化抄袭的严重程度：重抄袭、低抄袭、非抄袭和逐字抄袭。Udacity基准数据集展示了该模型卓越的检测能力，达到了92%的准确率、89%的召回率和90.5%的f1分数，特别是在轻量级文本修改方面。该框架的粒度得分为1.28，优于现有的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of King Saud University-Computer and Information Sciences COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

10.50

自引率

8.70%

发文量

656

审稿时长

29 days

期刊介绍： In 2022 the Journal of King Saud University - Computer and Information Sciences will become an author paid open access journal. Authors who submit their manuscript after October 31st 2021 will be asked to pay an Article Processing Charge (APC) after acceptance of their paper to make their work immediately, permanently, and freely accessible to all. The Journal of King Saud University Computer and Information Sciences is a refereed, international journal that covers all aspects of both foundations of computer and its practical applications.