Iqra Muneer , Aysha Shehzadi , Muhammad Adnan Ashraf , Rao Muhammad Adeel Nawab
{"title":"Has machine paraphrasing skills approached humans? Detecting automatically and manually generated paraphrased cases","authors":"Iqra Muneer , Aysha Shehzadi , Muhammad Adnan Ashraf , Rao Muhammad Adeel Nawab","doi":"10.1016/j.bdr.2025.100507","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, automatic text rewriting (or paraphrasing) tools are readily and publicly available. These tools have enabled text paraphrasing as an exceptionally straightforward approach that encourages trouble-free plagiarism and text reuse. In literature, the majority of efforts have focused on detecting real cases (manual/human paraphrasing) of paraphrasing (mainly in the domain of journalism). However, the problem of paraphrase detection has not been thoroughly explored for artificial cases (machine paraphrased), mainly, due to lack of standard resources for its evaluation. To fulfill this gap, this study proposes three benchmark corpora for artificial cases of paraphrases at sentence level, and one real corpus contains examples from daily life activities. Three popular and widely used automatic text rewriting online tools have been used, i.e., paraphrasing-tools, articlerewritetool and rewritertools, to develop artificial case corpora. Further, we used two real cases corpora, including Microsoft Paraphrase Corpus (MSRP) (from the domain of journalism) and a proposed real corpus which is a combination of carefully extracted Quora question pairs and MSRP (Q-MSRP). Both real case and artificial case paraphrases were evaluated using classical machine learning, transfer learning, Large language models and a proposed model, to investigate which of the two types of paraphrasing is more difficult to detect. The results show that our proposed model outperforms all the other approaches for both artificial and real case paraphrase detection. A thorough analysis of the results suggests that, by far, manual paraphrasing is still harder to detect but certain machine paraphrased texts are equally difficult to detect. All proposed corpora are freely available to promote the research on artificial case paraphrase detection.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"39 ","pages":"Article 100507"},"PeriodicalIF":3.5000,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Research","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579625000024","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, automatic text rewriting (or paraphrasing) tools are readily and publicly available. These tools have enabled text paraphrasing as an exceptionally straightforward approach that encourages trouble-free plagiarism and text reuse. In literature, the majority of efforts have focused on detecting real cases (manual/human paraphrasing) of paraphrasing (mainly in the domain of journalism). However, the problem of paraphrase detection has not been thoroughly explored for artificial cases (machine paraphrased), mainly, due to lack of standard resources for its evaluation. To fulfill this gap, this study proposes three benchmark corpora for artificial cases of paraphrases at sentence level, and one real corpus contains examples from daily life activities. Three popular and widely used automatic text rewriting online tools have been used, i.e., paraphrasing-tools, articlerewritetool and rewritertools, to develop artificial case corpora. Further, we used two real cases corpora, including Microsoft Paraphrase Corpus (MSRP) (from the domain of journalism) and a proposed real corpus which is a combination of carefully extracted Quora question pairs and MSRP (Q-MSRP). Both real case and artificial case paraphrases were evaluated using classical machine learning, transfer learning, Large language models and a proposed model, to investigate which of the two types of paraphrasing is more difficult to detect. The results show that our proposed model outperforms all the other approaches for both artificial and real case paraphrase detection. A thorough analysis of the results suggests that, by far, manual paraphrasing is still harder to detect but certain machine paraphrased texts are equally difficult to detect. All proposed corpora are freely available to promote the research on artificial case paraphrase detection.
期刊介绍:
The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic.
The journal will accept papers on foundational aspects in dealing with big data, as well as papers on specific Platforms and Technologies used to deal with big data. To promote Data Science and interdisciplinary collaboration between fields, and to showcase the benefits of data driven research, papers demonstrating applications of big data in domains as diverse as Geoscience, Social Web, Finance, e-Commerce, Health Care, Environment and Climate, Physics and Astronomy, Chemistry, life sciences and drug discovery, digital libraries and scientific publications, security and government will also be considered. Occasionally the journal may publish whitepapers on policies, standards and best practices.