一种新的剽窃检测框架:以乌尔都语为例

Waqar Ali, Tanveer Ahmad, Zobia Rehman, A. Rehman, M. A. Shah, Ansar Abbas, Ghulam Dustgeer
{"title":"一种新的剽窃检测框架:以乌尔都语为例","authors":"Waqar Ali, Tanveer Ahmad, Zobia Rehman, A. Rehman, M. A. Shah, Ansar Abbas, Ghulam Dustgeer","doi":"10.23919/IConAC.2018.8749122","DOIUrl":null,"url":null,"abstract":"Plagiarism is an act of presenting someone else's idea, words and original work as one's own without acknowledging the original source. It creates many problems, especially for academic institutions and researchers. There are many plagiarism detection tools publically available which are used to overcome these problems, however these tools mainly work for particular languages like Arabic and English. In South Asian countries specifically India and Pakistan, a huge part of research content is available in Hindi and Urdu languages. Unfortunately, plagiarism detection in Urdu text cannot acquire the proper attention of research community because it has complex sentence structure and lacks linguistic resources. In this paper, we propose a novel framework for plagiarism detection specifically for Urdu language. There is no benchmark corpus available for Urdu plagiarism detection, and therefore we developed a corpus of Urdu language. We applied distance measuring method along with vector space method to measure the similarity between suspicious and source text. For evaluation purpose, we defined different classes of plagiarized text such as paraphrase, heavily plagiarized, light plagiarized and direct copy-paste. Moreover, we evaluated each class of plagiarized text in terms of precision, recall, and f-measure. The experimental results have presented that Levenshiten distance and Jaccard containment methods produced significant improvement in the performance of plagiarism detection compared with existing methods.","PeriodicalId":121030,"journal":{"name":"2018 24th International Conference on Automation and Computing (ICAC)","volume":"12 25","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Novel Framework for Plagiarism Detection: A Case Study for Urdu Language\",\"authors\":\"Waqar Ali, Tanveer Ahmad, Zobia Rehman, A. Rehman, M. A. Shah, Ansar Abbas, Ghulam Dustgeer\",\"doi\":\"10.23919/IConAC.2018.8749122\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Plagiarism is an act of presenting someone else's idea, words and original work as one's own without acknowledging the original source. It creates many problems, especially for academic institutions and researchers. There are many plagiarism detection tools publically available which are used to overcome these problems, however these tools mainly work for particular languages like Arabic and English. In South Asian countries specifically India and Pakistan, a huge part of research content is available in Hindi and Urdu languages. Unfortunately, plagiarism detection in Urdu text cannot acquire the proper attention of research community because it has complex sentence structure and lacks linguistic resources. In this paper, we propose a novel framework for plagiarism detection specifically for Urdu language. There is no benchmark corpus available for Urdu plagiarism detection, and therefore we developed a corpus of Urdu language. We applied distance measuring method along with vector space method to measure the similarity between suspicious and source text. For evaluation purpose, we defined different classes of plagiarized text such as paraphrase, heavily plagiarized, light plagiarized and direct copy-paste. Moreover, we evaluated each class of plagiarized text in terms of precision, recall, and f-measure. The experimental results have presented that Levenshiten distance and Jaccard containment methods produced significant improvement in the performance of plagiarism detection compared with existing methods.\",\"PeriodicalId\":121030,\"journal\":{\"name\":\"2018 24th International Conference on Automation and Computing (ICAC)\",\"volume\":\"12 25\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 24th International Conference on Automation and Computing (ICAC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/IConAC.2018.8749122\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 24th International Conference on Automation and Computing (ICAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/IConAC.2018.8749122","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

抄袭是一种把别人的想法、文字和原创作品当作自己的而不承认原始来源的行为。这给学术机构和研究人员带来了许多问题。市面上有很多可以用来克服这些问题的抄袭检测工具,但是这些工具主要适用于特定的语言,比如阿拉伯语和英语。在南亚国家,特别是印度和巴基斯坦,很大一部分研究内容是用印地语和乌尔都语提供的。遗憾的是,乌尔都语文本的抄袭检测由于其复杂的句子结构和缺乏语言资源而未能得到学术界的应有重视。本文提出了一种针对乌尔都语的剽窃检测框架。目前还没有针对乌尔都语抄袭检测的基准语料库,因此我们开发了一个乌尔都语语料库。我们采用距离测量法和向量空间法来测量可疑文本和源文本之间的相似度。为了评估目的,我们定义了不同类别的抄袭文本,如意译、严重抄袭、轻微抄袭和直接复制粘贴。此外,我们从准确性、召回率和f-measure三个方面对每一类剽窃文本进行了评估。实验结果表明,Levenshiten距离和Jaccard包容方法在抄袭检测方面的性能比现有方法有了显著提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Novel Framework for Plagiarism Detection: A Case Study for Urdu Language
Plagiarism is an act of presenting someone else's idea, words and original work as one's own without acknowledging the original source. It creates many problems, especially for academic institutions and researchers. There are many plagiarism detection tools publically available which are used to overcome these problems, however these tools mainly work for particular languages like Arabic and English. In South Asian countries specifically India and Pakistan, a huge part of research content is available in Hindi and Urdu languages. Unfortunately, plagiarism detection in Urdu text cannot acquire the proper attention of research community because it has complex sentence structure and lacks linguistic resources. In this paper, we propose a novel framework for plagiarism detection specifically for Urdu language. There is no benchmark corpus available for Urdu plagiarism detection, and therefore we developed a corpus of Urdu language. We applied distance measuring method along with vector space method to measure the similarity between suspicious and source text. For evaluation purpose, we defined different classes of plagiarized text such as paraphrase, heavily plagiarized, light plagiarized and direct copy-paste. Moreover, we evaluated each class of plagiarized text in terms of precision, recall, and f-measure. The experimental results have presented that Levenshiten distance and Jaccard containment methods produced significant improvement in the performance of plagiarism detection compared with existing methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信