乌尔都语文本文件中的抄袭检测

2018 14th International Conference on Emerging Technologies (ICET) Pub Date : 2018-11-01 DOI:10.1109/ICET.2018.8603616

Waqar Ali, Tanveer Ahmed, Zobia Rehman, A. Rehman, M. Slaman

{"title":"乌尔都语文本文件中的抄袭检测","authors":"Waqar Ali, Tanveer Ahmed, Zobia Rehman, A. Rehman, M. Slaman","doi":"10.1109/ICET.2018.8603616","DOIUrl":null,"url":null,"abstract":"Plagiarism, intellectual theft, and copyright violation are the most important problems for researchers and academic organizations such as universities. The famous publicly available Plagiarism Detection (PD) tools are Turnitin, APlagramme, Plagscan, and Aplag and these tools use to overcome plagiarism problems. However, these tools mainly work for English, Persian and Arabic languages. Copyright and intellectual document have written in every language of the world and many South Asian countries including Pakistan and India, a huge amount of academic content is available in the Urdu language. Unfortunately, due to resources scarcity and less concentration of researcher There is no enough work has been done in Urdu PD. Capturing of plagiarism in Urdu is presented in this paper. Most existing Urdu PD systems fail to identify paraphrase plagiarism in comparison between suspicious and source text document. However, the proposed system is able to identify different types of plagiarism like sentence reordering, inert/delete inter-textual similarity and near copy similarity. The proposed system is based on a distance measuring method, structural alignment algorithm, and vector space model. The system performance is evaluated using machine learning classifiers i.e. Support Vector Machine and Naïve Bayes. The experimental results demonstrated that performance of the proposed method is improved as compared to other existing model i.e. cosine method, simple Jaccard measure.","PeriodicalId":443353,"journal":{"name":"2018 14th International Conference on Emerging Technologies (ICET)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Detection of Plagiarism in Urdu Text Documents\",\"authors\":\"Waqar Ali, Tanveer Ahmed, Zobia Rehman, A. Rehman, M. Slaman\",\"doi\":\"10.1109/ICET.2018.8603616\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Plagiarism, intellectual theft, and copyright violation are the most important problems for researchers and academic organizations such as universities. The famous publicly available Plagiarism Detection (PD) tools are Turnitin, APlagramme, Plagscan, and Aplag and these tools use to overcome plagiarism problems. However, these tools mainly work for English, Persian and Arabic languages. Copyright and intellectual document have written in every language of the world and many South Asian countries including Pakistan and India, a huge amount of academic content is available in the Urdu language. Unfortunately, due to resources scarcity and less concentration of researcher There is no enough work has been done in Urdu PD. Capturing of plagiarism in Urdu is presented in this paper. Most existing Urdu PD systems fail to identify paraphrase plagiarism in comparison between suspicious and source text document. However, the proposed system is able to identify different types of plagiarism like sentence reordering, inert/delete inter-textual similarity and near copy similarity. The proposed system is based on a distance measuring method, structural alignment algorithm, and vector space model. The system performance is evaluated using machine learning classifiers i.e. Support Vector Machine and Naïve Bayes. The experimental results demonstrated that performance of the proposed method is improved as compared to other existing model i.e. cosine method, simple Jaccard measure.\",\"PeriodicalId\":443353,\"journal\":{\"name\":\"2018 14th International Conference on Emerging Technologies (ICET)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 14th International Conference on Emerging Technologies (ICET)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICET.2018.8603616\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 14th International Conference on Emerging Technologies (ICET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICET.2018.8603616","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

剽窃、知识盗窃和侵犯版权是研究人员和大学等学术机构面临的最重要的问题。著名的公开抄袭检测(PD)工具是Turnitin, aplagme, Plagscan和Aplag，这些工具用于克服抄袭问题。然而，这些工具主要适用于英语、波斯语和阿拉伯语。版权和知识文献以世界上各种语言写成，包括巴基斯坦和印度在内的许多南亚国家，大量的学术内容以乌尔都语提供。遗憾的是，由于资源的缺乏和研究人员的不集中，乌尔都语PD方面的工作还不够。本文介绍了乌尔都语中的剽窃行为。大多数现有的乌尔都语PD系统在可疑文本和源文本文档之间的比较中无法识别释义剽窃。然而，该系统能够识别不同类型的抄袭，如句子重排、惰性/删除文本间相似性和近复制相似性。该系统基于距离测量方法、结构对准算法和向量空间模型。使用机器学习分类器(即支持向量机和Naïve贝叶斯)评估系统性能。实验结果表明，与现有的余弦法、简单的Jaccard测度等模型相比，该方法的性能得到了提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Detection of Plagiarism in Urdu Text Documents

Plagiarism, intellectual theft, and copyright violation are the most important problems for researchers and academic organizations such as universities. The famous publicly available Plagiarism Detection (PD) tools are Turnitin, APlagramme, Plagscan, and Aplag and these tools use to overcome plagiarism problems. However, these tools mainly work for English, Persian and Arabic languages. Copyright and intellectual document have written in every language of the world and many South Asian countries including Pakistan and India, a huge amount of academic content is available in the Urdu language. Unfortunately, due to resources scarcity and less concentration of researcher There is no enough work has been done in Urdu PD. Capturing of plagiarism in Urdu is presented in this paper. Most existing Urdu PD systems fail to identify paraphrase plagiarism in comparison between suspicious and source text document. However, the proposed system is able to identify different types of plagiarism like sentence reordering, inert/delete inter-textual similarity and near copy similarity. The proposed system is based on a distance measuring method, structural alignment algorithm, and vector space model. The system performance is evaluated using machine learning classifiers i.e. Support Vector Machine and Naïve Bayes. The experimental results demonstrated that performance of the proposed method is improved as compared to other existing model i.e. cosine method, simple Jaccard measure.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 14th International Conference on Emerging Technologies (ICET)

自引率

0.00%

发文量