基于阿拉伯语的网络文本重用检测候选文档检索

2016 12th International Conference on Innovations in Information Technology (IIT) Pub Date : 2016-11-01 DOI:10.1109/INNOVATIONS.2016.7880048

Leena Lulu, B. Belkhouche, S. Harous

{"title":"基于阿拉伯语的网络文本重用检测候选文档检索","authors":"Leena Lulu, B. Belkhouche, S. Harous","doi":"10.1109/INNOVATIONS.2016.7880048","DOIUrl":null,"url":null,"abstract":"Given an input document d, the problem of local text reuse detection is to detect from a given documents collection, all the possible reused passages between d and the other documents. Comparing the passages of document d with the passages of every other document in the collection is obviously infeasible especially with large collections such as the Web. Therefore, selecting a subset of the documents that potentially contains reused text with d becomes a major step in the detection problem. This paper describes a new efficient approach of query formulation to retrieve Arabic-based candidate source documents from the Web. We evaluated the work using a collection of documents especially constructed for this work. The experiments show that on average, 79.97% of the Web documents used in the reused cases were successfully retrieved.","PeriodicalId":412653,"journal":{"name":"2016 12th International Conference on Innovations in Information Technology (IIT)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Candidate document retrieval for Arabic-based text reuse detection on the web\",\"authors\":\"Leena Lulu, B. Belkhouche, S. Harous\",\"doi\":\"10.1109/INNOVATIONS.2016.7880048\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Given an input document d, the problem of local text reuse detection is to detect from a given documents collection, all the possible reused passages between d and the other documents. Comparing the passages of document d with the passages of every other document in the collection is obviously infeasible especially with large collections such as the Web. Therefore, selecting a subset of the documents that potentially contains reused text with d becomes a major step in the detection problem. This paper describes a new efficient approach of query formulation to retrieve Arabic-based candidate source documents from the Web. We evaluated the work using a collection of documents especially constructed for this work. The experiments show that on average, 79.97% of the Web documents used in the reused cases were successfully retrieved.\",\"PeriodicalId\":412653,\"journal\":{\"name\":\"2016 12th International Conference on Innovations in Information Technology (IIT)\",\"volume\":\"81 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 12th International Conference on Innovations in Information Technology (IIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INNOVATIONS.2016.7880048\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th International Conference on Innovations in Information Technology (IIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INNOVATIONS.2016.7880048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

给定一个输入文档d，本地文本重用检测的问题是从给定的文档集合中检测d和其他文档之间所有可能重用的段落。将文档d的段落与集合中所有其他文档的段落进行比较显然是不可行的，特别是对于Web这样的大型集合。因此，选择可能包含使用d的重用文本的文档子集成为检测问题中的一个主要步骤。本文描述了一种从Web检索基于阿拉伯语的候选源文档的高效查询方法。我们使用专门为这项工作构建的文档集来评估这项工作。实验表明，在重用案例中，平均79.97%的Web文档被成功检索。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Candidate document retrieval for Arabic-based text reuse detection on the web

Given an input document d, the problem of local text reuse detection is to detect from a given documents collection, all the possible reused passages between d and the other documents. Comparing the passages of document d with the passages of every other document in the collection is obviously infeasible especially with large collections such as the Web. Therefore, selecting a subset of the documents that potentially contains reused text with d becomes a major step in the detection problem. This paper describes a new efficient approach of query formulation to retrieve Arabic-based candidate source documents from the Web. We evaluated the work using a collection of documents especially constructed for this work. The experiments show that on average, 79.97% of the Web documents used in the reused cases were successfully retrieved.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 12th International Conference on Innovations in Information Technology (IIT)

自引率

0.00%

发文量