Candidate document retrieval for Arabic-based text reuse detection on the web

2016 12th International Conference on Innovations in Information Technology (IIT) Pub Date : 2016-11-01 DOI:10.1109/INNOVATIONS.2016.7880048

Leena Lulu, B. Belkhouche, S. Harous

引用次数: 0

Abstract

Given an input document d, the problem of local text reuse detection is to detect from a given documents collection, all the possible reused passages between d and the other documents. Comparing the passages of document d with the passages of every other document in the collection is obviously infeasible especially with large collections such as the Web. Therefore, selecting a subset of the documents that potentially contains reused text with d becomes a major step in the detection problem. This paper describes a new efficient approach of query formulation to retrieve Arabic-based candidate source documents from the Web. We evaluated the work using a collection of documents especially constructed for this work. The experiments show that on average, 79.97% of the Web documents used in the reused cases were successfully retrieved.

查看原文本刊更多论文

基于阿拉伯语的网络文本重用检测候选文档检索

给定一个输入文档d，本地文本重用检测的问题是从给定的文档集合中检测d和其他文档之间所有可能重用的段落。将文档d的段落与集合中所有其他文档的段落进行比较显然是不可行的，特别是对于Web这样的大型集合。因此，选择可能包含使用d的重用文本的文档子集成为检测问题中的一个主要步骤。本文描述了一种从Web检索基于阿拉伯语的候选源文档的高效查询方法。我们使用专门为这项工作构建的文档集来评估这项工作。实验表明，在重用案例中，平均79.97%的Web文档被成功检索。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 12th International Conference on Innovations in Information Technology (IIT)

自引率

0.00%

发文量