A Near-Duplicate Detection Algorithm to Facilitate Document Clustering

International Journal of Data Mining & Knowledge Management Process Pub Date : 2014-11-30 DOI:10.5121/IJDKP.2014.4604

Lavanya Pamulaparty, D. C. V. G. Rao, D. S. Rao

引用次数: 6

Abstract

Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages plays an important role in the performance degradation while integrating data from heterogeneous sources. These pages either increase the index storage space or increase the serving costs. Detecting these pages has many potential applications for example may indicate plagiarism or copyright infringement. This paper concerns detecting, and optionally removing duplicate and near duplicate documents which are used to perform clustering of documents .We demonstrated our approach in web news articles domain. The experimental results show that our algorithm outperforms in terms of similarity measures. The near duplicate and duplicate document identification has resulted reduced memory in repositories.

查看原文本刊更多论文

一种促进文档聚类的近重复检测算法

由于网页的重复和近似重复，Web Ming面临着巨大的问题。在像“互联网”这样的大数据集中，检测近重复是非常困难的。在集成异构数据源的数据时，这些网页的存在在性能下降中起着重要作用。这些页面要么增加索引存储空间，要么增加服务成本。检测这些页面有许多潜在的应用，例如可能表明剽窃或侵犯版权。本文研究了对重复和接近重复的文档进行检测和选择性删除，并将其用于文档聚类，并在web新闻文章领域进行了演示。实验结果表明，我们的算法在相似度度量方面表现优异。近重复和重复文档标识导致存储库中的内存减少。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Data Mining & Knowledge Management Process

自引率

0.00%

发文量