Near-duplicate detection by instance-level constrained clustering

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval Pub Date : 2006-08-06 DOI:10.1145/1148170.1148243

G. Yang, Jamie Callan

引用次数: 102

Abstract

For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both "almost-identical" documents in the data cleaning task and "relevant" documents in the search task. This paper presents an instance-level constrained clustering approach for near-duplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form near-duplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other near-duplicate detection algorithms and as about as effective as human assessors.

查看原文本刊更多论文

通过实例级约束聚类进行近重复检测

对于近重复文档检测任务，传统的数据库社区指纹识别技术和信息检索社区词袋比对方法都不够准确。这是因为近重复文档的特征不同于数据清理任务中的“几乎相同”文档和搜索任务中的“相关”文档。本文提出了一种用于近重复检测的实例级约束聚类方法。该框架将文档属性和内容结构等信息合并到集群过程中，以形成近乎重复的集群。从向美国政府机构提交的关于拟议新法规的若干公众评论中收集的实验结果表明，我们的方法优于其他近重复检测算法，并且与人类评估器一样有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

自引率

0.00%

发文量