一种用于大型数据库重复检测的增量聚类方案

9th International Database Engineering & Application Symposium (IDEAS'05) Pub Date : 2005-07-25 DOI:10.1109/IDEAS.2005.10

Eugenio Cesario, Francesco Folino, G. Manco, L. Pontieri

{"title":"一种用于大型数据库重复检测的增量聚类方案","authors":"Eugenio Cesario, Francesco Folino, G. Manco, L. Pontieri","doi":"10.1109/IDEAS.2005.10","DOIUrl":null,"url":null,"abstract":"We propose an incremental algorithm for clustering duplicate tuples in large databases, which allows to assign any new tuple t to the cluster containing the database tuples which are most similar to t (and hence are likely to refer to the same real-world entity t is associated with). The core of the approach is a hash-based indexing technique that tends to assign highly similar objects to the same buckets. Empirical evaluation proves that the proposed method allows to gain considerable efficiency improvement over a state-of-art index structure for proximity searches in metric spaces.","PeriodicalId":357591,"journal":{"name":"9th International Database Engineering & Application Symposium (IDEAS'05)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"An incremental clustering scheme for duplicate detection in large databases\",\"authors\":\"Eugenio Cesario, Francesco Folino, G. Manco, L. Pontieri\",\"doi\":\"10.1109/IDEAS.2005.10\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose an incremental algorithm for clustering duplicate tuples in large databases, which allows to assign any new tuple t to the cluster containing the database tuples which are most similar to t (and hence are likely to refer to the same real-world entity t is associated with). The core of the approach is a hash-based indexing technique that tends to assign highly similar objects to the same buckets. Empirical evaluation proves that the proposed method allows to gain considerable efficiency improvement over a state-of-art index structure for proximity searches in metric spaces.\",\"PeriodicalId\":357591,\"journal\":{\"name\":\"9th International Database Engineering & Application Symposium (IDEAS'05)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"9th International Database Engineering & Application Symposium (IDEAS'05)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IDEAS.2005.10\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"9th International Database Engineering & Application Symposium (IDEAS'05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IDEAS.2005.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

我们提出了一种用于大型数据库中重复元组聚类的增量算法，该算法允许将任何新的元组t分配给包含与t最相似的数据库元组的集群(因此可能引用与t相关联的相同的现实世界实体)。该方法的核心是基于散列的索引技术，该技术倾向于将高度相似的对象分配到相同的桶中。经验评估证明，该方法相对于度量空间中邻近搜索的最先进索引结构，可以获得相当大的效率提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An incremental clustering scheme for duplicate detection in large databases

We propose an incremental algorithm for clustering duplicate tuples in large databases, which allows to assign any new tuple t to the cluster containing the database tuples which are most similar to t (and hence are likely to refer to the same real-world entity t is associated with). The core of the approach is a hash-based indexing technique that tends to assign highly similar objects to the same buckets. Empirical evaluation proves that the proposed method allows to gain considerable efficiency improvement over a state-of-art index structure for proximity searches in metric spaces.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

9th International Database Engineering & Application Symposium (IDEAS'05)

自引率

0.00%

发文量