模糊重复的鲁棒识别

21st International Conference on Data Engineering (ICDE'05) Pub Date : 2005-04-05 DOI:10.1109/ICDE.2005.125

S. Chaudhuri, Venkatesh Ganti, R. Motwani

{"title":"模糊重复的鲁棒识别","authors":"S. Chaudhuri, Venkatesh Ganti, R. Motwani","doi":"10.1109/ICDE.2005.125","DOIUrl":null,"url":null,"abstract":"Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples, which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"193 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"227","resultStr":"{\"title\":\"Robust identification of fuzzy duplicates\",\"authors\":\"S. Chaudhuri, Venkatesh Ganti, R. Motwani\",\"doi\":\"10.1109/ICDE.2005.125\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples, which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm.\",\"PeriodicalId\":297231,\"journal\":{\"name\":\"21st International Conference on Data Engineering (ICDE'05)\",\"volume\":\"193 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-04-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"227\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"21st International Conference on Data Engineering (ICDE'05)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE.2005.125\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"21st International Conference on Data Engineering (ICDE'05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2005.125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 227

摘要

检测和消除模糊重复是许多应用程序需要的关键数据清理任务。模糊副本是多个看似不同的元组，它们表示相同的现实世界实体。我们提出了两个新的标准，使模糊重复的表征比现有的技术更准确。利用这些准则，我们提出了一个新的模糊重复消除问题的框架。我们表明，新框架内的解决方案比以前的方法具有更好的准确性。我们提出了一个在框架内求解实例化的有效算法。我们在实际数据集上对其进行了评估，以证明我们的算法的准确性和可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Robust identification of fuzzy duplicates

Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples, which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

21st International Conference on Data Engineering (ICDE'05)

自引率

0.00%

发文量