A Scalable Parallel Deduplication Algorithm

19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07) Pub Date : 2007-11-19 DOI:10.1109/SBAC-PAD.2007.32

W. Santos, Thiago Teixeira, Carla Machado, Wagner Meira Jr, R. Ferreira, Dorgival Olavo Guedes Neto, A. D. Silva

{"title":"A Scalable Parallel Deduplication Algorithm","authors":"W. Santos, Thiago Teixeira, Carla Machado, Wagner Meira Jr, R. Ferreira, Dorgival Olavo Guedes Neto, A. D. Silva","doi":"10.1109/SBAC-PAD.2007.32","DOIUrl":null,"url":null,"abstract":"The identification of replicas in a database is fundamental to improve the quality of the information. Deduplication is the task of identifying replicas in a database that refer to the same real world entity. This process is not always trivial, because data may be corrupted during their gathering, storing or even manipulation. Problems such as misspelled names, data truncation, data input in a wrong format, lack of conventions (like how to abbreviate a name), missing data or even fraud may lead to the insertion of replicas in a database. The deduplication process may be very hard, if not impossible, to be performed manually, since actual databases may have hundreds of millions of records. In this paper, we present our parallel deduplication algorithm, called FER- APARDA. By using probabilistic record linkage, we were able to successfully detect replicas in synthetic datasets with more than 1 million records in about 7 minutes using a 20- computer cluster, achieving an almost linear speedup. We believe that our results do not have similar in the literature when it comes to the size of the data set and the processing time.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2007.32","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

The identification of replicas in a database is fundamental to improve the quality of the information. Deduplication is the task of identifying replicas in a database that refer to the same real world entity. This process is not always trivial, because data may be corrupted during their gathering, storing or even manipulation. Problems such as misspelled names, data truncation, data input in a wrong format, lack of conventions (like how to abbreviate a name), missing data or even fraud may lead to the insertion of replicas in a database. The deduplication process may be very hard, if not impossible, to be performed manually, since actual databases may have hundreds of millions of records. In this paper, we present our parallel deduplication algorithm, called FER- APARDA. By using probabilistic record linkage, we were able to successfully detect replicas in synthetic datasets with more than 1 million records in about 7 minutes using a 20- computer cluster, achieving an almost linear speedup. We believe that our results do not have similar in the literature when it comes to the size of the data set and the processing time.

查看原文本刊更多论文

可扩展的并行重复数据删除算法

识别数据库中的副本是提高信息质量的基础。重复数据删除是识别数据库中引用相同现实世界实体的副本的任务。这个过程并不总是微不足道的，因为数据在收集、存储甚至操作过程中可能会损坏。诸如名称拼写错误、数据截断、格式错误的数据输入、缺乏约定(如如何缩写名称)、丢失数据甚至欺诈等问题都可能导致在数据库中插入副本。手动执行重复数据删除过程可能非常困难(如果不是不可能的话)，因为实际的数据库可能有数亿条记录。本文提出了一种并行重复数据删除算法，称为FER- APARDA。通过使用概率记录链接，我们能够在大约7分钟内使用20台计算机集群成功检测超过100万条记录的合成数据集中的副本，实现几乎线性的加速。我们认为，在数据集的大小和处理时间方面，我们的结果在文献中没有类似的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)

自引率

0.00%

发文量