A Scalable Parallel Deduplication Algorithm

W. Santos, Thiago Teixeira, Carla Machado, Wagner Meira Jr, R. Ferreira, Dorgival Olavo Guedes Neto, A. D. Silva
{"title":"A Scalable Parallel Deduplication Algorithm","authors":"W. Santos, Thiago Teixeira, Carla Machado, Wagner Meira Jr, R. Ferreira, Dorgival Olavo Guedes Neto, A. D. Silva","doi":"10.1109/SBAC-PAD.2007.32","DOIUrl":null,"url":null,"abstract":"The identification of replicas in a database is fundamental to improve the quality of the information. Deduplication is the task of identifying replicas in a database that refer to the same real world entity. This process is not always trivial, because data may be corrupted during their gathering, storing or even manipulation. Problems such as misspelled names, data truncation, data input in a wrong format, lack of conventions (like how to abbreviate a name), missing data or even fraud may lead to the insertion of replicas in a database. The deduplication process may be very hard, if not impossible, to be performed manually, since actual databases may have hundreds of millions of records. In this paper, we present our parallel deduplication algorithm, called FER- APARDA. By using probabilistic record linkage, we were able to successfully detect replicas in synthetic datasets with more than 1 million records in about 7 minutes using a 20- computer cluster, achieving an almost linear speedup. We believe that our results do not have similar in the literature when it comes to the size of the data set and the processing time.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2007.32","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21

Abstract

The identification of replicas in a database is fundamental to improve the quality of the information. Deduplication is the task of identifying replicas in a database that refer to the same real world entity. This process is not always trivial, because data may be corrupted during their gathering, storing or even manipulation. Problems such as misspelled names, data truncation, data input in a wrong format, lack of conventions (like how to abbreviate a name), missing data or even fraud may lead to the insertion of replicas in a database. The deduplication process may be very hard, if not impossible, to be performed manually, since actual databases may have hundreds of millions of records. In this paper, we present our parallel deduplication algorithm, called FER- APARDA. By using probabilistic record linkage, we were able to successfully detect replicas in synthetic datasets with more than 1 million records in about 7 minutes using a 20- computer cluster, achieving an almost linear speedup. We believe that our results do not have similar in the literature when it comes to the size of the data set and the processing time.
可扩展的并行重复数据删除算法
识别数据库中的副本是提高信息质量的基础。重复数据删除是识别数据库中引用相同现实世界实体的副本的任务。这个过程并不总是微不足道的,因为数据在收集、存储甚至操作过程中可能会损坏。诸如名称拼写错误、数据截断、格式错误的数据输入、缺乏约定(如如何缩写名称)、丢失数据甚至欺诈等问题都可能导致在数据库中插入副本。手动执行重复数据删除过程可能非常困难(如果不是不可能的话),因为实际的数据库可能有数亿条记录。本文提出了一种并行重复数据删除算法,称为FER- APARDA。通过使用概率记录链接,我们能够在大约7分钟内使用20台计算机集群成功检测超过100万条记录的合成数据集中的副本,实现几乎线性的加速。我们认为,在数据集的大小和处理时间方面,我们的结果在文献中没有类似的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信