W. Santos, Thiago Teixeira, Carla Machado, Wagner Meira Jr, R. Ferreira, Dorgival Olavo Guedes Neto, A. D. Silva
{"title":"A Scalable Parallel Deduplication Algorithm","authors":"W. Santos, Thiago Teixeira, Carla Machado, Wagner Meira Jr, R. Ferreira, Dorgival Olavo Guedes Neto, A. D. Silva","doi":"10.1109/SBAC-PAD.2007.32","DOIUrl":null,"url":null,"abstract":"The identification of replicas in a database is fundamental to improve the quality of the information. Deduplication is the task of identifying replicas in a database that refer to the same real world entity. This process is not always trivial, because data may be corrupted during their gathering, storing or even manipulation. Problems such as misspelled names, data truncation, data input in a wrong format, lack of conventions (like how to abbreviate a name), missing data or even fraud may lead to the insertion of replicas in a database. The deduplication process may be very hard, if not impossible, to be performed manually, since actual databases may have hundreds of millions of records. In this paper, we present our parallel deduplication algorithm, called FER- APARDA. By using probabilistic record linkage, we were able to successfully detect replicas in synthetic datasets with more than 1 million records in about 7 minutes using a 20- computer cluster, achieving an almost linear speedup. We believe that our results do not have similar in the literature when it comes to the size of the data set and the processing time.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2007.32","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21
Abstract
The identification of replicas in a database is fundamental to improve the quality of the information. Deduplication is the task of identifying replicas in a database that refer to the same real world entity. This process is not always trivial, because data may be corrupted during their gathering, storing or even manipulation. Problems such as misspelled names, data truncation, data input in a wrong format, lack of conventions (like how to abbreviate a name), missing data or even fraud may lead to the insertion of replicas in a database. The deduplication process may be very hard, if not impossible, to be performed manually, since actual databases may have hundreds of millions of records. In this paper, we present our parallel deduplication algorithm, called FER- APARDA. By using probabilistic record linkage, we were able to successfully detect replicas in synthetic datasets with more than 1 million records in about 7 minutes using a 20- computer cluster, achieving an almost linear speedup. We believe that our results do not have similar in the literature when it comes to the size of the data set and the processing time.