{"title":"Efficient Density-Based Blocking for Record Matching","authors":"Chenxiao Dou, Ruoyu Wang, Daniel W. Sun, M. Atif","doi":"10.1145/3105831.3105844","DOIUrl":null,"url":null,"abstract":"Record Matching in data engineering refers to searching for data records originating from the same entities across different data sources. In practice, the main challenge of record matching is that the amount of non-matches typically far exceeds the amount of matches. This is called imbalance problem, which notoriously affects efficiency and effectiveness of matching algorithms. To solve the imbalance problem, recently, density-based blocking algorithms have been studied and demonstrated an effective blocking performance. However, the efficiency of density-based blocking approaches is not good as their effectiveness. In this paper, we improve the efficiency of density-based blocking by exploiting the idea of pre-computing and pruning. Our approach optimizes the method of computing density to speed up the blocking process. Throughout experiments on real-world datasets, the proposed approach demonstrated a high performance on both blocking efficiency and blocking effectiveness.","PeriodicalId":319729,"journal":{"name":"Proceedings of the 21st International Database Engineering & Applications Symposium","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st International Database Engineering & Applications Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3105831.3105844","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Record Matching in data engineering refers to searching for data records originating from the same entities across different data sources. In practice, the main challenge of record matching is that the amount of non-matches typically far exceeds the amount of matches. This is called imbalance problem, which notoriously affects efficiency and effectiveness of matching algorithms. To solve the imbalance problem, recently, density-based blocking algorithms have been studied and demonstrated an effective blocking performance. However, the efficiency of density-based blocking approaches is not good as their effectiveness. In this paper, we improve the efficiency of density-based blocking by exploiting the idea of pre-computing and pruning. Our approach optimizes the method of computing density to speed up the blocking process. Throughout experiments on real-world datasets, the proposed approach demonstrated a high performance on both blocking efficiency and blocking effectiveness.