{"title":"An Approach of Standardization and Searching based on Hierarchical Bayesian Clustering (HBC) for Record Linkage System","authors":"Zin War Tun, N. Thein","doi":"10.1109/C5.2007.5","DOIUrl":null,"url":null,"abstract":"Information sources on the Web are controlled by different text formats, and have varying inconsistencies. Data form many online sources do not contain enough information to accurately link the records. To link record from different data sources, any system must identify common entities from these sources. Therefore, the major challenges in record linkage are computational complexity and linkage accuracy. To reduce the number of record pairs for comparison, record linkage utilizes similarity search techniques in order to search for candidate similar records. Various searching methods have been used in record linkage systems. In this paper, we propose a record linkage framework and also focus on standardization and enhance the searching method by adopting an advanced feature of cluster-based searching method called Hierarchical Bayesian Clustering (HBC), which is not only for more efficient record pair comparison, but also for speeding up the record linkage accuracy. The purpose of this method is to place similar records into cluster that restricts the search scope for record comparison and also enhances matching accuracy.","PeriodicalId":355191,"journal":{"name":"Fifth International Conference on Creating, Connecting and Collaborating through Computing (C5 '07)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Fifth International Conference on Creating, Connecting and Collaborating through Computing (C5 '07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/C5.2007.5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Information sources on the Web are controlled by different text formats, and have varying inconsistencies. Data form many online sources do not contain enough information to accurately link the records. To link record from different data sources, any system must identify common entities from these sources. Therefore, the major challenges in record linkage are computational complexity and linkage accuracy. To reduce the number of record pairs for comparison, record linkage utilizes similarity search techniques in order to search for candidate similar records. Various searching methods have been used in record linkage systems. In this paper, we propose a record linkage framework and also focus on standardization and enhance the searching method by adopting an advanced feature of cluster-based searching method called Hierarchical Bayesian Clustering (HBC), which is not only for more efficient record pair comparison, but also for speeding up the record linkage accuracy. The purpose of this method is to place similar records into cluster that restricts the search scope for record comparison and also enhances matching accuracy.