{"title":"Data-driven name reduction for record linkage","authors":"M. Schraagen, W. Kosters","doi":"10.1109/INTECH.2012.6457783","DOIUrl":null,"url":null,"abstract":"Automatic record linkage of data containing personal names is difficult in the presence of name variation and spelling errors. This paper presents a standardization procedure for personal names to address the variation problem. A classification tree based model is constructed using a training set of 65,002 name-variant pairs. The method provides an efficient procedure for record linkage (3500 records per second, F-measure 0.96 on a sample of Dutch historical civil records). The results include links with large edit distance between the records, however recall is lower for this category. A bootstrapping procedure is used to improve recall.","PeriodicalId":369113,"journal":{"name":"Second International Conference on the Innovative Computing Technology (INTECH 2012)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Second International Conference on the Innovative Computing Technology (INTECH 2012)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INTECH.2012.6457783","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Automatic record linkage of data containing personal names is difficult in the presence of name variation and spelling errors. This paper presents a standardization procedure for personal names to address the variation problem. A classification tree based model is constructed using a training set of 65,002 name-variant pairs. The method provides an efficient procedure for record linkage (3500 records per second, F-measure 0.96 on a sample of Dutch historical civil records). The results include links with large edit distance between the records, however recall is lower for this category. A bootstrapping procedure is used to improve recall.