{"title":"用于记录链接的数据驱动的名称缩减","authors":"M. Schraagen, W. Kosters","doi":"10.1109/INTECH.2012.6457783","DOIUrl":null,"url":null,"abstract":"Automatic record linkage of data containing personal names is difficult in the presence of name variation and spelling errors. This paper presents a standardization procedure for personal names to address the variation problem. A classification tree based model is constructed using a training set of 65,002 name-variant pairs. The method provides an efficient procedure for record linkage (3500 records per second, F-measure 0.96 on a sample of Dutch historical civil records). The results include links with large edit distance between the records, however recall is lower for this category. A bootstrapping procedure is used to improve recall.","PeriodicalId":369113,"journal":{"name":"Second International Conference on the Innovative Computing Technology (INTECH 2012)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Data-driven name reduction for record linkage\",\"authors\":\"M. Schraagen, W. Kosters\",\"doi\":\"10.1109/INTECH.2012.6457783\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic record linkage of data containing personal names is difficult in the presence of name variation and spelling errors. This paper presents a standardization procedure for personal names to address the variation problem. A classification tree based model is constructed using a training set of 65,002 name-variant pairs. The method provides an efficient procedure for record linkage (3500 records per second, F-measure 0.96 on a sample of Dutch historical civil records). The results include links with large edit distance between the records, however recall is lower for this category. A bootstrapping procedure is used to improve recall.\",\"PeriodicalId\":369113,\"journal\":{\"name\":\"Second International Conference on the Innovative Computing Technology (INTECH 2012)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Second International Conference on the Innovative Computing Technology (INTECH 2012)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INTECH.2012.6457783\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Second International Conference on the Innovative Computing Technology (INTECH 2012)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INTECH.2012.6457783","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Automatic record linkage of data containing personal names is difficult in the presence of name variation and spelling errors. This paper presents a standardization procedure for personal names to address the variation problem. A classification tree based model is constructed using a training set of 65,002 name-variant pairs. The method provides an efficient procedure for record linkage (3500 records per second, F-measure 0.96 on a sample of Dutch historical civil records). The results include links with large edit distance between the records, however recall is lower for this category. A bootstrapping procedure is used to improve recall.