Linking individuals across historical sources: A fully automated approach*

Ran Abramitzky, R. Mill, Santiago Pérez
{"title":"Linking individuals across historical sources: A fully automated approach*","authors":"Ran Abramitzky, R. Mill, Santiago Pérez","doi":"10.1080/01615440.2018.1543034","DOIUrl":null,"url":null,"abstract":"Abstract Linking individuals across historical datasets relies on information such as name and age that is both non-unique and prone to enumeration and transcription errors. These errors make it impossible to find the correct match with certainty. In the first part of the paper, we suggest a fully automated probabilistic method for linking historical datasets that enables researchers to create samples at the frontier of minimizing type I (false positives) and type II (false negatives) errors. The first step guides researchers in the choice of which variables to use for linking. The second step uses the Expectation-Maximization (EM) algorithm, a standard tool in statistics, to compute the probability that each two records correspond to the same individual. The third step suggests how to use these estimated probabilities to choose which records to use in the analysis. In the second part of the paper, we apply the method to link historical population censuses in the US and Norway, and use these samples to estimate measures of intergenerational occupational mobility. The estimates using our method are remarkably similar to the ones using IPUMS’, which relies on hand linking to create a training sample. We created an R code and a Stata command that implement this method.","PeriodicalId":154465,"journal":{"name":"Historical Methods: A Journal of Quantitative and Interdisciplinary History","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"61","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Historical Methods: A Journal of Quantitative and Interdisciplinary History","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/01615440.2018.1543034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 61

Abstract

Abstract Linking individuals across historical datasets relies on information such as name and age that is both non-unique and prone to enumeration and transcription errors. These errors make it impossible to find the correct match with certainty. In the first part of the paper, we suggest a fully automated probabilistic method for linking historical datasets that enables researchers to create samples at the frontier of minimizing type I (false positives) and type II (false negatives) errors. The first step guides researchers in the choice of which variables to use for linking. The second step uses the Expectation-Maximization (EM) algorithm, a standard tool in statistics, to compute the probability that each two records correspond to the same individual. The third step suggests how to use these estimated probabilities to choose which records to use in the analysis. In the second part of the paper, we apply the method to link historical population censuses in the US and Norway, and use these samples to estimate measures of intergenerational occupational mobility. The estimates using our method are remarkably similar to the ones using IPUMS’, which relies on hand linking to create a training sample. We created an R code and a Stata command that implement this method.
跨历史来源链接个人:一个完全自动化的方法*
跨历史数据集连接个体依赖于姓名和年龄等信息,这些信息既非唯一,又容易出现枚举和转录错误。这些错误使得不可能确定地找到正确的匹配。在本文的第一部分中,我们提出了一种全自动概率方法,用于链接历史数据集,使研究人员能够在最小化I型(假阳性)和II型(假阴性)错误的前沿创建样本。第一步指导研究人员选择使用哪些变量进行链接。第二步使用期望最大化(EM)算法(统计学中的标准工具)来计算每两条记录对应于同一个人的概率。第三步建议如何使用这些估计的概率来选择在分析中使用哪些记录。在本文的第二部分,我们将该方法应用于美国和挪威的历史人口普查,并使用这些样本来估计代际职业流动的措施。使用我们方法的估计与使用IPUMS方法的估计非常相似,IPUMS方法依赖于手链接来创建训练样本。我们创建了一个R代码和一个Stata命令来实现这个方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信