考察训练数据在自动记录链接的监督方法中的作用:经济史上最佳实践的教训

IF 2.6 1区 历史学 Q1 ECONOMICS
James J Feigenbaum , Jonas Helgertz , Joseph Price
{"title":"考察训练数据在自动记录链接的监督方法中的作用:经济史上最佳实践的教训","authors":"James J Feigenbaum ,&nbsp;Jonas Helgertz ,&nbsp;Joseph Price","doi":"10.1016/j.eeh.2025.101656","DOIUrl":null,"url":null,"abstract":"<div><div>During the past decade, scholars have produced a vast amount of research using linked historical individual-level data, shaping and changing our understanding of the past. This linked data revolution has been powered by methodological and computational advances, partly focused on supervised machine-learning methods that rely on training data. The importance of obtaining high-quality training data for the performance of the record linkage algorithm largely, however, remains unknown. This paper comprehensively examines the role of training data, and—by extension—improves our understanding of best practices in supervised methods of probabilistic record linkage. First, we compare the speed and costs of building training data using different methods. Second, we document high rates of conditional accuracy across the training data sets, rates that are especially high when built with access to more information. Third, we show that data constructed by record linking algorithms learning from different training-data-generation methods do not substantially differ in their accuracy, either overall or across demographic groups, though algorithms tend to perform best when their feature space aligns with the features used to build the training data. Lastly, we introduce errors in the training data and find that the examined record linking algorithms are remarkably capable of making accurate links even working with flawed training data.</div></div>","PeriodicalId":47413,"journal":{"name":"Explorations in Economic History","volume":"96 ","pages":"Article 101656"},"PeriodicalIF":2.6000,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Examining the role of training data for supervised methods of automated record linkage: Lessons for best practice in economic history\",\"authors\":\"James J Feigenbaum ,&nbsp;Jonas Helgertz ,&nbsp;Joseph Price\",\"doi\":\"10.1016/j.eeh.2025.101656\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>During the past decade, scholars have produced a vast amount of research using linked historical individual-level data, shaping and changing our understanding of the past. This linked data revolution has been powered by methodological and computational advances, partly focused on supervised machine-learning methods that rely on training data. The importance of obtaining high-quality training data for the performance of the record linkage algorithm largely, however, remains unknown. This paper comprehensively examines the role of training data, and—by extension—improves our understanding of best practices in supervised methods of probabilistic record linkage. First, we compare the speed and costs of building training data using different methods. Second, we document high rates of conditional accuracy across the training data sets, rates that are especially high when built with access to more information. Third, we show that data constructed by record linking algorithms learning from different training-data-generation methods do not substantially differ in their accuracy, either overall or across demographic groups, though algorithms tend to perform best when their feature space aligns with the features used to build the training data. Lastly, we introduce errors in the training data and find that the examined record linking algorithms are remarkably capable of making accurate links even working with flawed training data.</div></div>\",\"PeriodicalId\":47413,\"journal\":{\"name\":\"Explorations in Economic History\",\"volume\":\"96 \",\"pages\":\"Article 101656\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-01-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Explorations in Economic History\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0014498325000038\",\"RegionNum\":1,\"RegionCategory\":\"历史学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ECONOMICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Explorations in Economic History","FirstCategoryId":"98","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0014498325000038","RegionNum":1,"RegionCategory":"历史学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}
引用次数: 0

摘要

在过去的十年里,学者们利用相关的历史个人层面的数据进行了大量的研究,塑造和改变了我们对过去的理解。这种关联数据革命是由方法和计算的进步推动的,部分集中在依赖于训练数据的监督机器学习方法上。然而,获取高质量的训练数据对于记录链接算法性能的重要性在很大程度上仍然未知。本文全面考察了训练数据的作用,并通过扩展,提高了我们对概率记录链接的监督方法的最佳实践的理解。首先,我们比较了使用不同方法构建训练数据的速度和成本。其次,我们在训练数据集中记录了很高的条件准确性,当可以访问更多信息时,这一比率尤其高。第三,我们表明,通过从不同的训练数据生成方法学习的记录链接算法构建的数据在总体或跨人口统计组的准确性上没有本质差异,尽管当算法的特征空间与用于构建训练数据的特征一致时,算法往往表现最好。最后,我们引入了训练数据中的错误,并发现所检查的记录链接算法即使在有缺陷的训练数据下也能非常准确地建立链接。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Examining the role of training data for supervised methods of automated record linkage: Lessons for best practice in economic history
During the past decade, scholars have produced a vast amount of research using linked historical individual-level data, shaping and changing our understanding of the past. This linked data revolution has been powered by methodological and computational advances, partly focused on supervised machine-learning methods that rely on training data. The importance of obtaining high-quality training data for the performance of the record linkage algorithm largely, however, remains unknown. This paper comprehensively examines the role of training data, and—by extension—improves our understanding of best practices in supervised methods of probabilistic record linkage. First, we compare the speed and costs of building training data using different methods. Second, we document high rates of conditional accuracy across the training data sets, rates that are especially high when built with access to more information. Third, we show that data constructed by record linking algorithms learning from different training-data-generation methods do not substantially differ in their accuracy, either overall or across demographic groups, though algorithms tend to perform best when their feature space aligns with the features used to build the training data. Lastly, we introduce errors in the training data and find that the examined record linking algorithms are remarkably capable of making accurate links even working with flawed training data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
2.50
自引率
8.70%
发文量
27
期刊介绍: Explorations in Economic History provides broad coverage of the application of economic analysis to historical episodes. The journal has a tradition of innovative applications of theory and quantitative techniques, and it explores all aspects of economic change, all historical periods, all geographical locations, and all political and social systems. The journal includes papers by economists, economic historians, demographers, geographers, and sociologists. Explorations in Economic History is the only journal where you will find "Essays in Exploration." This unique department alerts economic historians to the potential in a new area of research, surveying the recent literature and then identifying the most promising issues to pursue.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信