Robust regression using probabilistically linked data

IF 5.4 2区数学 Q1 STATISTICS & PROBABILITY

Wiley Interdisciplinary Reviews-Computational Statistics Pub Date : 2022-07-07 DOI:10.1002/wics.1596

R. Chambers, E. Fabrizi, M. Ranalli, N. Salvati, Suojin Wang

{"title":"Robust regression using probabilistically linked data","authors":"R. Chambers, E. Fabrizi, M. Ranalli, N. Salvati, Suojin Wang","doi":"10.1002/wics.1596","DOIUrl":null,"url":null,"abstract":"There is growing interest in a data integration approach to survey sampling, particularly where population registers are linked for sampling and subsequent analysis. The reason for doing this is simple: it is only by linking the same individuals in the different sources that it becomes possible to create a data set suitable for analysis. But data linkage is not error free. Many linkages are nondeterministic, based on how likely a linking decision corresponds to a correct match, that is, it brings together the same individual in all sources. High quality linking will ensure that the probability of this happening is high. Analysis of the linked data should take account of this additional source of error when this is not the case. This is especially true for secondary analysis carried out without access to the linking information, that is, the often confidential data that agencies use in their record matching. We describe an inferential framework that allows for linkage errors when sampling from linked registers. After first reviewing current research activity in this area, we focus on secondary analysis and linear regression modeling, including the important special case of estimation of subpopulation and small area means. In doing so we consider both robustness and efficiency of the resulting linked data inferences.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2022-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Wiley Interdisciplinary Reviews-Computational Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1002/wics.1596","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 1

Abstract

There is growing interest in a data integration approach to survey sampling, particularly where population registers are linked for sampling and subsequent analysis. The reason for doing this is simple: it is only by linking the same individuals in the different sources that it becomes possible to create a data set suitable for analysis. But data linkage is not error free. Many linkages are nondeterministic, based on how likely a linking decision corresponds to a correct match, that is, it brings together the same individual in all sources. High quality linking will ensure that the probability of this happening is high. Analysis of the linked data should take account of this additional source of error when this is not the case. This is especially true for secondary analysis carried out without access to the linking information, that is, the often confidential data that agencies use in their record matching. We describe an inferential framework that allows for linkage errors when sampling from linked registers. After first reviewing current research activity in this area, we focus on secondary analysis and linear regression modeling, including the important special case of estimation of subpopulation and small area means. In doing so we consider both robustness and efficiency of the resulting linked data inferences.

查看原文本刊更多论文

使用概率关联数据的稳健回归

人们对调查抽样的数据综合办法越来越感兴趣，特别是在将人口登记册联系起来进行抽样和随后的分析的情况下。这样做的原因很简单:只有将不同来源中的相同个体联系起来，才有可能创建适合分析的数据集。但数据链接并非没有错误。许多链接是不确定的，这取决于链接决策对应于正确匹配的可能性，也就是说，它将所有来源中的相同个体聚集在一起。高质量的链接将确保这种情况发生的可能性很高。当不存在这种情况时，对关联数据的分析应考虑到这种额外的误差来源。在没有接触到相关信息的情况下进行的二次分析尤其如此，这些信息通常是各机构在其记录匹配中使用的机密数据。我们描述了一个推理框架，当从链接寄存器采样时允许链接错误。本文首先回顾了该领域的研究现状，重点介绍了二次分析和线性回归建模，包括亚种群估计和小面积均值的重要特例。在这样做时，我们考虑了由此产生的关联数据推断的鲁棒性和效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Wiley Interdisciplinary Reviews-Computational Statistics STATISTICS & PROBABILITY-

CiteScore

6.20

自引率

0.00%

发文量