缺失数据情况下的潜类分析变量选择与记录关联的应用

IF 1.6 3区医学 Q3 HEALTH CARE SCIENCES & SERVICES

Statistical Methods in Medical Research Pub Date : 2024-04-09 DOI:10.1177/09622802241242317

Huiping Xu, Xiaochun Li, Zuoyi Zhang, Shaun Grannis

{"title":"缺失数据情况下的潜类分析变量选择与记录关联的应用","authors":"Huiping Xu, Xiaochun Li, Zuoyi Zhang, Shaun Grannis","doi":"10.1177/09622802241242317","DOIUrl":null,"url":null,"abstract":"The Fellegi-Sunter model is a latent class model widely used in probabilistic linkage to identify records that belong to the same entity. Record linkage practitioners typically employ all available matching fields in the model with the premise that more fields convey greater information about the true match status and hence result in improved match performance. In the context of model-based clustering, it is well known that such a premise is incorrect and the inclusion of noisy variables could compromise the clustering. Variable selection procedures have therefore been developed to remove noisy variables. Although these procedures have the potential to improve record matching, they cannot be applied directly due to the ubiquity of the missing data in record linkage applications. In this paper, we modify the stepwise variable selection procedure proposed by Fop, Smart, and Murphy and extend it to account for missing data common in record linkage. Through simulation studies, our proposed method is shown to select the correct set of matching fields across various settings, leading to better-performing algorithms. The improved match performance is also seen in a real-world application. We therefore recommend the use of our proposed selection procedure to identify informative matching fields for probabilistic record linkage algorithms.","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":"62 1","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Variable selection for latent class analysis in the presence of missing data with application to record linkage\",\"authors\":\"Huiping Xu, Xiaochun Li, Zuoyi Zhang, Shaun Grannis\",\"doi\":\"10.1177/09622802241242317\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Fellegi-Sunter model is a latent class model widely used in probabilistic linkage to identify records that belong to the same entity. Record linkage practitioners typically employ all available matching fields in the model with the premise that more fields convey greater information about the true match status and hence result in improved match performance. In the context of model-based clustering, it is well known that such a premise is incorrect and the inclusion of noisy variables could compromise the clustering. Variable selection procedures have therefore been developed to remove noisy variables. Although these procedures have the potential to improve record matching, they cannot be applied directly due to the ubiquity of the missing data in record linkage applications. In this paper, we modify the stepwise variable selection procedure proposed by Fop, Smart, and Murphy and extend it to account for missing data common in record linkage. Through simulation studies, our proposed method is shown to select the correct set of matching fields across various settings, leading to better-performing algorithms. The improved match performance is also seen in a real-world application. We therefore recommend the use of our proposed selection procedure to identify informative matching fields for probabilistic record linkage algorithms.\",\"PeriodicalId\":22038,\"journal\":{\"name\":\"Statistical Methods in Medical Research\",\"volume\":\"62 1\",\"pages\":\"\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2024-04-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistical Methods in Medical Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/09622802241242317\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Methods in Medical Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/09622802241242317","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

Fellegi-Sunter 模型是一种潜类模型，被广泛应用于概率链接，以识别属于同一实体的记录。记录关联实践者通常会在模型中使用所有可用的匹配字段，前提是更多的字段能传递更多关于真实匹配状态的信息，从而提高匹配性能。众所周知，在基于模型的聚类中，这样的前提是不正确的，包含噪声变量会影响聚类效果。因此，我们开发了变量选择程序来去除噪声变量。虽然这些程序有改善记录匹配的潜力，但由于记录关联应用中缺失数据的普遍性，这些程序无法直接应用。在本文中，我们修改了 Fop、Smart 和 Murphy 提出的逐步变量选择程序，并对其进行了扩展，以考虑记录关联中常见的缺失数据。通过模拟研究，我们提出的方法可以在各种情况下选择正确的匹配字段集，从而产生性能更好的算法。在实际应用中，我们也看到了匹配性能的提高。因此，我们建议使用我们提出的选择程序来为概率记录关联算法识别信息匹配字段。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Variable selection for latent class analysis in the presence of missing data with application to record linkage

The Fellegi-Sunter model is a latent class model widely used in probabilistic linkage to identify records that belong to the same entity. Record linkage practitioners typically employ all available matching fields in the model with the premise that more fields convey greater information about the true match status and hence result in improved match performance. In the context of model-based clustering, it is well known that such a premise is incorrect and the inclusion of noisy variables could compromise the clustering. Variable selection procedures have therefore been developed to remove noisy variables. Although these procedures have the potential to improve record matching, they cannot be applied directly due to the ubiquity of the missing data in record linkage applications. In this paper, we modify the stepwise variable selection procedure proposed by Fop, Smart, and Murphy and extend it to account for missing data common in record linkage. Through simulation studies, our proposed method is shown to select the correct set of matching fields across various settings, leading to better-performing algorithms. The improved match performance is also seen in a real-world application. We therefore recommend the use of our proposed selection procedure to identify informative matching fields for probabilistic record linkage algorithms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Statistical Methods in Medical Research 医学-数学与计算生物学

CiteScore

4.10

自引率

4.30%

发文量

127

审稿时长

>12 weeks

期刊介绍： Statistical Methods in Medical Research is a peer reviewed scholarly journal and is the leading vehicle for articles in all the main areas of medical statistics and an essential reference for all medical statisticians. This unique journal is devoted solely to statistics and medicine and aims to keep professionals abreast of the many powerful statistical techniques now available to the medical profession. This journal is a member of the Committee on Publication Ethics (COPE)