Convergence Diagnostics for Entity Resolution

IF 7.4 1区 数学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS
Serge Aleshin-Guendel, Rebecca C. Steorts
{"title":"Convergence Diagnostics for Entity Resolution","authors":"Serge Aleshin-Guendel, Rebecca C. Steorts","doi":"10.1146/annurev-statistics-040522-114848","DOIUrl":null,"url":null,"abstract":"Entity resolution is the process of merging and removing duplicate records from multiple data sources, often in the absence of unique identifiers. Bayesian models for entity resolution allow one to include a priori information, quantify uncertainty in important applications, and directly estimate a partition of the records. Markov chain Monte Carlo (MCMC) sampling is the primary computational method for approximate posterior inference in this setting, but due to the high dimensionality of the space of partitions, there are no agreed upon standards for diagnosing nonconvergence of MCMC sampling. In this article, we review Bayesian entity resolution, with a focus on the specific challenges that it poses for the convergence of a Markov chain. We review prior methods for convergence diagnostics, discussing their weaknesses. We provide recommendations for using MCMC sampling for Bayesian entity resolution, focusing on the use of modern diagnostics that are commonplace in applied Bayesian statistics. Using simulated data, we find that a commonly used Gibbs sampler performs poorly compared with two alternatives.","PeriodicalId":48855,"journal":{"name":"Annual Review of Statistics and Its Application","volume":"4 1","pages":""},"PeriodicalIF":7.4000,"publicationDate":"2024-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Review of Statistics and Its Application","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1146/annurev-statistics-040522-114848","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Entity resolution is the process of merging and removing duplicate records from multiple data sources, often in the absence of unique identifiers. Bayesian models for entity resolution allow one to include a priori information, quantify uncertainty in important applications, and directly estimate a partition of the records. Markov chain Monte Carlo (MCMC) sampling is the primary computational method for approximate posterior inference in this setting, but due to the high dimensionality of the space of partitions, there are no agreed upon standards for diagnosing nonconvergence of MCMC sampling. In this article, we review Bayesian entity resolution, with a focus on the specific challenges that it poses for the convergence of a Markov chain. We review prior methods for convergence diagnostics, discussing their weaknesses. We provide recommendations for using MCMC sampling for Bayesian entity resolution, focusing on the use of modern diagnostics that are commonplace in applied Bayesian statistics. Using simulated data, we find that a commonly used Gibbs sampler performs poorly compared with two alternatives.
实体解析的会聚诊断
实体解析是合并和删除来自多个数据源的重复记录的过程,通常缺乏唯一标识符。实体解析的贝叶斯模型允许我们在重要应用中包含先验信息、量化不确定性,并直接估计记录的分区。马尔科夫链蒙特卡罗(MCMC)采样是在这种情况下进行近似后验推断的主要计算方法,但由于分区空间的维度很高,目前还没有公认的标准来诊断 MCMC 采样的不收敛性。在本文中,我们将回顾贝叶斯实体解析,重点讨论它对马尔可夫链收敛性提出的具体挑战。我们回顾了先前的收敛性诊断方法,讨论了它们的弱点。我们为使用 MCMC 采样进行贝叶斯实体解析提供了建议,重点是使用应用贝叶斯统计中常见的现代诊断方法。通过模拟数据,我们发现常用的 Gibbs 采样器与两种替代方法相比表现不佳。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Annual Review of Statistics and Its Application
Annual Review of Statistics and Its Application MATHEMATICS, INTERDISCIPLINARY APPLICATIONS-STATISTICS & PROBABILITY
CiteScore
13.40
自引率
1.30%
发文量
29
期刊介绍: The Annual Review of Statistics and Its Application publishes comprehensive review articles focusing on methodological advancements in statistics and the utilization of computational tools facilitating these advancements. It is abstracted and indexed in Scopus, Science Citation Index Expanded, and Inspec.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信