Learning with Sparsely Permuted Data: A Robust Bayesian Approach

Abhisek Chakraborty, Saptati Datta
{"title":"Learning with Sparsely Permuted Data: A Robust Bayesian Approach","authors":"Abhisek Chakraborty, Saptati Datta","doi":"arxiv-2409.10678","DOIUrl":null,"url":null,"abstract":"Data dispersed across multiple files are commonly integrated through\nprobabilistic linkage methods, where even minimal error rates in record\nmatching can significantly contaminate subsequent statistical analyses. In\nregression problems, we examine scenarios where the identifiers of predictors\nor responses are subject to an unknown permutation, challenging the assumption\nof correspondence. Many emerging approaches in the literature focus on sparsely\npermuted data, where only a small subset of pairs ($k << n$) are affected by\nthe permutation, treating these permuted entries as outliers to restore\noriginal correspondence and obtain consistent estimates of regression\nparameters. In this article, we complement the existing literature by\nintroducing a novel generalized robust Bayesian formulation of the problem. We\ndevelop an efficient posterior sampling scheme by adapting the fractional\nposterior framework and addressing key computational bottlenecks via careful\nuse of discrete optimal transport and sampling in the space of binary matrices\nwith fixed margins. Further, we establish new posterior contraction results\nwithin this framework, providing theoretical guarantees for our approach. The\nutility of the proposed framework is demonstrated via extensive numerical\nexperiments.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Statistics Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10678","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Data dispersed across multiple files are commonly integrated through probabilistic linkage methods, where even minimal error rates in record matching can significantly contaminate subsequent statistical analyses. In regression problems, we examine scenarios where the identifiers of predictors or responses are subject to an unknown permutation, challenging the assumption of correspondence. Many emerging approaches in the literature focus on sparsely permuted data, where only a small subset of pairs ($k << n$) are affected by the permutation, treating these permuted entries as outliers to restore original correspondence and obtain consistent estimates of regression parameters. In this article, we complement the existing literature by introducing a novel generalized robust Bayesian formulation of the problem. We develop an efficient posterior sampling scheme by adapting the fractional posterior framework and addressing key computational bottlenecks via careful use of discrete optimal transport and sampling in the space of binary matrices with fixed margins. Further, we establish new posterior contraction results within this framework, providing theoretical guarantees for our approach. The utility of the proposed framework is demonstrated via extensive numerical experiments.
利用稀疏堆积数据学习:稳健的贝叶斯方法
分散在多个文件中的数据通常通过概率链接方法进行整合,在这种方法中,即使记录匹配的错误率极低,也会对后续的统计分析造成严重污染。在回归问题中,我们研究了预测因子和响应的标识符受到未知排列组合影响的情况,这对对应假设提出了挑战。文献中的许多新方法侧重于稀疏置换数据,即只有一小部分数据对($k << n$)受置换影响,将这些置换条目视为异常值,以恢复原始对应关系并获得一致的回归参数估计。在本文中,我们对现有文献进行了补充,引入了一种新颖的广义稳健贝叶斯问题表述。我们开发了一种高效的后验采样方案,它采用了分数后验框架,并通过谨慎使用离散最优传输和具有固定边际的二元矩阵空间采样,解决了关键的计算瓶颈问题。此外,我们还在此框架内建立了新的后验收缩结果,为我们的方法提供了理论保证。我们通过大量的数值实验证明了所提框架的实用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信