{"title":"Learning with Sparsely Permuted Data: A Robust Bayesian Approach","authors":"Abhisek Chakraborty, Saptati Datta","doi":"arxiv-2409.10678","DOIUrl":null,"url":null,"abstract":"Data dispersed across multiple files are commonly integrated through\nprobabilistic linkage methods, where even minimal error rates in record\nmatching can significantly contaminate subsequent statistical analyses. In\nregression problems, we examine scenarios where the identifiers of predictors\nor responses are subject to an unknown permutation, challenging the assumption\nof correspondence. Many emerging approaches in the literature focus on sparsely\npermuted data, where only a small subset of pairs ($k << n$) are affected by\nthe permutation, treating these permuted entries as outliers to restore\noriginal correspondence and obtain consistent estimates of regression\nparameters. In this article, we complement the existing literature by\nintroducing a novel generalized robust Bayesian formulation of the problem. We\ndevelop an efficient posterior sampling scheme by adapting the fractional\nposterior framework and addressing key computational bottlenecks via careful\nuse of discrete optimal transport and sampling in the space of binary matrices\nwith fixed margins. Further, we establish new posterior contraction results\nwithin this framework, providing theoretical guarantees for our approach. The\nutility of the proposed framework is demonstrated via extensive numerical\nexperiments.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Statistics Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10678","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Data dispersed across multiple files are commonly integrated through
probabilistic linkage methods, where even minimal error rates in record
matching can significantly contaminate subsequent statistical analyses. In
regression problems, we examine scenarios where the identifiers of predictors
or responses are subject to an unknown permutation, challenging the assumption
of correspondence. Many emerging approaches in the literature focus on sparsely
permuted data, where only a small subset of pairs ($k << n$) are affected by
the permutation, treating these permuted entries as outliers to restore
original correspondence and obtain consistent estimates of regression
parameters. In this article, we complement the existing literature by
introducing a novel generalized robust Bayesian formulation of the problem. We
develop an efficient posterior sampling scheme by adapting the fractional
posterior framework and addressing key computational bottlenecks via careful
use of discrete optimal transport and sampling in the space of binary matrices
with fixed margins. Further, we establish new posterior contraction results
within this framework, providing theoretical guarantees for our approach. The
utility of the proposed framework is demonstrated via extensive numerical
experiments.