海量空间数据中的分布式贝叶斯推理

IF 3.9 1区数学 Q1 STATISTICS & PROBABILITY

Statistical Science Pub Date : 2023-01-01 DOI:10.1214/22-sts868

Rajarshi Guhaniyogi, Cheng Li, T. Savitsky, Sanvesh Srivastava

{"title":"海量空间数据中的分布式贝叶斯推理","authors":"Rajarshi Guhaniyogi, Cheng Li, T. Savitsky, Sanvesh Srivastava","doi":"10.1214/22-sts868","DOIUrl":null,"url":null,"abstract":"Gaussian process (GP) regression is computationally expensive in spatial applications involving massive data. Various methods address this limitation, including a small number of Bayesian methods based on distributed computations (or the divide-and-conquer strategy). Focusing on the latter literature, we achieve three main goals. First, we develop an extensible Bayesian framework for distributed spatial GP regression that embeds many popular methods. The proposed framework has three steps that partition the entire data into many subsets, apply a readily available Bayesian spatial process model in parallel on all the subsets, and combine the posterior distributions estimated on all the subsets into a pseudo posterior distribution that conditions on the entire data. The combined pseudo posterior distribution replaces the full data posterior distribution in prediction and inference problems. Demonstrating our framework’s generality, we extend posterior computations for (non-distributed) spatial process models with a stationary full-rank and a nonstationary low-rank GP priors to the distributed setting. Second, we contrast the empirical performance of popular distributed approaches with some widely used non-distributed alternatives and highlight their relative advantages and shortcomings. Third, we provide theoretical support for our numerical observations and show that the Bayes L2-risks of the combined posterior distributions obtained from a subclass of the divide-and-conquer methods achieves the near-optimal convergence rate in estimating the true spatial surface with various types of covariance functions. Additionally, we provide upper bounds on the number of subsets to achieve these near-optimal rates.","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":" ","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Distributed Bayesian Inference in Massive Spatial Data\",\"authors\":\"Rajarshi Guhaniyogi, Cheng Li, T. Savitsky, Sanvesh Srivastava\",\"doi\":\"10.1214/22-sts868\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Gaussian process (GP) regression is computationally expensive in spatial applications involving massive data. Various methods address this limitation, including a small number of Bayesian methods based on distributed computations (or the divide-and-conquer strategy). Focusing on the latter literature, we achieve three main goals. First, we develop an extensible Bayesian framework for distributed spatial GP regression that embeds many popular methods. The proposed framework has three steps that partition the entire data into many subsets, apply a readily available Bayesian spatial process model in parallel on all the subsets, and combine the posterior distributions estimated on all the subsets into a pseudo posterior distribution that conditions on the entire data. The combined pseudo posterior distribution replaces the full data posterior distribution in prediction and inference problems. Demonstrating our framework’s generality, we extend posterior computations for (non-distributed) spatial process models with a stationary full-rank and a nonstationary low-rank GP priors to the distributed setting. Second, we contrast the empirical performance of popular distributed approaches with some widely used non-distributed alternatives and highlight their relative advantages and shortcomings. Third, we provide theoretical support for our numerical observations and show that the Bayes L2-risks of the combined posterior distributions obtained from a subclass of the divide-and-conquer methods achieves the near-optimal convergence rate in estimating the true spatial surface with various types of covariance functions. Additionally, we provide upper bounds on the number of subsets to achieve these near-optimal rates.\",\"PeriodicalId\":51172,\"journal\":{\"name\":\"Statistical Science\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistical Science\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1214/22-sts868\",\"RegionNum\":1,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Science","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/22-sts868","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 1

摘要

在涉及海量数据的空间应用中，高斯过程（GP）回归在计算上是昂贵的。各种方法解决了这一限制，包括少量基于分布式计算的贝叶斯方法（或分治策略）。关注后一种文献，我们实现了三个主要目标。首先，我们为分布式空间GP回归开发了一个可扩展的贝叶斯框架，该框架嵌入了许多流行的方法。所提出的框架有三个步骤，将整个数据划分为多个子集，在所有子集上并行应用现成的贝叶斯空间过程模型，并将在所有子集中估计的后验分布组合为伪后验分布，以对整个数据进行调节。在预测和推理问题中，组合伪后验分布取代了全数据后验分布。为了证明我们的框架的通用性，我们将具有平稳全秩和非平稳低秩GP先验的（非分布式）空间过程模型的后验计算扩展到分布式设置。其次，我们将流行的分布式方法与一些广泛使用的非分布式替代方法的经验性能进行了比较，并强调了它们的相对优势和缺点。第三，我们为我们的数值观测提供了理论支持，并表明从分治方法的一个子类获得的组合后验分布的Bayes L2风险在估计具有各种类型协方差函数的真实空间表面时实现了接近最优的收敛速度。此外，我们提供了子集数量的上限，以实现这些接近最优的速率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Distributed Bayesian Inference in Massive Spatial Data

Gaussian process (GP) regression is computationally expensive in spatial applications involving massive data. Various methods address this limitation, including a small number of Bayesian methods based on distributed computations (or the divide-and-conquer strategy). Focusing on the latter literature, we achieve three main goals. First, we develop an extensible Bayesian framework for distributed spatial GP regression that embeds many popular methods. The proposed framework has three steps that partition the entire data into many subsets, apply a readily available Bayesian spatial process model in parallel on all the subsets, and combine the posterior distributions estimated on all the subsets into a pseudo posterior distribution that conditions on the entire data. The combined pseudo posterior distribution replaces the full data posterior distribution in prediction and inference problems. Demonstrating our framework’s generality, we extend posterior computations for (non-distributed) spatial process models with a stationary full-rank and a nonstationary low-rank GP priors to the distributed setting. Second, we contrast the empirical performance of popular distributed approaches with some widely used non-distributed alternatives and highlight their relative advantages and shortcomings. Third, we provide theoretical support for our numerical observations and show that the Bayes L2-risks of the combined posterior distributions obtained from a subclass of the divide-and-conquer methods achieves the near-optimal convergence rate in estimating the true spatial surface with various types of covariance functions. Additionally, we provide upper bounds on the number of subsets to achieve these near-optimal rates.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Statistical Science 数学-统计学与概率论

CiteScore

6.50

自引率

1.80%

发文量

审稿时长

>12 weeks

期刊介绍： The central purpose of Statistical Science is to convey the richness, breadth and unity of the field by presenting the full range of contemporary statistical thought at a moderate technical level, accessible to the wide community of practitioners, researchers and students of statistics and probability.