Zeyu Ding, Simon Omlor, Katja Ickstadt, Alexander Munteanu
{"title":"可扩展的贝叶斯 p 广义概率和逻辑回归","authors":"Zeyu Ding, Simon Omlor, Katja Ickstadt, Alexander Munteanu","doi":"10.1007/s11634-024-00599-1","DOIUrl":null,"url":null,"abstract":"<p>The logit and probit link functions are arguably the two most common choices for binary regression models. Many studies have extended the choice of link functions to avoid possible misspecification and to improve the model fit to the data. We introduce the <i>p</i>-generalized Gaussian distribution (<i>p</i>-GGD) to binary regression in a Bayesian framework. The <i>p</i>-GGD has received considerable attention due to its flexibility in modeling the tails, while generalizing, for instance, over the standard normal distribution where <span>\\(p=2\\)</span> or the Laplace distribution where <span>\\(p=1\\)</span>. Here, we extend from maximum likelihood estimation (MLE) to Bayesian posterior estimation using Markov Chain Monte Carlo (MCMC) sampling for the model parameters <span>\\(\\beta\\)</span> and the link function parameter <i>p</i>. We use simulated and real-world data to verify the effect of different parameters <i>p</i> on the estimation results, and how logistic regression and probit regression can be incorporated into a broader framework. To make our Bayesian methods scalable in the case of large data, we also incorporate coresets to reduce the data before running the complex and time-consuming MCMC analysis. This allows us to perform very efficient calculations while retaining the original posterior parameter distributions up to little distortions both, in practice, and with theoretical guarantees.</p>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"3 1","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Scalable Bayesian p-generalized probit and logistic regression\",\"authors\":\"Zeyu Ding, Simon Omlor, Katja Ickstadt, Alexander Munteanu\",\"doi\":\"10.1007/s11634-024-00599-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The logit and probit link functions are arguably the two most common choices for binary regression models. Many studies have extended the choice of link functions to avoid possible misspecification and to improve the model fit to the data. We introduce the <i>p</i>-generalized Gaussian distribution (<i>p</i>-GGD) to binary regression in a Bayesian framework. The <i>p</i>-GGD has received considerable attention due to its flexibility in modeling the tails, while generalizing, for instance, over the standard normal distribution where <span>\\\\(p=2\\\\)</span> or the Laplace distribution where <span>\\\\(p=1\\\\)</span>. Here, we extend from maximum likelihood estimation (MLE) to Bayesian posterior estimation using Markov Chain Monte Carlo (MCMC) sampling for the model parameters <span>\\\\(\\\\beta\\\\)</span> and the link function parameter <i>p</i>. We use simulated and real-world data to verify the effect of different parameters <i>p</i> on the estimation results, and how logistic regression and probit regression can be incorporated into a broader framework. To make our Bayesian methods scalable in the case of large data, we also incorporate coresets to reduce the data before running the complex and time-consuming MCMC analysis. This allows us to perform very efficient calculations while retaining the original posterior parameter distributions up to little distortions both, in practice, and with theoretical guarantees.</p>\",\"PeriodicalId\":49270,\"journal\":{\"name\":\"Advances in Data Analysis and Classification\",\"volume\":\"3 1\",\"pages\":\"\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2024-07-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in Data Analysis and Classification\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11634-024-00599-1\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Data Analysis and Classification","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11634-024-00599-1","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
摘要
logit 和 probit 连接函数可以说是二元回归模型中最常见的两种选择。许多研究对链接函数的选择进行了扩展,以避免可能的错误规范,并改善模型与数据的拟合。我们在贝叶斯框架下为二元回归引入了 p 广义高斯分布(p-GGD)。p-GGD因其对尾部建模的灵活性而受到广泛关注,例如,它可以泛化标准正态分布(\(p=2\)或拉普拉斯分布(\(p=1\))。在此,我们将最大似然估计(MLE)扩展到贝叶斯后验估计,使用马尔可夫链蒙特卡罗(MCMC)采样对模型参数(\\beta\)和链接函数参数 p 进行估计。我们使用模拟数据和实际数据来验证不同参数 p 对估计结果的影响,以及如何将逻辑回归和概率回归纳入更广泛的框架。为了使我们的贝叶斯方法在大数据情况下具有可扩展性,我们还在运行复杂而耗时的 MCMC 分析之前加入了核心集来减少数据。这使我们能够执行非常高效的计算,同时保留原始的后验参数分布,无论是在实践中还是在理论保证上,都不会出现太大的扭曲。
Scalable Bayesian p-generalized probit and logistic regression
The logit and probit link functions are arguably the two most common choices for binary regression models. Many studies have extended the choice of link functions to avoid possible misspecification and to improve the model fit to the data. We introduce the p-generalized Gaussian distribution (p-GGD) to binary regression in a Bayesian framework. The p-GGD has received considerable attention due to its flexibility in modeling the tails, while generalizing, for instance, over the standard normal distribution where \(p=2\) or the Laplace distribution where \(p=1\). Here, we extend from maximum likelihood estimation (MLE) to Bayesian posterior estimation using Markov Chain Monte Carlo (MCMC) sampling for the model parameters \(\beta\) and the link function parameter p. We use simulated and real-world data to verify the effect of different parameters p on the estimation results, and how logistic regression and probit regression can be incorporated into a broader framework. To make our Bayesian methods scalable in the case of large data, we also incorporate coresets to reduce the data before running the complex and time-consuming MCMC analysis. This allows us to perform very efficient calculations while retaining the original posterior parameter distributions up to little distortions both, in practice, and with theoretical guarantees.
期刊介绍:
The international journal Advances in Data Analysis and Classification (ADAC) is designed as a forum for high standard publications on research and applications concerning the extraction of knowable aspects from many types of data. It publishes articles on such topics as structural, quantitative, or statistical approaches for the analysis of data; advances in classification, clustering, and pattern recognition methods; strategies for modeling complex data and mining large data sets; methods for the extraction of knowledge from data, and applications of advanced methods in specific domains of practice. Articles illustrate how new domain-specific knowledge can be made available from data by skillful use of data analysis methods. The journal also publishes survey papers that outline, and illuminate the basic ideas and techniques of special approaches.