具有聚类特征的潜在因子回归推理

IF 1.7 2区数学 Q2 STATISTICS & PROBABILITY

Bernoulli Pub Date : 2022-05-01 DOI:10.3150/21-bej1374

Xin Bing, F. Bunea, M. Wegkamp

{"title":"具有聚类特征的潜在因子回归推理","authors":"Xin Bing, F. Bunea, M. Wegkamp","doi":"10.3150/21-bej1374","DOIUrl":null,"url":null,"abstract":"Regression models, in which the observed features X ∈ R p and the response Y ∈ R depend, jointly, on a lower dimensional, unobserved, latent vector Z ∈ R K , with K (cid:3) p , are popular in a large array of applications, and mainly used for predicting a response from correlated features. In contrast, methodology and theory for inference on the regression coefﬁcient β ∈ R K relating Y to Z are scarce, since typically the un-observable factor Z is hard to interpret. Furthermore, the determination of the asymptotic variance of an estimator of β is a long-standing problem, with solutions known only in a few particular cases. To address some of these outstanding questions, we develop inferential tools for β in a class of factor regression models in which the observed features are signed mixtures of the latent factors. The model speciﬁcations are both practically desirable, in a large array of applications, render interpretability to the components of Z , and are sufﬁcient for parameter identiﬁability. Without assuming that the number of latent factors K or the structure of the mixture is known in advance, we construct computationally efﬁcient estimators of β , along with estimators of other important model parameters. We benchmark the rate of convergence of β by ﬁrst establishing its (cid:3) 2 -norm minimax lower bound, and show that our proposed estimator (cid:2) β is minimax-rate adaptive. Our main contribution is the provision of a uniﬁed analysis of the component-wise Gaussian asymptotic distribution of (cid:2) β and, especially, the derivation of a closed form expression of its asymptotic variance, together with consistent variance estimators. The resulting inferential tools can be used when both K and p are independent of the sample size n , and also when both, or either, p and K vary with n , while allowing for p > n . This complements the only asymptotic normality results obtained for a particular case of the model under consideration, in the regime K = O( 1 ) and p → ∞ , but without a variance estimate. As an application, we provide, within our model speciﬁcations, a statistical platform for inference in regression on latent cluster centers, thereby increasing the scope of our theoretical results. We benchmark the newly developed methodology on a recently collected data set for the study of the effectiveness of a new SIV vaccine. Our analysis enables the determination of the top latent antibody-centric mechanisms associated with the vaccine response.","PeriodicalId":55387,"journal":{"name":"Bernoulli","volume":" ","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Inference in latent factor regression with clusterable features\",\"authors\":\"Xin Bing, F. Bunea, M. Wegkamp\",\"doi\":\"10.3150/21-bej1374\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Regression models, in which the observed features X ∈ R p and the response Y ∈ R depend, jointly, on a lower dimensional, unobserved, latent vector Z ∈ R K , with K (cid:3) p , are popular in a large array of applications, and mainly used for predicting a response from correlated features. In contrast, methodology and theory for inference on the regression coefﬁcient β ∈ R K relating Y to Z are scarce, since typically the un-observable factor Z is hard to interpret. Furthermore, the determination of the asymptotic variance of an estimator of β is a long-standing problem, with solutions known only in a few particular cases. To address some of these outstanding questions, we develop inferential tools for β in a class of factor regression models in which the observed features are signed mixtures of the latent factors. The model speciﬁcations are both practically desirable, in a large array of applications, render interpretability to the components of Z , and are sufﬁcient for parameter identiﬁability. Without assuming that the number of latent factors K or the structure of the mixture is known in advance, we construct computationally efﬁcient estimators of β , along with estimators of other important model parameters. We benchmark the rate of convergence of β by ﬁrst establishing its (cid:3) 2 -norm minimax lower bound, and show that our proposed estimator (cid:2) β is minimax-rate adaptive. Our main contribution is the provision of a uniﬁed analysis of the component-wise Gaussian asymptotic distribution of (cid:2) β and, especially, the derivation of a closed form expression of its asymptotic variance, together with consistent variance estimators. The resulting inferential tools can be used when both K and p are independent of the sample size n , and also when both, or either, p and K vary with n , while allowing for p > n . This complements the only asymptotic normality results obtained for a particular case of the model under consideration, in the regime K = O( 1 ) and p → ∞ , but without a variance estimate. As an application, we provide, within our model speciﬁcations, a statistical platform for inference in regression on latent cluster centers, thereby increasing the scope of our theoretical results. We benchmark the newly developed methodology on a recently collected data set for the study of the effectiveness of a new SIV vaccine. Our analysis enables the determination of the top latent antibody-centric mechanisms associated with the vaccine response.\",\"PeriodicalId\":55387,\"journal\":{\"name\":\"Bernoulli\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bernoulli\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.3150/21-bej1374\",\"RegionNum\":2,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bernoulli","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.3150/21-bej1374","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 5

摘要

回归模型，其中观察到的特征X∈Rp和响应Y∈R共同依赖于低维、未观察到的潜在向量Z∈RK，其中K（cid:3）p，在大量应用中很流行，主要用于根据相关特征预测响应。相反，由于通常难以解释不可观测的因子Z，因此很少有方法和理论来推断Y与Z之间的回归系数β∈RK。此外，β估计量的渐近方差的确定是一个长期存在的问题，其解仅在少数特定情况下已知。为了解决其中一些悬而未决的问题，我们开发了一类因子回归模型中β的推理工具，其中观察到的特征是潜在因子的符号混合。在大量应用中，模型规范在实践中都是可取的，可以解释Z的组成部分，并且足以识别参数。在不假设潜在因子K的数量或混合物的结构预先已知的情况下，我们构造了β的计算有效估计量，以及其他重要模型参数的估计量。我们通过首先建立β的（cid:3）2-范数极小极大下界来衡量β的收敛速度，并证明我们提出的估计量（cid:2）β是极小极大速率自适应的。我们的主要贡献是对（cid:2）β的分量高斯渐近分布进行了统一分析，特别是导出了其渐近方差的闭合形式表达式，以及一致方差估计量。当K和p都独立于样本量n时，以及当p和K都或其中一个随n变化时，可以使用由此产生的推理工具，同时允许p＞n。这补充了在K=O（1）和p的情况下，对于所考虑的模型的特定情况所获得的唯一渐近正态性结果→ ∞ , 但是没有方差估计。作为一种应用，我们在模型规范中提供了一个统计平台，用于对潜在聚类中心进行回归推断，从而扩大了我们理论结果的范围。我们根据最近收集的研究新型SIV疫苗有效性的数据集，对新开发的方法进行了基准测试。我们的分析能够确定与疫苗反应相关的最高潜在抗体中心机制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Inference in latent factor regression with clusterable features

Regression models, in which the observed features X ∈ R p and the response Y ∈ R depend, jointly, on a lower dimensional, unobserved, latent vector Z ∈ R K , with K (cid:3) p , are popular in a large array of applications, and mainly used for predicting a response from correlated features. In contrast, methodology and theory for inference on the regression coefﬁcient β ∈ R K relating Y to Z are scarce, since typically the un-observable factor Z is hard to interpret. Furthermore, the determination of the asymptotic variance of an estimator of β is a long-standing problem, with solutions known only in a few particular cases. To address some of these outstanding questions, we develop inferential tools for β in a class of factor regression models in which the observed features are signed mixtures of the latent factors. The model speciﬁcations are both practically desirable, in a large array of applications, render interpretability to the components of Z , and are sufﬁcient for parameter identiﬁability. Without assuming that the number of latent factors K or the structure of the mixture is known in advance, we construct computationally efﬁcient estimators of β , along with estimators of other important model parameters. We benchmark the rate of convergence of β by ﬁrst establishing its (cid:3) 2 -norm minimax lower bound, and show that our proposed estimator (cid:2) β is minimax-rate adaptive. Our main contribution is the provision of a uniﬁed analysis of the component-wise Gaussian asymptotic distribution of (cid:2) β and, especially, the derivation of a closed form expression of its asymptotic variance, together with consistent variance estimators. The resulting inferential tools can be used when both K and p are independent of the sample size n , and also when both, or either, p and K vary with n , while allowing for p > n . This complements the only asymptotic normality results obtained for a particular case of the model under consideration, in the regime K = O( 1 ) and p → ∞ , but without a variance estimate. As an application, we provide, within our model speciﬁcations, a statistical platform for inference in regression on latent cluster centers, thereby increasing the scope of our theoretical results. We benchmark the newly developed methodology on a recently collected data set for the study of the effectiveness of a new SIV vaccine. Our analysis enables the determination of the top latent antibody-centric mechanisms associated with the vaccine response.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bernoulli 数学-统计学与概率论

CiteScore

3.40

自引率

0.00%

发文量

116

审稿时长

6-12 weeks

期刊介绍： BERNOULLI is the journal of the Bernoulli Society for Mathematical Statistics and Probability, issued four times per year. The journal provides a comprehensive account of important developments in the fields of statistics and probability, offering an international forum for both theoretical and applied work. BERNOULLI will publish: Papers containing original and significant research contributions: with background, mathematical derivation and discussion of the results in suitable detail and, where appropriate, with discussion of interesting applications in relation to the methodology proposed. Papers of the following two types will also be considered for publication, provided they are judged to enhance the dissemination of research: Review papers which provide an integrated critical survey of some area of probability and statistics and discuss important recent developments. Scholarly written papers on some historical significant aspect of statistics and probability.