{"title":"具有聚类特征的潜在因子回归推理","authors":"Xin Bing, F. Bunea, M. Wegkamp","doi":"10.3150/21-bej1374","DOIUrl":null,"url":null,"abstract":"Regression models, in which the observed features X ∈ R p and the response Y ∈ R depend, jointly, on a lower dimensional, unobserved, latent vector Z ∈ R K , with K (cid:3) p , are popular in a large array of applications, and mainly used for predicting a response from correlated features. In contrast, methodology and theory for inference on the regression coefficient β ∈ R K relating Y to Z are scarce, since typically the un-observable factor Z is hard to interpret. Furthermore, the determination of the asymptotic variance of an estimator of β is a long-standing problem, with solutions known only in a few particular cases. To address some of these outstanding questions, we develop inferential tools for β in a class of factor regression models in which the observed features are signed mixtures of the latent factors. The model specifications are both practically desirable, in a large array of applications, render interpretability to the components of Z , and are sufficient for parameter identifiability. Without assuming that the number of latent factors K or the structure of the mixture is known in advance, we construct computationally efficient estimators of β , along with estimators of other important model parameters. We benchmark the rate of convergence of β by first establishing its (cid:3) 2 -norm minimax lower bound, and show that our proposed estimator (cid:2) β is minimax-rate adaptive. Our main contribution is the provision of a unified analysis of the component-wise Gaussian asymptotic distribution of (cid:2) β and, especially, the derivation of a closed form expression of its asymptotic variance, together with consistent variance estimators. The resulting inferential tools can be used when both K and p are independent of the sample size n , and also when both, or either, p and K vary with n , while allowing for p > n . This complements the only asymptotic normality results obtained for a particular case of the model under consideration, in the regime K = O( 1 ) and p → ∞ , but without a variance estimate. As an application, we provide, within our model specifications, a statistical platform for inference in regression on latent cluster centers, thereby increasing the scope of our theoretical results. We benchmark the newly developed methodology on a recently collected data set for the study of the effectiveness of a new SIV vaccine. Our analysis enables the determination of the top latent antibody-centric mechanisms associated with the vaccine response.","PeriodicalId":55387,"journal":{"name":"Bernoulli","volume":" ","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Inference in latent factor regression with clusterable features\",\"authors\":\"Xin Bing, F. Bunea, M. Wegkamp\",\"doi\":\"10.3150/21-bej1374\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Regression models, in which the observed features X ∈ R p and the response Y ∈ R depend, jointly, on a lower dimensional, unobserved, latent vector Z ∈ R K , with K (cid:3) p , are popular in a large array of applications, and mainly used for predicting a response from correlated features. In contrast, methodology and theory for inference on the regression coefficient β ∈ R K relating Y to Z are scarce, since typically the un-observable factor Z is hard to interpret. Furthermore, the determination of the asymptotic variance of an estimator of β is a long-standing problem, with solutions known only in a few particular cases. To address some of these outstanding questions, we develop inferential tools for β in a class of factor regression models in which the observed features are signed mixtures of the latent factors. The model specifications are both practically desirable, in a large array of applications, render interpretability to the components of Z , and are sufficient for parameter identifiability. Without assuming that the number of latent factors K or the structure of the mixture is known in advance, we construct computationally efficient estimators of β , along with estimators of other important model parameters. We benchmark the rate of convergence of β by first establishing its (cid:3) 2 -norm minimax lower bound, and show that our proposed estimator (cid:2) β is minimax-rate adaptive. Our main contribution is the provision of a unified analysis of the component-wise Gaussian asymptotic distribution of (cid:2) β and, especially, the derivation of a closed form expression of its asymptotic variance, together with consistent variance estimators. The resulting inferential tools can be used when both K and p are independent of the sample size n , and also when both, or either, p and K vary with n , while allowing for p > n . This complements the only asymptotic normality results obtained for a particular case of the model under consideration, in the regime K = O( 1 ) and p → ∞ , but without a variance estimate. As an application, we provide, within our model specifications, a statistical platform for inference in regression on latent cluster centers, thereby increasing the scope of our theoretical results. We benchmark the newly developed methodology on a recently collected data set for the study of the effectiveness of a new SIV vaccine. Our analysis enables the determination of the top latent antibody-centric mechanisms associated with the vaccine response.\",\"PeriodicalId\":55387,\"journal\":{\"name\":\"Bernoulli\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bernoulli\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.3150/21-bej1374\",\"RegionNum\":2,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bernoulli","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.3150/21-bej1374","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
Inference in latent factor regression with clusterable features
Regression models, in which the observed features X ∈ R p and the response Y ∈ R depend, jointly, on a lower dimensional, unobserved, latent vector Z ∈ R K , with K (cid:3) p , are popular in a large array of applications, and mainly used for predicting a response from correlated features. In contrast, methodology and theory for inference on the regression coefficient β ∈ R K relating Y to Z are scarce, since typically the un-observable factor Z is hard to interpret. Furthermore, the determination of the asymptotic variance of an estimator of β is a long-standing problem, with solutions known only in a few particular cases. To address some of these outstanding questions, we develop inferential tools for β in a class of factor regression models in which the observed features are signed mixtures of the latent factors. The model specifications are both practically desirable, in a large array of applications, render interpretability to the components of Z , and are sufficient for parameter identifiability. Without assuming that the number of latent factors K or the structure of the mixture is known in advance, we construct computationally efficient estimators of β , along with estimators of other important model parameters. We benchmark the rate of convergence of β by first establishing its (cid:3) 2 -norm minimax lower bound, and show that our proposed estimator (cid:2) β is minimax-rate adaptive. Our main contribution is the provision of a unified analysis of the component-wise Gaussian asymptotic distribution of (cid:2) β and, especially, the derivation of a closed form expression of its asymptotic variance, together with consistent variance estimators. The resulting inferential tools can be used when both K and p are independent of the sample size n , and also when both, or either, p and K vary with n , while allowing for p > n . This complements the only asymptotic normality results obtained for a particular case of the model under consideration, in the regime K = O( 1 ) and p → ∞ , but without a variance estimate. As an application, we provide, within our model specifications, a statistical platform for inference in regression on latent cluster centers, thereby increasing the scope of our theoretical results. We benchmark the newly developed methodology on a recently collected data set for the study of the effectiveness of a new SIV vaccine. Our analysis enables the determination of the top latent antibody-centric mechanisms associated with the vaccine response.
期刊介绍:
BERNOULLI is the journal of the Bernoulli Society for Mathematical Statistics and Probability, issued four times per year. The journal provides a comprehensive account of important developments in the fields of statistics and probability, offering an international forum for both theoretical and applied work.
BERNOULLI will publish:
Papers containing original and significant research contributions: with background, mathematical derivation and discussion of the results in suitable detail and, where appropriate, with discussion of interesting applications in relation to the methodology proposed.
Papers of the following two types will also be considered for publication, provided they are judged to enhance the dissemination of research:
Review papers which provide an integrated critical survey of some area of probability and statistics and discuss important recent developments.
Scholarly written papers on some historical significant aspect of statistics and probability.