Valid post-selection inference in model-free linear regression

IF 3.2 1区数学 Q1 STATISTICS & PROBABILITY

Annals of Statistics Pub Date : 2020-10-01 DOI:10.1214/19-AOS1917

Arun K. Kuchibhotla, L. Brown, A. Buja, Junhui Cai, E. George, Linda H. Zhao

{"title":"Valid post-selection inference in model-free linear regression","authors":"Arun K. Kuchibhotla, L. Brown, A. Buja, Junhui Cai, E. George, Linda H. Zhao","doi":"10.1214/19-AOS1917","DOIUrl":null,"url":null,"abstract":"S.1. Simulations Continued. The simulation setting in this section is the same as in Section 9. We first describe the reason for using the null situation β0 0p in the model. If β0 is an arbitrary non-zero vector, then, for fixed covariates, XiYi cannot be identically distributed and hence only (asymptotically) conservative inference is possible. In simulations this conservativeness confounds with the simultaneity so that the coverage becomes close to 1 (if not 1). In the main manuscript, we have shown plots comparing our method with Berk et al. (2013) and selective inference. We label our confidence region R̂:n,M (12) as “UPoSI,” the projected confidence region B̂ n,M (28) as “UPoSIBox”, and Berk et al. (2013) as “PoSI.” Tables 1, 2, and 3 show exact numbers for the comparison of our method with Berk et al. (2013). Note that size of each dot in the row plot of Figure 9 indicates the proportion of confidence regions of that volume among same-sized models. In Setting A and B, the confidence region volumes of same-sized models are the same. In Setting C, volumes of confidence regions of Berk and PoSI Box enlarge (hence smaller logpVolq{|M |q if the last covariate is included. Tables 4 and 5 show the numbers for the comparison of our method with selective inference when the selection procedure is forward stepwise and LARS, respectively. Sample splitting is a simple procedure that provides valid inference after selection as discussed in Section 1.3. We stress here that this is valid only for independent observations and that the model selected in the first split half could be different from the one selected in the full data. The comparison results with n 1000, p 500 and selection methods forward stepwise, LARS and BIC are summarized in Figure S.1. For sample splitting we have used the Bonferroni correction to obtain simultaneous inference for all coefficients in a model. Table 6 shows the comparison of our method with sample splitting.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":"48 1","pages":"2953-2981"},"PeriodicalIF":3.2000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/19-AOS1917","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 20

Abstract

S.1. Simulations Continued. The simulation setting in this section is the same as in Section 9. We first describe the reason for using the null situation β0 0p in the model. If β0 is an arbitrary non-zero vector, then, for fixed covariates, XiYi cannot be identically distributed and hence only (asymptotically) conservative inference is possible. In simulations this conservativeness confounds with the simultaneity so that the coverage becomes close to 1 (if not 1). In the main manuscript, we have shown plots comparing our method with Berk et al. (2013) and selective inference. We label our confidence region R̂:n,M (12) as “UPoSI,” the projected confidence region B̂ n,M (28) as “UPoSIBox”, and Berk et al. (2013) as “PoSI.” Tables 1, 2, and 3 show exact numbers for the comparison of our method with Berk et al. (2013). Note that size of each dot in the row plot of Figure 9 indicates the proportion of confidence regions of that volume among same-sized models. In Setting A and B, the confidence region volumes of same-sized models are the same. In Setting C, volumes of confidence regions of Berk and PoSI Box enlarge (hence smaller logpVolq{|M |q if the last covariate is included. Tables 4 and 5 show the numbers for the comparison of our method with selective inference when the selection procedure is forward stepwise and LARS, respectively. Sample splitting is a simple procedure that provides valid inference after selection as discussed in Section 1.3. We stress here that this is valid only for independent observations and that the model selected in the first split half could be different from the one selected in the full data. The comparison results with n 1000, p 500 and selection methods forward stepwise, LARS and BIC are summarized in Figure S.1. For sample splitting we have used the Bonferroni correction to obtain simultaneous inference for all coefficients in a model. Table 6 shows the comparison of our method with sample splitting.

查看原文本刊更多论文

无模型线性回归中的有效后选择推理

S.1。模拟继续说。本节中的模拟设置与第9节中的相同。我们首先描述了在模型中使用零情况β0 0p的原因。如果β0是任意非零向量，则对于固定的协变量，XiYi不可能是同分布的，因此只能(渐近)保守推断。在模拟中，这种保守性与同时性相混淆，使覆盖率接近1(如果不是1)。在主要手稿中，我们展示了将我们的方法与Berk等人(2013)和选择性推断进行比较的图表。我们将我们的置信区域R n,M(12)标记为“UPoSI”，将预测的置信区域B n,M(28)标记为“UPoSIBox”，并将Berk et al.(2013)标记为“PoSI”。表1、2和3显示了我们的方法与Berk et al.(2013)比较的确切数字。注意，图9的行图中每个点的大小表示该体积在相同大小的模型中置信区域的比例。在设置A和B中，相同大小模型的置信区域体积相同。在设置C中，如果包括最后一个协变量，则Berk和PoSI Box的置信区域的体积增大(因此更小的logpVolq{|M |q)。表4和表5分别显示了当选择过程为逐步前向和LARS时，我们的方法与选择性推理的比较数字。样本分割是一个简单的过程，在选择后提供有效的推理，如1.3节所讨论的。我们在这里强调，这只对独立的观测有效，并且在第一个分割部分中选择的模型可能不同于在完整数据中选择的模型。与n 1000, p 500和逐步选择方法，LARS和BIC的比较结果总结在图S.1中。对于样本分割，我们使用Bonferroni校正来获得模型中所有系数的同时推断。表6显示了我们的方法与样本分割的比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of Statistics 数学-统计学与概率论

CiteScore

9.30

自引率

8.90%

发文量

119

审稿时长

6-12 weeks

期刊介绍： The Annals of Statistics aim to publish research papers of highest quality reflecting the many facets of contemporary statistics. Primary emphasis is placed on importance and originality, not on formalism. The journal aims to cover all areas of statistics, especially mathematical statistics and applied & interdisciplinary statistics. Of course many of the best papers will touch on more than one of these general areas, because the discipline of statistics has deep roots in mathematics, and in substantive scientific fields.