Grouping predictors via network-wide metrics

arXiv - MATH - Statistics Theory Pub Date : 2024-05-04 DOI:arxiv-2405.02715

Brandon Woosuk Park, Anand N. Vidyashankar, Tucker S. McElroy

{"title":"Grouping predictors via network-wide metrics","authors":"Brandon Woosuk Park, Anand N. Vidyashankar, Tucker S. McElroy","doi":"arxiv-2405.02715","DOIUrl":null,"url":null,"abstract":"When multitudes of features can plausibly be associated with a response, both\nprivacy considerations and model parsimony suggest grouping them to increase\nthe predictive power of a regression model. Specifically, the identification of\ngroups of predictors significantly associated with the response variable eases\nfurther downstream analysis and decision-making. This paper proposes a new data\nanalysis methodology that utilizes the high-dimensional predictor space to\nconstruct an implicit network with weighted edges %and weights on the edges to\nidentify significant associations between the response and the predictors.\nUsing a population model for groups of predictors defined via network-wide\nmetrics, a new supervised grouping algorithm is proposed to determine the\ncorrect group, with probability tending to one as the sample size diverges to\ninfinity. For this reason, we establish several theoretical properties of the\nestimates of network-wide metrics. A novel model-assisted bootstrap procedure\nthat substantially decreases computational complexity is developed,\nfacilitating the assessment of uncertainty in the estimates of network-wide\nmetrics. The proposed methods account for several challenges that arise in the\nhigh-dimensional data setting, including (i) a large number of predictors, (ii)\nuncertainty regarding the true statistical model, and (iii) model selection\nvariability. The performance of the proposed methods is demonstrated through\nnumerical experiments, data from sports analytics, and breast cancer data.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"38 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - MATH - Statistics Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02715","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

When multitudes of features can plausibly be associated with a response, both privacy considerations and model parsimony suggest grouping them to increase the predictive power of a regression model. Specifically, the identification of groups of predictors significantly associated with the response variable eases further downstream analysis and decision-making. This paper proposes a new data analysis methodology that utilizes the high-dimensional predictor space to construct an implicit network with weighted edges %and weights on the edges to identify significant associations between the response and the predictors. Using a population model for groups of predictors defined via network-wide metrics, a new supervised grouping algorithm is proposed to determine the correct group, with probability tending to one as the sample size diverges to infinity. For this reason, we establish several theoretical properties of the estimates of network-wide metrics. A novel model-assisted bootstrap procedure that substantially decreases computational complexity is developed, facilitating the assessment of uncertainty in the estimates of network-wide metrics. The proposed methods account for several challenges that arise in the high-dimensional data setting, including (i) a large number of predictors, (ii) uncertainty regarding the true statistical model, and (iii) model selection variability. The performance of the proposed methods is demonstrated through numerical experiments, data from sports analytics, and breast cancer data.

查看原文本刊更多论文

通过全网指标对预测因子进行分组

当许多特征都有可能与响应相关联时，出于隐私考虑和模型简约性的考虑，建议将这些特征分组，以提高回归模型的预测能力。具体来说，确定与响应变量显著相关的一组预测因子可以简化进一步的下游分析和决策。本文提出了一种新的数据分析方法，即利用高维预测因子空间来构建一个隐式网络，通过加权边%和边上的权重来识别响应与预测因子之间的显著关联。使用通过网络宽度计量学定义的预测因子组群模型，提出了一种新的监督分组算法来确定正确的组群，当样本量发散到无限大时，概率趋向于1。为此，我们建立了网络度量估计值的几个理论属性。我们还开发了一种新颖的模型辅助引导程序，它大大降低了计算复杂度，便于评估全网度量估计值的不确定性。所提出的方法解决了高维数据环境中出现的几个难题，包括：(i) 大量预测因子；(ii) 真实统计模型的不确定性；(iii) 模型选择的可变性。通过数值实验、体育分析数据和乳腺癌数据，展示了所提方法的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - MATH - Statistics Theory

自引率

0.00%

发文量