Grouping predictors via network-wide metrics

Brandon Woosuk Park, Anand N. Vidyashankar, Tucker S. McElroy
{"title":"Grouping predictors via network-wide metrics","authors":"Brandon Woosuk Park, Anand N. Vidyashankar, Tucker S. McElroy","doi":"arxiv-2405.02715","DOIUrl":null,"url":null,"abstract":"When multitudes of features can plausibly be associated with a response, both\nprivacy considerations and model parsimony suggest grouping them to increase\nthe predictive power of a regression model. Specifically, the identification of\ngroups of predictors significantly associated with the response variable eases\nfurther downstream analysis and decision-making. This paper proposes a new data\nanalysis methodology that utilizes the high-dimensional predictor space to\nconstruct an implicit network with weighted edges %and weights on the edges to\nidentify significant associations between the response and the predictors.\nUsing a population model for groups of predictors defined via network-wide\nmetrics, a new supervised grouping algorithm is proposed to determine the\ncorrect group, with probability tending to one as the sample size diverges to\ninfinity. For this reason, we establish several theoretical properties of the\nestimates of network-wide metrics. A novel model-assisted bootstrap procedure\nthat substantially decreases computational complexity is developed,\nfacilitating the assessment of uncertainty in the estimates of network-wide\nmetrics. The proposed methods account for several challenges that arise in the\nhigh-dimensional data setting, including (i) a large number of predictors, (ii)\nuncertainty regarding the true statistical model, and (iii) model selection\nvariability. The performance of the proposed methods is demonstrated through\nnumerical experiments, data from sports analytics, and breast cancer data.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"38 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - MATH - Statistics Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02715","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

When multitudes of features can plausibly be associated with a response, both privacy considerations and model parsimony suggest grouping them to increase the predictive power of a regression model. Specifically, the identification of groups of predictors significantly associated with the response variable eases further downstream analysis and decision-making. This paper proposes a new data analysis methodology that utilizes the high-dimensional predictor space to construct an implicit network with weighted edges %and weights on the edges to identify significant associations between the response and the predictors. Using a population model for groups of predictors defined via network-wide metrics, a new supervised grouping algorithm is proposed to determine the correct group, with probability tending to one as the sample size diverges to infinity. For this reason, we establish several theoretical properties of the estimates of network-wide metrics. A novel model-assisted bootstrap procedure that substantially decreases computational complexity is developed, facilitating the assessment of uncertainty in the estimates of network-wide metrics. The proposed methods account for several challenges that arise in the high-dimensional data setting, including (i) a large number of predictors, (ii) uncertainty regarding the true statistical model, and (iii) model selection variability. The performance of the proposed methods is demonstrated through numerical experiments, data from sports analytics, and breast cancer data.
通过全网指标对预测因子进行分组
当许多特征都有可能与响应相关联时,出于隐私考虑和模型简约性的考虑,建议将这些特征分组,以提高回归模型的预测能力。具体来说,确定与响应变量显著相关的一组预测因子可以简化进一步的下游分析和决策。本文提出了一种新的数据分析方法,即利用高维预测因子空间来构建一个隐式网络,通过加权边%和边上的权重来识别响应与预测因子之间的显著关联。使用通过网络宽度计量学定义的预测因子组群模型,提出了一种新的监督分组算法来确定正确的组群,当样本量发散到无限大时,概率趋向于1。为此,我们建立了网络度量估计值的几个理论属性。我们还开发了一种新颖的模型辅助引导程序,它大大降低了计算复杂度,便于评估全网度量估计值的不确定性。所提出的方法解决了高维数据环境中出现的几个难题,包括:(i) 大量预测因子;(ii) 真实统计模型的不确定性;(iii) 模型选择的可变性。通过数值实验、体育分析数据和乳腺癌数据,展示了所提方法的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信