Brandon Woosuk Park, Anand N. Vidyashankar, Tucker S. McElroy
{"title":"通过全网指标对预测因子进行分组","authors":"Brandon Woosuk Park, Anand N. Vidyashankar, Tucker S. McElroy","doi":"arxiv-2405.02715","DOIUrl":null,"url":null,"abstract":"When multitudes of features can plausibly be associated with a response, both\nprivacy considerations and model parsimony suggest grouping them to increase\nthe predictive power of a regression model. Specifically, the identification of\ngroups of predictors significantly associated with the response variable eases\nfurther downstream analysis and decision-making. This paper proposes a new data\nanalysis methodology that utilizes the high-dimensional predictor space to\nconstruct an implicit network with weighted edges %and weights on the edges to\nidentify significant associations between the response and the predictors.\nUsing a population model for groups of predictors defined via network-wide\nmetrics, a new supervised grouping algorithm is proposed to determine the\ncorrect group, with probability tending to one as the sample size diverges to\ninfinity. For this reason, we establish several theoretical properties of the\nestimates of network-wide metrics. A novel model-assisted bootstrap procedure\nthat substantially decreases computational complexity is developed,\nfacilitating the assessment of uncertainty in the estimates of network-wide\nmetrics. The proposed methods account for several challenges that arise in the\nhigh-dimensional data setting, including (i) a large number of predictors, (ii)\nuncertainty regarding the true statistical model, and (iii) model selection\nvariability. The performance of the proposed methods is demonstrated through\nnumerical experiments, data from sports analytics, and breast cancer data.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"38 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Grouping predictors via network-wide metrics\",\"authors\":\"Brandon Woosuk Park, Anand N. Vidyashankar, Tucker S. McElroy\",\"doi\":\"arxiv-2405.02715\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"When multitudes of features can plausibly be associated with a response, both\\nprivacy considerations and model parsimony suggest grouping them to increase\\nthe predictive power of a regression model. Specifically, the identification of\\ngroups of predictors significantly associated with the response variable eases\\nfurther downstream analysis and decision-making. This paper proposes a new data\\nanalysis methodology that utilizes the high-dimensional predictor space to\\nconstruct an implicit network with weighted edges %and weights on the edges to\\nidentify significant associations between the response and the predictors.\\nUsing a population model for groups of predictors defined via network-wide\\nmetrics, a new supervised grouping algorithm is proposed to determine the\\ncorrect group, with probability tending to one as the sample size diverges to\\ninfinity. For this reason, we establish several theoretical properties of the\\nestimates of network-wide metrics. A novel model-assisted bootstrap procedure\\nthat substantially decreases computational complexity is developed,\\nfacilitating the assessment of uncertainty in the estimates of network-wide\\nmetrics. The proposed methods account for several challenges that arise in the\\nhigh-dimensional data setting, including (i) a large number of predictors, (ii)\\nuncertainty regarding the true statistical model, and (iii) model selection\\nvariability. The performance of the proposed methods is demonstrated through\\nnumerical experiments, data from sports analytics, and breast cancer data.\",\"PeriodicalId\":501330,\"journal\":{\"name\":\"arXiv - MATH - Statistics Theory\",\"volume\":\"38 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - MATH - Statistics Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.02715\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - MATH - Statistics Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02715","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
When multitudes of features can plausibly be associated with a response, both
privacy considerations and model parsimony suggest grouping them to increase
the predictive power of a regression model. Specifically, the identification of
groups of predictors significantly associated with the response variable eases
further downstream analysis and decision-making. This paper proposes a new data
analysis methodology that utilizes the high-dimensional predictor space to
construct an implicit network with weighted edges %and weights on the edges to
identify significant associations between the response and the predictors.
Using a population model for groups of predictors defined via network-wide
metrics, a new supervised grouping algorithm is proposed to determine the
correct group, with probability tending to one as the sample size diverges to
infinity. For this reason, we establish several theoretical properties of the
estimates of network-wide metrics. A novel model-assisted bootstrap procedure
that substantially decreases computational complexity is developed,
facilitating the assessment of uncertainty in the estimates of network-wide
metrics. The proposed methods account for several challenges that arise in the
high-dimensional data setting, including (i) a large number of predictors, (ii)
uncertainty regarding the true statistical model, and (iii) model selection
variability. The performance of the proposed methods is demonstrated through
numerical experiments, data from sports analytics, and breast cancer data.