Generic Feature Selection with Short Fat Data.

Journal of the Indian Society of Agricultural Statistics. Indian Society of Agricultural Statistics Pub Date : 2014-01-01

B Clarke, J-H Chu

{"title":"Generic Feature Selection with Short Fat Data.","authors":"B Clarke, J-H Chu","doi":"","DOIUrl":null,"url":null,"abstract":"Consider a regression problem in which there are many more explanatory variables than data points, i.e., p ≫ n. Essentially, without reducing the number of variables inference is impossible. So, we group the p explanatory variables into blocks by clustering, evaluate statistics on the blocks and then regress the response on these statistics under a penalized error criterion to obtain estimates of the regression coefficients. We examine the performance of this approach for a variety of choices of n, p, classes of statistics, clustering algorithms, penalty terms, and data types. When n is not large, the discrimination over number of statistics is weak, but computations suggest regressing on approximately [n/K] statistics where K is the number of blocks formed by a clustering algorithm. Small deviations from this are observed when the blocks of variables are of very different sizes. Larger deviations are observed when the penalty term is an Lq norm with high enough q.","PeriodicalId":89431,"journal":{"name":"Journal of the Indian Society of Agricultural Statistics. Indian Society of Agricultural Statistics","volume":"68 2","pages":"145-162"},"PeriodicalIF":0.0000,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4208697/pdf/nihms619926.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Indian Society of Agricultural Statistics. Indian Society of Agricultural Statistics","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Consider a regression problem in which there are many more explanatory variables than data points, i.e., p ≫ n. Essentially, without reducing the number of variables inference is impossible. So, we group the p explanatory variables into blocks by clustering, evaluate statistics on the blocks and then regress the response on these statistics under a penalized error criterion to obtain estimates of the regression coefficients. We examine the performance of this approach for a variety of choices of n, p, classes of statistics, clustering algorithms, penalty terms, and data types. When n is not large, the discrimination over number of statistics is weak, but computations suggest regressing on approximately [n/K] statistics where K is the number of blocks formed by a clustering algorithm. Small deviations from this are observed when the blocks of variables are of very different sizes. Larger deviations are observed when the penalty term is an L^q norm with high enough q.

Abstract Image

本刊更多论文

短脂肪数据的通用特征选择。

考虑一个回归问题，其中解释变量比数据点多，即p比n多。本质上，不减少变量的数量，推理是不可能的。因此，我们通过聚类将p个解释变量分组成块，评估块上的统计量，然后在惩罚误差准则下对这些统计量的响应进行回归，以获得回归系数的估计值。我们对n、p、统计类、聚类算法、惩罚项和数据类型的各种选择检查了这种方法的性能。当n不太大时，对统计数量的区分很弱，但计算建议对大约[n/K]个统计进行回归，其中K是由聚类算法形成的块的数量。当变量块的大小非常不同时，可以观察到与此相差很小的偏差。当惩罚项是q足够高的Lq范数时，观察到较大的偏差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the Indian Society of Agricultural Statistics. Indian Society of Agricultural Statistics

自引率

0.00%

发文量