{"title":"Generic Feature Selection with Short Fat Data.","authors":"B Clarke, J-H Chu","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Consider a regression problem in which there are many more explanatory variables than data points, <i>i.e</i>., <i>p</i> ≫ <i>n</i>. Essentially, without reducing the number of variables inference is impossible. So, we group the <i>p</i> explanatory variables into blocks by clustering, evaluate statistics on the blocks and then regress the response on these statistics under a penalized error criterion to obtain estimates of the regression coefficients. We examine the performance of this approach for a variety of choices of <i>n</i>, <i>p</i>, classes of statistics, clustering algorithms, penalty terms, and data types. When <i>n</i> is not large, the discrimination over number of statistics is weak, but computations suggest regressing on approximately [<i>n</i>/<i>K</i>] statistics where <i>K</i> is the number of blocks formed by a clustering algorithm. Small deviations from this are observed when the blocks of variables are of very different sizes. Larger deviations are observed when the penalty term is an <i>L<sup>q</sup></i> norm with high enough <i>q</i>.</p>","PeriodicalId":89431,"journal":{"name":"Journal of the Indian Society of Agricultural Statistics. Indian Society of Agricultural Statistics","volume":"68 2","pages":"145-162"},"PeriodicalIF":0.0000,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4208697/pdf/nihms619926.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Indian Society of Agricultural Statistics. Indian Society of Agricultural Statistics","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Consider a regression problem in which there are many more explanatory variables than data points, i.e., p ≫ n. Essentially, without reducing the number of variables inference is impossible. So, we group the p explanatory variables into blocks by clustering, evaluate statistics on the blocks and then regress the response on these statistics under a penalized error criterion to obtain estimates of the regression coefficients. We examine the performance of this approach for a variety of choices of n, p, classes of statistics, clustering algorithms, penalty terms, and data types. When n is not large, the discrimination over number of statistics is weak, but computations suggest regressing on approximately [n/K] statistics where K is the number of blocks formed by a clustering algorithm. Small deviations from this are observed when the blocks of variables are of very different sizes. Larger deviations are observed when the penalty term is an Lq norm with high enough q.