Generic Feature Selection with Short Fat Data.

B Clarke, J-H Chu
{"title":"Generic Feature Selection with Short Fat Data.","authors":"B Clarke,&nbsp;J-H Chu","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Consider a regression problem in which there are many more explanatory variables than data points, <i>i.e</i>., <i>p</i> ≫ <i>n</i>. Essentially, without reducing the number of variables inference is impossible. So, we group the <i>p</i> explanatory variables into blocks by clustering, evaluate statistics on the blocks and then regress the response on these statistics under a penalized error criterion to obtain estimates of the regression coefficients. We examine the performance of this approach for a variety of choices of <i>n</i>, <i>p</i>, classes of statistics, clustering algorithms, penalty terms, and data types. When <i>n</i> is not large, the discrimination over number of statistics is weak, but computations suggest regressing on approximately [<i>n</i>/<i>K</i>] statistics where <i>K</i> is the number of blocks formed by a clustering algorithm. Small deviations from this are observed when the blocks of variables are of very different sizes. Larger deviations are observed when the penalty term is an <i>L<sup>q</sup></i> norm with high enough <i>q</i>.</p>","PeriodicalId":89431,"journal":{"name":"Journal of the Indian Society of Agricultural Statistics. Indian Society of Agricultural Statistics","volume":"68 2","pages":"145-162"},"PeriodicalIF":0.0000,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4208697/pdf/nihms619926.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Indian Society of Agricultural Statistics. Indian Society of Agricultural Statistics","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Consider a regression problem in which there are many more explanatory variables than data points, i.e., pn. Essentially, without reducing the number of variables inference is impossible. So, we group the p explanatory variables into blocks by clustering, evaluate statistics on the blocks and then regress the response on these statistics under a penalized error criterion to obtain estimates of the regression coefficients. We examine the performance of this approach for a variety of choices of n, p, classes of statistics, clustering algorithms, penalty terms, and data types. When n is not large, the discrimination over number of statistics is weak, but computations suggest regressing on approximately [n/K] statistics where K is the number of blocks formed by a clustering algorithm. Small deviations from this are observed when the blocks of variables are of very different sizes. Larger deviations are observed when the penalty term is an Lq norm with high enough q.

Abstract Image

Abstract Image

短脂肪数据的通用特征选择。
考虑一个回归问题,其中解释变量比数据点多,即p比n多。本质上,不减少变量的数量,推理是不可能的。因此,我们通过聚类将p个解释变量分组成块,评估块上的统计量,然后在惩罚误差准则下对这些统计量的响应进行回归,以获得回归系数的估计值。我们对n、p、统计类、聚类算法、惩罚项和数据类型的各种选择检查了这种方法的性能。当n不太大时,对统计数量的区分很弱,但计算建议对大约[n/K]个统计进行回归,其中K是由聚类算法形成的块的数量。当变量块的大小非常不同时,可以观察到与此相差很小的偏差。当惩罚项是q足够高的Lq范数时,观察到较大的偏差。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信