Adaptive threshold-based classification of sparse high-dimensional data

IF 1.3 4区数学 Q3 STATISTICS & PROBABILITY

Electronic Journal of Statistics Pub Date : 2022-01-01 DOI:10.1214/22-ejs1998

T. Pavlenko, N. Stepanova, Lee Thompson

{"title":"Adaptive threshold-based classification of sparse high-dimensional data","authors":"T. Pavlenko, N. Stepanova, Lee Thompson","doi":"10.1214/22-ejs1998","DOIUrl":null,"url":null,"abstract":"Abstract: We revisit the problem of designing an efficient binary classifier in a challenging high-dimensional framework. The model under study assumes some local dependence structure among feature variables represented by a block-diagonal covariance matrix with a growing number of blocks of an arbitrary, but fixed size. The blocks correspond to non-overlapping independent groups of strongly correlated features. To assess the relevance of a particular block in predicting the response, we introduce a measure of “signal strength” pertaining to each feature block. This measure is then used to specify a sparse model of our interest. We further propose a threshold-based feature selector which operates as a screen-and-clean scheme integrated into a linear classifier: the data is subject to screening and hard threshold cleaning to filter out the blocks that contain no signals. Asymptotic properties of the proposed classifiers are studied when the sample size n depends on the number of feature blocks b, and the sample size goes to infinity with b at a slower rate than b. The new classifiers, which are fully adaptive to unknown parameters of the model, are shown to perform asymptotically optimally in a large part of the classification region. The numerical study confirms good analytical properties of the new classifiers that compare favorably to the existing threshold-based procedure used in a similar context.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":" ","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronic Journal of Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/22-ejs1998","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract: We revisit the problem of designing an efficient binary classifier in a challenging high-dimensional framework. The model under study assumes some local dependence structure among feature variables represented by a block-diagonal covariance matrix with a growing number of blocks of an arbitrary, but fixed size. The blocks correspond to non-overlapping independent groups of strongly correlated features. To assess the relevance of a particular block in predicting the response, we introduce a measure of “signal strength” pertaining to each feature block. This measure is then used to specify a sparse model of our interest. We further propose a threshold-based feature selector which operates as a screen-and-clean scheme integrated into a linear classifier: the data is subject to screening and hard threshold cleaning to filter out the blocks that contain no signals. Asymptotic properties of the proposed classifiers are studied when the sample size n depends on the number of feature blocks b, and the sample size goes to infinity with b at a slower rate than b. The new classifiers, which are fully adaptive to unknown parameters of the model, are shown to perform asymptotically optimally in a large part of the classification region. The numerical study confirms good analytical properties of the new classifiers that compare favorably to the existing threshold-based procedure used in a similar context.

查看原文本刊更多论文

基于自适应阈值的稀疏高维数据分类

摘要：我们重新审视了在一个具有挑战性的高维框架中设计一个高效的二进制分类器的问题。所研究的模型假设由块对角协方差矩阵表示的特征变量之间存在一些局部依赖结构，该矩阵具有不断增长的任意但固定大小的块。这些块对应于强相关特征的不重叠的独立组。为了评估特定块在预测响应中的相关性，我们引入了与每个特征块相关的“信号强度”度量。然后使用该度量来指定我们感兴趣的稀疏模型。我们进一步提出了一种基于阈值的特征选择器，它作为一种集成到线性分类器中的筛选和清理方案进行操作：对数据进行筛选和硬阈值清理，以过滤出不包含信号的块。当样本大小n取决于特征块b的数量，并且样本大小随b以比b慢的速率变为无穷大时，研究了所提出的分类器的渐近性质。新分类器完全自适应于模型的未知参数，在很大一部分分类区域中表现为渐近最优。数值研究证实了新分类器的良好分析性能，与在类似环境中使用的现有基于阈值的过程相比，这些分类器具有良好的分析性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Electronic Journal of Statistics STATISTICS & PROBABILITY-

CiteScore

1.80

自引率

9.10%

发文量

100

审稿时长

3 months

期刊介绍： The Electronic Journal of Statistics (EJS) publishes research articles and short notes on theoretical, computational and applied statistics. The journal is open access. Articles are refereed and are held to the same standard as articles in other IMS journals. Articles become publicly available shortly after they are accepted.