Feature subset selection bias for classification learning

Proceedings of the 23rd international conference on Machine learning Pub Date : 2006-06-25 DOI:10.1145/1143844.1143951

Surendra K. Singhi, Huan Liu

引用次数: 98

Abstract

Feature selection is often applied to high-dimensional data prior to classification learning. Using the same training dataset in both selection and learning can result in so-called feature subset selection bias. This bias putatively can exacerbate data over-fitting and negatively affect classification performance. However, in current practice separate datasets are seldom employed for selection and learning, because dividing the training data into two datasets for feature selection and classifier learning respectively reduces the amount of data that can be used in either task. This work attempts to address this dilemma. We formalize selection bias for classification learning, analyze its statistical properties, and study factors that affect selection bias, as well as how the bias impacts classification learning via various experiments. This research endeavors to provide illustration and explanation why the bias may not cause negative impact in classification as much as expected in regression.

查看原文本刊更多论文

分类学习的特征子集选择偏差

特征选择通常在分类学习之前应用于高维数据。在选择和学习中使用相同的训练数据集会导致所谓的特征子集选择偏差。这种偏差可能会加剧数据的过度拟合，并对分类性能产生负面影响。然而，在目前的实践中，很少使用单独的数据集进行选择和学习，因为将训练数据分成两个数据集分别进行特征选择和分类器学习会减少两项任务中可用的数据量。这项工作试图解决这一困境。我们形式化了分类学习的选择偏差，分析了选择偏差的统计特性，并通过各种实验研究了影响选择偏差的因素，以及选择偏差如何影响分类学习。本研究试图提供说明和解释为什么偏差可能不会像回归中预期的那样对分类产生负面影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 23rd international conference on Machine learning

自引率

0.00%

发文量