Graph-Based Iterative Hybrid Feature Selection

Erheng Zhong, Sihong Xie, W. Fan, Jiangtao Ren, Jing Peng, Kun Zhang
{"title":"Graph-Based Iterative Hybrid Feature Selection","authors":"Erheng Zhong, Sihong Xie, W. Fan, Jiangtao Ren, Jing Peng, Kun Zhang","doi":"10.1109/ICDM.2008.63","DOIUrl":null,"url":null,"abstract":"When the number of labeled examples is limited, traditional supervised feature selection techniques often fail due to sample selection bias or unrepresentative sample problem. To solve this, semi-supervised feature selection techniques exploit the statistical information of both labeled and unlabeled examples in the same time. However, the results of semi-supervised feature selection can be at times unsatisfactory, and the culprit is on how to effectively use the unlabeled data. Quite different from both supervised and semi-supervised feature selection, we propose a ldquohybridrdquoframework based on graph models. We first apply supervised methods to select a small set of most critical features from the labeled data. Importantly, these initial features might otherwise be missed when selection is performed on the labeled and unlabeled examples simultaneously. Next,this initial feature set is expanded and corrected with the use of unlabeled data. We formally analyze why the expected performance of the hybrid framework is better than both supervised and semi-supervised feature selection. Experimental results demonstrate that the proposed method outperforms both traditional supervised and state-of-the-art semi-supervised feature selection algorithms by at least 10% inaccuracy on a number of text and biomedical problems with thousands of features to choose from. Software and dataset is available from the authors.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 Eighth IEEE International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2008.63","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

When the number of labeled examples is limited, traditional supervised feature selection techniques often fail due to sample selection bias or unrepresentative sample problem. To solve this, semi-supervised feature selection techniques exploit the statistical information of both labeled and unlabeled examples in the same time. However, the results of semi-supervised feature selection can be at times unsatisfactory, and the culprit is on how to effectively use the unlabeled data. Quite different from both supervised and semi-supervised feature selection, we propose a ldquohybridrdquoframework based on graph models. We first apply supervised methods to select a small set of most critical features from the labeled data. Importantly, these initial features might otherwise be missed when selection is performed on the labeled and unlabeled examples simultaneously. Next,this initial feature set is expanded and corrected with the use of unlabeled data. We formally analyze why the expected performance of the hybrid framework is better than both supervised and semi-supervised feature selection. Experimental results demonstrate that the proposed method outperforms both traditional supervised and state-of-the-art semi-supervised feature selection algorithms by at least 10% inaccuracy on a number of text and biomedical problems with thousands of features to choose from. Software and dataset is available from the authors.
基于图的迭代混合特征选择
当标记样本数量有限时,传统的有监督特征选择技术往往由于样本选择偏差或样本不代表性问题而失败。为了解决这个问题,半监督特征选择技术同时利用了标记和未标记样本的统计信息。然而,半监督特征选择的结果有时会令人不满意,而罪魁祸首是如何有效地使用未标记的数据。与监督和半监督特征选择不同,我们提出了一种基于图模型的ldquohybridrdquframework。我们首先应用监督方法从标记数据中选择一小部分最关键的特征。重要的是,当同时对标记和未标记的样本进行选择时,这些初始特征可能会被遗漏。接下来,扩展这个初始特征集,并使用未标记的数据进行校正。我们正式分析了为什么混合框架的预期性能优于监督和半监督特征选择。实验结果表明,在具有数千个特征可供选择的文本和生物医学问题上,所提出的方法比传统的监督和最先进的半监督特征选择算法至少有10%的不准确性。软件和数据集可从作者处获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信