An Improvement to Feature Selection of Random Forests on Spark

2014 IEEE 17th International Conference on Computational Science and Engineering Pub Date : 2014-12-19 DOI:10.1109/CSE.2014.159

Ke Sun, Wansheng Miao, Xin Zhang, Ruonan Rao

引用次数: 9

Abstract

The Random Forests algorithm belongs to the class of ensemble learning methods, which are common used in classification problem. In this paper, we studied the problem of adopting the Random Forests algorithm to learn raw data from real usage scenario. An improvement, which is stable, strict, high efficient, data-driven, problem independent and has no impact on algorithm performance, is proposed to investigate 2 actual issues of feature selection of the Random Forests algorithm. The first one is to eliminate noisy features, which are irrelevant to the classification. And the second one is to eliminate redundant features, which are highly relevant with other features, but useless. We implemented our improvement approach on Spark. Experiments are performed to evaluate our improvement and the results show that our approach has an ideal performance.

查看原文本刊更多论文

基于Spark的随机森林特征选择的改进

随机森林算法属于集成学习方法的一类，是分类问题中常用的一种方法。本文研究了采用随机森林算法从实际使用场景中学习原始数据的问题。针对随机森林算法特征选择的两个实际问题，提出了一种稳定、严格、高效、数据驱动、问题独立且不影响算法性能的改进方案。首先是去除与分类无关的噪声特征。二是剔除冗余特征，即与其他特征高度相关但无用的特征。我们在Spark上实现了我们的改进方法。实验结果表明，该方法具有理想的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 17th International Conference on Computational Science and Engineering

自引率

0.00%

发文量