A novel random forests-based feature selection method for microarray expression data analysis

IF 0.4 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

International Journal of Data Mining and Bioinformatics Pub Date : 2015-07-01 DOI:10.1504/IJDMB.2015.070852

Dengju Yao, Jing Yang, Xiaojuan Zhan, Xiaorong Zhan, Zhiqiang Xie

{"title":"A novel random forests-based feature selection method for microarray expression data analysis","authors":"Dengju Yao, Jing Yang, Xiaojuan Zhan, Xiaorong Zhan, Zhiqiang Xie","doi":"10.1504/IJDMB.2015.070852","DOIUrl":null,"url":null,"abstract":"High-dimensional data and a large number of redundancy features in bioinformatics research have created an urgent need for feature selection. In this paper, a novel random forests-based feature selection method is proposed that adopts the idea of stratifying feature space and combines generalised sequence backward searching and generalised sequence forward searching strategies. A random forest variable importance score is used to rank features, and different classifiers are used as a feature subset evaluating function. The proposed method is examined on five microarray expression datasets, including leukaemia, prostate, breast, nervous and DLBCL, and the average accuracies of the SVM classifier in these datasets are 100%, 95.24%, 85%, 91.67%, and 91.67%, respectively. The results show that the proposed method could not only improve the classification accuracy but also greatly reduce the computation time of the feature selection process.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"13 1 1","pages":"84-101"},"PeriodicalIF":0.4000,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.070852","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Data Mining and Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1504/IJDMB.2015.070852","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 21

Abstract

High-dimensional data and a large number of redundancy features in bioinformatics research have created an urgent need for feature selection. In this paper, a novel random forests-based feature selection method is proposed that adopts the idea of stratifying feature space and combines generalised sequence backward searching and generalised sequence forward searching strategies. A random forest variable importance score is used to rank features, and different classifiers are used as a feature subset evaluating function. The proposed method is examined on five microarray expression datasets, including leukaemia, prostate, breast, nervous and DLBCL, and the average accuracies of the SVM classifier in these datasets are 100%, 95.24%, 85%, 91.67%, and 91.67%, respectively. The results show that the proposed method could not only improve the classification accuracy but also greatly reduce the computation time of the feature selection process.

查看原文本刊更多论文

基于随机森林特征选择的微阵列表达数据分析方法

生物信息学研究中的高维数据和大量冗余特征对特征选择产生了迫切的需求。本文提出了一种新的基于随机森林的特征选择方法，该方法采用分层特征空间的思想，结合广义序列后向搜索和广义序列前向搜索策略。使用随机森林变量重要性评分对特征进行排序，并使用不同的分类器作为特征子集评估函数。在白血病、前列腺癌、乳腺癌、神经癌和DLBCL等5个微阵列表达数据集上进行检验，SVM分类器在这些数据集上的平均准确率分别为100%、95.24%、85%、91.67%和91.67%。结果表明，该方法不仅提高了分类精度，而且大大减少了特征选择过程的计算时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Data Mining and Bioinformatics 生物-数学与计算生物学

CiteScore

1.00

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Mining bioinformatics data is an emerging area at the intersection between bioinformatics and data mining. The objective of IJDMB is to facilitate collaboration between data mining researchers and bioinformaticians by presenting cutting edge research topics and methodologies in the area of data mining for bioinformatics. This perspective acknowledges the inter-disciplinary nature of research in data mining and bioinformatics and provides a unified forum for researchers/practitioners/students/policy makers to share the latest research and developments in this fast growing multi-disciplinary research area.