Cost-sensitive classification on class-balanced ensembles for imbalanced non-coding RNA data

2016 IEEE EMBS International Student Conference (ISC) Pub Date : 2016-05-29 DOI:10.1109/EMBSISC.2016.7508607

Bashier ElKarami, A. Alkhateeb, L. Rueda

{"title":"Cost-sensitive classification on class-balanced ensembles for imbalanced non-coding RNA data","authors":"Bashier ElKarami, A. Alkhateeb, L. Rueda","doi":"10.1109/EMBSISC.2016.7508607","DOIUrl":null,"url":null,"abstract":"Many bioinformatics data sets have class-imbalanced data, where the number of samples in each class is not equal. Since most of data sets contain usual versus unusual cases, e.g. cancer versus normal or miRNAs versus other non-coding RNA, where the minority class with the least number of samples is the interesting class that contains the unusual cases. The learning models based on the standard classifiers, such as the support vector machine (SVM), random forest and k-NN are usually biased towards the majority class, which means that the classifier is most likely to predict the samples from the interesting class inaccurately. Thus, handling class-imbalanced data set has gained the researchers interests recently. A combination of proper feature selection, a cost-sensitive classifier and ensembling based on random forest method (BCE-CSC-RF) is proposed to handle the class-imbalanced data. Random class-balanced ensembles are built individually. Then, each ensemble is used as a training pool to classify the rest of out-bagged samples. Samples in each ensemble will be classified using class-sensitive classifier that incorporates random forest. The sample will be classified by selecting the most often class has been voted-for in all samples appearances in all the formed ensembles. A set of performance measurements including a geometric measurement suggests that the model can improve the classification of the minority class samples.","PeriodicalId":361773,"journal":{"name":"2016 IEEE EMBS International Student Conference (ISC)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE EMBS International Student Conference (ISC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EMBSISC.2016.7508607","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Many bioinformatics data sets have class-imbalanced data, where the number of samples in each class is not equal. Since most of data sets contain usual versus unusual cases, e.g. cancer versus normal or miRNAs versus other non-coding RNA, where the minority class with the least number of samples is the interesting class that contains the unusual cases. The learning models based on the standard classifiers, such as the support vector machine (SVM), random forest and k-NN are usually biased towards the majority class, which means that the classifier is most likely to predict the samples from the interesting class inaccurately. Thus, handling class-imbalanced data set has gained the researchers interests recently. A combination of proper feature selection, a cost-sensitive classifier and ensembling based on random forest method (BCE-CSC-RF) is proposed to handle the class-imbalanced data. Random class-balanced ensembles are built individually. Then, each ensemble is used as a training pool to classify the rest of out-bagged samples. Samples in each ensemble will be classified using class-sensitive classifier that incorporates random forest. The sample will be classified by selecting the most often class has been voted-for in all samples appearances in all the formed ensembles. A set of performance measurements including a geometric measurement suggests that the model can improve the classification of the minority class samples.

查看原文本刊更多论文

不平衡非编码RNA数据类平衡集成的代价敏感分类

许多生物信息学数据集存在类别不平衡数据，即每个类别中的样本数量不相等。由于大多数数据集包含常见与不寻常的情况，例如癌症与正常或miRNAs与其他非编码RNA，其中样本数量最少的少数类是包含不寻常情况的有趣类。基于标准分类器的学习模型，如支持向量机(SVM)、随机森林和k-NN通常偏向多数类，这意味着分类器很可能不准确地预测出感兴趣类的样本。因此，类不平衡数据集的处理成为近年来研究人员关注的热点。提出了一种结合特征选择、代价敏感分类器和基于随机森林的集成方法(BCE-CSC-RF)来处理类不平衡数据。随机的类平衡组合是单独构建的。然后，将每个集合作为训练池，对剩余的出袋样本进行分类。每个集合中的样本将使用包含随机森林的类敏感分类器进行分类。样本将通过选择在所有形成的集合中所有样本中出现的最常被投票的类来分类。包括几何测量在内的一组性能测量表明，该模型可以改善少数类样本的分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE EMBS International Student Conference (ISC)

自引率

0.00%

发文量