Selection of Significant Clusters of Genes based on Ensemble Clustering and Recursive Cluster Elimination (RCE)

Loai Abdallah, Waleed Khalifa, L. Showe, M. Yousef
{"title":"Selection of Significant Clusters of Genes based on Ensemble Clustering and Recursive Cluster Elimination (RCE)","authors":"Loai Abdallah, Waleed Khalifa, L. Showe, M. Yousef","doi":"10.4172/JPB.1000439","DOIUrl":null,"url":null,"abstract":"Background: Advances in technology have facilitated the generation of gene expression data from large numbers of samples and the development of “Big Data” approaches to analysing gene expression in basic and biomedical systems. That being said, the data still includes relatively small numbers of samples and tens of thousands of variables/gene expression. A variety of different approaches have been developed for searching these gene spaces in order to select the most informative variables that can accurately distinguish one class of subjects/ samples from another. However, there is still a need for new approaches that can accurately distinguish biologically different classes of subjects with similar gene expression profiles. We describe a new and promising approach for selecting the most informative differentially expressed genes that addresses this problem. We describe a method for identifying significant differentially expressed clusters of genes using a process of Recursive Cluster Elimination (RCE) that is based on an ensemble clustering approach. We refer to this approach as SVM-RCE-EC (Ensemble Clustering). We show that SVM-RCE-EC improves gene selection, classification accuracy as compared to other methods including the traditional SVM-RCE approach, and that this is particularly evident when applied to difficult data sets that are poorly resolved by other approaches. \nMethods: To implement SVM-RCE-EC we first applied an ensemble-clustering method, to identify robust gene clusters. We then applied Support Vector Machines (SVMs), with cross validation to score (rank) those clusters of genes based on their contributions to classification accuracy. The clusters of genes that are least significant are progressively removed by the procedure of RCE with the most significant clusters being retained until one identifies the most robust, significantly differentially expressed genes between the two classes. We compare the classification performance of SVM-RCE-EC to a variety of published classification algorithms. \nResults and Conclusion: Utilization of gene clusters selected using the ensemble method enhances classification performance as compared to other methods and identifies sets of significant genes that appear to be more biologically meaningful to the system being analyzed. We show that SVM-RCE-EC outperforms several other methods on data that represent highly similar sample classes that are difficult to distinguish and is comparable to other methods when applied to data where the classes are more easily separated. The improved performance of SVM-RCE-EC on difficult data sets is likely due to the fact that the significant clusters, as determined by the ensemble approach, capture the native structure of the data while SVM-RCE leaves that determination to the user. This hypothesis is supported by the observations that the performance of the clusters generated by SVM-RCE-EC is more robust. \nAvailability: The Matlab version of SVM-RCE-EC is available upon request to the first author and at GitHub (https://github.com/malikyousef/svm-rce-ec).","PeriodicalId":73911,"journal":{"name":"Journal of proteomics & bioinformatics","volume":"10 1","pages":"186-192"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of proteomics & bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4172/JPB.1000439","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Background: Advances in technology have facilitated the generation of gene expression data from large numbers of samples and the development of “Big Data” approaches to analysing gene expression in basic and biomedical systems. That being said, the data still includes relatively small numbers of samples and tens of thousands of variables/gene expression. A variety of different approaches have been developed for searching these gene spaces in order to select the most informative variables that can accurately distinguish one class of subjects/ samples from another. However, there is still a need for new approaches that can accurately distinguish biologically different classes of subjects with similar gene expression profiles. We describe a new and promising approach for selecting the most informative differentially expressed genes that addresses this problem. We describe a method for identifying significant differentially expressed clusters of genes using a process of Recursive Cluster Elimination (RCE) that is based on an ensemble clustering approach. We refer to this approach as SVM-RCE-EC (Ensemble Clustering). We show that SVM-RCE-EC improves gene selection, classification accuracy as compared to other methods including the traditional SVM-RCE approach, and that this is particularly evident when applied to difficult data sets that are poorly resolved by other approaches. Methods: To implement SVM-RCE-EC we first applied an ensemble-clustering method, to identify robust gene clusters. We then applied Support Vector Machines (SVMs), with cross validation to score (rank) those clusters of genes based on their contributions to classification accuracy. The clusters of genes that are least significant are progressively removed by the procedure of RCE with the most significant clusters being retained until one identifies the most robust, significantly differentially expressed genes between the two classes. We compare the classification performance of SVM-RCE-EC to a variety of published classification algorithms. Results and Conclusion: Utilization of gene clusters selected using the ensemble method enhances classification performance as compared to other methods and identifies sets of significant genes that appear to be more biologically meaningful to the system being analyzed. We show that SVM-RCE-EC outperforms several other methods on data that represent highly similar sample classes that are difficult to distinguish and is comparable to other methods when applied to data where the classes are more easily separated. The improved performance of SVM-RCE-EC on difficult data sets is likely due to the fact that the significant clusters, as determined by the ensemble approach, capture the native structure of the data while SVM-RCE leaves that determination to the user. This hypothesis is supported by the observations that the performance of the clusters generated by SVM-RCE-EC is more robust. Availability: The Matlab version of SVM-RCE-EC is available upon request to the first author and at GitHub (https://github.com/malikyousef/svm-rce-ec).
基于集成聚类和递归聚类消除(RCE)的基因显著聚类选择
背景:技术的进步促进了从大量样本中产生基因表达数据,并发展了“大数据”方法来分析基础和生物医学系统中的基因表达。话虽如此,数据仍然包括相对较少的样本和数以万计的变量/基因表达。已经开发了各种不同的方法来搜索这些基因空间,以选择最具信息量的变量,这些变量可以准确地区分一类受试者/样本与另一类受试者/样本。然而,仍然需要新的方法来准确区分具有相似基因表达谱的生物学上不同类别的受试者。我们描述了一种新的和有前途的方法来选择最具信息量的差异表达基因,以解决这个问题。我们描述了一种方法,用于识别显著差异表达的基因簇使用递归聚类消除(RCE)的过程,这是基于一个集成聚类方法。我们将这种方法称为SVM-RCE-EC(集成聚类)。我们发现,与包括传统SVM-RCE方法在内的其他方法相比,SVM-RCE- ec提高了基因选择和分类精度,这在应用于其他方法难以解决的困难数据集时尤为明显。方法:为了实现SVM-RCE-EC,我们首先采用了一种集成聚类方法来识别稳健的基因簇。然后,我们应用交叉验证的支持向量机(svm)根据对分类精度的贡献对这些基因簇进行评分(排序)。最不重要的基因簇通过RCE程序逐渐去除,保留最重要的基因簇,直到识别出两类之间最强大,显着差异表达的基因。我们将SVM-RCE-EC的分类性能与各种已发表的分类算法进行了比较。结果与结论:与其他方法相比,使用集成方法选择的基因簇的利用提高了分类性能,并识别出对被分析系统更具生物学意义的重要基因集。我们表明,SVM-RCE-EC在表示难以区分的高度相似样本类的数据上优于其他几种方法,并且在应用于类更容易分离的数据时与其他方法相当。SVM-RCE- ec在困难数据集上的性能提高可能是由于这样一个事实,即由集成方法确定的重要集群捕获数据的本地结构,而SVM-RCE将该决定留给用户。这一假设得到了SVM-RCE-EC生成的聚类性能更强的观察结果的支持。可用性:SVM-RCE-EC的Matlab版本可根据第一作者的要求和GitHub (https://github.com/malikyousef/svm-rce-ec)获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信