Gene selection for multiclass prediction of microarray data

Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003 Pub Date : 2003-08-11 DOI:10.1109/CSB.2003.1227385

Dechang Chen, D. Hua, J. Reifman, Xiuzhen Cheng

{"title":"Gene selection for multiclass prediction of microarray data","authors":"Dechang Chen, D. Hua, J. Reifman, Xiuzhen Cheng","doi":"10.1109/CSB.2003.1227385","DOIUrl":null,"url":null,"abstract":"Gene expression data from microarrays have been successfully applied to class prediction, where the purpose is to classify and predict the diagnostic category of a sample by its gene expression profile. A typical microarray dataset consists of expression levels for a large number of genes on a relatively small number of samples. As a consequence, one basic and important question associated with class prediction is: how do we identify a small subset of informative genes contributing the most to the classification task? Many methods have been proposed but most focus on two-class problems, such as discrimination between normal and disease samples. This paper addresses selecting informative genes for multiclass prediction problems by jointly considering all the classes simultaneously. Our approach is based on the power of the genes in discriminating among the different classes (e.g., tumor types) and the existing correlation between genes. We formulate the expression levels of a given gene by a one-way analysis of variance model with heterogeneity of variances, and determine the discriminatory power of the gene by a test statistic designed to test the equality of the class means. In other words, the discriminatory power of a gene is associated with a Behrens-Fisher problem. Informative genes are chosen such that each selected gene has a high discriminatory power and the correlation between any pair of selected genes is low. Test statistics considered in this paper include the ANOVA F test statistic, the Brown-Forsythe test statistic, the Cochran test statistic, and the Welch test statistic. Their performances are evaluated over several classification methods applied to two publicly available microarray datasets. The results show that Brown-Forsythe test statistic achieves the best performance.","PeriodicalId":147883,"journal":{"name":"Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSB.2003.1227385","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

Abstract

Gene expression data from microarrays have been successfully applied to class prediction, where the purpose is to classify and predict the diagnostic category of a sample by its gene expression profile. A typical microarray dataset consists of expression levels for a large number of genes on a relatively small number of samples. As a consequence, one basic and important question associated with class prediction is: how do we identify a small subset of informative genes contributing the most to the classification task? Many methods have been proposed but most focus on two-class problems, such as discrimination between normal and disease samples. This paper addresses selecting informative genes for multiclass prediction problems by jointly considering all the classes simultaneously. Our approach is based on the power of the genes in discriminating among the different classes (e.g., tumor types) and the existing correlation between genes. We formulate the expression levels of a given gene by a one-way analysis of variance model with heterogeneity of variances, and determine the discriminatory power of the gene by a test statistic designed to test the equality of the class means. In other words, the discriminatory power of a gene is associated with a Behrens-Fisher problem. Informative genes are chosen such that each selected gene has a high discriminatory power and the correlation between any pair of selected genes is low. Test statistics considered in this paper include the ANOVA F test statistic, the Brown-Forsythe test statistic, the Cochran test statistic, and the Welch test statistic. Their performances are evaluated over several classification methods applied to two publicly available microarray datasets. The results show that Brown-Forsythe test statistic achieves the best performance.

查看原文本刊更多论文

微阵列数据多类别预测的基因选择

来自微阵列的基因表达数据已经成功地应用于分类预测，其目的是通过基因表达谱对样本进行分类和预测诊断类别。典型的微阵列数据集由相对少量样本上大量基因的表达水平组成。因此，与分类预测相关的一个基本而重要的问题是:我们如何识别对分类任务贡献最大的一小部分信息基因?已经提出了许多方法，但大多数都集中在两类问题上，例如正常样本和疾病样本的区分。本文通过同时考虑所有类别的方法，研究了多类别预测问题的信息基因选择问题。我们的方法是基于基因区分不同类别(例如，肿瘤类型)的能力和基因之间存在的相关性。我们通过具有方差异质性的单向方差分析模型来确定给定基因的表达水平，并通过设计用于检验类均值的平等性的检验统计量来确定基因的歧视能力。换句话说，基因的歧视能力与贝伦斯-费雪问题有关。信息性基因的选择使得每一个被选择的基因都有很高的区别力，而任何一对被选择的基因之间的相关性都很低。本文考虑的检验统计量包括ANOVA F检验统计量、Brown-Forsythe检验统计量、Cochran检验统计量和Welch检验统计量。它们的性能通过应用于两个公开可用的微阵列数据集的几种分类方法进行评估。结果表明，Brown-Forsythe检验统计量达到了最佳性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003

自引率

0.00%

发文量