稳健和准确的癌症分类与基因表达谱。

Proceedings. IEEE Computational Systems Bioinformatics Conference Pub Date : 2005-01-01 DOI:10.1109/csb.2005.49

Haifeng Li, Keshu Zhang, Tao Jiang

{"title":"稳健和准确的癌症分类与基因表达谱。","authors":"Haifeng Li, Keshu Zhang, Tao Jiang","doi":"10.1109/csb.2005.49","DOIUrl":null,"url":null,"abstract":"Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the within-class scatter matrix S(w) be nonsingular. Unfortunately, Sw is always singular in the case of cancer classification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisher's criterion. GLDA is mathematically well-founded and coincides with the conventional LDA when S(w) is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of S(w), and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"310-21"},"PeriodicalIF":0.0000,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.49","citationCount":"23","resultStr":"{\"title\":\"Robust and accurate cancer classification with gene expression profiling.\",\"authors\":\"Haifeng Li, Keshu Zhang, Tao Jiang\",\"doi\":\"10.1109/csb.2005.49\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the within-class scatter matrix S(w) be nonsingular. Unfortunately, Sw is always singular in the case of cancer classification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisher's criterion. GLDA is mathematically well-founded and coincides with the conventional LDA when S(w) is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of S(w), and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc.\",\"PeriodicalId\":87417,\"journal\":{\"name\":\"Proceedings. IEEE Computational Systems Bioinformatics Conference\",\"volume\":\" \",\"pages\":\"310-21\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/csb.2005.49\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. IEEE Computational Systems Bioinformatics Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/csb.2005.49\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE Computational Systems Bioinformatics Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/csb.2005.49","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

摘要

准确可靠的癌症分类对癌症治疗至关重要。基因表达谱分析有望使我们能够准确、系统地诊断肿瘤。然而，由于维数的诅咒和小样本量的问题，在这种情况下的分类任务非常具有挑战性。在本文中，我们提出了一种新的方法来解决这两个问题。我们的方法能够将基因表达数据映射到一个非常低维的空间，从而满足推荐的样本与每类特征的比率。因此，它可以用于对新样本进行鲁棒分类，并且错误率低且可靠。该方法基于线性判别分析(LDA)。然而，传统的LDA要求类内散射矩阵S(w)是非奇异的。不幸的是，由于样本量小的问题，在癌症分类中，Sw总是单一的。为了克服这个问题，我们开发了一个广义线性判别分析(GLDA)，它是优化Fisher准则的一般、直接和完整的解。当S(w)为非奇异时，GLDA在数学上有良好的基础，与传统的LDA一致。与传统的LDA不同，GLDA不假设S(w)的非奇异性，因此自然解决了小样本量问题。为了适应散点矩阵的高维性，本文还提出了一种快速的GLDA算法。我们在七个公共癌症数据集上的大量实验表明，该方法性能良好。特别是在一些具有非常小的样本与每类基因比率的困难实例中，我们的方法比广泛使用的分类方法(如支持向量机，随机森林等)实现了更高的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Robust and accurate cancer classification with gene expression profiling.

Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the within-class scatter matrix S(w) be nonsingular. Unfortunately, Sw is always singular in the case of cancer classification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisher's criterion. GLDA is mathematically well-founded and coincides with the conventional LDA when S(w) is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of S(w), and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. IEEE Computational Systems Bioinformatics Conference

自引率

0.00%

发文量