Analysis of gene expression microarrays for phenotype classification.

A Califano, G Stolovitzky, Y Tu
{"title":"Analysis of gene expression microarrays for phenotype classification.","authors":"A Califano,&nbsp;G Stolovitzky,&nbsp;Y Tu","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Several microarray technologies that monitor the level of expression of a large number of genes have recently emerged. Given DNA-microarray data for a set of cells characterized by a given phenotype and for a set of control cells, an important problem is to identify \"patterns\" of gene expression that can be used to predict cell phenotype. The potential number of such patterns is exponential in the number of genes. In this paper, we propose a solution to this problem based on a supervised learning algorithm, which differs substantially from previous schemes. It couples a complex, non-linear similarity metric, which maximizes the probability of discovering discriminative gene expression patterns, and a pattern discovery algorithm called SPLASH. The latter discovers efficiently and deterministically all statistically significant gene expression patterns in the phenotype set. Statistical significance is evaluated based on the probability of a pattern to occur by chance in the control set. Finally, a greedy set covering algorithm is used to select an optimal subset of statistically significant patterns, which form the basis for a standard likelihood ratio classification scheme. We analyze data from 60 human cancer cell lines using this method, and compare our results with those of other supervised learning schemes. Different phenotypes are studied. These include cancer morphologies (such as melanoma), molecular targets (such as mutations in the p53 gene), and therapeutic targets related to the sensitivity to an anticancer compounds. We also analyze a synthetic data set that shows that this technique is especially well suited for the analysis of sub-phenotype mixtures. For complex phenotypes, such as p53, our method produces an encouragingly low rate of false positives and false negatives and seems to outperform the others. Similar low rates are reported when predicting the efficacy of experimental anticancer compounds. This counts among the first reported studies where drug efficacy has been successfully predicted from large-scale expression data analysis.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2000-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Several microarray technologies that monitor the level of expression of a large number of genes have recently emerged. Given DNA-microarray data for a set of cells characterized by a given phenotype and for a set of control cells, an important problem is to identify "patterns" of gene expression that can be used to predict cell phenotype. The potential number of such patterns is exponential in the number of genes. In this paper, we propose a solution to this problem based on a supervised learning algorithm, which differs substantially from previous schemes. It couples a complex, non-linear similarity metric, which maximizes the probability of discovering discriminative gene expression patterns, and a pattern discovery algorithm called SPLASH. The latter discovers efficiently and deterministically all statistically significant gene expression patterns in the phenotype set. Statistical significance is evaluated based on the probability of a pattern to occur by chance in the control set. Finally, a greedy set covering algorithm is used to select an optimal subset of statistically significant patterns, which form the basis for a standard likelihood ratio classification scheme. We analyze data from 60 human cancer cell lines using this method, and compare our results with those of other supervised learning schemes. Different phenotypes are studied. These include cancer morphologies (such as melanoma), molecular targets (such as mutations in the p53 gene), and therapeutic targets related to the sensitivity to an anticancer compounds. We also analyze a synthetic data set that shows that this technique is especially well suited for the analysis of sub-phenotype mixtures. For complex phenotypes, such as p53, our method produces an encouragingly low rate of false positives and false negatives and seems to outperform the others. Similar low rates are reported when predicting the efficacy of experimental anticancer compounds. This counts among the first reported studies where drug efficacy has been successfully predicted from large-scale expression data analysis.

基因表达微阵列分析用于表型分类。
最近出现了几种监测大量基因表达水平的微阵列技术。给定一组具有给定表型特征的细胞和一组对照细胞的dna微阵列数据,一个重要的问题是确定可用于预测细胞表型的基因表达“模式”。这种模式的潜在数量与基因数量呈指数关系。在本文中,我们提出了一种基于监督学习算法的解决方案,这与以往的方案有很大的不同。它结合了一个复杂的非线性相似性度量,该度量最大化发现歧视性基因表达模式的概率,以及一个称为SPLASH的模式发现算法。后者发现有效和确定性的所有统计显著基因表达模式在表型集。统计显著性是根据一种模式在控制集中偶然发生的概率来评估的。最后,使用贪婪集覆盖算法选择统计显著模式的最优子集,形成标准似然比分类方案的基础。我们使用这种方法分析了60个人类癌细胞系的数据,并将我们的结果与其他监督学习方案的结果进行了比较。研究了不同的表型。这些包括癌症形态(如黑色素瘤),分子靶标(如p53基因突变),以及与抗癌化合物敏感性相关的治疗靶标。我们还分析了一个合成数据集,表明这种技术特别适合于分析亚表型混合物。对于复杂的表现型,如p53,我们的方法产生了令人鼓舞的低假阳性和假阴性率,似乎优于其他方法。在预测实验性抗癌化合物的疗效时,也报道了类似的低准确率。这是首次报道的通过大规模表达数据分析成功预测药物疗效的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信