Challenges in Prediction of different Cancer Stages using Gene Expression Profile of Cancer Patients

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics Pub Date : 2017-08-20 DOI:10.1145/3107411.3108211

Sherry Bhalla, Suresh Sharma, Gajendra P.S. Raghava

{"title":"Challenges in Prediction of different Cancer Stages using Gene Expression Profile of Cancer Patients","authors":"Sherry Bhalla, Suresh Sharma, Gajendra P.S. Raghava","doi":"10.1145/3107411.3108211","DOIUrl":null,"url":null,"abstract":"Despite the plethora of gene expression based cancer biomarkers in the scientific literature, a few make their way to the clinic. In the past, several efforts have been made to predict cancer biomarkers with very limited success so far. One of the challenges in the field of cancer biology is to predict cancer at an early stage. The success of various therapies to treat cancer patients depends on correct identification of stage or progression of cancer. Despite the tremendous progress in the field of genomics and proteomics, the performance of stage classification has not improved substantially. Recently our group also developed CancerCSP, a server with prediction models for discriminating early and late stage of clear cell renal cancer (ccRCC) samples based on the gene expression profile. We achieved maximum accuracy of 72.64% with ROC value 0.81, despite the fact that we tried state of- the-art techniques to improve the performance of our models. This raises the question, why the models fail to discriminate ccRCC patients in the early and late stage with high accuracy. In this poster, the analysis is carried out on ccRCC samples obtained from The Cancer Genome Atlas (TCGA) data portal to understand the reasons for the failure of the stage classification models. Firstly, we performed bin-wise analysis of top 20 genes that can discriminate (single gene-based models using threshold) early and late stage samples with highest ROC. A significant overlap was observed in the expression of each gene in early and late stage samples. Though the number of early and late stage samples varied in different gene expression bins, this was not sufficient to classify both types of samples with high accuracy. As an example, the gene NR3C2 had maximum ROC of 0.67 at expression (log RSEM) of 7.61. There were nearly 70% early stage patients above this threshold that made it an average expression marker but the presence of nearly 55% of late stage patients above this threshold increased the false positives. Secondly, we performed hierarchical clustering of ccRCC samples using 64- gene expression features selected using Weka showed weak concordance with pathological stage. The k-means clustering of patients into four groups showed four separable clusters, but these clusters were not associated with the pathological stage. These observations led to the conclusion that the molecular parameters do not always comply with histopathological features. The third analysis was done to identify patients, which were not predicted correctly by any of the four machine-learning algorithms (SVM, Random Forest, SMO and Naïve Bayes). Many samples were not predicted correctly by any of the four machine-learning methods. The false positives and false negatives belonged to explicit clusters obtained through clustering. This further points out to the interspersed nature of the data to differentiate between histopathological stages of cancer. We reach the conclusion that expression profile of genes is not adequate to classify different stages of cancer samples.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3108211","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Despite the plethora of gene expression based cancer biomarkers in the scientific literature, a few make their way to the clinic. In the past, several efforts have been made to predict cancer biomarkers with very limited success so far. One of the challenges in the field of cancer biology is to predict cancer at an early stage. The success of various therapies to treat cancer patients depends on correct identification of stage or progression of cancer. Despite the tremendous progress in the field of genomics and proteomics, the performance of stage classification has not improved substantially. Recently our group also developed CancerCSP, a server with prediction models for discriminating early and late stage of clear cell renal cancer (ccRCC) samples based on the gene expression profile. We achieved maximum accuracy of 72.64% with ROC value 0.81, despite the fact that we tried state of- the-art techniques to improve the performance of our models. This raises the question, why the models fail to discriminate ccRCC patients in the early and late stage with high accuracy. In this poster, the analysis is carried out on ccRCC samples obtained from The Cancer Genome Atlas (TCGA) data portal to understand the reasons for the failure of the stage classification models. Firstly, we performed bin-wise analysis of top 20 genes that can discriminate (single gene-based models using threshold) early and late stage samples with highest ROC. A significant overlap was observed in the expression of each gene in early and late stage samples. Though the number of early and late stage samples varied in different gene expression bins, this was not sufficient to classify both types of samples with high accuracy. As an example, the gene NR3C2 had maximum ROC of 0.67 at expression (log RSEM) of 7.61. There were nearly 70% early stage patients above this threshold that made it an average expression marker but the presence of nearly 55% of late stage patients above this threshold increased the false positives. Secondly, we performed hierarchical clustering of ccRCC samples using 64- gene expression features selected using Weka showed weak concordance with pathological stage. The k-means clustering of patients into four groups showed four separable clusters, but these clusters were not associated with the pathological stage. These observations led to the conclusion that the molecular parameters do not always comply with histopathological features. The third analysis was done to identify patients, which were not predicted correctly by any of the four machine-learning algorithms (SVM, Random Forest, SMO and Naïve Bayes). Many samples were not predicted correctly by any of the four machine-learning methods. The false positives and false negatives belonged to explicit clusters obtained through clustering. This further points out to the interspersed nature of the data to differentiate between histopathological stages of cancer. We reach the conclusion that expression profile of genes is not adequate to classify different stages of cancer samples.

查看原文本刊更多论文

利用癌症患者基因表达谱预测不同癌症分期的挑战

尽管科学文献中有过多的基于基因表达的癌症生物标志物，但只有少数能够用于临床。过去，人们已经做出了一些努力来预测癌症生物标志物，但迄今为止收效甚微。癌症生物学领域的挑战之一是在早期阶段预测癌症。治疗癌症患者的各种疗法的成功取决于对癌症分期或进展的正确识别。尽管基因组学和蛋白质组学领域取得了巨大的进步，但分期分类的性能并没有得到实质性的提高。最近，我们的团队还开发了CancerCSP，这是一个基于基因表达谱区分早期和晚期透明细胞肾癌(ccRCC)样本的预测模型服务器。尽管我们尝试了最先进的技术来提高模型的性能，但我们实现了72.64%的最大准确率，ROC值为0.81。这就提出了一个问题，为什么这些模型不能高精度地区分早期和晚期的ccRCC患者。在这张海报中，我们对从the Cancer Genome Atlas (TCGA)数据门户获取的ccRCC样本进行分析，了解分期分类模型失效的原因。首先，我们对能够区分早期和晚期样品的前20个基因(使用阈值的单基因模型)进行了双向分析，这些基因具有最高的ROC。在早期和晚期样品中观察到每个基因的表达有明显的重叠。虽然不同基因表达箱中早期和晚期样品的数量不同，但这不足以对两种类型的样品进行高精度分类。以NR3C2基因为例，其表达时的最大ROC为0.67 (log RSEM)为7.61。有近70%的早期患者高于这个阈值，使其成为一个平均表达标记，但有近55%的晚期患者高于这个阈值，增加了假阳性。其次，我们使用Weka选择的64个基因表达特征对ccRCC样本进行分层聚类，这些特征与病理分期的一致性较弱。4组患者的k-means聚类显示4个可分离的聚类，但这些聚类与病理分期无关。这些观察得出的结论是，分子参数并不总是符合组织病理学特征。第三次分析是为了识别患者，四种机器学习算法(SVM, Random Forest, SMO和Naïve Bayes)中的任何一种都不能正确预测患者。四种机器学习方法中的任何一种都无法正确预测许多样本。假阳性和假阴性属于通过聚类得到的显式聚类。这进一步指出了数据的分散性质，以区分癌症的组织病理分期。我们得出的结论是，基因的表达谱不足以区分不同阶段的癌症样本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

自引率

0.00%

发文量