{"title":"Challenges in Prediction of different Cancer Stages using Gene Expression Profile of Cancer Patients","authors":"Sherry Bhalla, Suresh Sharma, Gajendra P.S. Raghava","doi":"10.1145/3107411.3108211","DOIUrl":null,"url":null,"abstract":"Despite the plethora of gene expression based cancer biomarkers in the scientific literature, a few make their way to the clinic. In the past, several efforts have been made to predict cancer biomarkers with very limited success so far. One of the challenges in the field of cancer biology is to predict cancer at an early stage. The success of various therapies to treat cancer patients depends on correct identification of stage or progression of cancer. Despite the tremendous progress in the field of genomics and proteomics, the performance of stage classification has not improved substantially. Recently our group also developed CancerCSP, a server with prediction models for discriminating early and late stage of clear cell renal cancer (ccRCC) samples based on the gene expression profile. We achieved maximum accuracy of 72.64% with ROC value 0.81, despite the fact that we tried state of- the-art techniques to improve the performance of our models. This raises the question, why the models fail to discriminate ccRCC patients in the early and late stage with high accuracy. In this poster, the analysis is carried out on ccRCC samples obtained from The Cancer Genome Atlas (TCGA) data portal to understand the reasons for the failure of the stage classification models. Firstly, we performed bin-wise analysis of top 20 genes that can discriminate (single gene-based models using threshold) early and late stage samples with highest ROC. A significant overlap was observed in the expression of each gene in early and late stage samples. Though the number of early and late stage samples varied in different gene expression bins, this was not sufficient to classify both types of samples with high accuracy. As an example, the gene NR3C2 had maximum ROC of 0.67 at expression (log RSEM) of 7.61. There were nearly 70% early stage patients above this threshold that made it an average expression marker but the presence of nearly 55% of late stage patients above this threshold increased the false positives. Secondly, we performed hierarchical clustering of ccRCC samples using 64- gene expression features selected using Weka showed weak concordance with pathological stage. The k-means clustering of patients into four groups showed four separable clusters, but these clusters were not associated with the pathological stage. These observations led to the conclusion that the molecular parameters do not always comply with histopathological features. The third analysis was done to identify patients, which were not predicted correctly by any of the four machine-learning algorithms (SVM, Random Forest, SMO and Naïve Bayes). Many samples were not predicted correctly by any of the four machine-learning methods. The false positives and false negatives belonged to explicit clusters obtained through clustering. This further points out to the interspersed nature of the data to differentiate between histopathological stages of cancer. We reach the conclusion that expression profile of genes is not adequate to classify different stages of cancer samples.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3108211","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Despite the plethora of gene expression based cancer biomarkers in the scientific literature, a few make their way to the clinic. In the past, several efforts have been made to predict cancer biomarkers with very limited success so far. One of the challenges in the field of cancer biology is to predict cancer at an early stage. The success of various therapies to treat cancer patients depends on correct identification of stage or progression of cancer. Despite the tremendous progress in the field of genomics and proteomics, the performance of stage classification has not improved substantially. Recently our group also developed CancerCSP, a server with prediction models for discriminating early and late stage of clear cell renal cancer (ccRCC) samples based on the gene expression profile. We achieved maximum accuracy of 72.64% with ROC value 0.81, despite the fact that we tried state of- the-art techniques to improve the performance of our models. This raises the question, why the models fail to discriminate ccRCC patients in the early and late stage with high accuracy. In this poster, the analysis is carried out on ccRCC samples obtained from The Cancer Genome Atlas (TCGA) data portal to understand the reasons for the failure of the stage classification models. Firstly, we performed bin-wise analysis of top 20 genes that can discriminate (single gene-based models using threshold) early and late stage samples with highest ROC. A significant overlap was observed in the expression of each gene in early and late stage samples. Though the number of early and late stage samples varied in different gene expression bins, this was not sufficient to classify both types of samples with high accuracy. As an example, the gene NR3C2 had maximum ROC of 0.67 at expression (log RSEM) of 7.61. There were nearly 70% early stage patients above this threshold that made it an average expression marker but the presence of nearly 55% of late stage patients above this threshold increased the false positives. Secondly, we performed hierarchical clustering of ccRCC samples using 64- gene expression features selected using Weka showed weak concordance with pathological stage. The k-means clustering of patients into four groups showed four separable clusters, but these clusters were not associated with the pathological stage. These observations led to the conclusion that the molecular parameters do not always comply with histopathological features. The third analysis was done to identify patients, which were not predicted correctly by any of the four machine-learning algorithms (SVM, Random Forest, SMO and Naïve Bayes). Many samples were not predicted correctly by any of the four machine-learning methods. The false positives and false negatives belonged to explicit clusters obtained through clustering. This further points out to the interspersed nature of the data to differentiate between histopathological stages of cancer. We reach the conclusion that expression profile of genes is not adequate to classify different stages of cancer samples.