Deep Learning-based Identification of Cancer or Normal Tissue using Gene Expression Data

T. Ahn, Taewan Goo, Chan-hee Lee, Sungmin Kim, Kyullhee Han, Sangick Park, T. Park
{"title":"Deep Learning-based Identification of Cancer or Normal Tissue using Gene Expression Data","authors":"T. Ahn, Taewan Goo, Chan-hee Lee, Sungmin Kim, Kyullhee Han, Sangick Park, T. Park","doi":"10.1109/BIBM.2018.8621108","DOIUrl":null,"url":null,"abstract":"Background: Deep learning has proven to show outstanding performance in resolving recognition and classification problems. As increasing amounts of cancer and normal gene expression data become publicly available, deep learning may become an integral component of efficiently finding specific patterns within massive datasets. Thus, we aim to address the extent to which the machine can learn to recognize cancer. We integrated cancer and normal tissue data from the Gene Expression Omnibus (GEO), The Cancer Gene Atlas (TCGA), Therapeutically Applicable Research To Generate Effective Treatments (TARGET), and Genotype-Tissue Expression (GTEx) databases, including 13,406 cancer and 12,842 normal gene expression data from 24 different tissues. We first trained the deep neural network (DNN) to discriminate between cancer and normal samples using various gene selection strategies and therapeutic target genes from commercial cancer panels and genes in NCI-curated cancer pathways. We also suggest systemic analyzation method to interpret trained deep neural network. We applied the method to find genes mostly contribute to classify cancer in an individual sample. Result: The best trained DNN could classify cancer and normal data with accuracy of 0.997 in the training data set of 13,123 (cancer: 6,703, normal: 6,402) samples. In the independent test set comprising 13,125 (cancer: 6,703, normal: 6,422) samples, the DNN model achieved 0.979 accuracy. Using the same training and test data, our DNN showed better performance than other conventional prediction methods, followed by the support vector machine approach. For interpretation, we propose a method that can extract a gene’s contribution to an individual sample’s cancer probability from the trained DNN. This method distinguished samples dependent on one or a few genes suggesting these samples are possibly}}{{\\it “oncogene addicted”. Conclusion: A deep learning approach in conjunction with our interpretation method is not only a useful tool to identify cancer from gene expression data but can also contribute toward understanding the complex nature of cancer based on large public data.","PeriodicalId":108667,"journal":{"name":"2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2018.8621108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26

Abstract

Background: Deep learning has proven to show outstanding performance in resolving recognition and classification problems. As increasing amounts of cancer and normal gene expression data become publicly available, deep learning may become an integral component of efficiently finding specific patterns within massive datasets. Thus, we aim to address the extent to which the machine can learn to recognize cancer. We integrated cancer and normal tissue data from the Gene Expression Omnibus (GEO), The Cancer Gene Atlas (TCGA), Therapeutically Applicable Research To Generate Effective Treatments (TARGET), and Genotype-Tissue Expression (GTEx) databases, including 13,406 cancer and 12,842 normal gene expression data from 24 different tissues. We first trained the deep neural network (DNN) to discriminate between cancer and normal samples using various gene selection strategies and therapeutic target genes from commercial cancer panels and genes in NCI-curated cancer pathways. We also suggest systemic analyzation method to interpret trained deep neural network. We applied the method to find genes mostly contribute to classify cancer in an individual sample. Result: The best trained DNN could classify cancer and normal data with accuracy of 0.997 in the training data set of 13,123 (cancer: 6,703, normal: 6,402) samples. In the independent test set comprising 13,125 (cancer: 6,703, normal: 6,422) samples, the DNN model achieved 0.979 accuracy. Using the same training and test data, our DNN showed better performance than other conventional prediction methods, followed by the support vector machine approach. For interpretation, we propose a method that can extract a gene’s contribution to an individual sample’s cancer probability from the trained DNN. This method distinguished samples dependent on one or a few genes suggesting these samples are possibly}}{{\it “oncogene addicted”. Conclusion: A deep learning approach in conjunction with our interpretation method is not only a useful tool to identify cancer from gene expression data but can also contribute toward understanding the complex nature of cancer based on large public data.
基于深度学习的肿瘤或正常组织基因表达数据识别
背景:深度学习已被证明在解决识别和分类问题方面表现出色。随着越来越多的癌症和正常基因表达数据公开,深度学习可能成为在海量数据集中有效发现特定模式的一个组成部分。因此,我们的目标是解决机器可以学习识别癌症的程度。我们整合了来自基因表达综合(GEO)、癌症基因图谱(TCGA)、治疗应用研究(TARGET)和基因型组织表达(GTEx)数据库的癌症和正常组织数据,包括来自24个不同组织的13,406例癌症和12,842例正常基因表达数据。我们首先训练深度神经网络(DNN)使用各种基因选择策略和来自商业癌症面板的治疗靶基因以及nci策划的癌症途径中的基因来区分癌症和正常样本。我们还提出了系统分析的方法来解释训练好的深度神经网络。我们应用该方法在个体样本中寻找对癌症分类最有贡献的基因。结果:在13123个样本(癌症:6703个,正常:6402个)的训练数据集中,训练出的最佳DNN能对癌症和正常数据进行分类,准确率为0.997。在包含13,125个样本(癌症:6,703个,正常:6,422个)的独立测试集中,DNN模型的准确率达到0.979。使用相同的训练和测试数据,我们的深度神经网络表现出比其他传统预测方法更好的性能,其次是支持向量机方法。为了解释,我们提出了一种方法,可以从训练好的DNN中提取基因对个体样本癌症概率的贡献。这种方法区分了依赖于一个或几个基因的样本,这表明这些样本可能是“癌基因成瘾”的。结论:结合我们的解释方法的深度学习方法不仅是从基因表达数据中识别癌症的有用工具,而且还有助于基于大量公共数据理解癌症的复杂性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信