Identifying pan-cancer and cancer subtype miRNAs using interpretable convolutional neural network

IF 3.7 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science Pub Date : 2025-02-01 DOI:10.1016/j.jocs.2024.102510

Joginder Singh , Shubhra Sankar Ray , Sukriti Roy

{"title":"Identifying pan-cancer and cancer subtype miRNAs using interpretable convolutional neural network","authors":"Joginder Singh , Shubhra Sankar Ray , Sukriti Roy","doi":"10.1016/j.jocs.2024.102510","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>MiRNAs are short-length (<span><math><mo>∼</mo></math></span>22nt) non-coding RNAs and are considered to be important biomarkers in pan-cancer analysis. Pan-cancer analysis is the study of finding the commonalities and differences in genetic and cellular alterations in various types of cancers. A common computational challenge in handling miRNA expression data is that it is high dimensional and complex (HDC) in nature. In this regard, convolutional neural networks are proven to be good performers due to their nature of finding patterns in complex data.</div></div><div><h3>Methodology:</h3><div>An interpretable convolutional neural network model (ICNNM) is developed for classifying miRNA expression based pan-cancer data. The ICNNM is a one dimensional model. The layers and other hyperparameters are optimized using Bayesian optimization with multivariate tree parzen estimator (BoMTPE). An interpretable approach is developed using SHapley Additive exPlanations (SHAP) values for explaining the behavior of ICNNM. This approach helps in introducing an attribution score for identifying relevant miRNAs using SHAP values. The attribution scores are assigned higher values for those miRNAs which help in the accurate prediction of tumor class of patients by utilizing the game theory concept in computing the SHAP values. The model is evaluated on 9 datasets among which 6 datasets (4 general pan cancer and two subtypes) are derived from a single TCGA pan-cancer dataset, one dataset is downloaded as Breast sub-type from TCGA, and two datasets, nasopharyngeal carcinoma and bone and soft tissue sarcoma, are downloaded from GEO as rare cancer ones.</div></div><div><h3>Results:</h3><div>The ICNNM is seen to perform better as compared to related techniques such as three variations of the CNN model, random forest RF, SVM, Gboost, XGboost, and Catboost. The performance is evaluated in terms of F1-score, discriminability power of expressions between normal and tumor classes, and biological significance of the selected miRNAs. The biological significance is established through existing literatures and online databases such as gene ontology and KEGG pathways after obtaining the target genes using miRDB database. While the performance of ICNNM in terms of F1-score varies from 0.95 to 0.99 for 4 general pan-cancer datasets, it varies from 0.91 to 0.99 for 3 subtype datasets and from 0.76 to 0.90 for rare cancer datasets. Many of the selected miRNAs are found to be the key biomarkers in various tumor classes according to existing investigations. Three miRNAs miR-503, miR-202, and miR-135a can be considered as novel predictions for cancer classes prostate and rectum, mesothelioma, and testicular germ cells, respectively, as their target genes are involved in related cancer pathways, obtained using miRDB database.</div></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"85 ","pages":"Article 102510"},"PeriodicalIF":3.7000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S187775032400303X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Background:

MiRNAs are short-length (

\sim

22nt) non-coding RNAs and are considered to be important biomarkers in pan-cancer analysis. Pan-cancer analysis is the study of finding the commonalities and differences in genetic and cellular alterations in various types of cancers. A common computational challenge in handling miRNA expression data is that it is high dimensional and complex (HDC) in nature. In this regard, convolutional neural networks are proven to be good performers due to their nature of finding patterns in complex data.

Methodology:

An interpretable convolutional neural network model (ICNNM) is developed for classifying miRNA expression based pan-cancer data. The ICNNM is a one dimensional model. The layers and other hyperparameters are optimized using Bayesian optimization with multivariate tree parzen estimator (BoMTPE). An interpretable approach is developed using SHapley Additive exPlanations (SHAP) values for explaining the behavior of ICNNM. This approach helps in introducing an attribution score for identifying relevant miRNAs using SHAP values. The attribution scores are assigned higher values for those miRNAs which help in the accurate prediction of tumor class of patients by utilizing the game theory concept in computing the SHAP values. The model is evaluated on 9 datasets among which 6 datasets (4 general pan cancer and two subtypes) are derived from a single TCGA pan-cancer dataset, one dataset is downloaded as Breast sub-type from TCGA, and two datasets, nasopharyngeal carcinoma and bone and soft tissue sarcoma, are downloaded from GEO as rare cancer ones.

Results:

The ICNNM is seen to perform better as compared to related techniques such as three variations of the CNN model, random forest RF, SVM, Gboost, XGboost, and Catboost. The performance is evaluated in terms of F1-score, discriminability power of expressions between normal and tumor classes, and biological significance of the selected miRNAs. The biological significance is established through existing literatures and online databases such as gene ontology and KEGG pathways after obtaining the target genes using miRDB database. While the performance of ICNNM in terms of F1-score varies from 0.95 to 0.99 for 4 general pan-cancer datasets, it varies from 0.91 to 0.99 for 3 subtype datasets and from 0.76 to 0.90 for rare cancer datasets. Many of the selected miRNAs are found to be the key biomarkers in various tumor classes according to existing investigations. Three miRNAs miR-503, miR-202, and miR-135a can be considered as novel predictions for cancer classes prostate and rectum, mesothelioma, and testicular germ cells, respectively, as their target genes are involved in related cancer pathways, obtained using miRDB database.

查看原文本刊更多论文

利用可解释卷积神经网络识别泛癌症和癌症亚型mirna

背景：mirna是短长度（约22nt）的非编码rna，被认为是泛癌症分析中的重要生物标志物。泛癌分析是发现各种类型癌症的遗传和细胞改变的共性和差异的研究。处理miRNA表达数据的一个常见的计算挑战是它本质上是高维和复杂的（HDC）。在这方面，卷积神经网络由于其在复杂数据中发现模式的性质而被证明是良好的表现。方法：基于泛癌症数据，开发了一种可解释的卷积神经网络模型（ICNNM），用于对miRNA表达进行分类。ICNNM是一个一维模型。采用贝叶斯优化和多元树parzen估计器（BoMTPE）对层和其他超参数进行优化。利用SHapley加性解释（SHAP）值，开发了一种可解释的方法来解释ICNNM的行为。这种方法有助于引入归因分数，用于使用SHAP值识别相关的mirna。对于那些利用博弈论概念计算SHAP值有助于准确预测患者肿瘤类别的mirna，归因得分被赋予更高的值。模型在9个数据集上进行评估，其中6个数据集（4个一般泛癌和2个亚型）来自单一TCGA泛癌数据集，1个数据集作为乳腺癌亚型从TCGA下载，2个数据集鼻咽癌和骨软组织肉瘤从GEO下载作为罕见癌数据集。结果：与CNN模型的三种变体、随机森林RF、SVM、Gboost、XGboost和Catboost等相关技术相比，ICNNM的表现更好。性能是根据f1评分、正常和肿瘤类别之间表达的区分能力以及所选mirna的生物学意义来评估的。利用miRDB数据库获取目的基因后，通过现有文献和基因本体、KEGG通路等在线数据库建立生物学意义。对于4个一般泛癌症数据集，ICNNM在f1得分方面的表现从0.95到0.99不等，而对于3个亚型数据集，ICNNM的表现从0.91到0.99不等，对于罕见癌症数据集，ICNNM的表现从0.76到0.90不等。根据现有的研究，许多选择的mirna被发现是各种肿瘤类别的关键生物标志物。三个mirna miR-503、miR-202和miR-135a可以被认为是前列腺和直肠、间皮瘤和睾丸生殖细胞癌症类别的新预测，因为它们的靶基因参与了相关的癌症途径，通过miRDB数据库获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computational Science COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-COMPUTER SCIENCE, THEORY & METHODS

CiteScore

5.50

自引率

3.00%

发文量

227

审稿时长

41 days

期刊介绍： Computational Science is a rapidly growing multi- and interdisciplinary field that uses advanced computing and data analysis to understand and solve complex problems. It has reached a level of predictive capability that now firmly complements the traditional pillars of experimentation and theory. The recent advances in experimental techniques such as detectors, on-line sensor networks and high-resolution imaging techniques, have opened up new windows into physical and biological processes at many levels of detail. The resulting data explosion allows for detailed data driven modeling and simulation. This new discipline in science combines computational thinking, modern computational methods, devices and collateral technologies to address problems far beyond the scope of traditional numerical methods. Computational science typically unifies three distinct elements: • Modeling, Algorithms and Simulations (e.g. numerical and non-numerical, discrete and continuous); • Software developed to solve science (e.g., biological, physical, and social), engineering, medicine, and humanities problems; • Computer and information science that develops and optimizes the advanced system hardware, software, networking, and data management components (e.g. problem solving environments).