{"title":"Identifying pan-cancer and cancer subtype miRNAs using interpretable convolutional neural network","authors":"Joginder Singh , Shubhra Sankar Ray , Sukriti Roy","doi":"10.1016/j.jocs.2024.102510","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>MiRNAs are short-length (<span><math><mo>∼</mo></math></span>22nt) non-coding RNAs and are considered to be important biomarkers in pan-cancer analysis. Pan-cancer analysis is the study of finding the commonalities and differences in genetic and cellular alterations in various types of cancers. A common computational challenge in handling miRNA expression data is that it is high dimensional and complex (HDC) in nature. In this regard, convolutional neural networks are proven to be good performers due to their nature of finding patterns in complex data.</div></div><div><h3>Methodology:</h3><div>An interpretable convolutional neural network model (ICNNM) is developed for classifying miRNA expression based pan-cancer data. The ICNNM is a one dimensional model. The layers and other hyperparameters are optimized using Bayesian optimization with multivariate tree parzen estimator (BoMTPE). An interpretable approach is developed using SHapley Additive exPlanations (SHAP) values for explaining the behavior of ICNNM. This approach helps in introducing an attribution score for identifying relevant miRNAs using SHAP values. The attribution scores are assigned higher values for those miRNAs which help in the accurate prediction of tumor class of patients by utilizing the game theory concept in computing the SHAP values. The model is evaluated on 9 datasets among which 6 datasets (4 general pan cancer and two subtypes) are derived from a single TCGA pan-cancer dataset, one dataset is downloaded as Breast sub-type from TCGA, and two datasets, nasopharyngeal carcinoma and bone and soft tissue sarcoma, are downloaded from GEO as rare cancer ones.</div></div><div><h3>Results:</h3><div>The ICNNM is seen to perform better as compared to related techniques such as three variations of the CNN model, random forest RF, SVM, Gboost, XGboost, and Catboost. The performance is evaluated in terms of F1-score, discriminability power of expressions between normal and tumor classes, and biological significance of the selected miRNAs. The biological significance is established through existing literatures and online databases such as gene ontology and KEGG pathways after obtaining the target genes using miRDB database. While the performance of ICNNM in terms of F1-score varies from 0.95 to 0.99 for 4 general pan-cancer datasets, it varies from 0.91 to 0.99 for 3 subtype datasets and from 0.76 to 0.90 for rare cancer datasets. Many of the selected miRNAs are found to be the key biomarkers in various tumor classes according to existing investigations. Three miRNAs miR-503, miR-202, and miR-135a can be considered as novel predictions for cancer classes prostate and rectum, mesothelioma, and testicular germ cells, respectively, as their target genes are involved in related cancer pathways, obtained using miRDB database.</div></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"85 ","pages":"Article 102510"},"PeriodicalIF":3.1000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S187775032400303X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Background:
MiRNAs are short-length (22nt) non-coding RNAs and are considered to be important biomarkers in pan-cancer analysis. Pan-cancer analysis is the study of finding the commonalities and differences in genetic and cellular alterations in various types of cancers. A common computational challenge in handling miRNA expression data is that it is high dimensional and complex (HDC) in nature. In this regard, convolutional neural networks are proven to be good performers due to their nature of finding patterns in complex data.
Methodology:
An interpretable convolutional neural network model (ICNNM) is developed for classifying miRNA expression based pan-cancer data. The ICNNM is a one dimensional model. The layers and other hyperparameters are optimized using Bayesian optimization with multivariate tree parzen estimator (BoMTPE). An interpretable approach is developed using SHapley Additive exPlanations (SHAP) values for explaining the behavior of ICNNM. This approach helps in introducing an attribution score for identifying relevant miRNAs using SHAP values. The attribution scores are assigned higher values for those miRNAs which help in the accurate prediction of tumor class of patients by utilizing the game theory concept in computing the SHAP values. The model is evaluated on 9 datasets among which 6 datasets (4 general pan cancer and two subtypes) are derived from a single TCGA pan-cancer dataset, one dataset is downloaded as Breast sub-type from TCGA, and two datasets, nasopharyngeal carcinoma and bone and soft tissue sarcoma, are downloaded from GEO as rare cancer ones.
Results:
The ICNNM is seen to perform better as compared to related techniques such as three variations of the CNN model, random forest RF, SVM, Gboost, XGboost, and Catboost. The performance is evaluated in terms of F1-score, discriminability power of expressions between normal and tumor classes, and biological significance of the selected miRNAs. The biological significance is established through existing literatures and online databases such as gene ontology and KEGG pathways after obtaining the target genes using miRDB database. While the performance of ICNNM in terms of F1-score varies from 0.95 to 0.99 for 4 general pan-cancer datasets, it varies from 0.91 to 0.99 for 3 subtype datasets and from 0.76 to 0.90 for rare cancer datasets. Many of the selected miRNAs are found to be the key biomarkers in various tumor classes according to existing investigations. Three miRNAs miR-503, miR-202, and miR-135a can be considered as novel predictions for cancer classes prostate and rectum, mesothelioma, and testicular germ cells, respectively, as their target genes are involved in related cancer pathways, obtained using miRDB database.
期刊介绍:
Computational Science is a rapidly growing multi- and interdisciplinary field that uses advanced computing and data analysis to understand and solve complex problems. It has reached a level of predictive capability that now firmly complements the traditional pillars of experimentation and theory.
The recent advances in experimental techniques such as detectors, on-line sensor networks and high-resolution imaging techniques, have opened up new windows into physical and biological processes at many levels of detail. The resulting data explosion allows for detailed data driven modeling and simulation.
This new discipline in science combines computational thinking, modern computational methods, devices and collateral technologies to address problems far beyond the scope of traditional numerical methods.
Computational science typically unifies three distinct elements:
• Modeling, Algorithms and Simulations (e.g. numerical and non-numerical, discrete and continuous);
• Software developed to solve science (e.g., biological, physical, and social), engineering, medicine, and humanities problems;
• Computer and information science that develops and optimizes the advanced system hardware, software, networking, and data management components (e.g. problem solving environments).