{"title":"Pan-cancer gene set discovery via scRNA-seq for optimal deep learning based downstream tasks","authors":"Jong Hyun Kim, Jongseong Jang","doi":"arxiv-2408.07233","DOIUrl":null,"url":null,"abstract":"The application of machine learning to transcriptomics data has led to\nsignificant advances in cancer research. However, the high dimensionality and\ncomplexity of RNA sequencing (RNA-seq) data pose significant challenges in\npan-cancer studies. This study hypothesizes that gene sets derived from\nsingle-cell RNA sequencing (scRNA-seq) data will outperform those selected\nusing bulk RNA-seq in pan-cancer downstream tasks. We analyzed scRNA-seq data\nfrom 181 tumor biopsies across 13 cancer types. High-dimensional weighted gene\nco-expression network analysis (hdWGCNA) was performed to identify relevant\ngene sets, which were further refined using XGBoost for feature selection.\nThese gene sets were applied to downstream tasks using TCGA pan-cancer RNA-seq\ndata and compared to six reference gene sets and oncogenes from OncoKB\nevaluated with deep learning models, including multilayer perceptrons (MLPs)\nand graph neural networks (GNNs). The XGBoost-refined hdWGCNA gene set\ndemonstrated higher performance in most tasks, including tumor mutation burden\nassessment, microsatellite instability classification, mutation prediction,\ncancer subtyping, and grading. In particular, genes such as DPM1, BAD, and\nFKBP4 emerged as important pan-cancer biomarkers, with DPM1 consistently\nsignificant across tasks. This study presents a robust approach for feature\nselection in cancer genomics by integrating scRNA-seq data and advanced\nanalysis techniques, offering a promising avenue for improving predictive\naccuracy in cancer research.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.07233","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The application of machine learning to transcriptomics data has led to
significant advances in cancer research. However, the high dimensionality and
complexity of RNA sequencing (RNA-seq) data pose significant challenges in
pan-cancer studies. This study hypothesizes that gene sets derived from
single-cell RNA sequencing (scRNA-seq) data will outperform those selected
using bulk RNA-seq in pan-cancer downstream tasks. We analyzed scRNA-seq data
from 181 tumor biopsies across 13 cancer types. High-dimensional weighted gene
co-expression network analysis (hdWGCNA) was performed to identify relevant
gene sets, which were further refined using XGBoost for feature selection.
These gene sets were applied to downstream tasks using TCGA pan-cancer RNA-seq
data and compared to six reference gene sets and oncogenes from OncoKB
evaluated with deep learning models, including multilayer perceptrons (MLPs)
and graph neural networks (GNNs). The XGBoost-refined hdWGCNA gene set
demonstrated higher performance in most tasks, including tumor mutation burden
assessment, microsatellite instability classification, mutation prediction,
cancer subtyping, and grading. In particular, genes such as DPM1, BAD, and
FKBP4 emerged as important pan-cancer biomarkers, with DPM1 consistently
significant across tasks. This study presents a robust approach for feature
selection in cancer genomics by integrating scRNA-seq data and advanced
analysis techniques, offering a promising avenue for improving predictive
accuracy in cancer research.