Subclassification of lung adenocarcinoma through comprehensive multi-omics data to benefit survival outcomes

IF 2.6 4区生物学 Q2 BIOLOGY

Computational Biology and Chemistry Pub Date : 2024-07-14 DOI:10.1016/j.compbiolchem.2024.108150

{"title":"Subclassification of lung adenocarcinoma through comprehensive multi-omics data to benefit survival outcomes","authors":"","doi":"10.1016/j.compbiolchem.2024.108150","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><p>Lung adenocarcinoma (LUAD) is the most common subtype of non-small cell lung cancer. Understanding the molecular mechanisms underlying tumor progression is of great clinical significance. This study aims to identify novel molecular markers associated with LUAD subtypes, with the goal of improving the precision of LUAD subtype classification. Additionally, optimization efforts are directed towards enhancing insights from the perspective of patient survival analysis.</p></div><div><h3>Materials and methods</h3><p>We propose an innovative feature-selection approach that focuses on LUAD classification, which is comprehensive and robust. The proposed method integrates multi-omics data from The Cancer Genome Atlas (TCGA) and leverages a synergistic combination of max-relevance and min-redundancy, least absolute shrinkage and selection operator, and Boruta algorithms. These selected features were deployed in six machine-learning classifiers: logistic regression, random forest, support vector machine, naive Bayes, k-Nearest Neighbor, and XGBoost.</p></div><div><h3>Results</h3><p>The proposed approach achieved an area under the receiver operating characteristic curve (AUC) of 0.9958 for LR. Notably, the accuracy and AUC of a composite model incorporating copy number, methylation, as well as RNA- sequencing data for expression of exons, genes, and miRNA mature strands surpassed the accuracy and AUC metrics of models with single-omics data or other multi-omics combinations. Survival analyses, revealed the SVM classifier to elicit optimal classification, outperforming that achieved by TCGA. To enhance model interpretability, SHapley Additive exPlanations (SHAP) values were utilized to elucidate the impact of each feature on the predictions. Gene Ontology (GO) enrichment analysis identified significant biological processes, molecular functions, and cellular components associated with LUAD subtypes.</p></div><div><h3>Conclusion</h3><p>In summary, our feature selection process, based on TCGA multi-omics data and combined with multiple machine learning classifiers, proficiently identifies molecular subtypes of lung adenocarcinoma and their corresponding significant genes. Our method could enhance the early detection and diagnosis of LUAD, expedite the development of targeted therapies and, ultimately, lengthen patient survival.</p></div>","PeriodicalId":10616,"journal":{"name":"Computational Biology and Chemistry","volume":null,"pages":null},"PeriodicalIF":2.6000,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Biology and Chemistry","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1476927124001385","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives

Lung adenocarcinoma (LUAD) is the most common subtype of non-small cell lung cancer. Understanding the molecular mechanisms underlying tumor progression is of great clinical significance. This study aims to identify novel molecular markers associated with LUAD subtypes, with the goal of improving the precision of LUAD subtype classification. Additionally, optimization efforts are directed towards enhancing insights from the perspective of patient survival analysis.

Materials and methods

We propose an innovative feature-selection approach that focuses on LUAD classification, which is comprehensive and robust. The proposed method integrates multi-omics data from The Cancer Genome Atlas (TCGA) and leverages a synergistic combination of max-relevance and min-redundancy, least absolute shrinkage and selection operator, and Boruta algorithms. These selected features were deployed in six machine-learning classifiers: logistic regression, random forest, support vector machine, naive Bayes, k-Nearest Neighbor, and XGBoost.

Results

The proposed approach achieved an area under the receiver operating characteristic curve (AUC) of 0.9958 for LR. Notably, the accuracy and AUC of a composite model incorporating copy number, methylation, as well as RNA- sequencing data for expression of exons, genes, and miRNA mature strands surpassed the accuracy and AUC metrics of models with single-omics data or other multi-omics combinations. Survival analyses, revealed the SVM classifier to elicit optimal classification, outperforming that achieved by TCGA. To enhance model interpretability, SHapley Additive exPlanations (SHAP) values were utilized to elucidate the impact of each feature on the predictions. Gene Ontology (GO) enrichment analysis identified significant biological processes, molecular functions, and cellular components associated with LUAD subtypes.

Conclusion

In summary, our feature selection process, based on TCGA multi-omics data and combined with multiple machine learning classifiers, proficiently identifies molecular subtypes of lung adenocarcinoma and their corresponding significant genes. Our method could enhance the early detection and diagnosis of LUAD, expedite the development of targeted therapies and, ultimately, lengthen patient survival.

查看原文本刊更多论文

通过全面的多组学数据对肺腺癌进行亚分类以改善生存结果

目的肺腺癌（LUAD）是非小细胞肺癌中最常见的亚型。了解肿瘤进展的分子机制具有重要的临床意义。本研究旨在鉴定与 LUAD 亚型相关的新型分子标记物，以提高 LUAD 亚型分类的精确度。材料与方法我们提出了一种创新的特征选择方法，该方法侧重于 LUAD 分类，具有全面性和稳健性。所提出的方法整合了癌症基因组图谱（TCGA）中的多组学数据，并利用了最大相关性和最小冗余性、最小绝对收缩和选择算子以及 Boruta 算法的协同组合。这些选定的特征被部署在六种机器学习分类器中：逻辑回归、随机森林、支持向量机、天真贝叶斯、k-近邻和 XGBoost。值得注意的是，包含拷贝数、甲基化以及外显子、基因和 miRNA 成熟链表达的 RNA 测序数据的复合模型的准确度和 AUC 均超过了单一组学数据或其他多组学组合模型的准确度和 AUC 指标。生存分析表明，SVM 分类器的分类效果最佳，超过了 TCGA 的分类效果。为了提高模型的可解释性，我们使用了SHAPLE Additive exPlanations（SHAP）值来阐明每个特征对预测的影响。基因本体（GO）富集分析确定了与 LUAD 亚型相关的重要生物过程、分子功能和细胞成分。结论综上所述，我们的特征选择过程基于 TCGA 多组学数据，并与多种机器学习分类器相结合，能熟练识别肺腺癌的分子亚型及其相应的重要基因。我们的方法可以提高肺腺癌的早期发现和诊断率，加快靶向疗法的开发，并最终延长患者的生存期。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Biology and Chemistry 生物-计算机：跨学科应用

CiteScore

6.10

自引率

3.20%

发文量

142

审稿时长

24 days

期刊介绍： Computational Biology and Chemistry publishes original research papers and review articles in all areas of computational life sciences. High quality research contributions with a major computational component in the areas of nucleic acid and protein sequence research, molecular evolution, molecular genetics (functional genomics and proteomics), theory and practice of either biology-specific or chemical-biology-specific modeling, and structural biology of nucleic acids and proteins are particularly welcome. Exceptionally high quality research work in bioinformatics, systems biology, ecology, computational pharmacology, metabolism, biomedical engineering, epidemiology, and statistical genetics will also be considered. Given their inherent uncertainty, protein modeling and molecular docking studies should be thoroughly validated. In the absence of experimental results for validation, the use of molecular dynamics simulations along with detailed free energy calculations, for example, should be used as complementary techniques to support the major conclusions. Submissions of premature modeling exercises without additional biological insights will not be considered. Review articles will generally be commissioned by the editors and should not be submitted to the journal without explicit invitation. However prospective authors are welcome to send a brief (one to three pages) synopsis, which will be evaluated by the editors.