Multi-ensemble machine learning framework for omics data integration: A case study using breast cancer samples

Q1 Medicine

Informatics in Medicine Unlocked Pub Date : 2024-01-01 DOI:10.1016/j.imu.2024.101507

Kunal Tembhare, Tina Sharma, Sunitha M. Kasibhatla, Archana Achalere, Rajendra Joshi

{"title":"Multi-ensemble machine learning framework for omics data integration: A case study using breast cancer samples","authors":"Kunal Tembhare, Tina Sharma, Sunitha M. Kasibhatla, Archana Achalere, Rajendra Joshi","doi":"10.1016/j.imu.2024.101507","DOIUrl":null,"url":null,"abstract":"<div><p>Integration of voluminous omics data aids to unravel biological complexities associated with different disease phenotypes. Machine learning (ML) approaches provide insightful techniques for systematic multi-omics data integration. In this study, survival prediction of breast cancer patients was undertaken using omics data of 302 female patients from The Cancer Genome Atlas (TCGA). The data included gene expression, miRNA expression, DNA methylation and copy number variation. Three computational multi-ensemble ML pipelines were tested using Support Vector Machine (SVM), Random Forest (RF) and Partial Least Squares-Discriminant Analysis (PLS-DA) algorithms. To overcome the limitations associated with univariate feature selection criteria, the ML pipelines were built along with latent factors obtained by multivariate dimension reduction method. This facilitated investigation of background genetic networks and identification of potential hub genes. Analysis of the results obtained revealed that SVM with PLS-DA method (integrated with gene expression, DNA methylation, and miRNA expression modalities) was the best-performing model with an Area Under Curve (AUC) of 89% and an accuracy of 83% for survival prediction. This study not only corroborated previously reported breast cancer-specific prognostic biomarkers but also predicted additional potential biomarkers. The work demonstrates the effective use of a multi-ensemble ML model with efficient feature selection methods as a robust protocol for cancer genotype to phenotype correlation.</p></div>","PeriodicalId":13953,"journal":{"name":"Informatics in Medicine Unlocked","volume":"47 ","pages":"Article 101507"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2352914824000637/pdfft?md5=d0bc5069357cca8ad1607f59098d6c54&pid=1-s2.0-S2352914824000637-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics in Medicine Unlocked","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352914824000637","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

Abstract

Integration of voluminous omics data aids to unravel biological complexities associated with different disease phenotypes. Machine learning (ML) approaches provide insightful techniques for systematic multi-omics data integration. In this study, survival prediction of breast cancer patients was undertaken using omics data of 302 female patients from The Cancer Genome Atlas (TCGA). The data included gene expression, miRNA expression, DNA methylation and copy number variation. Three computational multi-ensemble ML pipelines were tested using Support Vector Machine (SVM), Random Forest (RF) and Partial Least Squares-Discriminant Analysis (PLS-DA) algorithms. To overcome the limitations associated with univariate feature selection criteria, the ML pipelines were built along with latent factors obtained by multivariate dimension reduction method. This facilitated investigation of background genetic networks and identification of potential hub genes. Analysis of the results obtained revealed that SVM with PLS-DA method (integrated with gene expression, DNA methylation, and miRNA expression modalities) was the best-performing model with an Area Under Curve (AUC) of 89% and an accuracy of 83% for survival prediction. This study not only corroborated previously reported breast cancer-specific prognostic biomarkers but also predicted additional potential biomarkers. The work demonstrates the effective use of a multi-ensemble ML model with efficient feature selection methods as a robust protocol for cancer genotype to phenotype correlation.

查看原文本刊更多论文

用于 omics 数据整合的多集合机器学习框架：使用乳腺癌样本的案例研究

整合大量的组学数据有助于揭示与不同疾病表型相关的生物复杂性。机器学习（ML）方法为系统的多组学数据整合提供了具有洞察力的技术。在这项研究中，我们利用癌症基因组图谱（TCGA）中 302 名女性患者的组学数据对乳腺癌患者的生存率进行了预测。这些数据包括基因表达、miRNA表达、DNA甲基化和拷贝数变异。使用支持向量机（SVM）、随机森林（RF）和偏最小二乘法判别分析（PLS-DA）算法测试了三种计算多集合 ML 管道。为了克服与单变量特征选择标准相关的局限性，在建立 ML 管道的同时，还采用了多变量降维方法获得的潜在因子。这有助于研究背景遗传网络和识别潜在的中心基因。对所得结果的分析表明，采用 PLS-DA 方法（与基因表达、DNA 甲基化和 miRNA 表达模式相结合）的 SVM 是表现最好的模型，其曲线下面积（AUC）为 89%，生存预测准确率为 83%。这项研究不仅证实了之前报道的乳腺癌特异性预后生物标志物，还预测了其他潜在的生物标志物。这项工作证明了多集合 ML 模型与高效特征选择方法的有效结合，可作为癌症基因型与表型相关性的稳健方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Informatics in Medicine Unlocked Medicine-Health Informatics

CiteScore

9.50

自引率

0.00%

发文量

282

审稿时长

39 days

期刊介绍： Informatics in Medicine Unlocked (IMU) is an international gold open access journal covering a broad spectrum of topics within medical informatics, including (but not limited to) papers focusing on imaging, pathology, teledermatology, public health, ophthalmological, nursing and translational medicine informatics. The full papers that are published in the journal are accessible to all who visit the website.