Multi-ensemble machine learning framework for omics data integration: A case study using breast cancer samples

Q1 Medicine
Kunal Tembhare, Tina Sharma, Sunitha M. Kasibhatla, Archana Achalere, Rajendra Joshi
{"title":"Multi-ensemble machine learning framework for omics data integration: A case study using breast cancer samples","authors":"Kunal Tembhare,&nbsp;Tina Sharma,&nbsp;Sunitha M. Kasibhatla,&nbsp;Archana Achalere,&nbsp;Rajendra Joshi","doi":"10.1016/j.imu.2024.101507","DOIUrl":null,"url":null,"abstract":"<div><p>Integration of voluminous omics data aids to unravel biological complexities associated with different disease phenotypes. Machine learning (ML) approaches provide insightful techniques for systematic multi-omics data integration. In this study, survival prediction of breast cancer patients was undertaken using omics data of 302 female patients from The Cancer Genome Atlas (TCGA). The data included gene expression, miRNA expression, DNA methylation and copy number variation. Three computational multi-ensemble ML pipelines were tested using Support Vector Machine (SVM), Random Forest (RF) and Partial Least Squares-Discriminant Analysis (PLS-DA) algorithms. To overcome the limitations associated with univariate feature selection criteria, the ML pipelines were built along with latent factors obtained by multivariate dimension reduction method. This facilitated investigation of background genetic networks and identification of potential hub genes. Analysis of the results obtained revealed that SVM with PLS-DA method (integrated with gene expression, DNA methylation, and miRNA expression modalities) was the best-performing model with an Area Under Curve (AUC) of 89% and an accuracy of 83% for survival prediction. This study not only corroborated previously reported breast cancer-specific prognostic biomarkers but also predicted additional potential biomarkers. The work demonstrates the effective use of a multi-ensemble ML model with efficient feature selection methods as a robust protocol for cancer genotype to phenotype correlation.</p></div>","PeriodicalId":13953,"journal":{"name":"Informatics in Medicine Unlocked","volume":"47 ","pages":"Article 101507"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2352914824000637/pdfft?md5=d0bc5069357cca8ad1607f59098d6c54&pid=1-s2.0-S2352914824000637-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics in Medicine Unlocked","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352914824000637","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0

Abstract

Integration of voluminous omics data aids to unravel biological complexities associated with different disease phenotypes. Machine learning (ML) approaches provide insightful techniques for systematic multi-omics data integration. In this study, survival prediction of breast cancer patients was undertaken using omics data of 302 female patients from The Cancer Genome Atlas (TCGA). The data included gene expression, miRNA expression, DNA methylation and copy number variation. Three computational multi-ensemble ML pipelines were tested using Support Vector Machine (SVM), Random Forest (RF) and Partial Least Squares-Discriminant Analysis (PLS-DA) algorithms. To overcome the limitations associated with univariate feature selection criteria, the ML pipelines were built along with latent factors obtained by multivariate dimension reduction method. This facilitated investigation of background genetic networks and identification of potential hub genes. Analysis of the results obtained revealed that SVM with PLS-DA method (integrated with gene expression, DNA methylation, and miRNA expression modalities) was the best-performing model with an Area Under Curve (AUC) of 89% and an accuracy of 83% for survival prediction. This study not only corroborated previously reported breast cancer-specific prognostic biomarkers but also predicted additional potential biomarkers. The work demonstrates the effective use of a multi-ensemble ML model with efficient feature selection methods as a robust protocol for cancer genotype to phenotype correlation.

用于 omics 数据整合的多集合机器学习框架:使用乳腺癌样本的案例研究
整合大量的组学数据有助于揭示与不同疾病表型相关的生物复杂性。机器学习(ML)方法为系统的多组学数据整合提供了具有洞察力的技术。在这项研究中,我们利用癌症基因组图谱(TCGA)中 302 名女性患者的组学数据对乳腺癌患者的生存率进行了预测。这些数据包括基因表达、miRNA表达、DNA甲基化和拷贝数变异。使用支持向量机(SVM)、随机森林(RF)和偏最小二乘法判别分析(PLS-DA)算法测试了三种计算多集合 ML 管道。为了克服与单变量特征选择标准相关的局限性,在建立 ML 管道的同时,还采用了多变量降维方法获得的潜在因子。这有助于研究背景遗传网络和识别潜在的中心基因。对所得结果的分析表明,采用 PLS-DA 方法(与基因表达、DNA 甲基化和 miRNA 表达模式相结合)的 SVM 是表现最好的模型,其曲线下面积(AUC)为 89%,生存预测准确率为 83%。这项研究不仅证实了之前报道的乳腺癌特异性预后生物标志物,还预测了其他潜在的生物标志物。这项工作证明了多集合 ML 模型与高效特征选择方法的有效结合,可作为癌症基因型与表型相关性的稳健方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Informatics in Medicine Unlocked
Informatics in Medicine Unlocked Medicine-Health Informatics
CiteScore
9.50
自引率
0.00%
发文量
282
审稿时长
39 days
期刊介绍: Informatics in Medicine Unlocked (IMU) is an international gold open access journal covering a broad spectrum of topics within medical informatics, including (but not limited to) papers focusing on imaging, pathology, teledermatology, public health, ophthalmological, nursing and translational medicine informatics. The full papers that are published in the journal are accessible to all who visit the website.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信