Robust and consistent biomarker candidates identification by a machine learning approach applied to pancreatic ductal adenocarcinoma metastasis.

IF 3.3 3区 医学 Q2 MEDICAL INFORMATICS
Tanakamol Mahawan, Teifion Luckett, Ainhoa Mielgo Iza, Natapol Pornputtapong, Eva Caamaño Gutiérrez
{"title":"Robust and consistent biomarker candidates identification by a machine learning approach applied to pancreatic ductal adenocarcinoma metastasis.","authors":"Tanakamol Mahawan, Teifion Luckett, Ainhoa Mielgo Iza, Natapol Pornputtapong, Eva Caamaño Gutiérrez","doi":"10.1186/s12911-024-02578-0","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Machine Learning (ML) plays a crucial role in biomedical research. Nevertheless, it still has limitations in data integration and irreproducibility. To address these challenges, robust methods are needed. Pancreatic ductal adenocarcinoma (PDAC), a highly aggressive cancer with low early detection rates and survival rates, is used as a case study. PDAC lacks reliable diagnostic biomarkers, especially metastatic biomarkers, which remains an unmet need. In this study, we propose an ML-based approach for discovering disease biomarkers, apply it to the identification of a PDAC metastatic composite biomarker candidate, and demonstrate the advantages of harnessing data resources.</p><p><strong>Methods: </strong>We utilised primary tumour RNAseq data from five public repositories, pooling samples to maximise statistical power and integrating data by correcting for technical variance. Data were split into train and validation sets. The train dataset underwent variable selection via a 10-fold cross-validation process that combined three algorithms in 100 models per fold. Genes found in at least 80% of models and five folds were considered robust to build a consensus multivariate model. A random forest model was constructed using selected genes from the train dataset and tested in the validation set. We also assessed the goodness of prediction by recalibrating a model using only the validation data. The biological context and relevance of signals was explored through enrichment and pathway analyses using QIAGEN Ingenuity Pathway Analysis and GeneMANIA.</p><p><strong>Results: </strong>We developed a pipeline that can detect robust signatures to build composite biomarkers. We tested the pipeline in PDAC, exploiting transcriptomics data from different sources, proposing a composite biomarker candidate comprised of fifteen genes consistently selected that showed very promising predictive capability. Biological contextualisation revealed links with cancer progression and metastasis, underscoring their potential relevance. All code is available in GitHub.</p><p><strong>Conclusion: </strong>This study establishes a robust framework for identifying composite biomarkers across various disease contexts. We demonstrate its potential by proposing a plausible composite biomarker candidate for PDAC metastasis. By reusing data from public repositories, we highlight the sustainability of our research and the wider applications of our pipeline. The preliminary findings shed light on a promising validation and application path.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":null,"pages":null},"PeriodicalIF":3.3000,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11191155/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02578-0","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Machine Learning (ML) plays a crucial role in biomedical research. Nevertheless, it still has limitations in data integration and irreproducibility. To address these challenges, robust methods are needed. Pancreatic ductal adenocarcinoma (PDAC), a highly aggressive cancer with low early detection rates and survival rates, is used as a case study. PDAC lacks reliable diagnostic biomarkers, especially metastatic biomarkers, which remains an unmet need. In this study, we propose an ML-based approach for discovering disease biomarkers, apply it to the identification of a PDAC metastatic composite biomarker candidate, and demonstrate the advantages of harnessing data resources.

Methods: We utilised primary tumour RNAseq data from five public repositories, pooling samples to maximise statistical power and integrating data by correcting for technical variance. Data were split into train and validation sets. The train dataset underwent variable selection via a 10-fold cross-validation process that combined three algorithms in 100 models per fold. Genes found in at least 80% of models and five folds were considered robust to build a consensus multivariate model. A random forest model was constructed using selected genes from the train dataset and tested in the validation set. We also assessed the goodness of prediction by recalibrating a model using only the validation data. The biological context and relevance of signals was explored through enrichment and pathway analyses using QIAGEN Ingenuity Pathway Analysis and GeneMANIA.

Results: We developed a pipeline that can detect robust signatures to build composite biomarkers. We tested the pipeline in PDAC, exploiting transcriptomics data from different sources, proposing a composite biomarker candidate comprised of fifteen genes consistently selected that showed very promising predictive capability. Biological contextualisation revealed links with cancer progression and metastasis, underscoring their potential relevance. All code is available in GitHub.

Conclusion: This study establishes a robust framework for identifying composite biomarkers across various disease contexts. We demonstrate its potential by proposing a plausible composite biomarker candidate for PDAC metastasis. By reusing data from public repositories, we highlight the sustainability of our research and the wider applications of our pipeline. The preliminary findings shed light on a promising validation and application path.

将机器学习方法应用于胰腺导管腺癌转移,鉴定出可靠且一致的候选生物标记物。
背景:机器学习(ML)在生物医学研究中发挥着至关重要的作用。然而,它在数据整合和不可再现性方面仍有局限性。为了应对这些挑战,我们需要稳健的方法。胰腺导管腺癌(PDAC)是一种侵袭性很强的癌症,早期发现率和存活率都很低,本研究将其作为一个案例进行研究。PDAC 缺乏可靠的诊断生物标志物,尤其是转移性生物标志物,而这一需求仍未得到满足。在本研究中,我们提出了一种基于 ML 的发现疾病生物标志物的方法,并将其应用于 PDAC 转移性复合生物标志物候选物的鉴定,同时展示了利用数据资源的优势:我们利用了来自五个公共存储库的原发性肿瘤 RNAseq 数据,汇集样本以最大限度地提高统计能力,并通过校正技术差异来整合数据。数据分为训练集和验证集。训练数据集通过 10 倍交叉验证过程进行变量选择,该过程结合了三种算法,每倍 100 个模型。在至少 80% 的模型和五次折叠中发现的基因被认为是稳健的,从而建立了一个共识多变量模型。使用训练数据集中的选定基因构建随机森林模型,并在验证集中进行测试。我们还通过仅使用验证数据重新校准模型来评估预测的准确性。通过使用 QIAGEN Ingenuity Pathway Analysis 和 GeneMANIA 进行富集和通路分析,探索了信号的生物学背景和相关性:结果:我们开发了一种管道,它能检测出稳健的特征,从而构建复合生物标记物。我们利用不同来源的转录组学数据在 PDAC 中测试了该管道,提出了一个由 15 个基因组成的候选复合生物标志物,这些基因被一致选中,显示出非常好的预测能力。生物学背景揭示了这些基因与癌症进展和转移的联系,强调了它们的潜在相关性。所有代码均可在 GitHub 上获取:本研究建立了一个稳健的框架,用于识别各种疾病背景下的复合生物标记物。我们为 PDAC 转移提出了一个可信的候选复合生物标记物,从而证明了该框架的潜力。通过重复使用公共存储库中的数据,我们强调了研究的可持续性以及我们的管道的广泛应用。初步研究结果为我们指明了一条前景广阔的验证和应用之路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.20
自引率
5.70%
发文量
297
审稿时长
1 months
期刊介绍: BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信