Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2024-09-10 DOI:10.1186/s13040-024-00388-8

Erika Cantor, Sandra Guauque-Olarte, Roberto León, Steren Chabert, Rodrigo Salas

{"title":"Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data","authors":"Erika Cantor, Sandra Guauque-Olarte, Roberto León, Steren Chabert, Rodrigo Salas","doi":"10.1186/s13040-024-00388-8","DOIUrl":null,"url":null,"abstract":"The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes $$(n \\le 30)$$ comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"10 1","pages":""},"PeriodicalIF":6.1000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-024-00388-8","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

The use of prior knowledge in the machine learning framework has been considered a potential tool to handle the curse of dimensionality in genetic and genomics data. Although random forest (RF) represents a flexible non-parametric approach with several advantages, it can provide poor accuracy in high-dimensional settings, mainly in scenarios with small sample sizes. We propose a knowledge-slanted RF that integrates biological networks as prior knowledge into the model to improve its performance and explainability, exemplifying its use for selecting and identifying relevant genes. knowledge-slanted RF is a combination of two stages. First, prior knowledge represented by graphs is translated by running a random walk with restart algorithm to determine the relevance of each gene based on its connection and localization on a protein-protein interaction network. Then, each relevance is used to modify the selection probability to draw a gene as a candidate split-feature in the conventional RF. Experiments in simulated datasets with very small sample sizes $$(n \le 30)$$ comparing knowledge-slanted RF against conventional RF and logistic lasso regression, suggest an improved precision in outcome prediction compared to the other methods. The knowledge-slanted RF was completed with the introduction of a modified version of the Boruta feature selection algorithm. Finally, knowledge-slanted RF identified more relevant biological genes, offering a higher level of explainability for users than conventional RF. These findings were corroborated in one real case to identify relevant genes to calcific aortic valve stenosis.

查看原文本刊更多论文

针对高维数据和小样本量的知识倾斜随机森林方法与基因表达数据的特征选择应用

在机器学习框架中使用先验知识一直被认为是处理遗传和基因组学数据维度诅咒的潜在工具。虽然随机森林（RF）是一种灵活的非参数方法，具有多种优势，但在高维环境下，主要是在样本量较小的情况下，其准确性可能较差。我们提出了一种知识倾斜 RF，将生物网络作为先验知识整合到模型中，以提高其性能和可解释性，并将其用于选择和识别相关基因。首先，通过运行带重启算法的随机行走来转换由图代表的先验知识，从而根据每个基因在蛋白质-蛋白质相互作用网络上的连接和定位来确定其相关性。然后，利用每个相关性来修改选择概率，从而在传统的 RF 中将某个基因作为候选分割特征提取出来。在样本量极小的模拟数据集上进行的实验表明，知识倾斜RF与传统RF和logistic lasso回归相比，结果预测的精确度有所提高。通过引入改进版的 Boruta 特征选择算法，知识倾斜 RF 得到了完善。最后，与传统 RF 相比，知识倾斜 RF 识别出了更多相关的生物基因，为用户提供了更高水平的可解释性。这些发现在一个真实病例中得到了证实，从而确定了钙化性主动脉瓣狭窄的相关基因。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.