Assessing Random Forest self-reproducibility for optimal short biomarker signature discovery.

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics Pub Date : 2025-07-02 DOI:10.1093/bib/bbaf318

Ahmed Debit, Christophe Poulet, Claire Josse, Guy Jerusalem, Chloe-Agathe Azencott, Vincent Bours, Kristel Van Steen

{"title":"Assessing Random Forest self-reproducibility for optimal short biomarker signature discovery.","authors":"Ahmed Debit, Christophe Poulet, Claire Josse, Guy Jerusalem, Chloe-Agathe Azencott, Vincent Bours, Kristel Van Steen","doi":"10.1093/bib/bbaf318","DOIUrl":null,"url":null,"abstract":"<p><p>Biomarker signature discovery remains the main path to developing clinical diagnostic tools when the biological knowledge on pathology is weak. Shortest signatures are often preferred to reduce the cost of the diagnostic. The ability to find the best and shortest signature relies on the robustness of the models that can be built on such a set of molecules. The classification algorithm that will be used is often selected based on the average Area Under the Curve (AUC) performance of its models. However, it is not guaranteed that an algorithm with a large AUC distribution will keep a stable performance when facing data. Here, we propose two AUC-derived hyper-stability scores, the Hyper-stability Resampling Sensitive (HRS) and the Hyper-stability Signature Sensitive (HSS), as complementary metrics to the average AUC that should bring confidence in the choice for the best classification algorithm. To emphasize the importance of these scores, we compared 15 different Random Forest implementations. Our findings show that the Random Forest implementation should be chosen according to the data at hand and the classification question being evaluated. No Random Forest implementation can be used universally for any classification and on any dataset. Each of them should be tested for their average AUC performance and AUC-derived stability, prior to analysis.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 4","pages":""},"PeriodicalIF":6.8000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12245662/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbaf318","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Biomarker signature discovery remains the main path to developing clinical diagnostic tools when the biological knowledge on pathology is weak. Shortest signatures are often preferred to reduce the cost of the diagnostic. The ability to find the best and shortest signature relies on the robustness of the models that can be built on such a set of molecules. The classification algorithm that will be used is often selected based on the average Area Under the Curve (AUC) performance of its models. However, it is not guaranteed that an algorithm with a large AUC distribution will keep a stable performance when facing data. Here, we propose two AUC-derived hyper-stability scores, the Hyper-stability Resampling Sensitive (HRS) and the Hyper-stability Signature Sensitive (HSS), as complementary metrics to the average AUC that should bring confidence in the choice for the best classification algorithm. To emphasize the importance of these scores, we compared 15 different Random Forest implementations. Our findings show that the Random Forest implementation should be chosen according to the data at hand and the classification question being evaluated. No Random Forest implementation can be used universally for any classification and on any dataset. Each of them should be tested for their average AUC performance and AUC-derived stability, prior to analysis.

查看原文本刊更多论文

评估随机森林的最佳短生物标志物签名发现的自我再现性。

在病理学生物学知识薄弱的情况下，生物标志物特征的发现仍然是开发临床诊断工具的主要途径。为了减少诊断成本，通常更倾向于使用最短的签名。找到最佳和最短特征的能力依赖于可以建立在这样一组分子上的模型的鲁棒性。将要使用的分类算法通常是根据其模型的平均曲线下面积（AUC）性能来选择的。但是，并不能保证AUC分布大的算法在面对数据时能保持稳定的性能。在这里，我们提出了两个AUC衍生的超稳定性分数，超稳定性重采样敏感（HRS）和超稳定性签名敏感（HSS），作为平均AUC的补充指标，应该为选择最佳分类算法带来信心。为了强调这些分数的重要性，我们比较了15种不同的随机森林实现。我们的研究结果表明，随机森林的实现应该根据手头的数据和正在评估的分类问题来选择。随机森林的实现不可能对任何分类和任何数据集通用。在分析之前，应该测试它们的平均AUC性能和AUC衍生的稳定性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.