Nonparametric IPSS: fast, flexible feature selection with false discovery control.

Bioinformatics (Oxford, England) Pub Date : 2025-05-06 DOI:10.1093/bioinformatics/btaf299

Omar Melikechi, David B Dunson, Jeffrey W Miller

{"title":"Nonparametric IPSS: fast, flexible feature selection with false discovery control.","authors":"Omar Melikechi, David B Dunson, Jeffrey W Miller","doi":"10.1093/bioinformatics/btaf299","DOIUrl":null,"url":null,"abstract":"Motivation: Feature selection is a critical task in machine learning and statistics. However, existing feature selection methods either (i) rely on parametric methods such as linear or generalized linear models, (ii) lack theoretical false discovery control, or (iii) identify few true positives.Results: We introduce a general feature selection method with finite-sample false discovery control based on applying integrated path stability selection (IPSS) to arbitrary feature importance scores. The method is nonparametric whenever the importance scores are nonparametric, and it estimates q-values, which are better suited to high-dimensional data than P-values. We focus on two special cases using importance scores from gradient boosting (IPSSGB) and random forests (IPSSRF). Extensive nonlinear simulations with RNA sequencing data show that both methods accurately control the false discovery rate and detect more true positives than existing methods. Both methods are also efficient, running in under 20 s when there are 500 samples and 5000 features. We apply IPSSGB and IPSSRF to detect microRNAs and genes related to cancer, finding that they yield better predictions with fewer features than existing approaches.Availability and implementation: All code and data used in this work are available on GitHub (https://github.com/omelikechi/ipss_bioinformatics) and permanently archived on Zenodo (https://doi.org/10.5281/zenodo.15335289). A Python package for implementing IPSS is available on GitHub (https://github.com/omelikechi/ipss) and PyPI (https://pypi.org/project/ipss/). An R implementation of IPSS is also available on GitHub (https://github.com/omelikechi/ipssR).","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf299","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Feature selection is a critical task in machine learning and statistics. However, existing feature selection methods either (i) rely on parametric methods such as linear or generalized linear models, (ii) lack theoretical false discovery control, or (iii) identify few true positives.

Results: We introduce a general feature selection method with finite-sample false discovery control based on applying integrated path stability selection (IPSS) to arbitrary feature importance scores. The method is nonparametric whenever the importance scores are nonparametric, and it estimates q-values, which are better suited to high-dimensional data than P-values. We focus on two special cases using importance scores from gradient boosting (IPSSGB) and random forests (IPSSRF). Extensive nonlinear simulations with RNA sequencing data show that both methods accurately control the false discovery rate and detect more true positives than existing methods. Both methods are also efficient, running in under 20 s when there are 500 samples and 5000 features. We apply IPSSGB and IPSSRF to detect microRNAs and genes related to cancer, finding that they yield better predictions with fewer features than existing approaches.

Availability and implementation: All code and data used in this work are available on GitHub (https://github.com/omelikechi/ipss_bioinformatics) and permanently archived on Zenodo (https://doi.org/10.5281/zenodo.15335289). A Python package for implementing IPSS is available on GitHub (https://github.com/omelikechi/ipss) and PyPI (https://pypi.org/project/ipss/). An R implementation of IPSS is also available on GitHub (https://github.com/omelikechi/ipssR).

查看原文本刊更多论文

非参数IPSS：快速，灵活的特征选择与错误发现控制。

动机：特征选择是机器学习和统计学中的一项关键任务。然而，现有的特征选择方法要么(i)依赖于参数方法，如线性或广义线性模型，（ii）缺乏理论上的错误发现控制，或者（iii）识别很少的真阳性。结果：我们引入了一种基于集成路径稳定性选择（IPSS）对任意特征重要性分数的有限样本错误发现控制的通用特征选择方法。当重要性分数是非参数时，该方法是非参数的，并且它估计q值，这比p值更适合高维数据。我们使用梯度增强（IPSSGB）和随机森林（IPSSRF）的重要性分数来关注两个特殊情况。利用RNA测序数据进行的大量非线性模拟表明，这两种方法都能准确地控制假发现率，并比现有方法检测出更多的真阳性。这两种方法都很有效，当有500个样本和5000个特征时，运行时间不到20秒。我们应用IPSSGB和IPSSRF来检测与癌症相关的microrna和基因，发现它们比现有方法具有更少的特征，从而产生更好的预测。可用性和实现：本工作中使用的所有代码和数据都可以在GitHub （https://github.com/omelikechi/ipss_bioinformatics）上获得，并永久存档在Zenodo （https://doi.org/10.5281/zenodo.15335289）上。用于实现IPSS的Python包可在GitHub （https://github.com/omelikechi/ipss）和PyPI （https://pypi.org/project/ipss/）上获得。IPSS的R实现也可在GitHub上获得（https://github.com/omelikechi/ipssR）.Supplementary information：补充数据可在Bioinformatics在线获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量