RCE-IFE: recursive cluster elimination with intra-cluster feature elimination.

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

PeerJ Computer Science Pub Date : 2025-02-07 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.2528

Cihan Kuzudisli, Burcu Bakir-Gungor, Bahjat Qaqish, Malik Yousef

{"title":"RCE-IFE: recursive cluster elimination with intra-cluster feature elimination.","authors":"Cihan Kuzudisli, Burcu Bakir-Gungor, Bahjat Qaqish, Malik Yousef","doi":"10.7717/peerj-cs.2528","DOIUrl":null,"url":null,"abstract":"<p><p>The computational and interpretational difficulties caused by the ever-increasing dimensionality of biological data generated by new technologies pose a significant challenge. Feature selection (FS) methods aim to reduce the dimension, and feature grouping has emerged as a foundation for FS techniques that seek to detect strong correlations among features and identify irrelevant features. In this work, we propose the Recursive Cluster Elimination with Intra-Cluster Feature Elimination (RCE-IFE) method that utilizes feature grouping and iterates grouping and elimination steps in a supervised context. We assess dimensionality reduction and discriminatory capabilities of RCE-IFE on various high-dimensional datasets from different biological domains. For a set of gene expression, microRNA (miRNA) expression, and methylation datasets, the performance of RCE-IFE is comparatively evaluated with RCE-IFE-SVM (the SVM-adapted version of RCE-IFE) and SVM-RCE. On average, RCE-IFE attains an area under the curve (AUC) of 0.85 among tested expression datasets with the fewest features and the shortest running time, while RCE-IFE-SVM (the SVM-adapted version of RCE-IFE) and SVM-RCE achieve similar AUCs of 0.84 and 0.83, respectively. RCE-IFE and SVM-RCE yield AUCs of 0.79 and 0.68, respectively when averaged over seven different metagenomics datasets, with RCE-IFE significantly reducing feature subsets. Furthermore, RCE-IFE surpasses several state-of-the-art FS methods, such as Minimum Redundancy Maximum Relevance (MRMR), Fast Correlation-Based Filter (FCBF), Information Gain (IG), Conditional Mutual Information Maximization (CMIM), SelectKBest (SKB), and eXtreme Gradient Boosting (XGBoost), obtaining an average AUC of 0.76 on five gene expression datasets. Compared with a similar tool, Multi-stage, RCE-IFE gives a similar average accuracy rate of 89.27% using fewer features on four cancer-related datasets. The comparability of RCE-IFE is also verified with other biological domain knowledge-based Grouping-Scoring-Modeling (G-S-M) tools, including mirGediNET, 3Mint, and miRcorrNet. Additionally, the biological relevance of the selected features by RCE-IFE is evaluated. The proposed method also exhibits high consistency in terms of the selected features across multiple runs. Our experimental findings imply that RCE-IFE provides robust classifier performance and significantly reduces feature size while maintaining feature relevance and consistency.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e2528"},"PeriodicalIF":3.5000,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11888879/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2528","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The computational and interpretational difficulties caused by the ever-increasing dimensionality of biological data generated by new technologies pose a significant challenge. Feature selection (FS) methods aim to reduce the dimension, and feature grouping has emerged as a foundation for FS techniques that seek to detect strong correlations among features and identify irrelevant features. In this work, we propose the Recursive Cluster Elimination with Intra-Cluster Feature Elimination (RCE-IFE) method that utilizes feature grouping and iterates grouping and elimination steps in a supervised context. We assess dimensionality reduction and discriminatory capabilities of RCE-IFE on various high-dimensional datasets from different biological domains. For a set of gene expression, microRNA (miRNA) expression, and methylation datasets, the performance of RCE-IFE is comparatively evaluated with RCE-IFE-SVM (the SVM-adapted version of RCE-IFE) and SVM-RCE. On average, RCE-IFE attains an area under the curve (AUC) of 0.85 among tested expression datasets with the fewest features and the shortest running time, while RCE-IFE-SVM (the SVM-adapted version of RCE-IFE) and SVM-RCE achieve similar AUCs of 0.84 and 0.83, respectively. RCE-IFE and SVM-RCE yield AUCs of 0.79 and 0.68, respectively when averaged over seven different metagenomics datasets, with RCE-IFE significantly reducing feature subsets. Furthermore, RCE-IFE surpasses several state-of-the-art FS methods, such as Minimum Redundancy Maximum Relevance (MRMR), Fast Correlation-Based Filter (FCBF), Information Gain (IG), Conditional Mutual Information Maximization (CMIM), SelectKBest (SKB), and eXtreme Gradient Boosting (XGBoost), obtaining an average AUC of 0.76 on five gene expression datasets. Compared with a similar tool, Multi-stage, RCE-IFE gives a similar average accuracy rate of 89.27% using fewer features on four cancer-related datasets. The comparability of RCE-IFE is also verified with other biological domain knowledge-based Grouping-Scoring-Modeling (G-S-M) tools, including mirGediNET, 3Mint, and miRcorrNet. Additionally, the biological relevance of the selected features by RCE-IFE is evaluated. The proposed method also exhibits high consistency in terms of the selected features across multiple runs. Our experimental findings imply that RCE-IFE provides robust classifier performance and significantly reduces feature size while maintaining feature relevance and consistency.

查看原文本刊更多论文

RCE-IFE：递归聚类消除与簇内特征消除。

由新技术产生的不断增加的生物数据维度造成的计算和解释困难构成了重大挑战。特征选择（FS）方法的目标是降维，特征分组已经成为FS技术的基础，它寻求检测特征之间的强相关性和识别不相关的特征。在这项工作中，我们提出了递归聚类消除与聚类内特征消除（RCE-IFE）方法，该方法利用特征分组，并在监督上下文中迭代分组和消除步骤。我们评估了RCE-IFE对来自不同生物领域的各种高维数据集的降维和区分能力。对于一组基因表达、microRNA （miRNA）表达和甲基化数据集，将RCE-IFE的性能与RCE-IFE- svm （RCE-IFE的svm改编版）和SVM-RCE进行比较评估。在特征最少、运行时间最短的被测表达数据集中，RCE-IFE的平均曲线下面积（AUC）为0.85，而RCE-IFE- svm （RCE-IFE的svm适应版）和SVM-RCE的AUC相似，分别为0.84和0.83。当对7个不同的宏基因组数据集进行平均时，RCE-IFE和SVM-RCE的auc分别为0.79和0.68，RCE-IFE显著减少了特征子集。此外，RCE-IFE超越了几种最先进的FS方法，如最小冗余最大相关性（MRMR）、快速相关滤波（FCBF）、信息增益（IG）、条件互信息最大化（CMIM）、SelectKBest （SKB）和极端梯度增强（XGBoost），在五个基因表达数据集上获得了0.76的平均AUC。与类似的工具Multi-stage相比，RCE-IFE在四个癌症相关数据集上使用更少的特征，平均准确率为89.27%。RCE-IFE还与其他基于生物领域知识的分组-评分-建模（G-S-M）工具（包括mirGediNET， 3Mint和miRcorrNet）进行了可比性验证。此外，通过RCE-IFE评估所选特征的生物学相关性。所提出的方法在多次运行中所选择的特征方面也表现出高度的一致性。我们的实验结果表明，RCE-IFE提供了鲁棒的分类器性能，并在保持特征相关性和一致性的同时显着降低了特征大小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PeerJ Computer Science Computer Science-General Computer Science

CiteScore

6.10

自引率

5.30%

发文量

332

审稿时长

10 weeks

期刊介绍： PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.