Feature Identification Using Hypotheses of Relevance and a 2D-Cascade of SEQENS Ensembles

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems Pub Date : 2025-02-09 DOI:10.1111/exsy.70002

Joaquim Arlandis, Rafael Llobet, J. Ramón Navarro Cerdán, Laura Arnal, François Signol, Juan-Carlos Perez-Cortes

{"title":"Feature Identification Using Hypotheses of Relevance and a 2D-Cascade of SEQENS Ensembles","authors":"Joaquim Arlandis, Rafael Llobet, J. Ramón Navarro Cerdán, Laura Arnal, François Signol, Juan-Carlos Perez-Cortes","doi":"10.1111/exsy.70002","DOIUrl":null,"url":null,"abstract":"<div>\n \n SEQENS is an ensemble method aimed at feature identification that has demonstrated strong performance in identifying relevant genes in high-dimensional spaces, across different synthetic tasks. In this paper, we first introduce the differences between feature importance, feature selection (FS) and feature identification concepts. Following this, we present a framework based on SEQENS covering the following contributions: (1) computing the hypergeometric p-value of the features of a SEQENS output ranking in order to be able to establish a threshold between relevant and non-relevant features; (2) extending SEQENS by introducing the use of preselected features as hypotheses of relevance in the sequential FS, which may help to attract other features that might exhibit weak correlation with the target on their own, but gain relevance when combined with the preselected ones and; (3) designing an automated process based on a 2D-cascade of SEQENS ensembles to obtain a purged feature set, or PFS, that is, having as many relevant features, and as few non-relevant, as possible. The framework presented, named pc–SEQENS, integrates the former techniques so that the PFS is used as a hypothesis of relevance in a SEQENS ensemble. Performance is analysed in a gene expression identification task using the E-MTAB-3732 public database and synthetic targets. pc–SEQENS is compared to other state-of-the-art methods, including SEQENS to check the effect of using hypotheses of relevance. On average, the proposed framework identifies better the relevant genes, especially in unfavourable sample-to-dimension rates, and exhibits a stronger stability.\n </div>","PeriodicalId":51053,"journal":{"name":"Expert Systems","volume":"42 3","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2025-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/exsy.70002","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

SEQENS is an ensemble method aimed at feature identification that has demonstrated strong performance in identifying relevant genes in high-dimensional spaces, across different synthetic tasks. In this paper, we first introduce the differences between feature importance, feature selection (FS) and feature identification concepts. Following this, we present a framework based on SEQENS covering the following contributions: (1) computing the hypergeometric p-value of the features of a SEQENS output ranking in order to be able to establish a threshold between relevant and non-relevant features; (2) extending SEQENS by introducing the use of preselected features as hypotheses of relevance in the sequential FS, which may help to attract other features that might exhibit weak correlation with the target on their own, but gain relevance when combined with the preselected ones and; (3) designing an automated process based on a 2D-cascade of SEQENS ensembles to obtain a purged feature set, or PFS, that is, having as many relevant features, and as few non-relevant, as possible. The framework presented, named pc–SEQENS, integrates the former techniques so that the PFS is used as a hypothesis of relevance in a SEQENS ensemble. Performance is analysed in a gene expression identification task using the E-MTAB-3732 public database and synthetic targets. pc–SEQENS is compared to other state-of-the-art methods, including SEQENS to check the effect of using hypotheses of relevance. On average, the proposed framework identifies better the relevant genes, especially in unfavourable sample-to-dimension rates, and exhibits a stronger stability.

查看原文本刊更多论文

使用关联假设和2d级联序列集成的特征识别

SEQENS是一种针对特征识别的集成方法，在识别高维空间中不同合成任务的相关基因方面表现出很强的性能。本文首先介绍了特征重要性、特征选择（FS）和特征识别概念之间的区别。在此基础上，我们提出了一个基于SEQENS的框架，其中包括以下贡献：(1)计算SEQENS输出排序特征的超几何p值，以便能够在相关和不相关特征之间建立阈值；(2)通过在序列FS中引入预选特征作为关联假设来扩展序列序列，这可能有助于吸引其他与目标本身表现出弱相关性，但与预选特征和结合时获得相关性的特征；(3)设计一个基于2d级联序列的自动化流程，以获得一个净化的特征集（PFS），即具有尽可能多的相关特征和尽可能少的不相关特征。该框架被命名为pc-SEQENS，它集成了前面的技术，使得PFS被用作SEQENS集成中的关联假设。利用E-MTAB-3732公共数据库和合成靶标分析了基因表达鉴定任务的性能。将pc-SEQENS与其他最先进的方法（包括SEQENS）进行比较，以检查使用相关性假设的效果。平均而言，所提出的框架更好地识别相关基因，特别是在不利的样本-维度率，并表现出更强的稳定性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems 工程技术-计算机：理论方法

CiteScore

7.40

自引率

6.10%

发文量

266

审稿时长

24 months

期刊介绍： Expert Systems: The Journal of Knowledge Engineering publishes papers dealing with all aspects of knowledge engineering, including individual methods and techniques in knowledge acquisition and representation, and their application in the construction of systems – including expert systems – based thereon. Detailed scientific evaluation is an essential part of any paper. As well as traditional application areas, such as Software and Requirements Engineering, Human-Computer Interaction, and Artificial Intelligence, we are aiming at the new and growing markets for these technologies, such as Business, Economy, Market Research, and Medical and Health Care. The shift towards this new focus will be marked by a series of special issues covering hot and emergent topics.