Genetic Programming-based Feature Selection for Symbolic Regression on Incomplete Data.

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Evolutionary Computation Pub Date : 2024-11-21 DOI:10.1162/evco_a_00362

Baligh Al-Helali, Qi Chen, Bing Xue, Mengjie Zhang

{"title":"Genetic Programming-based Feature Selection for Symbolic Regression on Incomplete Data.","authors":"Baligh Al-Helali, Qi Chen, Bing Xue, Mengjie Zhang","doi":"10.1162/evco_a_00362","DOIUrl":null,"url":null,"abstract":"<p><p>High-dimensionality is one of the serious real-world data challenges in symbolic regression and it is more challenging if the data are incomplete. Genetic programming has been successfully utilised for high-dimensional tasks due to its natural feature selection ability, but it is not directly applicable to incomplete data. Commonly, it needs to impute the missing values first and then perform genetic programming on the imputed complete data. However, in the case of having many irrelevant features being incomplete, intuitively, it is not necessary to perform costly imputations on such features. For this purpose, this work proposes a genetic programming-based approach to select features directly from incomplete high-dimensional data to improve symbolic regression performance. We extend the concept of identity/neutral elements from mathematics into the function operators of genetic programming, thus they can handle the missing values in incomplete data. Experiments have been conducted on a number of data sets considering different missingness ratios in high-dimensional symbolic regression tasks. The results show that the proposed method leads to better symbolic regression results when compared with state-of-the-art methods that can select features directly from incomplete data. Further results show that our approach not only leads to better symbolic regression accuracy but also selects a smaller number of relevant features, and consequently improves both the effectiveness and the efficiency of the learning process.</p>","PeriodicalId":50470,"journal":{"name":"Evolutionary Computation","volume":" ","pages":"1-27"},"PeriodicalIF":3.4000,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Evolutionary Computation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/evco_a_00362","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

High-dimensionality is one of the serious real-world data challenges in symbolic regression and it is more challenging if the data are incomplete. Genetic programming has been successfully utilised for high-dimensional tasks due to its natural feature selection ability, but it is not directly applicable to incomplete data. Commonly, it needs to impute the missing values first and then perform genetic programming on the imputed complete data. However, in the case of having many irrelevant features being incomplete, intuitively, it is not necessary to perform costly imputations on such features. For this purpose, this work proposes a genetic programming-based approach to select features directly from incomplete high-dimensional data to improve symbolic regression performance. We extend the concept of identity/neutral elements from mathematics into the function operators of genetic programming, thus they can handle the missing values in incomplete data. Experiments have been conducted on a number of data sets considering different missingness ratios in high-dimensional symbolic regression tasks. The results show that the proposed method leads to better symbolic regression results when compared with state-of-the-art methods that can select features directly from incomplete data. Further results show that our approach not only leads to better symbolic regression accuracy but also selects a smaller number of relevant features, and consequently improves both the effectiveness and the efficiency of the learning process.

查看原文本刊更多论文

基于遗传编程的不完整数据符号回归特征选择

高维度是符号回归在现实世界中面临的严峻数据挑战之一，如果数据不完整，则挑战性更大。遗传编程因其天然的特征选择能力，已成功用于高维任务，但它并不能直接适用于不完整数据。通常情况下，需要先对缺失值进行估算，然后再对估算出的完整数据执行遗传编程。然而，在有许多不相关特征不完整的情况下，直觉上没有必要对这些特征进行代价高昂的推算。为此，本研究提出了一种基于遗传编程的方法，直接从不完整的高维数据中选择特征，以提高符号回归性能。我们将数学中的同一性/中性元素概念扩展到遗传编程的函数运算符中，因此它们可以处理不完整数据中的缺失值。我们在一些数据集上进行了实验，考虑了高维符号回归任务中不同的缺失率。结果表明，与能直接从不完整数据中选择特征的最先进方法相比，所提出的方法能带来更好的符号回归结果。进一步的结果表明，我们的方法不仅能提高符号回归的准确性，还能选择更少的相关特征，从而提高学习过程的有效性和效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Evolutionary Computation 工程技术-计算机：理论方法

CiteScore

6.40

自引率

1.50%

发文量

审稿时长

3 months

期刊介绍： Evolutionary Computation is a leading journal in its field. It provides an international forum for facilitating and enhancing the exchange of information among researchers involved in both the theoretical and practical aspects of computational systems drawing their inspiration from nature, with particular emphasis on evolutionary models of computation such as genetic algorithms, evolutionary strategies, classifier systems, evolutionary programming, and genetic programming. It welcomes articles from related fields such as swarm intelligence (e.g. Ant Colony Optimization and Particle Swarm Optimization), and other nature-inspired computation paradigms (e.g. Artificial Immune Systems). As well as publishing articles describing theoretical and/or experimental work, the journal also welcomes application-focused papers describing breakthrough results in an application domain or methodological papers where the specificities of the real-world problem led to significant algorithmic improvements that could possibly be generalized to other areas.