{"title":"Single and Ensemble Based Filters in Environmental Data","authors":"Yousra Cherif, Ali Idri","doi":"10.1111/exsy.70076","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Researchers rely on species distribution models (SDMs) to establish a correlation between species occurrence records and environmental data. These models offer insights into the ecological and evolutionary aspects of the subject. Feature selection (FS) aims to choose useful interlinked features or remove unnecessary and redundant ones and make the induced model easier to understand. Although feature selection plays a crucial role in SDMs, only a limited number of studies in the literature have addressed it with several key shortcomings such as lack of the use of multivariate techniques, lack of comparison between the univariate and the multivariate filters, and absence of a comparison between the ensemble univariate and multivariate filters. Therefore, this study presents a rigorous empirical evaluation consisting of assessing and comparing six filter-based univariate feature selection methods using two thresholds with two multivariate techniques, as well as four classifiers: Extreme Gradient boosting (XGB), Random Forest (RF), Decision Tree (DT), and Light gradient-boosting machine (LGBM). Furthermore, the current study proposes a novel approach for ensemble construction consisting of evaluating the applications of ensemble learning using 40% of features ranked by means of Borda Count and Reciprocal Rank (univariate filter ensembles) as well as the fusion-based and the intersection-based ensembles (multivariate filter ensembles). Moreover, we evaluated and compared the performances of univariate and multivariate techniques with their ensembles. Similarly, we evaluated and compared the performances of the best ensemble techniques across datasets. The empirical evaluations involve several techniques, such as the 5-fold cross-validation method, the Scott Knott (SK) test, and Borda Count. In addition, we used three performance metrics (accuracy, Kappa, and <i>F</i>1-score). Experiments showed that Consistency-based subset selection in conjunction with RF outperformed all other univariate and multivariate FS techniques with an accuracy value of 91.63% across all datasets. However, Fisher score trained with RF was the best choice when considering the number of features. Moreover, the univariate or multivariate based ensembles, in general, outperformed their singles. In addition, when comparing the univariate and multivariate ensembles, the fusion-based ensemble outperformed all other ensembles achieving an accuracy of 91.77% when using RF across datasets. Nevertheless, in terms of performance and number of features, the ensemble constructed using Reciprocal Rank performed better than all other FS techniques regardless of the classifier used. It achieved an accuracy of 91.61% across datasets when using RF.</p>\n </div>","PeriodicalId":51053,"journal":{"name":"Expert Systems","volume":"42 7","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/exsy.70076","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Researchers rely on species distribution models (SDMs) to establish a correlation between species occurrence records and environmental data. These models offer insights into the ecological and evolutionary aspects of the subject. Feature selection (FS) aims to choose useful interlinked features or remove unnecessary and redundant ones and make the induced model easier to understand. Although feature selection plays a crucial role in SDMs, only a limited number of studies in the literature have addressed it with several key shortcomings such as lack of the use of multivariate techniques, lack of comparison between the univariate and the multivariate filters, and absence of a comparison between the ensemble univariate and multivariate filters. Therefore, this study presents a rigorous empirical evaluation consisting of assessing and comparing six filter-based univariate feature selection methods using two thresholds with two multivariate techniques, as well as four classifiers: Extreme Gradient boosting (XGB), Random Forest (RF), Decision Tree (DT), and Light gradient-boosting machine (LGBM). Furthermore, the current study proposes a novel approach for ensemble construction consisting of evaluating the applications of ensemble learning using 40% of features ranked by means of Borda Count and Reciprocal Rank (univariate filter ensembles) as well as the fusion-based and the intersection-based ensembles (multivariate filter ensembles). Moreover, we evaluated and compared the performances of univariate and multivariate techniques with their ensembles. Similarly, we evaluated and compared the performances of the best ensemble techniques across datasets. The empirical evaluations involve several techniques, such as the 5-fold cross-validation method, the Scott Knott (SK) test, and Borda Count. In addition, we used three performance metrics (accuracy, Kappa, and F1-score). Experiments showed that Consistency-based subset selection in conjunction with RF outperformed all other univariate and multivariate FS techniques with an accuracy value of 91.63% across all datasets. However, Fisher score trained with RF was the best choice when considering the number of features. Moreover, the univariate or multivariate based ensembles, in general, outperformed their singles. In addition, when comparing the univariate and multivariate ensembles, the fusion-based ensemble outperformed all other ensembles achieving an accuracy of 91.77% when using RF across datasets. Nevertheless, in terms of performance and number of features, the ensemble constructed using Reciprocal Rank performed better than all other FS techniques regardless of the classifier used. It achieved an accuracy of 91.61% across datasets when using RF.
期刊介绍:
Expert Systems: The Journal of Knowledge Engineering publishes papers dealing with all aspects of knowledge engineering, including individual methods and techniques in knowledge acquisition and representation, and their application in the construction of systems – including expert systems – based thereon. Detailed scientific evaluation is an essential part of any paper.
As well as traditional application areas, such as Software and Requirements Engineering, Human-Computer Interaction, and Artificial Intelligence, we are aiming at the new and growing markets for these technologies, such as Business, Economy, Market Research, and Medical and Health Care. The shift towards this new focus will be marked by a series of special issues covering hot and emergent topics.