Pei-Yi Wong , Yu-Ting Zeng , Huey-Jen Su , Shih-Chun Candice Lung , Yu-Cheng Chen , Pau-Chung Chen , Ta-Chih Hsiao , Gary Adamkiewicz , Chih-Da Wu
{"title":"Effects of feature selection methods in estimating SO2 concentration variations using machine learning and stacking ensemble approach","authors":"Pei-Yi Wong , Yu-Ting Zeng , Huey-Jen Su , Shih-Chun Candice Lung , Yu-Cheng Chen , Pau-Chung Chen , Ta-Chih Hsiao , Gary Adamkiewicz , Chih-Da Wu","doi":"10.1016/j.eti.2024.103996","DOIUrl":null,"url":null,"abstract":"<div><div>Statistical-based feature selection methods have been used for dimension reduction, but only a few studies have explored the impact of selected features on machine learning models. This study aims to investigate the effects of statistical and machine learning-based feature selection methods on spatial prediction models for estimating variations in SO<sub>2</sub> concentrations. We collected daily SO<sub>2</sub> observations from 1994 to 2018 along with predictor variables such as land-use/land cover allocations, roads, landmarks, meteorological factors, and satellite images, resulting in a total of 428 geographic predictors. Important features were identified using statistical-based feature selection methods including SelectKBest, stepwise feature selection, elastic net, and machine learning-based methods such as random forest. The selected features from the four feature selection methods were fitted to machine learning algorithms including gradient boosting, CatBoost, XGBoost, and stacking ensemble to establish prediction models for estimating SO<sub>2</sub> concentrations. SHapley Additive exPlanations (SHAP) was applied to explain the contribution of each selected feature to the model's prediction capability. The results showed that stacking ensemble model outperformed the three single machine learning algorithms. Among the four feature selection methods, the random forest method yielded the highest prediction accuracy (R<sup>2</sup>=0.80) in the training model, followed by stepwise selection (R<sup>2</sup>=0.75), SelectKBest (R<sup>2</sup>=0.75), and elastic net (R<sup>2</sup>=0.72) in the stacking ensemble model. These results were robust after several validation tests. Our findings suggested that the random forest feature selection method was more suitable for developing machine learning models for air pollution estimation. The identified features also provide important information for urban air pollution management.</div></div>","PeriodicalId":11725,"journal":{"name":"Environmental Technology & Innovation","volume":"37 ","pages":"Article 103996"},"PeriodicalIF":6.7000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Technology & Innovation","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352186424004723","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Statistical-based feature selection methods have been used for dimension reduction, but only a few studies have explored the impact of selected features on machine learning models. This study aims to investigate the effects of statistical and machine learning-based feature selection methods on spatial prediction models for estimating variations in SO2 concentrations. We collected daily SO2 observations from 1994 to 2018 along with predictor variables such as land-use/land cover allocations, roads, landmarks, meteorological factors, and satellite images, resulting in a total of 428 geographic predictors. Important features were identified using statistical-based feature selection methods including SelectKBest, stepwise feature selection, elastic net, and machine learning-based methods such as random forest. The selected features from the four feature selection methods were fitted to machine learning algorithms including gradient boosting, CatBoost, XGBoost, and stacking ensemble to establish prediction models for estimating SO2 concentrations. SHapley Additive exPlanations (SHAP) was applied to explain the contribution of each selected feature to the model's prediction capability. The results showed that stacking ensemble model outperformed the three single machine learning algorithms. Among the four feature selection methods, the random forest method yielded the highest prediction accuracy (R2=0.80) in the training model, followed by stepwise selection (R2=0.75), SelectKBest (R2=0.75), and elastic net (R2=0.72) in the stacking ensemble model. These results were robust after several validation tests. Our findings suggested that the random forest feature selection method was more suitable for developing machine learning models for air pollution estimation. The identified features also provide important information for urban air pollution management.
期刊介绍:
Environmental Technology & Innovation adopts a challenge-oriented approach to solutions by integrating natural sciences to promote a sustainable future. The journal aims to foster the creation and development of innovative products, technologies, and ideas that enhance the environment, with impacts across soil, air, water, and food in rural and urban areas.
As a platform for disseminating scientific evidence for environmental protection and sustainable development, the journal emphasizes fundamental science, methodologies, tools, techniques, and policy considerations. It emphasizes the importance of science and technology in environmental benefits, including smarter, cleaner technologies for environmental protection, more efficient resource processing methods, and the evidence supporting their effectiveness.