Effects of feature selection methods in estimating SO2 concentration variations using machine learning and stacking ensemble approach

IF 6.7 2区 环境科学与生态学 Q1 BIOTECHNOLOGY & APPLIED MICROBIOLOGY
Pei-Yi Wong , Yu-Ting Zeng , Huey-Jen Su , Shih-Chun Candice Lung , Yu-Cheng Chen , Pau-Chung Chen , Ta-Chih Hsiao , Gary Adamkiewicz , Chih-Da Wu
{"title":"Effects of feature selection methods in estimating SO2 concentration variations using machine learning and stacking ensemble approach","authors":"Pei-Yi Wong ,&nbsp;Yu-Ting Zeng ,&nbsp;Huey-Jen Su ,&nbsp;Shih-Chun Candice Lung ,&nbsp;Yu-Cheng Chen ,&nbsp;Pau-Chung Chen ,&nbsp;Ta-Chih Hsiao ,&nbsp;Gary Adamkiewicz ,&nbsp;Chih-Da Wu","doi":"10.1016/j.eti.2024.103996","DOIUrl":null,"url":null,"abstract":"<div><div>Statistical-based feature selection methods have been used for dimension reduction, but only a few studies have explored the impact of selected features on machine learning models. This study aims to investigate the effects of statistical and machine learning-based feature selection methods on spatial prediction models for estimating variations in SO<sub>2</sub> concentrations. We collected daily SO<sub>2</sub> observations from 1994 to 2018 along with predictor variables such as land-use/land cover allocations, roads, landmarks, meteorological factors, and satellite images, resulting in a total of 428 geographic predictors. Important features were identified using statistical-based feature selection methods including SelectKBest, stepwise feature selection, elastic net, and machine learning-based methods such as random forest. The selected features from the four feature selection methods were fitted to machine learning algorithms including gradient boosting, CatBoost, XGBoost, and stacking ensemble to establish prediction models for estimating SO<sub>2</sub> concentrations. SHapley Additive exPlanations (SHAP) was applied to explain the contribution of each selected feature to the model's prediction capability. The results showed that stacking ensemble model outperformed the three single machine learning algorithms. Among the four feature selection methods, the random forest method yielded the highest prediction accuracy (R<sup>2</sup>=0.80) in the training model, followed by stepwise selection (R<sup>2</sup>=0.75), SelectKBest (R<sup>2</sup>=0.75), and elastic net (R<sup>2</sup>=0.72) in the stacking ensemble model. These results were robust after several validation tests. Our findings suggested that the random forest feature selection method was more suitable for developing machine learning models for air pollution estimation. The identified features also provide important information for urban air pollution management.</div></div>","PeriodicalId":11725,"journal":{"name":"Environmental Technology & Innovation","volume":"37 ","pages":"Article 103996"},"PeriodicalIF":6.7000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Technology & Innovation","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352186424004723","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Statistical-based feature selection methods have been used for dimension reduction, but only a few studies have explored the impact of selected features on machine learning models. This study aims to investigate the effects of statistical and machine learning-based feature selection methods on spatial prediction models for estimating variations in SO2 concentrations. We collected daily SO2 observations from 1994 to 2018 along with predictor variables such as land-use/land cover allocations, roads, landmarks, meteorological factors, and satellite images, resulting in a total of 428 geographic predictors. Important features were identified using statistical-based feature selection methods including SelectKBest, stepwise feature selection, elastic net, and machine learning-based methods such as random forest. The selected features from the four feature selection methods were fitted to machine learning algorithms including gradient boosting, CatBoost, XGBoost, and stacking ensemble to establish prediction models for estimating SO2 concentrations. SHapley Additive exPlanations (SHAP) was applied to explain the contribution of each selected feature to the model's prediction capability. The results showed that stacking ensemble model outperformed the three single machine learning algorithms. Among the four feature selection methods, the random forest method yielded the highest prediction accuracy (R2=0.80) in the training model, followed by stepwise selection (R2=0.75), SelectKBest (R2=0.75), and elastic net (R2=0.72) in the stacking ensemble model. These results were robust after several validation tests. Our findings suggested that the random forest feature selection method was more suitable for developing machine learning models for air pollution estimation. The identified features also provide important information for urban air pollution management.
基于统计的特征选择方法已被用于降低维度,但只有少数研究探讨了所选特征对机器学习模型的影响。本研究旨在探讨基于统计和机器学习的特征选择方法对估算二氧化硫浓度变化的空间预测模型的影响。我们收集了 1994 年至 2018 年的每日二氧化硫观测数据,以及土地利用/土地覆盖分配、道路、地标、气象因素和卫星图像等预测变量,共获得 428 个地理预测因子。使用基于统计的特征选择方法(包括 SelectKBest、逐步特征选择、弹性网和基于机器学习的方法(如随机森林))确定了重要特征。将从四种特征选择方法中选出的特征与梯度提升、CatBoost、XGBoost 和堆叠集合等机器学习算法相匹配,以建立用于估算二氧化硫浓度的预测模型。应用 SHapley Additive exPlanations(SHAP)来解释每个选定特征对模型预测能力的贡献。结果表明,堆叠集合模型优于三种单一机器学习算法。在四种特征选择方法中,随机森林方法在训练模型中的预测准确率最高(R2=0.80),其次是逐步选择法(R2=0.75)、SelectKBest(R2=0.75)和堆积集合模型中的弹性网(R2=0.72)。经过多次验证测试后,这些结果都是稳健的。我们的研究结果表明,随机森林特征选择方法更适合用于开发空气污染评估的机器学习模型。所识别的特征也为城市空气污染管理提供了重要信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Environmental Technology & Innovation
Environmental Technology & Innovation Environmental Science-General Environmental Science
CiteScore
14.00
自引率
4.20%
发文量
435
审稿时长
74 days
期刊介绍: Environmental Technology & Innovation adopts a challenge-oriented approach to solutions by integrating natural sciences to promote a sustainable future. The journal aims to foster the creation and development of innovative products, technologies, and ideas that enhance the environment, with impacts across soil, air, water, and food in rural and urban areas. As a platform for disseminating scientific evidence for environmental protection and sustainable development, the journal emphasizes fundamental science, methodologies, tools, techniques, and policy considerations. It emphasizes the importance of science and technology in environmental benefits, including smarter, cleaner technologies for environmental protection, more efficient resource processing methods, and the evidence supporting their effectiveness.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信