{"title":"An integrated feature selection and machine learning framework for PM10 concentration prediction","authors":"Elham Kalantari , Hamid Gholami , Hossein Malakooti , Dimitris G. Kaskaoutis , Poorya Saneei","doi":"10.1016/j.apr.2025.102456","DOIUrl":null,"url":null,"abstract":"<div><div>The Sistan Basin, east Iran is a major dust source, presenting significant atmospheric, ecological, socio-economic, and health challenges. This study employed machine learning (ML) algorithms, including Random Forest (RF), K-Nearest Neighbor (KNN), Weighted K-Nearest Neighbor (WKNN), Support Vector Regression (SVR), and Least Absolute Shrinkage and Selection Operator (LASSO), to model and predict PM<sub>10</sub> concentrations in Zabol City (2013–2022), utilizing independent meteorological variables such as temperature, relative humidity, wind speed and direction. Feature selection methods — Filter (Information Gain, F-Test, Correlation Coefficient), Wrapper (Recursive Feature Elimination, Sequential Forward/Backward Selection), and Embedded (LASSO, Elastic Net, Ridge Regression, RF Importance) — were applied to identify significant predictors, with embedded methods providing the best balance of simplicity, accuracy, and cost-efficiency. Among the models, RF demonstrated the highest seasonal performance (R<sup>2</sup> = 0.75) during summer. RF's prediction R<sup>2</sup> values for PM<sub>10</sub> remained above 0.5 in all seasons, consistently outperformed the other models. The WKNN model performed reasonably well across all seasons, ranking second among the models, while the LASSO model demonstrated weaker performance. The SVR model showed satisfactory performance in specific seasons, such as summer and autumn. A common feature of all models was their better performance during summer. Importantly, the models relied solely on readily available meteorological data, enabling accurate predictions of PM<sub>10</sub> in this arid region of eastern Iran. The findings highlight the potential of ML techniques for developing air pollution prediction and warning systems, offering valuable support to policymakers in the design of effective pollution control strategies and safeguarding public health.</div></div>","PeriodicalId":8604,"journal":{"name":"Atmospheric Pollution Research","volume":"16 5","pages":"Article 102456"},"PeriodicalIF":3.9000,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Atmospheric Pollution Research","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1309104225000583","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
The Sistan Basin, east Iran is a major dust source, presenting significant atmospheric, ecological, socio-economic, and health challenges. This study employed machine learning (ML) algorithms, including Random Forest (RF), K-Nearest Neighbor (KNN), Weighted K-Nearest Neighbor (WKNN), Support Vector Regression (SVR), and Least Absolute Shrinkage and Selection Operator (LASSO), to model and predict PM10 concentrations in Zabol City (2013–2022), utilizing independent meteorological variables such as temperature, relative humidity, wind speed and direction. Feature selection methods — Filter (Information Gain, F-Test, Correlation Coefficient), Wrapper (Recursive Feature Elimination, Sequential Forward/Backward Selection), and Embedded (LASSO, Elastic Net, Ridge Regression, RF Importance) — were applied to identify significant predictors, with embedded methods providing the best balance of simplicity, accuracy, and cost-efficiency. Among the models, RF demonstrated the highest seasonal performance (R2 = 0.75) during summer. RF's prediction R2 values for PM10 remained above 0.5 in all seasons, consistently outperformed the other models. The WKNN model performed reasonably well across all seasons, ranking second among the models, while the LASSO model demonstrated weaker performance. The SVR model showed satisfactory performance in specific seasons, such as summer and autumn. A common feature of all models was their better performance during summer. Importantly, the models relied solely on readily available meteorological data, enabling accurate predictions of PM10 in this arid region of eastern Iran. The findings highlight the potential of ML techniques for developing air pollution prediction and warning systems, offering valuable support to policymakers in the design of effective pollution control strategies and safeguarding public health.
期刊介绍:
Atmospheric Pollution Research (APR) is an international journal designed for the publication of articles on air pollution. Papers should present novel experimental results, theory and modeling of air pollution on local, regional, or global scales. Areas covered are research on inorganic, organic, and persistent organic air pollutants, air quality monitoring, air quality management, atmospheric dispersion and transport, air-surface (soil, water, and vegetation) exchange of pollutants, dry and wet deposition, indoor air quality, exposure assessment, health effects, satellite measurements, natural emissions, atmospheric chemistry, greenhouse gases, and effects on climate change.