{"title":"From accurate to actionable: Interpretable PM2.5 forecasting with feature engineering and SHAP for the Liverpool–Wirral region","authors":"Seyed Matin Malakouti","doi":"10.1016/j.envc.2025.101290","DOIUrl":null,"url":null,"abstract":"<div><div>Fine particulate matter (PM<sub>2.5</sub>) poses a significant threat to public health worldwide, contributing to millions of premature deaths annually, according to the World Health Organization. At the regional scale, Europe and the United Kingdom continue to experience PM<sub>2.5</sub> episodes influenced by transboundary pollution, industrial emissions, and meteorological patterns. Locally, the Liverpool–Wirral area faces specific challenges due to dense urban traffic, port activities, and mixed industrial–residential land use, which can lead to localized pollution hotspots. Despite advances in air quality modeling, there remains a gap in producing highly accurate and interpretable short-term PM<sub>2.5</sub> forecasts tailored to local conditions using dense networks of low-cost sensors. To address this gap, machine learning models—ExtraTrees, LightGBM, and a weighted ensemble—were developed in this study to forecast daily PM<sub>2.5</sub> concentrations from 2019 to 2024 using a rich set of engineered time-series features. These features included lagged values (1, 7, 14, 30 days), rolling averages (3, 7, 14, 30 days), day-over-day raw and percentage changes, day of the week, weekend indicator, month, and cyclical day-of-year components (sine and cosine) to capture short-term autocorrelation, medium- and long-term trends, and seasonal effects. The models were trained and validated on a Liverpool–Wirral dataset, and their performance was evaluated on early-2024 observations. To interpret feature contributions, SHAP (SHapley Additive exPlanations) values were computed for the LightGBM model, revealing that the 3-day rolling mean, day-over-day change, and 1-day lag dominated the predictive power. The ensemble model achieved the lowest test-set RMSE (0.54 <span><math><mi>μ</mi></math></span>g/m<sup>3</sup>, <span><math><mrow><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>=</mo><mn>0</mn><mo>.</mo><mn>990</mn></mrow></math></span>). A full-year 2025 forecast indicated modest seasonal variability, with ensemble predictions remaining stable around 6–6.2 <span><math><mi>μ</mi></math></span>g/m<sup>3</sup>. These results demonstrate that careful feature engineering, coupled with SHAP-based interpretation, can yield highly accurate and transparent PM<sub>2.5</sub> forecasts.</div></div>","PeriodicalId":34794,"journal":{"name":"Environmental Challenges","volume":"21 ","pages":"Article 101290"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Challenges","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667010025002094","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Environmental Science","Score":null,"Total":0}
引用次数: 0
Abstract
Fine particulate matter (PM2.5) poses a significant threat to public health worldwide, contributing to millions of premature deaths annually, according to the World Health Organization. At the regional scale, Europe and the United Kingdom continue to experience PM2.5 episodes influenced by transboundary pollution, industrial emissions, and meteorological patterns. Locally, the Liverpool–Wirral area faces specific challenges due to dense urban traffic, port activities, and mixed industrial–residential land use, which can lead to localized pollution hotspots. Despite advances in air quality modeling, there remains a gap in producing highly accurate and interpretable short-term PM2.5 forecasts tailored to local conditions using dense networks of low-cost sensors. To address this gap, machine learning models—ExtraTrees, LightGBM, and a weighted ensemble—were developed in this study to forecast daily PM2.5 concentrations from 2019 to 2024 using a rich set of engineered time-series features. These features included lagged values (1, 7, 14, 30 days), rolling averages (3, 7, 14, 30 days), day-over-day raw and percentage changes, day of the week, weekend indicator, month, and cyclical day-of-year components (sine and cosine) to capture short-term autocorrelation, medium- and long-term trends, and seasonal effects. The models were trained and validated on a Liverpool–Wirral dataset, and their performance was evaluated on early-2024 observations. To interpret feature contributions, SHAP (SHapley Additive exPlanations) values were computed for the LightGBM model, revealing that the 3-day rolling mean, day-over-day change, and 1-day lag dominated the predictive power. The ensemble model achieved the lowest test-set RMSE (0.54 g/m3, ). A full-year 2025 forecast indicated modest seasonal variability, with ensemble predictions remaining stable around 6–6.2 g/m3. These results demonstrate that careful feature engineering, coupled with SHAP-based interpretation, can yield highly accurate and transparent PM2.5 forecasts.