From accurate to actionable: Interpretable PM2.5 forecasting with feature engineering and SHAP for the Liverpool–Wirral region

Q2 Environmental Science
Seyed Matin Malakouti
{"title":"From accurate to actionable: Interpretable PM2.5 forecasting with feature engineering and SHAP for the Liverpool–Wirral region","authors":"Seyed Matin Malakouti","doi":"10.1016/j.envc.2025.101290","DOIUrl":null,"url":null,"abstract":"<div><div>Fine particulate matter (PM<sub>2.5</sub>) poses a significant threat to public health worldwide, contributing to millions of premature deaths annually, according to the World Health Organization. At the regional scale, Europe and the United Kingdom continue to experience PM<sub>2.5</sub> episodes influenced by transboundary pollution, industrial emissions, and meteorological patterns. Locally, the Liverpool–Wirral area faces specific challenges due to dense urban traffic, port activities, and mixed industrial–residential land use, which can lead to localized pollution hotspots. Despite advances in air quality modeling, there remains a gap in producing highly accurate and interpretable short-term PM<sub>2.5</sub> forecasts tailored to local conditions using dense networks of low-cost sensors. To address this gap, machine learning models—ExtraTrees, LightGBM, and a weighted ensemble—were developed in this study to forecast daily PM<sub>2.5</sub> concentrations from 2019 to 2024 using a rich set of engineered time-series features. These features included lagged values (1, 7, 14, 30 days), rolling averages (3, 7, 14, 30 days), day-over-day raw and percentage changes, day of the week, weekend indicator, month, and cyclical day-of-year components (sine and cosine) to capture short-term autocorrelation, medium- and long-term trends, and seasonal effects. The models were trained and validated on a Liverpool–Wirral dataset, and their performance was evaluated on early-2024 observations. To interpret feature contributions, SHAP (SHapley Additive exPlanations) values were computed for the LightGBM model, revealing that the 3-day rolling mean, day-over-day change, and 1-day lag dominated the predictive power. The ensemble model achieved the lowest test-set RMSE (0.54 <span><math><mi>μ</mi></math></span>g/m<sup>3</sup>, <span><math><mrow><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>=</mo><mn>0</mn><mo>.</mo><mn>990</mn></mrow></math></span>). A full-year 2025 forecast indicated modest seasonal variability, with ensemble predictions remaining stable around 6–6.2 <span><math><mi>μ</mi></math></span>g/m<sup>3</sup>. These results demonstrate that careful feature engineering, coupled with SHAP-based interpretation, can yield highly accurate and transparent PM<sub>2.5</sub> forecasts.</div></div>","PeriodicalId":34794,"journal":{"name":"Environmental Challenges","volume":"21 ","pages":"Article 101290"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Challenges","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667010025002094","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Environmental Science","Score":null,"Total":0}
引用次数: 0

Abstract

Fine particulate matter (PM2.5) poses a significant threat to public health worldwide, contributing to millions of premature deaths annually, according to the World Health Organization. At the regional scale, Europe and the United Kingdom continue to experience PM2.5 episodes influenced by transboundary pollution, industrial emissions, and meteorological patterns. Locally, the Liverpool–Wirral area faces specific challenges due to dense urban traffic, port activities, and mixed industrial–residential land use, which can lead to localized pollution hotspots. Despite advances in air quality modeling, there remains a gap in producing highly accurate and interpretable short-term PM2.5 forecasts tailored to local conditions using dense networks of low-cost sensors. To address this gap, machine learning models—ExtraTrees, LightGBM, and a weighted ensemble—were developed in this study to forecast daily PM2.5 concentrations from 2019 to 2024 using a rich set of engineered time-series features. These features included lagged values (1, 7, 14, 30 days), rolling averages (3, 7, 14, 30 days), day-over-day raw and percentage changes, day of the week, weekend indicator, month, and cyclical day-of-year components (sine and cosine) to capture short-term autocorrelation, medium- and long-term trends, and seasonal effects. The models were trained and validated on a Liverpool–Wirral dataset, and their performance was evaluated on early-2024 observations. To interpret feature contributions, SHAP (SHapley Additive exPlanations) values were computed for the LightGBM model, revealing that the 3-day rolling mean, day-over-day change, and 1-day lag dominated the predictive power. The ensemble model achieved the lowest test-set RMSE (0.54 μg/m3, R2=0.990). A full-year 2025 forecast indicated modest seasonal variability, with ensemble predictions remaining stable around 6–6.2 μg/m3. These results demonstrate that careful feature engineering, coupled with SHAP-based interpretation, can yield highly accurate and transparent PM2.5 forecasts.
从精确到可操作:利物浦-威勒尔地区的特征工程和SHAP可解释的PM2.5预测
世界卫生组织称,细颗粒物(PM2.5)对全球公众健康构成重大威胁,每年导致数百万人过早死亡。在区域尺度上,欧洲和英国继续经历受跨境污染、工业排放和气象模式影响的PM2.5事件。在当地,由于密集的城市交通、港口活动和混合的工业-住宅用地,利物浦-威勒尔地区面临着特殊的挑战,这可能导致局部污染热点。尽管在空气质量建模方面取得了进展,但在利用密集的低成本传感器网络,根据当地情况做出高精度、可解释的短期PM2.5预测方面,仍存在差距。为了解决这一差距,本研究开发了机器学习模型(extratrees、LightGBM和加权集合),利用一组丰富的工程时间序列特征来预测2019年至2024年的每日PM2.5浓度。这些特征包括滞后值(1、7、14、30天)、滚动平均值(3、7、14、30天)、逐日的原始变化和百分比变化、星期几、周末指标、月份和周期性的日分量(正弦和余弦),以捕捉短期自相关性、中期和长期趋势以及季节性影响。这些模型在利物浦-威勒尔数据集上进行了训练和验证,并在2024年初的观测数据中对其性能进行了评估。为了解释特征贡献,我们计算了LightGBM模型的SHAP (SHapley Additive exPlanations)值,结果表明,3天滚动平均值、逐日变化和1天滞后在预测能力中占主导地位。集合模型的测试集RMSE最低(0.54 μg/m3, R2=0.990)。2025年全年预测显示出适度的季节变化,总体预测保持稳定在6-6.2 μg/m3左右。这些结果表明,仔细的特征工程,加上基于shap的解释,可以产生高度准确和透明的PM2.5预测。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Environmental Challenges
Environmental Challenges Environmental Science-Environmental Engineering
CiteScore
8.00
自引率
0.00%
发文量
249
审稿时长
8 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信