Extracting regional and temporal features to improve machine learning for hourly air pollutants in urban India

IF 4.2 2区环境科学与生态学 Q2 ENVIRONMENTAL SCIENCES

Atmospheric Environment Pub Date : 2024-09-18 DOI:10.1016/j.atmosenv.2024.120834

Shuai Wang , Mengyuan Zhang , Hui Zhao , Peng Wang , Sri Harsha Kota , Qingyan Fu , Hongliang Zhang

{"title":"Extracting regional and temporal features to improve machine learning for hourly air pollutants in urban India","authors":"Shuai Wang , Mengyuan Zhang , Hui Zhao , Peng Wang , Sri Harsha Kota , Qingyan Fu , Hongliang Zhang","doi":"10.1016/j.atmosenv.2024.120834","DOIUrl":null,"url":null,"abstract":"<div>India is suffering from severe particulate matter (PM, including PM2.5 and PM10) pollution, while limited ground observations are insufficient to support a comprehensive understanding of its health risks. Machine learning (ML) has the potential to improve the estimation of PM distribution and exposure efficiently. Regional transport as well as accumulation and dispersion processes of PM and its components, which have significant impacts on PM concentrations, are crucial when building ML models, especially for sparsely observed regions like India. Here, geographic and temporal-rolling weighting methods were used to separately extract regional and temporal features for improving the performance of the ML model. The incorporation of temporal and regional features into the ML model significantly improved ML model performance, with root mean square error (RMSE) reduced by 21 % and 19% for PM2.5 and PM10 estimation, as well as an improvement in model underestimation for the heavy pollution scenarios. The spatial-temporal model shows out-of-sample test CV coefficients of determination (R2) of 0.87 and 0.88 for hourly PM2.5 and PM10. The ML model predicts an annual nationwide concentration of 68.3 μg/m3 for PM2.5 with a north (high, especially in Indo-Gangetic Plain) to south (low) distribution, which is consistent with high satellite aerosol optical depth (AOD) values. Boundary layer height is identified as the main meteorological factor influencing PM2.5 concentrations in winter. Characterizing the regional transport and cumulative dispersion processes of pollutants by extracting features can help in machine learning training, and this method can be further improved and applied to other studies.</div>","PeriodicalId":250,"journal":{"name":"Atmospheric Environment","volume":"338 ","pages":"Article 120834"},"PeriodicalIF":4.2000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Atmospheric Environment","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1352231024005090","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

India is suffering from severe particulate matter (PM, including PM_2.5 and PM₁₀) pollution, while limited ground observations are insufficient to support a comprehensive understanding of its health risks. Machine learning (ML) has the potential to improve the estimation of PM distribution and exposure efficiently. Regional transport as well as accumulation and dispersion processes of PM and its components, which have significant impacts on PM concentrations, are crucial when building ML models, especially for sparsely observed regions like India. Here, geographic and temporal-rolling weighting methods were used to separately extract regional and temporal features for improving the performance of the ML model. The incorporation of temporal and regional features into the ML model significantly improved ML model performance, with root mean square error (RMSE) reduced by 21 % and 19% for PM_2.5 and PM₁₀ estimation, as well as an improvement in model underestimation for the heavy pollution scenarios. The spatial-temporal model shows out-of-sample test CV coefficients of determination (R²) of 0.87 and 0.88 for hourly PM_2.5 and PM₁₀. The ML model predicts an annual nationwide concentration of 68.3 μg/m³ for PM_2.5 with a north (high, especially in Indo-Gangetic Plain) to south (low) distribution, which is consistent with high satellite aerosol optical depth (AOD) values. Boundary layer height is identified as the main meteorological factor influencing PM_2.5 concentrations in winter. Characterizing the regional transport and cumulative dispersion processes of pollutants by extracting features can help in machine learning training, and this method can be further improved and applied to other studies.

查看原文本刊更多论文

提取区域和时间特征，改进针对印度城市每小时空气污染物的机器学习

印度正在遭受严重的颗粒物（PM，包括 PM2.5 和 PM10）污染，而有限的地面观测不足以支持对其健康风险的全面了解。机器学习（ML）有可能有效改善对可吸入颗粒物分布和暴露的估计。可吸入颗粒物及其成分的区域传输、累积和扩散过程对可吸入颗粒物的浓度有重大影响，因此在建立 ML 模型时至关重要，特别是对于像印度这样观测稀少的地区。这里使用了地理和时间滚动加权方法来分别提取区域和时间特征，以提高 ML 模型的性能。将时间和区域特征纳入 ML 模型后，ML 模型的性能显著提高，PM2.5 和 PM10 估计的均方根误差（RMSE）分别降低了 21% 和 19%，重污染情景下的模型低估也有所改善。时空模型显示，每小时 PM2.5 和 PM10 的样本外测试 CV 决定系数（R2）分别为 0.87 和 0.88。ML 模型预测全国 PM2.5 的年浓度为 68.3 μg/m3 ，从北（高，尤其是在印度-甘肃平原）到南（低）分布，这与高卫星气溶胶光学深度（AOD）值一致。边界层高度被认为是影响冬季 PM2.5 浓度的主要气象因素。通过提取特征来描述污染物的区域传输和累积扩散过程有助于机器学习训练，该方法可进一步改进并应用于其他研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Atmospheric Environment 环境科学-环境科学

CiteScore

9.40

自引率

8.00%

发文量

458

审稿时长

53 days

期刊介绍： Atmospheric Environment has an open access mirror journal Atmospheric Environment: X, sharing the same aims and scope, editorial team, submission system and rigorous peer review. Atmospheric Environment is the international journal for scientists in different disciplines related to atmospheric composition and its impacts. The journal publishes scientific articles with atmospheric relevance of emissions and depositions of gaseous and particulate compounds, chemical processes and physical effects in the atmosphere, as well as impacts of the changing atmospheric composition on human health, air quality, climate change, and ecosystems.