Forecasting 24 h averaged PM2.5 concentration in the Aburrá Valley using tree-based machine learning models, global forecasts, and satellite information

Q1 Mathematics

Advances in Statistical Climatology, Meteorology and Oceanography Pub Date : 2023-12-22 DOI:10.5194/ascmo-9-121-2023

Jhayron S. Pérez-Carrasquilla, Paola A. Montoya, Juan Manuel Sánchez, K. Hernández, Mauricio Ramírez

{"title":"Forecasting 24 h averaged PM2.5 concentration in the Aburrá Valley using tree-based machine learning models, global forecasts, and satellite information","authors":"Jhayron S. Pérez-Carrasquilla, Paola A. Montoya, Juan Manuel Sánchez, K. Hernández, Mauricio Ramírez","doi":"10.5194/ascmo-9-121-2023","DOIUrl":null,"url":null,"abstract":"Abstract. We develop a framework to forecast 24 h averaged particulate matter (PM2.5) concentrations 4 d in advance in ground-based stations over the metropolitan area of the Aburrá Valley, Colombia. The input variables are gathered from a highly diverse set of sources, including in situ real-time PM2.5 observations, meteorological forecasts from the Global Forecasting System (GFS), aerosol optical depth (AOD) forecasts from the European Copernicus Atmosphere Monitoring Service (CAMS), and the Moderate Resolution Imaging Spectroradiometer (MODIS) active fire products. We compare the performance of two tree-based machine learning (ML) methods, random forests (RFs) and gradient boosting (GB), with linear regression as a baseline for error metrics. One of the disadvantages of tree-based models is their inability to make skillful predictions out of the domain in which the models were trained. To address that problem, we implement piecewise linear regression learners within the models. Additionally, to enhance the performance of the models, we use a customized loss function that considers the probability distribution of the target values. Tree-based models highly outperform the linear regression, with GB showing the best results in most of the 19 stations used in this study. We also test two approaches for the multi-step output problem, a direct multi-output (MO) scheme and a recursive (RC) scheme, with the GB–MO approach showing the best results. According to the performance analysis, the predictability is less for values away from the mean and decreases between 06:00 LT (local time) and the early afternoon, when the expansion of the boundary layer occurs. To contribute to understanding the sources of predictability and uncertainty of air quality in the city, we perform a feature importance analysis revealing that the relevance of the different independent variables is a function of the lead time. Particularly, apart from the past concentrations, the variables that most affect the predictability are the forecasted aerosol optical depth (AOD), the integrated fire radiative power over a forecasted back trajectory (BT-IFRP), and the predicted planetary boundary layer height (PBLH). In the testing period, the models showed the ability to forecast poor-air-quality events in the valley with more than 1 d of anticipation. This study serves as a framework for developing and evaluating the ML-based air quality forecasting models over the Andean region.\n","PeriodicalId":36792,"journal":{"name":"Advances in Statistical Climatology, Meteorology and Oceanography","volume":"15 8","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Statistical Climatology, Meteorology and Oceanography","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5194/ascmo-9-121-2023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract. We develop a framework to forecast 24 h averaged particulate matter (PM2.5) concentrations 4 d in advance in ground-based stations over the metropolitan area of the Aburrá Valley, Colombia. The input variables are gathered from a highly diverse set of sources, including in situ real-time PM2.5 observations, meteorological forecasts from the Global Forecasting System (GFS), aerosol optical depth (AOD) forecasts from the European Copernicus Atmosphere Monitoring Service (CAMS), and the Moderate Resolution Imaging Spectroradiometer (MODIS) active fire products. We compare the performance of two tree-based machine learning (ML) methods, random forests (RFs) and gradient boosting (GB), with linear regression as a baseline for error metrics. One of the disadvantages of tree-based models is their inability to make skillful predictions out of the domain in which the models were trained. To address that problem, we implement piecewise linear regression learners within the models. Additionally, to enhance the performance of the models, we use a customized loss function that considers the probability distribution of the target values. Tree-based models highly outperform the linear regression, with GB showing the best results in most of the 19 stations used in this study. We also test two approaches for the multi-step output problem, a direct multi-output (MO) scheme and a recursive (RC) scheme, with the GB–MO approach showing the best results. According to the performance analysis, the predictability is less for values away from the mean and decreases between 06:00 LT (local time) and the early afternoon, when the expansion of the boundary layer occurs. To contribute to understanding the sources of predictability and uncertainty of air quality in the city, we perform a feature importance analysis revealing that the relevance of the different independent variables is a function of the lead time. Particularly, apart from the past concentrations, the variables that most affect the predictability are the forecasted aerosol optical depth (AOD), the integrated fire radiative power over a forecasted back trajectory (BT-IFRP), and the predicted planetary boundary layer height (PBLH). In the testing period, the models showed the ability to forecast poor-air-quality events in the valley with more than 1 d of anticipation. This study serves as a framework for developing and evaluating the ML-based air quality forecasting models over the Andean region.

查看原文本刊更多论文

利用基于树的机器学习模型、全球预测和卫星信息预测阿布拉山谷 24 小时平均 PM2.5 浓度

摘要。我们开发了一个框架，用于提前 4 天预测哥伦比亚阿布拉山谷大都市区上空地面站的 24 小时平均颗粒物（PM2.5）浓度。输入变量的来源多种多样，包括现场实时 PM2.5 观测、全球预报系统（GFS）的气象预报、欧洲哥白尼大气监测服务（CAMS）的气溶胶光学深度（AOD）预报以及中分辨率成像分光仪（MODIS）的主动火灾产品。我们比较了随机森林（RF）和梯度提升（GB）这两种基于树的机器学习（ML）方法的性能，并将线性回归作为误差指标的基准。基于树的模型的缺点之一是无法在训练模型的领域之外进行娴熟的预测。为了解决这个问题，我们在模型中实施了片断线性回归学习器。此外，为了提高模型的性能，我们还使用了考虑目标值概率分布的定制损失函数。基于树的模型在很大程度上优于线性回归模型，其中 GB 模型在本研究使用的 19 个站点中的大多数站点都显示出最佳效果。我们还测试了解决多步输出问题的两种方法，即直接多步输出（MO）方案和递归（RC）方案，其中 GB-MO 方法显示出最佳结果。根据性能分析，远离平均值的值的可预测性较低，并且在当地时间 06:00 LT 和下午早些时候边界层扩张时可预测性降低。为了帮助理解城市空气质量的可预测性和不确定性的来源，我们进行了特征重要性分析，结果显示不同自变量的相关性是前导时间的函数。特别是，除了过去的浓度外，对可预测性影响最大的变量是预测的气溶胶光学深度（AOD）、预测的后向轨迹上的综合火辐射功率（BT-IFRP）和预测的行星边界层高度（PBLH）。在测试期间，模型显示出了预报山谷空气质量差事件的能力，预报时间超过 1 天。这项研究为开发和评估基于 ML 的安第斯地区空气质量预报模型提供了框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊