利用数值模型信息试验台检索中国地面 PM2.5 浓度（2013-2021 年）以减轻样本失衡引起的偏差

IF 11.6 1区地球科学 Q1 GEOSCIENCES, MULTIDISCIPLINARY

Earth System Science Data Pub Date : 2024-05-15 DOI:10.5194/essd-2024-170

Siwei Li, Yu Ding, Jia Xing, Joshua S. Fu

{"title":"利用数值模型信息试验台检索中国地面 PM2.5 浓度（2013-2021 年）以减轻样本失衡引起的偏差","authors":"Siwei Li, Yu Ding, Jia Xing, Joshua S. Fu","doi":"10.5194/essd-2024-170","DOIUrl":null,"url":null,"abstract":"Abstract. Ground-level PM2.5 data derived from satellites with machine learning are crucial for health and climate assessments, however, uncertainties persist due to the absence of spatially covered observations. To address this, we propose a novel testbed using untraditional numerical simulations to evaluate PM2.5 estimation across the entire spatial domain. The testbed emulates the general machine-learning approach, by training the model with grids corresponding to ground monitor sites and subsequently testing its predictive accuracy for other locations. Our approach enables comprehensive evaluation of various machine-learning methods’ performance in estimating PM2.5 across the spatial domain for the first time. Unexpected results are shown in the application in China, with larger PM2.5 biases found in densely populated regions with abundant ground observations across all benchmark models, challenging conventional expectations and are not explored in the recent literature. The imbalance in training samples, mostly from urban areas with high emissions, is the main reason, leading to significant overestimation due to the lack of monitors in downwind areas where PM2.5 is transported from urban areas with varying vertical profiles. Our proposed testbed also provides an efficient strategy for optimizing model structure or training samples to enhance satellite-retrieval model performance. Integration of spatiotemporal features, especially with CNN-based deep-learning approaches like the ResNet model, successfully mitigates PM2.5 overestimation (by 5–30 µg m-3) and corresponding exposure (by 3 million people • µg m-3) in the downwind area over the past nine years (2013–2021) compared to the traditional approach. Furthermore, the incorporation of 600 strategically positioned ground-measurement sites identified through the testbed is essential to achieve a more balanced distribution of training samples, thereby ensuring precise PM2.5 estimation and facilitating the assessment of associated impacts in China. In addition to presenting the retrieved surface PM2.5 concentrations in China from 2013 to 2021, this study provides a testbed dataset derived from physical modeling simulations which can serve to evaluate the performance of data-driven methodologies, such as machine learning, in estimating spatial PM2.5 concentrations for the community.","PeriodicalId":48747,"journal":{"name":"Earth System Science Data","volume":"20 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Retrieving Ground-Level PM2.5 Concentrations in China (2013–2021) with a Numerical Model-Informed Testbed to Mitigate Sample Imbalance-Induced Biases\",\"authors\":\"Siwei Li, Yu Ding, Jia Xing, Joshua S. Fu\",\"doi\":\"10.5194/essd-2024-170\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract. Ground-level PM2.5 data derived from satellites with machine learning are crucial for health and climate assessments, however, uncertainties persist due to the absence of spatially covered observations. To address this, we propose a novel testbed using untraditional numerical simulations to evaluate PM2.5 estimation across the entire spatial domain. The testbed emulates the general machine-learning approach, by training the model with grids corresponding to ground monitor sites and subsequently testing its predictive accuracy for other locations. Our approach enables comprehensive evaluation of various machine-learning methods’ performance in estimating PM2.5 across the spatial domain for the first time. Unexpected results are shown in the application in China, with larger PM2.5 biases found in densely populated regions with abundant ground observations across all benchmark models, challenging conventional expectations and are not explored in the recent literature. The imbalance in training samples, mostly from urban areas with high emissions, is the main reason, leading to significant overestimation due to the lack of monitors in downwind areas where PM2.5 is transported from urban areas with varying vertical profiles. Our proposed testbed also provides an efficient strategy for optimizing model structure or training samples to enhance satellite-retrieval model performance. Integration of spatiotemporal features, especially with CNN-based deep-learning approaches like the ResNet model, successfully mitigates PM2.5 overestimation (by 5–30 µg m-3) and corresponding exposure (by 3 million people • µg m-3) in the downwind area over the past nine years (2013–2021) compared to the traditional approach. Furthermore, the incorporation of 600 strategically positioned ground-measurement sites identified through the testbed is essential to achieve a more balanced distribution of training samples, thereby ensuring precise PM2.5 estimation and facilitating the assessment of associated impacts in China. In addition to presenting the retrieved surface PM2.5 concentrations in China from 2013 to 2021, this study provides a testbed dataset derived from physical modeling simulations which can serve to evaluate the performance of data-driven methodologies, such as machine learning, in estimating spatial PM2.5 concentrations for the community.\",\"PeriodicalId\":48747,\"journal\":{\"name\":\"Earth System Science Data\",\"volume\":\"20 1\",\"pages\":\"\"},\"PeriodicalIF\":11.6000,\"publicationDate\":\"2024-05-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Earth System Science Data\",\"FirstCategoryId\":\"89\",\"ListUrlMain\":\"https://doi.org/10.5194/essd-2024-170\",\"RegionNum\":1,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"GEOSCIENCES, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Earth System Science Data","FirstCategoryId":"89","ListUrlMain":"https://doi.org/10.5194/essd-2024-170","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOSCIENCES, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

摘要利用机器学习从卫星获取的地面 PM2.5 数据对于健康和气候评估至关重要，然而，由于缺乏空间覆盖的观测数据，不确定性依然存在。为了解决这个问题，我们提出了一个新颖的测试平台，利用非传统的数值模拟来评估整个空间域的 PM2.5 估算。该试验平台模仿了一般的机器学习方法，通过与地面监测点相对应的网格来训练模型，随后测试其对其他地点的预测准确性。我们的方法首次实现了对各种机器学习方法在估计整个空间域的 PM2.5 性能方面的全面评估。在中国的应用中出现了意想不到的结果，在人口稠密地区，所有基准模型都存在较大的 PM2.5 偏差，而这些基准模型都有丰富的地面观测数据。训练样本的不平衡是主要原因，这些样本大多来自高排放的城市地区，由于下风向地区缺乏监测仪，PM2.5从城市地区以不同的垂直剖面传输，导致了显著的高估。我们提出的测试平台还提供了优化模型结构或训练样本的有效策略，以提高卫星检索模型的性能。与传统方法相比，时空特征的整合，尤其是与基于 CNN 的深度学习方法（如 ResNet 模型）的整合，成功缓解了过去九年（2013-2021 年）中下风向地区 PM2.5 的高估（5-30 µg m-3）和相应的暴露量（300 万人 - µg m-3）。此外，通过试验平台确定的 600 个战略位置地面测量点的加入对于实现更均衡的训练样本分布至关重要，从而确保精确估算 PM2.5，并促进对中国相关影响的评估。除了展示 2013 年至 2021 年中国地表 PM2.5 浓度的检索结果外，本研究还提供了一个来自物理建模模拟的试验台数据集，可用于评估数据驱动方法（如机器学习）在估算社区空间 PM2.5 浓度方面的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Retrieving Ground-Level PM2.5 Concentrations in China (2013–2021) with a Numerical Model-Informed Testbed to Mitigate Sample Imbalance-Induced Biases

Abstract. Ground-level PM_2.5 data derived from satellites with machine learning are crucial for health and climate assessments, however, uncertainties persist due to the absence of spatially covered observations. To address this, we propose a novel testbed using untraditional numerical simulations to evaluate PM_2.5 estimation across the entire spatial domain. The testbed emulates the general machine-learning approach, by training the model with grids corresponding to ground monitor sites and subsequently testing its predictive accuracy for other locations. Our approach enables comprehensive evaluation of various machine-learning methods’ performance in estimating PM_2.5 across the spatial domain for the first time. Unexpected results are shown in the application in China, with larger PM_2.5biases found in densely populated regions with abundant ground observations across all benchmark models, challenging conventional expectations and are not explored in the recent literature. The imbalance in training samples, mostly from urban areas with high emissions, is the main reason, leading to significant overestimation due to the lack of monitors in downwind areas where PM_2.5is transported from urban areas with varying vertical profiles. Our proposed testbed also provides an efficient strategy for optimizing model structure or training samples to enhance satellite-retrieval model performance. Integration of spatiotemporal features, especially with CNN-based deep-learning approaches like the ResNet model, successfully mitigates PM_2.5overestimation (by 5–30 µg m^-3) and corresponding exposure (by 3 million people • µg m^-3) in the downwind area over the past nine years (2013–2021) compared to the traditional approach. Furthermore, the incorporation of 600 strategically positioned ground-measurement sites identified through the testbed is essential to achieve a more balanced distribution of training samples, thereby ensuring precise PM_2.5 estimation and facilitating the assessment of associated impacts in China. In addition to presenting the retrieved surface PM_2.5concentrations in China from 2013 to 2021, this study provides a testbed dataset derived from physical modeling simulations which can serve to evaluate the performance of data-driven methodologies, such as machine learning, in estimating spatial PM_2.5 concentrations for the community.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Earth System Science Data GEOSCIENCES, MULTIDISCIPLINARYMETEOROLOGY-METEOROLOGY & ATMOSPHERIC SCIENCES

CiteScore

18.00

自引率

5.30%

发文量

231

审稿时长

35 weeks

期刊介绍： Earth System Science Data (ESSD) is an international, interdisciplinary journal that publishes articles on original research data in order to promote the reuse of high-quality data in the field of Earth system sciences. The journal welcomes submissions of original data or data collections that meet the required quality standards and have the potential to contribute to the goals of the journal. It includes sections dedicated to regular-length articles, brief communications (such as updates to existing data sets), commentaries, review articles, and special issues. ESSD is abstracted and indexed in several databases, including Science Citation Index Expanded, Current Contents/PCE, Scopus, ADS, CLOCKSS, CNKI, DOAJ, EBSCO, Gale/Cengage, GoOA (CAS), and Google Scholar, among others.