Positive matrix factorization outperforms machine learning in imputing missing PM2.5 and further identifying spatial patterns in multi-sites without external data

IF 6.9 2区工程技术 Q1 ENVIRONMENTAL SCIENCES

Urban Climate Pub Date : 2025-08-01 DOI:10.1016/j.uclim.2025.102552

Youngkwon Kim , Philip K. Hopke , Seung-Muk Yi , Woojoo Lee , Ho Kim , JongBae Heo , Hwajin Kim , Young Su Lee , Kwonho Jeon , Jieun Park

{"title":"Positive matrix factorization outperforms machine learning in imputing missing PM2.5 and further identifying spatial patterns in multi-sites without external data","authors":"Youngkwon Kim , Philip K. Hopke , Seung-Muk Yi , Woojoo Lee , Ho Kim , JongBae Heo , Hwajin Kim , Young Su Lee , Kwonho Jeon , Jieun Park","doi":"10.1016/j.uclim.2025.102552","DOIUrl":null,"url":null,"abstract":"<div><div>Missing observations of fine particulate matter (PM<sub>2.5</sub>) distort air pollution studies by reducing the available concentration information. While machine learning (ML) and statistical methods are commonly used for imputation, they typically rely on external datasets, limiting reproducibility. This study addresses this gap by evaluating five techniques, including positive matrix factorization (PMF), random forest (RF), denoising autoencoder (DAE), multiple imputation by chained equations (MICE), and k-nearest neighbor (kNN), to impute missing PM<sub>2.5</sub> concentrations from 25 districts in Seoul, South Korea, without external data. First, completely filled dataset was obtained. Then, some observations were artificially masked to mimic the actual missingness rate. Using 5-fold cross-validation, imputation accuracy was assessed via mean absolute percentage error (MAPE). PMF showed the lowest MAPE (19.1 %), outperforming RF (21.3 %), DAE (23.7 %), MICE (24.6 %), and kNN (25.9 %). The imputed concentrations from the PMF analysis were sufficiently accurate to be used in air pollution studies with missing data while considering uncertainties. The highest accuracy of PMF is attributed to its ability to effectively resolve latent factors that represent spatial patterns contributing to PM<sub>2.5</sub> in Seoul and use them to impute missing values. Spatial patterns grouped 25 districts into six areas associated with PM<sub>2.5</sub> concentrations from specific districts that are mainly affected by the same pollution sources. This work demonstrates PMF outperforms ML and statistical methods in accurately imputing missing concentrations and further identifying spatial PM<sub>2.5</sub> patterns in multi-sites without external data. Missing PM<sub>2.5</sub> data in Seoul needs to be imputed using the PMF analysis for reliable air quality investigations.</div></div>","PeriodicalId":48626,"journal":{"name":"Urban Climate","volume":"62 ","pages":"Article 102552"},"PeriodicalIF":6.9000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Urban Climate","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2212095525002688","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Missing observations of fine particulate matter (PM_2.5) distort air pollution studies by reducing the available concentration information. While machine learning (ML) and statistical methods are commonly used for imputation, they typically rely on external datasets, limiting reproducibility. This study addresses this gap by evaluating five techniques, including positive matrix factorization (PMF), random forest (RF), denoising autoencoder (DAE), multiple imputation by chained equations (MICE), and k-nearest neighbor (kNN), to impute missing PM_2.5 concentrations from 25 districts in Seoul, South Korea, without external data. First, completely filled dataset was obtained. Then, some observations were artificially masked to mimic the actual missingness rate. Using 5-fold cross-validation, imputation accuracy was assessed via mean absolute percentage error (MAPE). PMF showed the lowest MAPE (19.1 %), outperforming RF (21.3 %), DAE (23.7 %), MICE (24.6 %), and kNN (25.9 %). The imputed concentrations from the PMF analysis were sufficiently accurate to be used in air pollution studies with missing data while considering uncertainties. The highest accuracy of PMF is attributed to its ability to effectively resolve latent factors that represent spatial patterns contributing to PM_2.5 in Seoul and use them to impute missing values. Spatial patterns grouped 25 districts into six areas associated with PM_2.5 concentrations from specific districts that are mainly affected by the same pollution sources. This work demonstrates PMF outperforms ML and statistical methods in accurately imputing missing concentrations and further identifying spatial PM_2.5 patterns in multi-sites without external data. Missing PM_2.5 data in Seoul needs to be imputed using the PMF analysis for reliable air quality investigations.

查看原文本刊更多论文

在没有外部数据的情况下，正矩阵分解在估算缺失PM2.5和进一步识别多地点的空间模式方面优于机器学习

细颗粒物（PM2.5）观测的缺失减少了可用的浓度信息，从而扭曲了空气污染研究。虽然机器学习（ML）和统计方法通常用于imputation，但它们通常依赖于外部数据集，限制了可重复性。本研究通过评估五种技术，包括正矩阵分解（PMF）、随机森林（RF）、去噪自动编码器（DAE）、链式方程多次归算（MICE）和k近邻（kNN），在没有外部数据的情况下，对韩国首尔25个地区缺失的PM2.5浓度进行了归算，从而弥补了这一差距。首先，得到完全填充的数据集。然后，人为地掩盖一些观察结果，以模拟实际的失踪率。使用5倍交叉验证，通过平均绝对百分比误差（MAPE）评估插入准确性。PMF的MAPE最低（19.1%），排在RF（21.3%）、DAE（23.7%）、MICE（24.6%）、kNN（25.9%）之后。从PMF分析中推算出的浓度足够准确，可以在考虑不确定性的情况下用于缺少数据的空气污染研究。PMF的最高准确性归功于它能够有效地解决代表首尔PM2.5空间格局的潜在因素，并利用它们来计算缺失值。空间格局将25个地区划分为6个区域，这些区域与主要受相同污染源影响的特定地区的PM2.5浓度相关。这项工作表明，PMF在没有外部数据的情况下准确地输入缺失浓度并进一步识别多地点的PM2.5空间模式方面优于ML和统计方法。为了进行可靠的空气质量调查，需要使用PMF分析来推算首尔缺失的PM2.5数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Urban Climate Social Sciences-Urban Studies

CiteScore

9.70

自引率

9.40%

发文量

286

期刊介绍： Urban Climate serves the scientific and decision making communities with the publication of research on theory, science and applications relevant to understanding urban climatic conditions and change in relation to their geography and to demographic, socioeconomic, institutional, technological and environmental dynamics and global change. Targeted towards both disciplinary and interdisciplinary audiences, this journal publishes original research papers, comprehensive review articles, book reviews, and short communications on topics including, but not limited to, the following: Urban meteorology and climate[...] Urban environmental pollution[...] Adaptation to global change[...] Urban economic and social issues[...] Research Approaches[...]