Positive matrix factorization outperforms machine learning in imputing missing PM2.5 and further identifying spatial patterns in multi-sites without external data
Youngkwon Kim , Philip K. Hopke , Seung-Muk Yi , Woojoo Lee , Ho Kim , JongBae Heo , Hwajin Kim , Young Su Lee , Kwonho Jeon , Jieun Park
{"title":"Positive matrix factorization outperforms machine learning in imputing missing PM2.5 and further identifying spatial patterns in multi-sites without external data","authors":"Youngkwon Kim , Philip K. Hopke , Seung-Muk Yi , Woojoo Lee , Ho Kim , JongBae Heo , Hwajin Kim , Young Su Lee , Kwonho Jeon , Jieun Park","doi":"10.1016/j.uclim.2025.102552","DOIUrl":null,"url":null,"abstract":"<div><div>Missing observations of fine particulate matter (PM<sub>2.5</sub>) distort air pollution studies by reducing the available concentration information. While machine learning (ML) and statistical methods are commonly used for imputation, they typically rely on external datasets, limiting reproducibility. This study addresses this gap by evaluating five techniques, including positive matrix factorization (PMF), random forest (RF), denoising autoencoder (DAE), multiple imputation by chained equations (MICE), and k-nearest neighbor (kNN), to impute missing PM<sub>2.5</sub> concentrations from 25 districts in Seoul, South Korea, without external data. First, completely filled dataset was obtained. Then, some observations were artificially masked to mimic the actual missingness rate. Using 5-fold cross-validation, imputation accuracy was assessed via mean absolute percentage error (MAPE). PMF showed the lowest MAPE (19.1 %), outperforming RF (21.3 %), DAE (23.7 %), MICE (24.6 %), and kNN (25.9 %). The imputed concentrations from the PMF analysis were sufficiently accurate to be used in air pollution studies with missing data while considering uncertainties. The highest accuracy of PMF is attributed to its ability to effectively resolve latent factors that represent spatial patterns contributing to PM<sub>2.5</sub> in Seoul and use them to impute missing values. Spatial patterns grouped 25 districts into six areas associated with PM<sub>2.5</sub> concentrations from specific districts that are mainly affected by the same pollution sources. This work demonstrates PMF outperforms ML and statistical methods in accurately imputing missing concentrations and further identifying spatial PM<sub>2.5</sub> patterns in multi-sites without external data. Missing PM<sub>2.5</sub> data in Seoul needs to be imputed using the PMF analysis for reliable air quality investigations.</div></div>","PeriodicalId":48626,"journal":{"name":"Urban Climate","volume":"62 ","pages":"Article 102552"},"PeriodicalIF":6.9000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Urban Climate","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2212095525002688","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Missing observations of fine particulate matter (PM2.5) distort air pollution studies by reducing the available concentration information. While machine learning (ML) and statistical methods are commonly used for imputation, they typically rely on external datasets, limiting reproducibility. This study addresses this gap by evaluating five techniques, including positive matrix factorization (PMF), random forest (RF), denoising autoencoder (DAE), multiple imputation by chained equations (MICE), and k-nearest neighbor (kNN), to impute missing PM2.5 concentrations from 25 districts in Seoul, South Korea, without external data. First, completely filled dataset was obtained. Then, some observations were artificially masked to mimic the actual missingness rate. Using 5-fold cross-validation, imputation accuracy was assessed via mean absolute percentage error (MAPE). PMF showed the lowest MAPE (19.1 %), outperforming RF (21.3 %), DAE (23.7 %), MICE (24.6 %), and kNN (25.9 %). The imputed concentrations from the PMF analysis were sufficiently accurate to be used in air pollution studies with missing data while considering uncertainties. The highest accuracy of PMF is attributed to its ability to effectively resolve latent factors that represent spatial patterns contributing to PM2.5 in Seoul and use them to impute missing values. Spatial patterns grouped 25 districts into six areas associated with PM2.5 concentrations from specific districts that are mainly affected by the same pollution sources. This work demonstrates PMF outperforms ML and statistical methods in accurately imputing missing concentrations and further identifying spatial PM2.5 patterns in multi-sites without external data. Missing PM2.5 data in Seoul needs to be imputed using the PMF analysis for reliable air quality investigations.
期刊介绍:
Urban Climate serves the scientific and decision making communities with the publication of research on theory, science and applications relevant to understanding urban climatic conditions and change in relation to their geography and to demographic, socioeconomic, institutional, technological and environmental dynamics and global change. Targeted towards both disciplinary and interdisciplinary audiences, this journal publishes original research papers, comprehensive review articles, book reviews, and short communications on topics including, but not limited to, the following:
Urban meteorology and climate[...]
Urban environmental pollution[...]
Adaptation to global change[...]
Urban economic and social issues[...]
Research Approaches[...]