Evaluation of multivariate time series clustering for imputation of air pollution data

IF 2.3 4区地球科学 Q3 GEOSCIENCES, MULTIDISCIPLINARY

Geoscientific Instrumentation Methods and Data Systems Pub Date : 2021-11-03 DOI:10.5194/gi-10-265-2021

Wedad Alahamade, I. Lake, C. Reeves, B. de la Iglesia

{"title":"Evaluation of multivariate time series clustering for imputation of air pollution data","authors":"Wedad Alahamade, I. Lake, C. Reeves, B. de la Iglesia","doi":"10.5194/gi-10-265-2021","DOIUrl":null,"url":null,"abstract":"Abstract. Air pollution is one of the world's leading risk factors for death, with 6.5 million deaths per year worldwide attributed to air-pollution-related diseases. Understanding the behaviour of certain pollutants through air quality assessment can produce improvements in air quality management that will translate to health and economic benefits. However, problems with missing data and uncertainty hinder that assessment. We are motivated by the need to enhance the air pollution data available. We focus on the problem of missing air pollutant concentration data either because a limited set of pollutants is measured at a monitoring site or because an instrument is not operating, so a particular pollutant is not measured for a period of time. In our previous work, we have proposed models which can impute a whole missing time series to enhance air quality monitoring. Some of these models are based on a multivariate time series (MVTS) clustering method. Here, we apply our method to real data and show how different graphical and statistical model evaluation functions enable us to select the imputation model that produces the most plausible imputations. We then compare the Daily Air Quality Index (DAQI) values obtained after imputation with observed values incorporating missing data. Our results show that using an ensemble model that aggregates the spatial similarity obtained by the geographical correlation between monitoring stations and the fused temporal similarity between pollutant concentrations produces very good imputation results. Furthermore, the analysis enhances understanding of the different pollutant behaviours and of the characteristics of different stations according to their environmental type.\n","PeriodicalId":48742,"journal":{"name":"Geoscientific Instrumentation Methods and Data Systems","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2021-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Geoscientific Instrumentation Methods and Data Systems","FirstCategoryId":"89","ListUrlMain":"https://doi.org/10.5194/gi-10-265-2021","RegionNum":4,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"GEOSCIENCES, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract. Air pollution is one of the world's leading risk factors for death, with 6.5 million deaths per year worldwide attributed to air-pollution-related diseases. Understanding the behaviour of certain pollutants through air quality assessment can produce improvements in air quality management that will translate to health and economic benefits. However, problems with missing data and uncertainty hinder that assessment. We are motivated by the need to enhance the air pollution data available. We focus on the problem of missing air pollutant concentration data either because a limited set of pollutants is measured at a monitoring site or because an instrument is not operating, so a particular pollutant is not measured for a period of time. In our previous work, we have proposed models which can impute a whole missing time series to enhance air quality monitoring. Some of these models are based on a multivariate time series (MVTS) clustering method. Here, we apply our method to real data and show how different graphical and statistical model evaluation functions enable us to select the imputation model that produces the most plausible imputations. We then compare the Daily Air Quality Index (DAQI) values obtained after imputation with observed values incorporating missing data. Our results show that using an ensemble model that aggregates the spatial similarity obtained by the geographical correlation between monitoring stations and the fused temporal similarity between pollutant concentrations produces very good imputation results. Furthermore, the analysis enhances understanding of the different pollutant behaviours and of the characteristics of different stations according to their environmental type.

查看原文本刊更多论文

多元时间序列聚类在大气污染数据插补中的应用评价

摘要空气污染是世界上主要的死亡风险因素之一，全球每年有650万人死于空气污染相关疾病。通过空气质量评估了解某些污染物的行为可以改善空气质量管理，从而带来健康和经济效益。然而，数据缺失和不确定性的问题阻碍了这一评估。我们的动机是需要加强现有的空气污染数据。我们关注的是空气污染物浓度数据缺失的问题，要么是因为在监测点测量了一组有限的污染物，要么是由于仪器不工作，因此在一段时间内没有测量到特定的污染物。在我们之前的工作中，我们提出了可以估算整个缺失时间序列的模型，以加强空气质量监测。其中一些模型基于多变量时间序列（MVTS）聚类方法。在这里，我们将我们的方法应用于真实数据，并展示了不同的图形和统计模型评估函数如何使我们能够选择产生最合理估算的估算模型。然后，我们将插补后获得的每日空气质量指数（DAQI）值与包含缺失数据的观测值进行比较。我们的结果表明，使用集合模型，将监测站之间的地理相关性获得的空间相似性和污染物浓度之间的融合时间相似性进行聚合，可以产生非常好的插补结果。此外，该分析增强了对不同污染物行为的理解，以及对不同站点根据其环境类型的特征的理解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Geoscientific Instrumentation Methods and Data Systems GEOSCIENCES, MULTIDISCIPLINARYMETEOROLOGY-METEOROLOGY & ATMOSPHERIC SCIENCES

CiteScore

3.70

自引率

0.00%

发文量

审稿时长

37 weeks

期刊介绍： Geoscientific Instrumentation, Methods and Data Systems (GI) is an open-access interdisciplinary electronic journal for swift publication of original articles and short communications in the area of geoscientific instruments. It covers three main areas: (i) atmospheric and geospace sciences, (ii) earth science, and (iii) ocean science. A unique feature of the journal is the emphasis on synergy between science and technology that facilitates advances in GI. These advances include but are not limited to the following: concepts, design, and description of instrumentation and data systems; retrieval techniques of scientific products from measurements; calibration and data quality assessment; uncertainty in measurements; newly developed and planned research platforms and community instrumentation capabilities; major national and international field campaigns and observational research programs; new observational strategies to address societal needs in areas such as monitoring climate change and preventing natural disasters; networking of instruments for enhancing high temporal and spatial resolution of observations. GI has an innovative two-stage publication process involving the scientific discussion forum Geoscientific Instrumentation, Methods and Data Systems Discussions (GID), which has been designed to do the following: foster scientific discussion; maximize the effectiveness and transparency of scientific quality assurance; enable rapid publication; make scientific publications freely accessible.