利用机器学习校准技术实现低成本、高性能的空气污染测量

P. Nowack, Lev Konstantinovskiy, Hannah Gardiner, J. Cant
{"title":"利用机器学习校准技术实现低成本、高性能的空气污染测量","authors":"P. Nowack, Lev Konstantinovskiy, Hannah Gardiner, J. Cant","doi":"10.5194/amt-2020-473","DOIUrl":null,"url":null,"abstract":"Abstract. Air pollution is a key public health issue in urban areas worldwide. The development of low-cost air pollution sensors is consequently a major research priority. However, low-cost sensors often fail to attain sufficient measurement performance compared to state-of-the-art measurement stations, and typically require calibration procedures in expensive laboratory settings. As a result, there has been much debate about calibration techniques that could make their performance more reliable, while also developing calibration procedures that can be carried out without access to advanced laboratories. One repeatedly proposed strategy is low-cost sensor calibration through co-location with public measurement stations. The idea is that, using a regression function, the low-cost sensor signals can be calibrated against the station reference signal, to be then deployed separately with performances similar to the original stations. Here we test the idea of using machine learning algorithms for such regression tasks using hourly-averaged co-location data for nitrogen dioxide (NO2) and particulate matter of particle sizes smaller than 10 μm (PM10) at three different locations in the urban area of London, UK. Specifically, we compare the performance of Ridge regression, a linear statistical learning algorithm, to two non-linear algorithms in the form of Random Forest (RF) regression and Gaussian Process regression (GPR). We further benchmark the performance of all three machine learning methods to the more common Multiple Linear Regression (MLR). We obtain very good out-of-sample R2-scores (coefficient of determination) > 0.7, frequently exceeding 0.8, for the machine learning calibrated low-cost sensors. In contrast, the performance of MLR is more dependent on random variations in the sensor hardware and co-located signals, and is also more sensitive to the length of the co-location period. We find that, subject to certain conditions, GPR is typically the best performing method in our calibration setting, followed by Ridge regression and RF regression. However, we also highlight several key limitations of the machine learning methods, which will be crucial to consider in any co-location calibration. In particular, none of the methods is able to extrapolate to pollution levels well outside those encountered at training stage. Ultimately, this is one of the key limiting factors when sensors are deployed away from the co-location site itself. Consequently, we find that the linear Ridge method, which best mitigates such extrapolation effects, is typically performing as good as, or even better, than GPR after sensor re-location. Overall, our results highlight the potential of co-location methods paired with machine learning calibration techniques to reduce costs of air pollution measurements, subject to careful consideration of the co-location training conditions, the choice of calibration variables, and the features of the calibration algorithm.\n","PeriodicalId":441110,"journal":{"name":"Atmospheric Measurement Techniques Discussions","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Towards low-cost and high-performance air pollution measurements\\nusing machine learning calibration techniques\",\"authors\":\"P. Nowack, Lev Konstantinovskiy, Hannah Gardiner, J. Cant\",\"doi\":\"10.5194/amt-2020-473\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract. Air pollution is a key public health issue in urban areas worldwide. The development of low-cost air pollution sensors is consequently a major research priority. However, low-cost sensors often fail to attain sufficient measurement performance compared to state-of-the-art measurement stations, and typically require calibration procedures in expensive laboratory settings. As a result, there has been much debate about calibration techniques that could make their performance more reliable, while also developing calibration procedures that can be carried out without access to advanced laboratories. One repeatedly proposed strategy is low-cost sensor calibration through co-location with public measurement stations. The idea is that, using a regression function, the low-cost sensor signals can be calibrated against the station reference signal, to be then deployed separately with performances similar to the original stations. Here we test the idea of using machine learning algorithms for such regression tasks using hourly-averaged co-location data for nitrogen dioxide (NO2) and particulate matter of particle sizes smaller than 10 μm (PM10) at three different locations in the urban area of London, UK. Specifically, we compare the performance of Ridge regression, a linear statistical learning algorithm, to two non-linear algorithms in the form of Random Forest (RF) regression and Gaussian Process regression (GPR). We further benchmark the performance of all three machine learning methods to the more common Multiple Linear Regression (MLR). We obtain very good out-of-sample R2-scores (coefficient of determination) > 0.7, frequently exceeding 0.8, for the machine learning calibrated low-cost sensors. In contrast, the performance of MLR is more dependent on random variations in the sensor hardware and co-located signals, and is also more sensitive to the length of the co-location period. We find that, subject to certain conditions, GPR is typically the best performing method in our calibration setting, followed by Ridge regression and RF regression. However, we also highlight several key limitations of the machine learning methods, which will be crucial to consider in any co-location calibration. In particular, none of the methods is able to extrapolate to pollution levels well outside those encountered at training stage. Ultimately, this is one of the key limiting factors when sensors are deployed away from the co-location site itself. Consequently, we find that the linear Ridge method, which best mitigates such extrapolation effects, is typically performing as good as, or even better, than GPR after sensor re-location. Overall, our results highlight the potential of co-location methods paired with machine learning calibration techniques to reduce costs of air pollution measurements, subject to careful consideration of the co-location training conditions, the choice of calibration variables, and the features of the calibration algorithm.\\n\",\"PeriodicalId\":441110,\"journal\":{\"name\":\"Atmospheric Measurement Techniques Discussions\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Atmospheric Measurement Techniques Discussions\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5194/amt-2020-473\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Atmospheric Measurement Techniques Discussions","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5194/amt-2020-473","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

摘要空气污染是全球城市地区的一个关键公共卫生问题。因此,开发低成本的空气污染传感器是一个主要的研究重点。然而,与最先进的测量站相比,低成本传感器往往无法获得足够的测量性能,并且通常需要在昂贵的实验室环境中进行校准。因此,关于可以使其性能更可靠的校准技术,同时也开发可以在没有先进实验室的情况下进行的校准程序,存在很多争论。一种被反复提出的策略是通过与公共测量站共同定位来进行低成本的传感器校准。这个想法是,使用回归函数,低成本的传感器信号可以根据台站参考信号进行校准,然后单独部署,性能与原始台站相似。在这里,我们使用在英国伦敦市区三个不同地点的二氧化氮(NO2)和粒径小于10 μm的颗粒物(PM10)的小时平均共定位数据来测试使用机器学习算法进行此类回归任务的想法。具体来说,我们比较了Ridge回归(一种线性统计学习算法)与随机森林(RF)回归和高斯过程回归(GPR)两种非线性算法的性能。我们进一步将这三种机器学习方法的性能与更常见的多元线性回归(MLR)进行了基准测试。对于机器学习校准的低成本传感器,我们获得了非常好的样本外r2分数(决定系数)> 0.7,经常超过0.8。相比之下,MLR的性能更依赖于传感器硬件和共定位信号的随机变化,并且对共定位周期的长度也更敏感。我们发现,在一定条件下,在我们的校准设置中,探地雷达通常是表现最好的方法,其次是Ridge回归和RF回归。然而,我们也强调了机器学习方法的几个关键限制,这在任何协同定位校准中都是至关重要的。特别是,没有一种方法能够推断出远远超出训练阶段所遇到的污染水平。最终,当传感器部署在远离托管站点本身的地方时,这是一个关键的限制因素。因此,我们发现线性脊线方法可以最好地减轻这种外推效应,通常在传感器重新定位后的表现与GPR一样好,甚至更好。总的来说,我们的研究结果强调了协同定位方法与机器学习校准技术相结合的潜力,在仔细考虑协同定位训练条件、校准变量的选择和校准算法的特点的情况下,可以降低空气污染测量的成本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Towards low-cost and high-performance air pollution measurements using machine learning calibration techniques
Abstract. Air pollution is a key public health issue in urban areas worldwide. The development of low-cost air pollution sensors is consequently a major research priority. However, low-cost sensors often fail to attain sufficient measurement performance compared to state-of-the-art measurement stations, and typically require calibration procedures in expensive laboratory settings. As a result, there has been much debate about calibration techniques that could make their performance more reliable, while also developing calibration procedures that can be carried out without access to advanced laboratories. One repeatedly proposed strategy is low-cost sensor calibration through co-location with public measurement stations. The idea is that, using a regression function, the low-cost sensor signals can be calibrated against the station reference signal, to be then deployed separately with performances similar to the original stations. Here we test the idea of using machine learning algorithms for such regression tasks using hourly-averaged co-location data for nitrogen dioxide (NO2) and particulate matter of particle sizes smaller than 10 μm (PM10) at three different locations in the urban area of London, UK. Specifically, we compare the performance of Ridge regression, a linear statistical learning algorithm, to two non-linear algorithms in the form of Random Forest (RF) regression and Gaussian Process regression (GPR). We further benchmark the performance of all three machine learning methods to the more common Multiple Linear Regression (MLR). We obtain very good out-of-sample R2-scores (coefficient of determination) > 0.7, frequently exceeding 0.8, for the machine learning calibrated low-cost sensors. In contrast, the performance of MLR is more dependent on random variations in the sensor hardware and co-located signals, and is also more sensitive to the length of the co-location period. We find that, subject to certain conditions, GPR is typically the best performing method in our calibration setting, followed by Ridge regression and RF regression. However, we also highlight several key limitations of the machine learning methods, which will be crucial to consider in any co-location calibration. In particular, none of the methods is able to extrapolate to pollution levels well outside those encountered at training stage. Ultimately, this is one of the key limiting factors when sensors are deployed away from the co-location site itself. Consequently, we find that the linear Ridge method, which best mitigates such extrapolation effects, is typically performing as good as, or even better, than GPR after sensor re-location. Overall, our results highlight the potential of co-location methods paired with machine learning calibration techniques to reduce costs of air pollution measurements, subject to careful consideration of the co-location training conditions, the choice of calibration variables, and the features of the calibration algorithm.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信