S. Deepak Narayanan, Apoorv Agnihotri, Nipun Batra
{"title":"Active Learning for Air Quality Station Location Recommendation","authors":"S. Deepak Narayanan, Apoorv Agnihotri, Nipun Batra","doi":"10.1145/3371158.3371208","DOIUrl":null,"url":null,"abstract":"Motivation: Recent years have seen a decline in air quality across the planet, with studies suggesting that a significant proportion of global population has reduced life expectancy by up to 4 years [1, 2, 5]. To tackle this increasing growth in air pollution and its adverse effects, governments across the world have set up air quality monitoring stations that measure concentrations of various pollutants like NO2, SO2 and PM2.5, of which PM2.5 especially has significant health impact and is used for measuring air quality. One major issue with the deployment of these stations is the massive cost involved. Owing to the high installation and maintenance costs, the spatial resolution of air quality monitoring is generally poor. In this current work, we propose active learning methods to choose the next location to install an air quality monitor, motivated by sparse spatial air quality monitoring and expensive sensing equipment. Related Work: Previous work has predominantly focused on interpolation and forecasting of air quality [7, 8]. Work on air quality station location recommendation has largely been limited [4]. Previous work [4, 7, 8] has shown that installing air quality stations uniformly to maximize spatial coverage does not work well in practice, which acts as a major motivation for our work. Problem Statement: Given a set S of air quality monitoring stations, along with their corresponding values of PM2.5 over a period of time {d1,d2, ....dn }, where di represents day i , we want to choose a new location s ′, such that installing a station at s ′ gives us the best estimate of air quality at unknown locations. Approach: We perform active learning using Query by Committee (QBC) [6].Wemaintain three sets of stations the train set, the test set, and the pool set. The train set contains currently monitored locations, test set contains the locations where we wish to estimate the air quality and the pool set contains candidate stations for querying, i.e., we query from the pool set and observe how our estimation improves on the test set. To query from the pool set, we need a measure of uncertainty for the stations in the pool set. To obtain this uncertainty, we train an ensemble of learners, and take the standard deviation of their predictions for each station in the pool set. We add the station with maximum standard deviation to our train set, and remove the same station from the pool set. We repeat this process as time progresses. We use K Neighbors Regressor (KNN) as our main model inspired by the fact that nearby days will likely have similar air quality (temporal locality), and so will nearby stations (spatial","PeriodicalId":360747,"journal":{"name":"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3371158.3371208","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Motivation: Recent years have seen a decline in air quality across the planet, with studies suggesting that a significant proportion of global population has reduced life expectancy by up to 4 years [1, 2, 5]. To tackle this increasing growth in air pollution and its adverse effects, governments across the world have set up air quality monitoring stations that measure concentrations of various pollutants like NO2, SO2 and PM2.5, of which PM2.5 especially has significant health impact and is used for measuring air quality. One major issue with the deployment of these stations is the massive cost involved. Owing to the high installation and maintenance costs, the spatial resolution of air quality monitoring is generally poor. In this current work, we propose active learning methods to choose the next location to install an air quality monitor, motivated by sparse spatial air quality monitoring and expensive sensing equipment. Related Work: Previous work has predominantly focused on interpolation and forecasting of air quality [7, 8]. Work on air quality station location recommendation has largely been limited [4]. Previous work [4, 7, 8] has shown that installing air quality stations uniformly to maximize spatial coverage does not work well in practice, which acts as a major motivation for our work. Problem Statement: Given a set S of air quality monitoring stations, along with their corresponding values of PM2.5 over a period of time {d1,d2, ....dn }, where di represents day i , we want to choose a new location s ′, such that installing a station at s ′ gives us the best estimate of air quality at unknown locations. Approach: We perform active learning using Query by Committee (QBC) [6].Wemaintain three sets of stations the train set, the test set, and the pool set. The train set contains currently monitored locations, test set contains the locations where we wish to estimate the air quality and the pool set contains candidate stations for querying, i.e., we query from the pool set and observe how our estimation improves on the test set. To query from the pool set, we need a measure of uncertainty for the stations in the pool set. To obtain this uncertainty, we train an ensemble of learners, and take the standard deviation of their predictions for each station in the pool set. We add the station with maximum standard deviation to our train set, and remove the same station from the pool set. We repeat this process as time progresses. We use K Neighbors Regressor (KNN) as our main model inspired by the fact that nearby days will likely have similar air quality (temporal locality), and so will nearby stations (spatial