{"title":"Data Utility Maximization When Leveraging Crowdsensing in Machine Learning","authors":"Juan Li, Jie Wu, Yanmin Zhu","doi":"10.1109/IWQoS.2018.8624185","DOIUrl":null,"url":null,"abstract":"With the increasingly wide adoption of crowdsensing services, we can leverage the crowd to obtain labeled data instances for training machine learning models. In this paper, we focus on the critical problem that which data instances should be collected to maximize the performance of the trained model under the budget limit. Solving this problem is nontrivial because of the unclear relationship between the performance of the trained model and the data collection process, NP-hardness of the problem and the online arrival of workers. To overcome these challenges, we first propose a crowdsensing framework with multiple rounds of data collecting and model training. The framework is based on the stream-based batch-mode active learning. According to the framework, we come up with a novel data utility model to measure the contribution of a data batch to the performance of the learning model. The data utility model combines uncertainty and weighted density to measure the contribution of one instance. Finally, we propose an online algorithm to select a data batch in each round. The algorithm achieves fairness, computational efficiency and a competitive ratio 0.1218 when the ratio of the largest contribution of one data instance to the optimal offline total data utility is infinitely small. Through evaluations based on a real data set, we demonstrate the efficiency of our data utility model and our online algorithm.","PeriodicalId":222290,"journal":{"name":"2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IWQoS.2018.8624185","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
With the increasingly wide adoption of crowdsensing services, we can leverage the crowd to obtain labeled data instances for training machine learning models. In this paper, we focus on the critical problem that which data instances should be collected to maximize the performance of the trained model under the budget limit. Solving this problem is nontrivial because of the unclear relationship between the performance of the trained model and the data collection process, NP-hardness of the problem and the online arrival of workers. To overcome these challenges, we first propose a crowdsensing framework with multiple rounds of data collecting and model training. The framework is based on the stream-based batch-mode active learning. According to the framework, we come up with a novel data utility model to measure the contribution of a data batch to the performance of the learning model. The data utility model combines uncertainty and weighted density to measure the contribution of one instance. Finally, we propose an online algorithm to select a data batch in each round. The algorithm achieves fairness, computational efficiency and a competitive ratio 0.1218 when the ratio of the largest contribution of one data instance to the optimal offline total data utility is infinitely small. Through evaluations based on a real data set, we demonstrate the efficiency of our data utility model and our online algorithm.