{"title":"一种填充缺失元组的有效源选择算法","authors":"Hengzhen Xie, Lingli Li, Ping Xuan","doi":"10.1109/ICPDS47662.2019.9017179","DOIUrl":null,"url":null,"abstract":"Completeness is one of the central criteria for data quality, and the completeness of data becomes particularly important. Specifically, incomplete data refers to a data set that does not contain enough information to answer the query, which can be divided into missing the values and tuples. This paper presents a technique of leveraging other data sources to fill missing tuples in target data. However, accessing too many data sources introduces a huge cost, so we investigate how to select a proper subset of sources to fill the missing tuples. Firstly, we define the gain model of sources and introduce the optimization problem of source selection from the perspective of missing tuples, in which the gain is maximized with the cost under a threshold. For filling the missing tuples, we propose a data source selection strategy based on a genetic algorithm. Experimental results show high performance on both the effectiveness of our algorithm.","PeriodicalId":130202,"journal":{"name":"2019 IEEE International Conference on Power Data Science (ICPDS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An Effective Source Selection Algorithm for Filling Missing Tuples\",\"authors\":\"Hengzhen Xie, Lingli Li, Ping Xuan\",\"doi\":\"10.1109/ICPDS47662.2019.9017179\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Completeness is one of the central criteria for data quality, and the completeness of data becomes particularly important. Specifically, incomplete data refers to a data set that does not contain enough information to answer the query, which can be divided into missing the values and tuples. This paper presents a technique of leveraging other data sources to fill missing tuples in target data. However, accessing too many data sources introduces a huge cost, so we investigate how to select a proper subset of sources to fill the missing tuples. Firstly, we define the gain model of sources and introduce the optimization problem of source selection from the perspective of missing tuples, in which the gain is maximized with the cost under a threshold. For filling the missing tuples, we propose a data source selection strategy based on a genetic algorithm. Experimental results show high performance on both the effectiveness of our algorithm.\",\"PeriodicalId\":130202,\"journal\":{\"name\":\"2019 IEEE International Conference on Power Data Science (ICPDS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Conference on Power Data Science (ICPDS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPDS47662.2019.9017179\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Power Data Science (ICPDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPDS47662.2019.9017179","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Effective Source Selection Algorithm for Filling Missing Tuples
Completeness is one of the central criteria for data quality, and the completeness of data becomes particularly important. Specifically, incomplete data refers to a data set that does not contain enough information to answer the query, which can be divided into missing the values and tuples. This paper presents a technique of leveraging other data sources to fill missing tuples in target data. However, accessing too many data sources introduces a huge cost, so we investigate how to select a proper subset of sources to fill the missing tuples. Firstly, we define the gain model of sources and introduce the optimization problem of source selection from the perspective of missing tuples, in which the gain is maximized with the cost under a threshold. For filling the missing tuples, we propose a data source selection strategy based on a genetic algorithm. Experimental results show high performance on both the effectiveness of our algorithm.