{"title":"评估数据一致性与匹配依赖从多个来源","authors":"Mi Huang, Lingli Li, Ping Xuan","doi":"10.1109/ICPDS47662.2019.9017191","DOIUrl":null,"url":null,"abstract":"With the rapid growth of data, data quality issues have attracted increasing attention in both industry and academia. Since data consistency is one of the critical issues in data quality, we study the problem of how to evaluate the consistency of target data from multiple relevant sources under matching dependencies (MDs). Since accessing data sources directly introduces a huge cost of data comparisons, so this paper aims to design an efficient approximate consistency evaluation method with linear-time complexity. Firstly, we build a signature for each data source to approximate the pattern sets in this source defined by the MDs. Secondly, we develop a signature-based evaluation method to compute the consistency of target data based on the signatures of all the data sources that are related to our target data. Experimental results on real datasets shows high performance on both accuracy and efficiency of our algorithm.","PeriodicalId":130202,"journal":{"name":"2019 IEEE International Conference on Power Data Science (ICPDS)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Data Consistency with Matching Dependencies from Multiple Sources\",\"authors\":\"Mi Huang, Lingli Li, Ping Xuan\",\"doi\":\"10.1109/ICPDS47662.2019.9017191\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the rapid growth of data, data quality issues have attracted increasing attention in both industry and academia. Since data consistency is one of the critical issues in data quality, we study the problem of how to evaluate the consistency of target data from multiple relevant sources under matching dependencies (MDs). Since accessing data sources directly introduces a huge cost of data comparisons, so this paper aims to design an efficient approximate consistency evaluation method with linear-time complexity. Firstly, we build a signature for each data source to approximate the pattern sets in this source defined by the MDs. Secondly, we develop a signature-based evaluation method to compute the consistency of target data based on the signatures of all the data sources that are related to our target data. Experimental results on real datasets shows high performance on both accuracy and efficiency of our algorithm.\",\"PeriodicalId\":130202,\"journal\":{\"name\":\"2019 IEEE International Conference on Power Data Science (ICPDS)\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Conference on Power Data Science (ICPDS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPDS47662.2019.9017191\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Power Data Science (ICPDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPDS47662.2019.9017191","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Evaluating Data Consistency with Matching Dependencies from Multiple Sources
With the rapid growth of data, data quality issues have attracted increasing attention in both industry and academia. Since data consistency is one of the critical issues in data quality, we study the problem of how to evaluate the consistency of target data from multiple relevant sources under matching dependencies (MDs). Since accessing data sources directly introduces a huge cost of data comparisons, so this paper aims to design an efficient approximate consistency evaluation method with linear-time complexity. Firstly, we build a signature for each data source to approximate the pattern sets in this source defined by the MDs. Secondly, we develop a signature-based evaluation method to compute the consistency of target data based on the signatures of all the data sources that are related to our target data. Experimental results on real datasets shows high performance on both accuracy and efficiency of our algorithm.