{"title":"Metadata-driven error detection","authors":"L. Visengeriyeva, Ziawasch Abedjan","doi":"10.1145/3221269.3223028","DOIUrl":null,"url":null,"abstract":"Scientific data often originates from multiple sources and human agents. The integration of data from different sources must also resolve data quality problems that might occur because of inconsistency or different quality assurance levels of the sources. To identify various data quality problems in a dataset, it is necessary to use several error detection methods. Existing error detection solutions are usually tailored towards one specific type of data errors, such as rule violations or outliers, requiring the application of multiple strategies. Using all possible error detection methods is also not satisfying, as some systems might perform poorly on a particular dataset by producing a large number of false positives and missing some results. However, it is not trivial to assess the effectiveness of each strategy upfront. We propose two new holistic approaches for effectively combining off-the-shelf error detection systems. Our approaches are learning-based and incorporate metadata extracted from the dataset at hand. We empirically show, using four real-world datasets, that our method of combining error-detecting strategies achieves an average F1 score 15% higher than multiple heuristics-based baselines.","PeriodicalId":365491,"journal":{"name":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","volume":"173 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3221269.3223028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25
Abstract
Scientific data often originates from multiple sources and human agents. The integration of data from different sources must also resolve data quality problems that might occur because of inconsistency or different quality assurance levels of the sources. To identify various data quality problems in a dataset, it is necessary to use several error detection methods. Existing error detection solutions are usually tailored towards one specific type of data errors, such as rule violations or outliers, requiring the application of multiple strategies. Using all possible error detection methods is also not satisfying, as some systems might perform poorly on a particular dataset by producing a large number of false positives and missing some results. However, it is not trivial to assess the effectiveness of each strategy upfront. We propose two new holistic approaches for effectively combining off-the-shelf error detection systems. Our approaches are learning-based and incorporate metadata extracted from the dataset at hand. We empirically show, using four real-world datasets, that our method of combining error-detecting strategies achieves an average F1 score 15% higher than multiple heuristics-based baselines.