{"title":"An Application of Distributed Data Mining to Identify Data Quality Problems","authors":"Eshref Januzaj, Visar Januzaj, P. Mandl","doi":"10.1145/3366030.3366103","DOIUrl":null,"url":null,"abstract":"When dealing with huge data sets, during the integration process of distributed data into a single data warehouse, one is not only confronted with time and security factors but with the well known problem of low data quality as well. In order to cope with such issues that the integration of distributed data often is faced with, we present in this paper an approach that applies distributed data mining, to facilitate a data quality analysis of the data in their distributed state. Data quality problems are identified by a classifier, which uses the knowledge gained from the clustering (subspace clustering) process performed on the distributed data. Experiments on real data show that the distributed analysis results are comparable to those conducted on the central data warehouse using classical data mining.","PeriodicalId":446280,"journal":{"name":"Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366030.3366103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
When dealing with huge data sets, during the integration process of distributed data into a single data warehouse, one is not only confronted with time and security factors but with the well known problem of low data quality as well. In order to cope with such issues that the integration of distributed data often is faced with, we present in this paper an approach that applies distributed data mining, to facilitate a data quality analysis of the data in their distributed state. Data quality problems are identified by a classifier, which uses the knowledge gained from the clustering (subspace clustering) process performed on the distributed data. Experiments on real data show that the distributed analysis results are comparable to those conducted on the central data warehouse using classical data mining.