{"title":"一种用于约束发现和故障检测的自动化数据质量测试方法","authors":"Hajar Homayouni, Sudipto Ghosh, I. Ray","doi":"10.1109/IRI.2019.00023","DOIUrl":null,"url":null,"abstract":"Data quality tests validate the data stored in databases and data warehouses to detect violations of syntactic and semantic constraints. Domain experts grapple with the issues related to the capturing of all the important constraints and checking that they are satisfied. Domain experts often define the constraints in an ad hoc manner based on their knowledge of the application domain and needs of the stakeholders. We propose ADQuaTe, which is an automated data quality test approach that uses an unsupervised machine learning technique to discover constraints that may have been missed by experts. ADQuaTe marks records that violate the constraints as suspicious and explains the violations. We evaluate ADQuaTe on real-world applications using a health data warehouse and a plant diagnosis database to demonstrate that the approach can uncover previously detected as well as new faults in the data.","PeriodicalId":295028,"journal":{"name":"2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"ADQuaTe: An Automated Data Quality Test Approach for Constraint Discovery and Fault Detection\",\"authors\":\"Hajar Homayouni, Sudipto Ghosh, I. Ray\",\"doi\":\"10.1109/IRI.2019.00023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data quality tests validate the data stored in databases and data warehouses to detect violations of syntactic and semantic constraints. Domain experts grapple with the issues related to the capturing of all the important constraints and checking that they are satisfied. Domain experts often define the constraints in an ad hoc manner based on their knowledge of the application domain and needs of the stakeholders. We propose ADQuaTe, which is an automated data quality test approach that uses an unsupervised machine learning technique to discover constraints that may have been missed by experts. ADQuaTe marks records that violate the constraints as suspicious and explains the violations. We evaluate ADQuaTe on real-world applications using a health data warehouse and a plant diagnosis database to demonstrate that the approach can uncover previously detected as well as new faults in the data.\",\"PeriodicalId\":295028,\"journal\":{\"name\":\"2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IRI.2019.00023\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2019.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ADQuaTe: An Automated Data Quality Test Approach for Constraint Discovery and Fault Detection
Data quality tests validate the data stored in databases and data warehouses to detect violations of syntactic and semantic constraints. Domain experts grapple with the issues related to the capturing of all the important constraints and checking that they are satisfied. Domain experts often define the constraints in an ad hoc manner based on their knowledge of the application domain and needs of the stakeholders. We propose ADQuaTe, which is an automated data quality test approach that uses an unsupervised machine learning technique to discover constraints that may have been missed by experts. ADQuaTe marks records that violate the constraints as suspicious and explains the violations. We evaluate ADQuaTe on real-world applications using a health data warehouse and a plant diagnosis database to demonstrate that the approach can uncover previously detected as well as new faults in the data.