{"title":"Methods and analyses for determining quality","authors":"W. Winkler","doi":"10.1145/1077501.1077505","DOIUrl":null,"url":null,"abstract":"In a possibly ideal world, records in a database would be complete and would contain fields having values that correspond to an underlying reality. An individuals name, address and date-of-birth would be present without typographical error. An income field might be a reasonably close approximation of a \"true income\" and would not be missing. A list of customers would be complete, unduplicated and current.In this ideal world, a database could be used for several purposes and would be considered to have high quality. A set of databases might be linked using name, address, and other weakly identifying information.In this paper, we describe situations where properly chosen metrics may indicate that data quality is not sufficiently high for monitoring processes, for modeling, and for data mining.Some of the metrics are supplementary to those in the quality literature or have rarely been used. Additionally, we describe generalized methods and software tools that allow a skilled individual to perform massive clean-up of files in some situations.The clean-up, while possibly sub-optimal in recreating \"truth\", can replace exceptionally large amounts of clerical review and allow many uses of the \"cleaned\" files.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"93 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Quality in Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1077501.1077505","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
In a possibly ideal world, records in a database would be complete and would contain fields having values that correspond to an underlying reality. An individuals name, address and date-of-birth would be present without typographical error. An income field might be a reasonably close approximation of a "true income" and would not be missing. A list of customers would be complete, unduplicated and current.In this ideal world, a database could be used for several purposes and would be considered to have high quality. A set of databases might be linked using name, address, and other weakly identifying information.In this paper, we describe situations where properly chosen metrics may indicate that data quality is not sufficiently high for monitoring processes, for modeling, and for data mining.Some of the metrics are supplementary to those in the quality literature or have rarely been used. Additionally, we describe generalized methods and software tools that allow a skilled individual to perform massive clean-up of files in some situations.The clean-up, while possibly sub-optimal in recreating "truth", can replace exceptionally large amounts of clerical review and allow many uses of the "cleaned" files.