Methods and analyses for determining quality

Information Quality in Information Systems Pub Date : 2005-06-17 DOI:10.1145/1077501.1077505

W. Winkler

{"title":"Methods and analyses for determining quality","authors":"W. Winkler","doi":"10.1145/1077501.1077505","DOIUrl":null,"url":null,"abstract":"In a possibly ideal world, records in a database would be complete and would contain fields having values that correspond to an underlying reality. An individuals name, address and date-of-birth would be present without typographical error. An income field might be a reasonably close approximation of a \"true income\" and would not be missing. A list of customers would be complete, unduplicated and current.In this ideal world, a database could be used for several purposes and would be considered to have high quality. A set of databases might be linked using name, address, and other weakly identifying information.In this paper, we describe situations where properly chosen metrics may indicate that data quality is not sufficiently high for monitoring processes, for modeling, and for data mining.Some of the metrics are supplementary to those in the quality literature or have rarely been used. Additionally, we describe generalized methods and software tools that allow a skilled individual to perform massive clean-up of files in some situations.The clean-up, while possibly sub-optimal in recreating \"truth\", can replace exceptionally large amounts of clerical review and allow many uses of the \"cleaned\" files.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"93 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Quality in Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1077501.1077505","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

In a possibly ideal world, records in a database would be complete and would contain fields having values that correspond to an underlying reality. An individuals name, address and date-of-birth would be present without typographical error. An income field might be a reasonably close approximation of a "true income" and would not be missing. A list of customers would be complete, unduplicated and current.In this ideal world, a database could be used for several purposes and would be considered to have high quality. A set of databases might be linked using name, address, and other weakly identifying information.In this paper, we describe situations where properly chosen metrics may indicate that data quality is not sufficiently high for monitoring processes, for modeling, and for data mining.Some of the metrics are supplementary to those in the quality literature or have rarely been used. Additionally, we describe generalized methods and software tools that allow a skilled individual to perform massive clean-up of files in some situations.The clean-up, while possibly sub-optimal in recreating "truth", can replace exceptionally large amounts of clerical review and allow many uses of the "cleaned" files.

查看原文本刊更多论文

质量测定方法及分析

在可能理想的情况下，数据库中的记录是完整的，并且包含具有与底层现实相对应的值的字段。个人的姓名、地址和出生日期将不会出现印刷错误。收入字段可能与“真实收入”相当接近，不会遗漏。客户列表将是完整的、不重复的和最新的。在这种理想情况下，数据库可以用于多种目的，并且被认为具有高质量。可以使用名称、地址和其他弱标识信息链接一组数据库。在本文中，我们描述了适当选择的度量可能表明数据质量不足以用于监控过程、建模和数据挖掘的情况。有些度量标准是对质量文献中的度量标准的补充，或者很少被使用。此外，我们描述了通用的方法和软件工具，这些方法和软件工具允许熟练的个人在某些情况下执行大量的文件清理。清理虽然在重建“真相”方面可能不是最理想的，但可以取代异常大量的文书审查，并允许许多人使用“清理过的”文件。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Quality in Information Systems

自引率

0.00%

发文量