重复检测数据准备

Journal of Data and Information Quality (JDIQ) Pub Date : 2020-06-13 DOI:10.1145/3377878

Ioannis K. Koumarelas, Lan Jiang, Felix Naumann

{"title":"重复检测数据准备","authors":"Ioannis K. Koumarelas, Lan Jiang, Felix Naumann","doi":"10.1145/3377878","DOIUrl":null,"url":null,"abstract":"Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"1 1","pages":"1 - 24"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Data Preparation for Duplicate Detection\",\"authors\":\"Ioannis K. Koumarelas, Lan Jiang, Felix Naumann\",\"doi\":\"10.1145/3377878\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.\",\"PeriodicalId\":15582,\"journal\":{\"name\":\"Journal of Data and Information Quality (JDIQ)\",\"volume\":\"1 1\",\"pages\":\"1 - 24\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Data and Information Quality (JDIQ)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3377878\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3377878","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

数据错误是大多数应用程序工作流中的一个主要问题。在执行任何重要任务之前，必须通过消除数据中可能出现的许多不同错误来保证一定的数据质量。通常，这些错误中的大多数都可以通过数据准备方法修复，例如删除空白。但是，重复记录的特定错误，即多个记录指向同一个实体，通常可以通过专门的技术独立地消除。我们的工作是第一个将这两个领域结合在一起，在执行重复检测之前，在系统的方法下应用数据准备操作。我们的流程工作流可以总结如下:它开始于用户提供一个黄金标准的样本作为输入，实际数据集，以及可选的一些特定于领域的数据准备的约束，比如地址规范化。制剂选择分两个连续的阶段进行。首先，为了大大减少无效数据准备的搜索空间，基于对相似度的改善或恶化进行决策。其次，利用剩余的数据准备，迭代的“留一”分类过程逐个去除准备，并根据precision-recall curve (AUC-PR)下的实现面积确定冗余准备。使用此工作流程，我们成功地将AUC-PR的重复检测结果提高了19%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Data Preparation for Duplicate Detection

Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Data and Information Quality (JDIQ)

自引率

0.00%

发文量