On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems Pub Date : 2023-12-04 DOI:10.1016/j.is.2023.102323

Witold Andrzejewski , Bartosz Bębel , Paweł Boiński , Robert Wrembel

{"title":"On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records","authors":"Witold Andrzejewski , Bartosz Bębel , Paweł Boiński , Robert Wrembel","doi":"10.1016/j.is.2023.102323","DOIUrl":null,"url":null,"abstract":"<div>Data stored in information systems are often erroneous. Duplicate data are one of the typical error type. To discover and handle duplicates, the so-called deduplication methods are applied. They are complex and time costly algorithms. In data deduplication, pairs of records are compared and their similarities are computed. For a given deduplication problem, challenging tasks are: (1) to decide which similarity measures are the most adequate to given attributes being compared and (2) defining the importance of attributes being compared, and (3) defining adequate similarity thresholds between similar and not similar pairs of records. In this paper, we summarize our experience gained from a real R&D project run for a large financial institution. In particular, we answer the following three research questions: (1) what are the adequate similarity measures for comparing attributes of text data types, (2) what are the adequate weights of attributes in the procedure of comparing pairs of records, and (3) what are the similarity thresholds between classes: duplicates, probably duplicates, non-duplicates? The answers to the questions are based on the experimental evaluation of 54 similarity measures for text values. The measures were compared on five different real data sets of different data characteristic. The similarity measures were assessed based on: (1) similarity values they produced for given values being compared and (2) their execution time. Furthermore, we present our method, based on mathematical programming, for computing weights of attributes and similarity thresholds for records being compared. The experimental evaluation of the method and its assessment by experts from the financial institution proved that it is adequate to the deduplication problem at hand. The whole data deduplication pipeline that we have developed has been deployed in the financial institution and is run in their production system, processing batches of over 20 million of customer records.</div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102323"},"PeriodicalIF":3.4000,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S030643792300159X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Data stored in information systems are often erroneous. Duplicate data are one of the typical error type. To discover and handle duplicates, the so-called deduplication methods are applied. They are complex and time costly algorithms. In data deduplication, pairs of records are compared and their similarities are computed. For a given deduplication problem, challenging tasks are: (1) to decide which similarity measures are the most adequate to given attributes being compared and (2) defining the importance of attributes being compared, and (3) defining adequate similarity thresholds between similar and not similar pairs of records. In this paper, we summarize our experience gained from a real R&D project run for a large financial institution. In particular, we answer the following three research questions: (1) what are the adequate similarity measures for comparing attributes of text data types, (2) what are the adequate weights of attributes in the procedure of comparing pairs of records, and (3) what are the similarity thresholds between classes: duplicates, probably duplicates, non-duplicates? The answers to the questions are based on the experimental evaluation of 54 similarity measures for text values. The measures were compared on five different real data sets of different data characteristic. The similarity measures were assessed based on: (1) similarity values they produced for given values being compared and (2) their execution time. Furthermore, we present our method, based on mathematical programming, for computing weights of attributes and similarity thresholds for records being compared. The experimental evaluation of the method and its assessment by experts from the financial institution proved that it is adequate to the deduplication problem at hand. The whole data deduplication pipeline that we have developed has been deployed in the financial institution and is run in their production system, processing batches of over 20 million of customer records.

查看原文本刊更多论文

客户记录重复数据删除管道中指导相似度计算的参数调优:来自研发项目的经验

存储在信息系统中的数据常常是错误的。重复数据是典型的错误类型之一。为了发现和处理重复数据，使用了所谓的重复数据删除方法。它们是复杂且耗时的算法。在重复数据删除中，对记录进行比较并计算它们的相似度。对于给定的重复数据删除问题，具有挑战性的任务是:(1)决定哪些相似性度量最适合要比较的给定属性;(2)定义要比较的属性的重要性;(3)在相似和不相似的记录对之间定义适当的相似性阈值。在本文中，我们总结了从为一家大型金融机构运行的实际研发项目中获得的经验。特别是，我们回答了以下三个研究问题:(1)比较文本数据类型属性的适当相似性度量是什么，(2)比较成对记录过程中属性的适当权重是什么，以及(3)类之间的相似性阈值是什么:重复，可能重复，非重复?问题的答案是基于54个文本值相似度量的实验评估。在具有不同数据特征的5个不同的真实数据集上进行了度量比较。对相似性度量的评估基于:(1)它们为所比较的给定值产生的相似性值和(2)它们的执行时间。此外，我们提出了一种基于数学规划的方法，用于计算被比较记录的属性权重和相似阈值。该方法的实验评价和金融机构专家的评估证明，该方法足以解决手头的重复数据删除问题。我们开发的整个重复数据删除管道已经部署在金融机构中，并在他们的生产系统中运行，处理了超过2000万条客户记录。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.