Detecting duplicate objects in XML documents

Information Quality in Information Systems Pub Date : 2004-06-18 DOI:10.1145/1012453.1012456

Melanie Herschel, Felix Naumann

{"title":"Detecting duplicate objects in XML documents","authors":"Melanie Herschel, Felix Naumann","doi":"10.1145/1012453.1012456","DOIUrl":null,"url":null,"abstract":"The problem of detecting duplicate entities that describe the same real-world object (and purging them) is an important data cleansing task, necessary to improve data quality. For data stored in a flat relation, numerous solutions to this problem exist. As XML becomes increasingly popular for data representation, algorithms to detect duplicates in nested XML documents are required.In this paper, we present a domain-independent algorithm that effectively identifies duplicates in an XML document. The solution adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level. Pairs of duplicate elements are detected using a thresholded similarity function, and are then clustered by computing the transitive closure. To minimize the number of pairwise element comparisons, an appropriate filter function is used. The similarity measure involves string similarity for pairs of strings, which is measured using their edit distance. To increase efficiency, we avoid the computation of edit distance for pairs of strings using three filtering methods subsequently. First experiments show that our approach detects XML duplicates accurately and efficiently.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"71","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Quality in Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1012453.1012456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 71

Abstract

The problem of detecting duplicate entities that describe the same real-world object (and purging them) is an important data cleansing task, necessary to improve data quality. For data stored in a flat relation, numerous solutions to this problem exist. As XML becomes increasingly popular for data representation, algorithms to detect duplicates in nested XML documents are required.In this paper, we present a domain-independent algorithm that effectively identifies duplicates in an XML document. The solution adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level. Pairs of duplicate elements are detected using a thresholded similarity function, and are then clustered by computing the transitive closure. To minimize the number of pairwise element comparisons, an appropriate filter function is used. The similarity measure involves string similarity for pairs of strings, which is measured using their edit distance. To increase efficiency, we avoid the computation of edit distance for pairs of strings using three filtering methods subsequently. First experiments show that our approach detects XML duplicates accurately and efficiently.

查看原文本刊更多论文

检测XML文档中的重复对象

检测描述相同现实世界对象的重复实体(并清除它们)是一项重要的数据清理任务，对于提高数据质量是必要的。对于存储在平面关系中的数据，存在许多解决方案。随着XML在数据表示方面越来越流行，需要在嵌套XML文档中检测重复项的算法。在本文中，我们提出了一种独立于域的算法，可以有效地识别XML文档中的重复项。该解决方案采用自顶向下的XML树结构遍历，以识别每个级别上的重复元素。使用阈值相似函数检测重复元素对，然后通过计算传递闭包进行聚类。为了尽量减少成对元素比较的次数，需要使用适当的过滤函数。相似性度量涉及字符串对的字符串相似性，这是使用它们的编辑距离来度量的。为了提高效率，我们随后使用三种过滤方法避免了字符串对编辑距离的计算。第一个实验表明，我们的方法能够准确有效地检测XML重复。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Quality in Information Systems

自引率

0.00%

发文量