Schema matching using duplicates

21st International Conference on Data Engineering (ICDE'05) Pub Date : 2005-04-05 DOI:10.1109/ICDE.2005.126

Alexander Bilke, Felix Naumann

引用次数: 241

Abstract

Most data integration applications require a matching between the schemas of the respective data sets. We show how the existence of duplicates within these data sets can be exploited to automatically identify matching attributes. We describe an algorithm that first discovers duplicates among data sets with unaligned schemas and then uses these duplicates to perform schema matching between schemas with opaque column names. Discovering duplicates among data sets with unaligned schemas is more difficult than in the usual setting, because it is not clear which fields in one object should be compared with which fields in the other. We have developed a new algorithm that efficiently finds the most likely duplicates in such a setting. Now, our schema matching algorithm is able to identify corresponding attributes by comparing data values within those duplicate records. An experimental study on real-world data shows the effectiveness of this approach.

查看原文本刊更多论文

使用重复项进行模式匹配

大多数数据集成应用程序都需要在各自数据集的模式之间进行匹配。我们将展示如何利用这些数据集中存在的重复项来自动识别匹配的属性。我们描述了一种算法，该算法首先发现具有未对齐模式的数据集之间的重复项，然后使用这些重复项在具有不透明列名的模式之间执行模式匹配。在具有未对齐模式的数据集中发现重复项比在通常设置中更难，因为不清楚应该将一个对象中的哪些字段与另一个对象中的哪些字段进行比较。我们开发了一种新的算法，可以在这种情况下有效地找到最可能的副本。现在，我们的模式匹配算法能够通过比较这些重复记录中的数据值来识别相应的属性。对实际数据的实验研究表明了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

21st International Conference on Data Engineering (ICDE'05)

自引率

0.00%

发文量