高效计算包含依赖关系的模式发现

22nd International Conference on Data Engineering Workshops (ICDEW'06) Pub Date : 2006-04-03 DOI:10.1109/ICDEW.2006.54

Jana Bauckmann, U. Leser, Felix Naumann

{"title":"高效计算包含依赖关系的模式发现","authors":"Jana Bauckmann, U. Leser, Felix Naumann","doi":"10.1109/ICDEW.2006.54","DOIUrl":null,"url":null,"abstract":"Large data integration projects must often cope with undocumented data sources. Schema discovery aims at automatically finding structures in such cases. An important class of relationships between attributes that can be detected automatically are inclusion dependencies (IND), which provide an excellent basis for guessing foreign key constraints. INDs can be discovered by comparing the sets of distinct values of pairs of attributes. In this paper we present efficient algorithms for finding unary INDs. We first show that (and why) SQL is not suitable for this task. We then develop two algorithms that compute inclusion dependencies outside of the database. Both are much faster than the SQL-based methods; in fact, for larger schemas they are the only feasible solution. Our experiments show that we can compute all unary INDs in a schema of 1, 680 attributes with a total database size of 3.2 GB in approximately 2.5 hours.","PeriodicalId":331953,"journal":{"name":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":"{\"title\":\"Efficiently Computing Inclusion Dependencies for Schema Discovery\",\"authors\":\"Jana Bauckmann, U. Leser, Felix Naumann\",\"doi\":\"10.1109/ICDEW.2006.54\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large data integration projects must often cope with undocumented data sources. Schema discovery aims at automatically finding structures in such cases. An important class of relationships between attributes that can be detected automatically are inclusion dependencies (IND), which provide an excellent basis for guessing foreign key constraints. INDs can be discovered by comparing the sets of distinct values of pairs of attributes. In this paper we present efficient algorithms for finding unary INDs. We first show that (and why) SQL is not suitable for this task. We then develop two algorithms that compute inclusion dependencies outside of the database. Both are much faster than the SQL-based methods; in fact, for larger schemas they are the only feasible solution. Our experiments show that we can compute all unary INDs in a schema of 1, 680 attributes with a total database size of 3.2 GB in approximately 2.5 hours.\",\"PeriodicalId\":331953,\"journal\":{\"name\":\"22nd International Conference on Data Engineering Workshops (ICDEW'06)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"31\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"22nd International Conference on Data Engineering Workshops (ICDEW'06)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDEW.2006.54\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDEW.2006.54","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

摘要

大型数据集成项目通常必须处理未记录的数据源。模式发现的目的是在这种情况下自动查找结构。可以自动检测的属性之间的一类重要关系是包含依赖关系(IND)，它为猜测外键约束提供了很好的基础。可以通过比较属性对的不同值的集合来发现索引。在本文中，我们提出了寻找一元ind的有效算法。我们首先说明(以及为什么)SQL不适合这个任务。然后，我们开发了两个算法来计算数据库外的包含依赖关系。两者都比基于sql的方法快得多;事实上，对于较大的模式，它们是唯一可行的解决方案。我们的实验表明，我们可以在大约2.5小时内计算出包含1,680个属性的模式中的所有一元ind，数据库总大小为3.2 GB。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficiently Computing Inclusion Dependencies for Schema Discovery

Large data integration projects must often cope with undocumented data sources. Schema discovery aims at automatically finding structures in such cases. An important class of relationships between attributes that can be detected automatically are inclusion dependencies (IND), which provide an excellent basis for guessing foreign key constraints. INDs can be discovered by comparing the sets of distinct values of pairs of attributes. In this paper we present efficient algorithms for finding unary INDs. We first show that (and why) SQL is not suitable for this task. We then develop two algorithms that compute inclusion dependencies outside of the database. Both are much faster than the SQL-based methods; in fact, for larger schemas they are the only feasible solution. Our experiments show that we can compute all unary INDs in a schema of 1, 680 attributes with a total database size of 3.2 GB in approximately 2.5 hours.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

22nd International Conference on Data Engineering Workshops (ICDEW'06)

自引率

0.00%

发文量